Comparison of heat maps showing residence price generated using interpolation methods

(1)

Comparison of heat maps showing

residence price generated using

interpolation methods

MARK WONG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

showing residence price

generated using interpolation

methods

MARK WONG

MARKWONG@KTH.SE

Master in Computer Science Date: September 3, 2017 Supervisor: Joel Brynielsson Examiner: Olle Bälter

Swedish title: Jämförelse av färgdiagram för bostadspriser genererade med hjälp av interpolationsmetoder

(3)

(4)

Abstract

In this report we attempt to provide insights in how interpolation can be used for creating heat maps showing residence prices for different residence markets in Sweden. More specifically, three interpolation methods are implemented and are then used on three Swedish resi-dence markets. These three resiresi-dence markets are of varying charac-teristics such as size and residence type. Data of residence sales and the physical definitions of the residence markets were collected. As residence sales are never identical, residence sales were preprocessed to make them comparable. For comparison, a so-called external pre-dictor was used as an extra parameter for the interpolation method. In this report, distance to nearest public transportation was used as an external predictor.

The interpolated heat maps were compared and evaluated using both quantitative and qualitative approaches. Results show that each interpolation method has its own strengths and weaknesses, and that using an external predictor results in better heat maps compared to only using residence price as predictor.

(5)

Sammanfattning

Den här rapporten försöker ge insikter i hur interpolation kan använ-das för att skapa färgdiagram över bostadspriser för olika bostads-marknader i Sverige. Mer specifikt implementeras tre interpolations-metoder som sedan används på tre olika svenska bostadsmarknader. Dessa tre bostadsmarknader är av olika karaktär med hänsyn till stor-lek och bostadstyp. Bostadsförsäljningsdata och de fysiska definitio-nerna för bostadsmarknaderna samlades in. Eftersom bostadsförsälj-ningar aldrig är identiska, behandlas de först i syfte att göra dem jäm-förbara. En extern indikator, vilket är en extra parameter för interpo-lationsmetoder, undersöktes även. I den här rapporten användes av-ståndet till närmaste kollektiva transportmedel som extern indikator.

De interpolerade färgdiagrammen jämfördes och utvärderades bå-de med en kvantiativ och en kvalitativ metod. Resultaten visar att var-je interpolationsmetod har sina styrkor och svagheter och att använ-dandet av en extern indikator alltid renderade i ett bättre färgdiagram jämfört med att endast använda bostadspris som indikator.

(6)

1 Introduction 1

1.1 Purpose and problem statement . . . 3

2 Background 4 2.1 Thesis client . . . 4 2.2 Data . . . 5 2.2.1 Polygons . . . 5 2.2.2 Residence transactions . . . 6 2.3 Related work . . . 9 3 Theory 12 3.1 Interpolation methods . . . 12

3.1.1 Inverse distance weighting . . . 13

3.1.2 Thin plate smoothing spline . . . 13

3.1.3 Kriging . . . 15

4 Methodology 18 4.1 Overview . . . 19

4.2 Data collection and preprocessing . . . 20

4.2.1 Booli data . . . 20

4.2.2 Valueguard HOX indices . . . 22

4.2.3 Google Maps public transportation data . . . 23

4.2.4 Outliers . . . 23

4.3 Parameter selection . . . 24

4.3.1 Parameter for Inverse distance weighting . . . 24

4.3.2 Parameter for Thin plate smoothing spline . . . . 25

4.3.3 Model selection for Kriging . . . 25

4.3.4 Grid size of polygon . . . 26

4.4 Evaluation . . . 26

4.4.1 K-fold cross validation . . . 26

(7)

4.4.2 Root-mean-squared error . . . 27

4.4.3 Visual evaluation . . . 27

4.5 Chosen interpolation models . . . 28

5 Results 29 5.1 Residential area #1 . . . 29 5.2 Residential area #2 . . . 34 5.3 Residential area #3 . . . 38 5.4 Expert judgement . . . 43 5.5 Computation time . . . 43 6 Discussion 45 6.1 Quantitative and qualitative results . . . 45

6.2 Data sets . . . 46

6.3 Sustainability and ethics . . . 48

(8)

2.1 Polygon data structure of the city district area

Söder-malm, Sweden . . . 5

2.2 The polygon from Figure 2.1 projected on OpenStreetMap 6 3.1 Semi-variance (circles) and fitted model (line) . . . 16

4.1 Flow chart of methodology . . . 19

4.2 Root-mean-squared error for different p-values . . . 25

5.1 Distribution of observations for Södertälje . . . 30

5.2 Distribution of observations for Södertälje . . . 31

5.3 Interpolated map of Södertälje using Kriging with price . 32 5.4 Interpolated map of Södertälje using Kriging with price and distance to transport . . . 32

5.5 Interpolated map of Södertälje using IDW with price . . 33

5.6 Interpolated map of Södertälje using TPS with price . . . 33

5.7 Distribution of observations for Eskilstuna . . . 34

5.8 Distribution of observations for Eskilstuna . . . 36

5.9 Interpolated map of Eskilstuna using Kriging with price 36 5.10 Interpolated map of Eskilstuna using Kriging with price and distance to transport . . . 37

5.11 Interpolated map of Eskilstuna using IDW with price . . 37

5.12 Interpolated map of Eskilstuna using TPS with price . . . 38

5.13 Distribution of observations for Stockholm . . . 39

5.14 Distribution of observations for Stockholm . . . 40

5.15 Interpolated map of Stockholm using Kriging with price 41 5.16 Interpolated map of Stockholm using Kriging with price and distance to transport . . . 41

5.17 Interpolated map of Stockholm using IDW with price . . 42

5.18 Interpolated map of Stockholm using TPS with price . . 42

(9)

2.1 Fictive example of a residence transaction of an apartment 7

2.2 Fictive example of a residence transaction of a house . . . 7

4.1 Residence transaction data . . . 20

4.2 Chosen interpolation models . . . 28

5.1 RMSE for residential area #1 . . . 31

5.4 Computation time for residential area #1 (Södertälje) mea-sured in ms . . . 44

5.5 Computation time for residential area #2 (Eskilstuna) measured in ms . . . 44

5.6 Computation time for residential area #3 (Stockholm) measured in ms . . . 44

(10)

Introduction

Purchasing and selling residences is often regarded as one of the biggest decisions in an individual’s life. Deciding on what and when to buy or sell can have a long term effect on one’s economy. In recent years, the residence market has been a recurring topic among citizens of Swe-den. This is understandable since the residence prices for apartments in Sweden have risen by more than 247% since 1996 [1]. It is there-fore of great interest for society that the information available on the residence market is transparent and fair. Thanks to the Internet and its users we are today able to find and store more data than ever con-cerning residence sales. All this data could be used in combination with interpolation in an attempt to create a more transparent residence market.

To find an accurate and efficient way for prediction of residence prices is therefore of great interest both for buyers and sellers of resi-dences. Historically there has been numerous ways to predict the price of a residence where the most common one is consulting a real estate agent. What real estate agents often do is check recent sales of resi-dences with similar characteristics such as building type, operational costs, size of the living area, etc. With the help of the aforementioned data the real estate agents are then able to calculate a theoretical price for a residence. Other than the cost for consulting a real estate agent, it is also a tedious and time consuming activity that can be made more effective.

Interpolation is an emerging field in the domain of price prediction of residences. The use of interpolation has shown great potential and has been researched and is constantly growing [2]. Historically it has

(11)

been used for prediction of locations of natural resources and meteoro-logical variables (e.g., rain fall) [3]. Interpolation has also shown to be valuable for tasks like residence price prediction [2, 4, 5, 6]. What all of the above have in common is that the variables are both in theory and practice geographically dependent. Occurrences of natural resources such as gold and ores have shown to be geographically dependent. In other words, if a location contains a large amount of gold there is a high probability that neighbouring areas contain gold as well.

When discussing residence sales, this report will henceforth in-clude all categories of residences including apartments, townhouses, houses, etc. It is also important to mention that the real estate mar-ket in Sweden is further divided into mainly two forms of ownership. The first form is the right to rent an apartment. This form of apart-ment implies that you are given a contract with the right to rent an apartment for a monthly cost. This form does not require any initial payment. The second form is co-operative building society dwelling (sv. bostadsrätt) which entitles the owner the right to use the apartment for a monthly cost. This form requires an initial one-time payment as well, and can at any time be traded on the real estate market. When discussing real estate prices, one refers to the second form. This report will henceforth only include co-operative building society dwellings when discussing apartments.

In Sweden a residential area tends to have clusters of residences that belong to the same category. As with natural resources, residence prices have also shown similar trends. The characteristics and prices of residences in Sweden are in many ways dependent on location and its surroundings. The sold price and residence type for each category often correlate geographically, meaning that given two identical apart-ments situated within 50 meters of each other there will be less of a price difference compared to a third identical apartment that is located 1 km away. This correlation is why interpolation of residence prices is sensible.

(12)

solu-tion, mass prediction can be done for a large set of residences which will hopefully contribute to a more transparent and efficient real estate market.

1.1 Purpose and problem statement

The aim is to produce accurate heat maps of housing areas in Sweden in terms of price. The heat maps should preferably be able to differ-entiate between areas where there is a significant price difference even though they are geographically close. One example is if one area has close access to transportation while the neighbouring area does not. This can for example happen when the areas are separated by water. This kind of factor should be captured and reflected on the heat maps.

This report aims to explicitly answer the following question: • Which interpolation method is best suited for the task to generate

(13)

Background

This chapter presents the thesis client and contains a description of the data that has been used. It then presents related work both within the domain of this report and related to other domains.

2.1 Thesis client

This report was conducted at Booli Search Technologies AB (hence-forth "Booli"). Booli develops products and services in the domain of residence markets. Their main product is the web service Booli.se, which offers users the service to, e.g., search for residences up for sale and access sold prices for all types of residences in Sweden. Currently they are the leading service provider of residence sales in terms of number of listings in Sweden. Booli have databases that contain data regarding a large part of all residence transactions in Sweden from 2012 onwards. For some areas there is data from 2008 with varying quality. This database is populated daily with new data from resi-dence sales. An accurate solution for mass resiresi-dence price prediction is highly demanded from the thesis client. Booli has an existing model for predicting residence prices and they wish to improve the way their model takes geographical price variation into account. With the find-ings in this thesis, Booli could in the future refine its approach to offer more accurate residential price prediction.

(14)

2.2 Data

Booli has access to various types of data. There is for example a database containing a large part of residence transactions which include the cor-responding sold price, address, date, etc. There is also a database that contains polygon data for various areas in Sweden. This data is avail-able via API calls.

Figure 2.1: Polygon data structure of the city district area Södermalm, Sweden

2.2.1 Polygons

A polygon can be described as a set of geographical points that to-gether make up a geographical shape when connected. In Figure 2.1 a polygon of the city district area Södermalm, Sweden is shown. Booli has access to polygons of varying geographical sizes, ranging from country, province (sv. landskap), municipality and urban area. For country, province and municipality there exists an offical definition and the corresponding polygons are therefore obtained from Swedish Lantmäteriet1. For urban areas, Booli has in cooperation with Swedish real estate agencies developed inofficial polygons for in-house use. There exists for instance no official definition of what makes up the geographical area Södermalm. Generally, the same applies for all city

(15)

district areas in Sweden. It is also important to note that Booli has ac-cess to polygons that describe even smaller areas. Sub-areas of Söder-malm such as Skanstull, Medborgarplatsen and Slussen are available as well. These are also developed together with real estate agencies. In addition to expert judgement, Booli also offers visitors of the web service Booli.se to map up urban areas with the purpose of refining these areas.

Figure 2.2 shows the polygon from Figure 2.1 projected on a map offered by OpenStreetMap2_.

Figure 2.2: The polygon from Figure 2.1 projected on OpenStreetMap

2.2.2 Residence transactions

Booli has extensive data, roughly around 1 million observations for a large part of the residence transactions in Sweden since 2012. The type of data depends on residence type. For object type "apartment" the most relevant data describing a transaction is shown in Table 2.1. The corresponding data for object type "house" is shown in Table 2.2

(16)

Table 2.1: Fictive example of a residence transaction of an apartment Field Value soldPrice 3950000 (SEK) soldDate 2016-03-12 20:57:21 publishedDate 2016-03-06 12:10:10 objectType Apartment livingArea 52 (m2₎ squareMeterPrice 75961,54 (SEK) longitude 59.31149 latitude 18.07435 streetAddress Götgatan 114 rentalCost 3213 (SEK) rooms 2 constructionYear 1926 floor 3 namedAreas Södermalm

areaId 115341 (Booli ID-system) distanceWater 369 (m)

Table 2.2: Fictive example of a residence transaction of a house

Field Value soldPrice 6950000 (SEK) soldDate 2016-08-21 12:53:01 publishedDate 2016-07-30 21:01:43 objectType Villa livingArea 145.2 (m2₎ plotArea 283 (m2₎ longitude 59.30406 latitude 18.19527 streetAddress Vesslevägen 12 operatingCost 3710 (SEK) rooms 6 constructionYear 1975 namedAreas Nacka

(17)

Naturally the data available and the data of interest varies with resident type. The majority of data columns will be identical for all resident types. Depending on resident type some columns will differ. Typically a house has additional area (excluding the living area) in the form of a garden, parking space, etc., as important factors for price prediction. A house will also have an operating cost which includes heat and electrical costs. Whether a house has access to geothermal heating could also be an important factor. On the other hand, apart-ments have factors such as rental cost and which floor it is located on, that affect the price.

Since residences are never exactly identical, an important step is to aim to normalize them prior to interpolating. The primary factor that needs to be adjusted is the sold price. The sold price of, e.g., a small apartment will almost always be less than that of a large apartment. Since an apartment building consists of apartments with varying sizes, one way to tackle this problem would be to instead use the sold price per square meter. This will make it more accurate when comparing apartments of different sizes. Houses on the other hand, are not as dependent on the size of living area. Two different houses that are ge-ographically close and of the same sizes but with different plot areas can differ greatly in price. Another fntmateriactor to consider is the date of a residence transaction. Due to inflation and change in market price the sold price needs to be adjusted accordingly. Valueguard3 _is

a Swedish private company that in cooperation with KTH has devel-oped Nasdaq OMX Valueguard-KTH Housing Index (HOX)4 _for

dif-ferent residential areas in Sweden [7]. The indices are offered with the purpose of offering an efficient way to track the real estate market. They are free to use and are divided in the following categories:

(18)

Each category is offered for both apartments and houses and could be utilized to reduce the relevance of the date of a residence sale.

2.3 Related work

Interpolation has previously been applied for prediction in similar spa-tial problems with great success. For the specific domain of this re-port mainly kriging, thin plate splines and inverse distance weight-ing have been investigated. The background and theory related to these interpolation methods are explained in Chapter 3. In a few re-ports there have been comparisons of the aforementioned interpola-tion methods both within this domain and in other domains, that have shown promising results [3, 2, 8, 4, 9, 10, 5, 6]. However, it has not been investigated if the results obtained in other domains could be applied to the domain of this report, which is residence price prediction.

A few previous papers investigate the possibility of predicting me-teorological variables such as rainfall [8] and evaporation [3]. Hiem-stra and Sluiter [3] compare several interpolation methods including inverse distance weighting (IDW), thin plate spline (TPS) and kriging in terms of production of accurate heat maps. This is done using cross-validation and comparing their root-mean-squared error (RMSE) and mean error (ME). They mention that these metrics can only indicate how the interpolation method performs in terms of each other and can not be seen as an actual performance. To solve this the metrics are di-vided by the standard deviation of the observations [3]. The authors also include a more subjective evaluation method. Expert judgement on the visual appearance of the heat maps is also factored in their con-clusion. Further, they conclude that there is a clear trend in that thin plate spline and kriging were the best interpolation methods. Fur-ther Hiemstra and Sluiter [3] mention that IDW interpolates maps that show a tendency towards more local patterns. A local pattern is a pat-tern occurring only in a specific area and is different from the general pattern.

(19)

com-pares to a multivariate approach (cokriging). The auxiliary variables used in this study is surface area, amplitude, age, number of rooms and whether the apartments have central heating. The amplitude is defined by Chica-Olmo [2] as the quotient between the surface area of the apartment and the number of rooms. The report focuses only on interpolating heat maps for the city of Granada and has a rather small sample size of only 287 apartments in the data set. The most impor-tant findings from this report is that cokriging as well as univariate kriging are both viable methods for carrying out mass residence pre-diction [2]. Chica-Olmo [2] also concludes that for the study area, more expensive residential areas often contain good social, hospital, educa-tional and commercial services.

McCluskey et al. [4] investigate several methods for mass residence price prediction. They conclude that methods that incorporate spatial correlation yield more accurate predictions. Further they arrive at the conclusion that both location and the structural characteristics of a res-idence are key variables that affect the accuracy of predicting resres-idence prices. This strengthens the conclusions from [2] as it was found that including characteristics of a residence (e.g., surface area and number of rooms) improved the accuracy of heat maps.

Chen and Liu [11] investigate IDW for the purpose of interpolating rainfall distribution in the middle of Taiwan. Different variations of IDW with different values were tested. The authors limited the p-values to be the range of p-values from zero to five with an incremental increase of 0.1. The IDW models with different p-values were evalu-ated with cross-validation and the corresponding RMSE for the pur-pose of finding the optimal p-value [11]. Chen and Liu [11] further arrive at the conclusion that the accuracy of prediction of rainfall de-pends greatly on finding the optimal p-value. The authors also argue that the number of observations influences the accuracy as well, and that having a large spread of observations is favorable. In summary, IDW is deemed as a suitable interpolation method for rainfall predic-tion [11].

(20)

(21)

Theory

This chapter presents an explanation and discussion of interpolation. Each interpolation method used in the report is introduced and ex-plained. The chapter also explains how these methods can be tuned for better results.

3.1 Interpolation methods

Interpolation is a method in numerical analysis for predicting unseen data based on known data. Interpolation in the scope of this thesis project corresponds to the process of finding estimates of some vari-able for unseen locations using areas where there is known data. De-pending on complexity, there exists various methods of interpolation. One of the most basic ones is global mean interpolation which is the process of taking the global mean of the data and assigning it to all unknown locations [3]. This method could be effective for smaller ar-eas where there are little local trends. Another method is the near-est neighbour approach which assigns the nearnear-est known data for all unassigned locations [3]. Another slightly more advanced method is the inverse distance weighting interpolation (IDW) that can be tuned with a single parameter to adjust the local and global trends [12].

Due to the relative simplicity of IDW, it will serve as a base-line for this thesis project. In the following sections IDW and two methods for more advanced interpolation will be discussed. In particular, they will be described in a theoretical sense.

(22)

3.1.1 Inverse distance weighting

Inverse distance weighted interpolation (IDW) is a deterministic in-terpolation method that works using the assumption that the distance between two geographical points affects the similarity [3]. This means that as the distance increases between two points the less similar they will be. To interpolate an unknown observation ˆZ(s0)at a location s0,

IDW assigns weights for all known observations and is then able to produce an interpolated value. The weight for the known observa-tions is related to the inverse of the Euclidean distance to the location s0[3]. This is described as follows [12]:

ˆ Z(s0) = Pn i=1w(si)Z(si) Pn i=1w(si) , (3.1)

where ˆZ(s0)is the observation at location s0, and w(si) is the weight

returned from the weight function at location si, and n is the number

of observations in the data.

The weight function is described in the following way:

w(si) = ksi− s0k−p, (3.2)

where k·k indicates the Euclidean distance and p is an inverse distance weighting power.

The p factor is the only tunable parameter in IDW that can only as-sume a positive value including zero. A large p value would converge to the one-nearest-neighbour interpolation [12]. On the contrary, a p value of zero would result in global mean interpolation. According to Bivand et al. [12], p is most commonly defaulted to 2 but an optimal value can be found using cross-validation.

Contrary to kriging, IDW ignores the autocorrelation of tions and could therefore lead to inaccurate predictions if the observa-tion locaobserva-tions are strongly clustered [12].

3.1.2 Thin plate smoothing spline

(23)

the energy function in Equation 3.3 [13] where n is the number of data points: Etps(f ) = n X i=1 (yi− f (xi))2. (3.3)

An important aspect of TPS is that it seeks an exact interpolation meaning that the point yi = f (xi)for the data point (xi, yi)[13].

Thin plate smoothing spline (TPSS) is a variant of TPS that intro-duces an additional component in the form of a smoothing parameter for regularization [13]. TPSS is often used for interpolation of spatially correlated surfaces due to its ability to produce accurate predictions and its simplicity [14]. Unlike TPS, a TPSS does not need an exact in-terpolation for all data points. Instead it seeks the function f (x) that minimizes: Etps,smooth(f ) = n P i=1 (yi− f (xi))2_{+ λ}R R ∂2_f ∂x12 2 + 2 ∂2_f ∂x1∂x2 2 + δ2_f δx22 2 dx1dx2, (3.4) for i ∈ 1, ..., n where n is the number of observations. λ is a fixed smoothing parameter. f is an unknown deterministic function.

The solution to this minimization problem is the function that de-scribes the TPSS [14]. For the two-dimensional case, we can assume the model described in Equation 3.5 where y is the response variable, x are covariates, f a smoothing function and an error term that is assumed to be independent for different data points:

yi = f (x1i, x2i) + i. (3.5)

By tuning the value of λ it is possible to adjust the local and global accuracy of the TPSS [3]. This implies that it can be used to regulate the importance of closeness of fitting the data and smoothness of f [15]. The parameter λ is always a positive scalar. When it is set to zero it results in interpolation without any smoothing. On the contrary, when set to infinity it results in a plane which is the least square fit of the data [16]. This parameter must be found a priori. There exists several methods for choosing the optimal value of the parameter [16]. Generally, the proposed method is to find the value that minimizes the generalized cross-validation (GCV) [3, 16, 17].

(24)

O(n3)operations where n is the number of observations [14]. There ex-ists other faster but more complex methods for this computation which could be more sensible for larger data sets [14].

3.1.3 Kriging

Kriging is a form of linear regression originally developed as a way to estimate distribution of minerals based on samples of already found minerals. Since it is a general method of statistical interpolation, it has been applied to various areas including hydrogeology, residence price prediction, etc. [3, 4, 17, 2]. Kriging differs from linear regression in the sense that residuals are saved, with the assumption that they are spa-tially correlated [3]. This is a key assumption, as kriging is often used for problems where variables are believed to be spatially correlated.

When discussing kriging, it is imperative to know that it is a generic term that includes several types of sub-methods. Depending on which assumptions and knowledge on the data is available different types of kriging apply. Before the types of kriging can be explained it is impor-tant to note that autocorrelation is always assumed. Further, autocor-relation in this case does not depend on the actual location s but only on the distance from s to location h [3]. Hence, the distance from s1to

s1+ hautocorrelate in the same manner as s2to s2+ h.

To simplify, assume we have an expression to determine a variable of interest (e.g., residence price at location s):

Z(s) = µ(s) + (s)

where Z is the variable of interest, µ is the trend factor and is the autocorrelated residuals. The s term indicates a location that can be given by, e.g., a pair of coordinates.

The trend factor µ is seen as a deterministic function on location and describes how it changes with location. This factor can be set to a constant, µ(s) = m. For cases when m is unknown we arrive at ordi-nary kriging (OK). In rare cases where the trend factor can be assumed to have a known value for m we have simple kriging (SK) [3].

(25)

covariance function is defined as: C(h) = C(0) − γ(h) where γ(h) is the variogram function for distance h [3]. The term variogram is used interchangeably with semi-variogram and this report will henceforth use the latter [2]. The semi-variogram can be found by computing the sample semi-variogram together with the input data set [3]. The sam-ple semi-variogram is given by:

ˆ γ(h) = 1 2n n X i=1 (ˆe(xi) − ˆe(xi+ h))2

where ˆγ(h) is the estimated semi-variance for distance h, ˆe(xi) and

ˆ

e(xi + h) are estimated residuals for those locations and n the

num-ber of observations.

The semi-variogram can then be used to compute the semi-variance for different values of distances, h, and plotted graphically. An exam-ple is shown in Figure 3.1 [12]. As it would be highly cumbersome to plot all different pair distances in the data set, a common solution is to group the distances into lag bins. For example, one could define one lag bin as distances greater than 100 meters but less than 200 meters.

Experimental variogram and fitted variogram model

Distance Semi − v ar iance 2.0e+11 4.0e+11 6.0e+11 8.0e+11 1.0e+12 500 1000 1500 2000 2500 3000 2354 3876 4361 7517 7992 8548 33810 42690 72642 75467 89413 ₁₂₅₅₇₃ Model: Mat Nugget: 186742185241 Sill: 971977523085 Range: 152 Kappa: 0.7

(26)

For shorter distances the semi-variance is lower but increases with the distance. This is as expected as the closer two locations are, the less they should co-vary. However, the semi-variance reaches a point where further distance no longer increases it. This value is known as the sill [3]. Other than sill, there are two more terms: nugget and range, that together define a semi-variogram. The initial semi-variance where the fitted model crosses the y-axis is called the nugget. This value represents the minimum variation between two locations. As the semi-variance always assumes a positive scalar value it is only defined for distances greater than zero. The range is the distance where the model reaches the sill. Since the sill is not defined exactly, the range is typically the distance where it has reached 95% of the asymptotic value of the sill.

The result of kriging depends greatly on how well the model fits the semi-variogram. There exists various model types but most practi-cal studies use the exponential, spheripracti-cal, Gaussian, Matérn or power models [12]. When a model is set, interpolation can be done. To pre-dict an unknown location we take a weighted sum of all the values in the data set based on the the semi-variance. It is important to note that these weights do not solely depend on the distance but on the model [2]. To make sure that the estimator is unbiased Equation 3.6 must hold:

n

X

i=1

λi = 1. (3.6)

Lastly, if auxiliary information is available for all cells in a grid and assumed to have a correlation with the variable of interest it is possible to use kriging with external drift (KED) [18]. Typical auxiliary infor-mation is for example distance to water and altitude which are both metrics possible to calculate for all cells in any grid [3]. KED is iden-tical to ordinary kriging with the exception that the covariance matrix must be extended to include the value of the auxiliary variable [19].

(27)

Methodology

This chapter explains the methodology used in this study. The first sec-tion provides an overview of the methodology. The second secsec-tion de-scribes the data collection and data preprocessing processes. The third section describes the parameter selection for the interpolation imple-mentations and how the parameters for each interpolation method is set. The fourth section describes a quantitative metric and a qualita-tive method for evaluation of the interpolation methods. The fifth and final section describes the finally chosen interpolation methods.

(28)

4.1 Overview

Figure 4.1: Flow chart of methodology

(29)

4.2 Data collection and preprocessing

The data needed for the methods comes primarily from three sources. The first and arguably the most important data needed is the residence transaction data and the polygon data of residential areas that can be retrieved from the Booli API. To be able to compare residence trans-actions from different time periods, there is also need for a residence price index which is retrieved from Valueguard. Depending on res-idence type and the location, different indices from Valueguard has been used. The method of choosing the correct index is described in Section 4.2.2. The third data needed is coordinates of subway and train station entrances which can be retrieved graphically from Google Maps1_.

4.2.1 Booli data

The main data of interest from Booli is the polygon data that is used as a grid for the interpolation and the residence transaction data. The res-idence transaction data contains various fields, but for this report only a subset of these were of interest. The fields are described in Table 4.1.

Table 4.1: Residence transaction data

Field Value soldPrice 3950000 (SEK) soldDate 2015-03-12 20:57:21 objectType Apartment livingArea 52 (m2₎ longitude 59.31149 latitude 18.07435 streetAddress Götgatan 114 namedAreas Södermalm

areaId 115341 (Booli ID-system)

The raw data shown in Table 4.1 depicts one observation in the data set. All residence transactions belonging to a residential area were retrieved for each residential area. This data was then prepro-cessed. Depending on residential areas, different objectType values

(30)

were of interest. Residential areas often contain clusters of the same objectTypebut there are in some cases exceptions. The city district area Södermalm in Sweden is for example an area heavily dominated by apartments. However, there exists a few houses that are seen as exceptions. It is therefore important that anomalies are filtered out as they will otherwise affect the results and the interpolated maps.

Depending on the residential area, different variations of the price predictor was used. For this thesis, the residential areas were divided into two categories depending on the type of residences. For residen-tial areas that are largely filled with apartments, the price predictor is the soldPrice divided by the livingArea. The reason for this is because the price for an apartment is heavily correlated with the size of living area. For residential areas focused on villas, townhouses, etc., the price predictor used was soldPrice. An important note is that this thesis only investigates two categories of residential groups. The residence transaction data for a residential area is either only apart-ments or only villas, townhouses, etc.

As the residence transaction date affects the residence price, the soldPriceand soldDate fields were used as inputs to calculate an index adjusted residence price, thus making it possible to compare residence transactions from different time periods. The Valueguard HOX indices are available from January 2005 with a start index value of 100. For each subsequent month up to March 2017, Valueguard has calculated a new index value. To adjust historic residence prices, the most recent index value was used as the base value and divided with the corresponding index value for the month of the historic residence transaction to calculate the price change factor. To calculate the ad-justed historical residence price the price change factor is multiplied with the historical residence price. The mathematical formula for this is given in Equation 4.1:

P riceadjusted=

Indexcurrent

Indexhistoric

· P ricehistoric. (4.1)

(31)

street addresses that contain multiple residences.

As this thesis project has involved spatial data given by geograph-ical coordinates, the programming language R was used as there is a large amount of built-in functionality for handling this type of data. The coordinates in Table 4.1 are given in the WGS842 _reference

sys-tem which is one of the default syssys-tems. WGS84 is widely used in for example GPS devices. The WGS84 reference system is a global sys-tem which, while commonly used, has flaws [12]. As one moves from the equator up in the northern hemisphere the inaccuracy increases. For Sweden (the geographical area of this study) there exists an offi-cial alternative to WGS84 called SWEREF99 (Swedish Reference Frame 1999) [20]. SWEREF99 was created by Swedish Lantmäteriet and is a reference system over Sweden. It can be used to reduce the positional inaccuracy by up to 10 meters compared to WGS84 for areas within Sweden [20]. Since SWEREF99 offers a more accurate reference sys-tem than WGS84 and the scope of this thesis project only considers areas within Sweden, all coordinates used were transformed to con-form to SWEREF99. This transcon-formation was done using the CRS() and proj4string() methods available from the sp library in R.

As with the coordinates of the residences, the coordinates of the polygons also needed to be transformed to conform to SWEREF99. The raw polygon obtained from Booli was gridded out so that inter-polation could be done for each cell in the grid. The sp library in R offers methods to create n-sized grids of irregular spatial objects, such as polygons with equal cell sizes. Grid sizes were chosen according to the level of detail needed for each polygon. The larger the polygon the more cells were needed. For the opposite, a lower number of cells were needed to capture the residence price differences. The size of the polygon is therefore the main factor when deciding the value of n.

4.2.2 Valueguard HOX indices

The Valueguard HOX indices are available for free and can be obtained in CSV format [7]. As there are different indices that cover different ge-ographical areas in Sweden with varying specificity, a manual choice of index for each residential area was done. The choice of index was done manually by choosing the most narrow and specific index (see the list of indices in Section 2.2.2) that is applicable for each residential

(32)

area. In other words, the mid-sized city index would be chosen for the residential area Södertälje in favor of the Sweden index, as Södertälje is first and foremost a mid-sized city but at the same time a part of Sweden.

4.2.3 Google Maps public transportation data

According to Chica-Olmo [2] expensive residential areas tend to be close to public transportation. It was therefore of interest to see how including this in the interpolation methods affected the predictions. For this project, the Euclidean distance from each cell in the gridded polygon to its nearest transportation location was computed. It is also important to mention that the Euclidean distance might not be optimal as this distance usually differs compared to real life walking distance. All public transportation stations were given equal weight in the in-terpolation methods regardless of its size and popularity.

The spatial data of public transportation locations were retrieved from the Google Maps web application. For this thesis project only subway and train stations were considered. For a given residential area the spatial data for the public transportation locations within the area was included as data. As a given subway or train station typi-cally has multiple entrances that can be located far from each other, all the entrances for each subway and train station were included. This spatial data was also transformed to SWEREF99 as Google Maps coor-dinates are offered in WGS84.

4.2.4 Outliers

(33)

assumes that the variable follows a normal distribution [21]. As resi-dence prices are assumed to follow a log-normal distribution [22], the data was first transformed to a normal distribution before box plots and histograms were used.

4.3 Parameter selection

The full implementation of the interpolation methods and statistical analysis was done in the R programming language. The interpolation implementation was heavily based on functions available from the fol-lowing R libraries: automap, fields and gstat.

The interpolated results and maps were then exported and graph-ically visualised for evaluation purposes in QGIS3, a free and open source geographical information system.

4.3.1 Parameter for Inverse distance weighting

Inverse distance weighting contains only one tunable parameter p that determines how much impact nearby observations have. The optimal value for p is the value that minimizes the root-mean-squared error (RMSE). This parameter can be found using cross-validation. In Fig-ure 4.2 an example of finding the optimal value for p for a specific data set is shown.

(34)

Figure 4.2: Root-mean-squared error for different p-values Figure 4.2 shows the calculated RMSE for different p-values rang-ing from p = 1 to p = 7 with a step size incrementation of 0.2. As shown in Figure 4.2, for this data set there is an evident trend that the minima is found for p = 2. This is coherent with Bivand et al. [12] that states that p = 2 is the default value for IDW.

4.3.2 Parameter for Thin plate smoothing spline

Thin plate smoothing spline contains only one tunable parameter λ that affects the local and global accuracy of the resulting interpolated maps [3]. As the most commonly used way of choosing an optimal value that results in a balance of local and global accuracy is using GCV, the resulting smoothing parameter for each data set is thus the value that minimizes GCV [3, 17, 16]. The λ parameter was computed using the Tps function in the fields library in R by setting the GCV parameter.

4.3.3 Model selection for Kriging

(35)

model that best fits the semi-variance found at different distances. The model can be fitted either by a visual judgement or with a fitting al-gorithm [3]. The fitting alal-gorithm used in this report utilizes a fitting algorithm from the automap package in R. The fitting algorithm op-erates by choosing the model that minimizes the least square fitting error between the model and the semi-variogram.

4.3.4 Grid size of polygon

The grid size is the number of equally sized cells that fits inside a poly-gon. Depending on how large a polygon is, different grid sizes are suitable. By adjusting the grid size one adjusts the level of detail in the residential area that the interpolation methods are able to capture. Finding a sufficient grid size for each residential area is therefore very important. In this thesis, the grid size of each polygon was therefore chosen as the minimum number so that each street address lies in a separate cell, which was determined through an iterative trial and er-ror process.

4.4 Evaluation

The interpolation methods presented were evaluated for different sets of residential areas of different characteristics in terms of size, resi-dence type and population. The quantitative metric used for compar-ing the relative performance of the interpolation methods was their corresponding root-mean-squared error.

To reduce the importance of data set size and minimize overfit-ting, k-fold cross-validation was used. Kohavi [23] investigated sev-eral cross validation methods and concluded that k = 10 was optimal.

4.4.1 K-fold cross validation

(36)

allow all data to be used for training and testing purposes. By calcu-lating the average value of the performance metric for all iterations, a more fair estimate can be found.

4.4.2 Root-mean-squared error

To determine the performance of the interpolation methods quantita-tively, the root-mean-squared error (RMSE) was utilized. The expres-sion for the RMSE is [3]:

RM SE = v u u t 1 n n X i=1 ( ˆZcv,i− Zi)2, (4.2)

where n is the number of observations, ˆZcv,iis the estimated value for

the cross-validation estimate, and Zithe true value for an observation.

As the data for each residence market varied in availability and size cross-validation was used to mitigate the risk of overfitting as the whole data set could be used for training.

4.4.3 Visual evaluation

(37)

4.5 Chosen interpolation models

Four variations of interpolation methods were used for the residential areas. The variations are described in Table 4.2.

Table 4.2: Chosen interpolation models

Interpolation method Description

idw~price Inverse distance weighting with price as predictor kriging~price Kriging with price as predictor

kriging~price+dist Kriging with price as predictor and distance

to nearest transportation station as external predictor tps~price Thin plate smoothing spline with price as predictor

(38)

Results

This chapter presents the results obtained from the interpolation meth-ods described in Section 4.5 and their corresponding evaluations as de-scribed in Section 4.4 for 3 different residential areas. The same color scale is used for all heat maps in this chapter. The color scale follows a red, yellow and blue scheme where the values are in a decreasing order.

5.1 Residential area #1

The first residential area is the suburban industrial city Södertälje, Swe-den that mainly contains clusters of houses. The total area of Södertälje is 26 km2_{. Expert judgement and fine-tuning was used to find a grid}

size able to capture geographical differences at street address level. The grid size is the number of equally sized cells that fit inside the polygon. A grid size of n = 20000 was found to be sufficient for this residential area. The number of residence transactions in the data set was 1155. The price distribution of these are shown in Figure 5.1 where the blue vertical line shows the mean price. The mean sold price in the data set was 3.8 MSEK. The Valueguard HOX mid-sized index was chosen for this area.

(39)

Distribution of observations

Sold price

Frequency

2e+06 4e+06 6e+06 8e+06 1e+07

0 5 10 15 20 25 30

Figure 5.1: Distribution of observations for Södertälje

(40)

in 850.000 which is slightly better or equal to the result of the two krig-ing models. Lastly we have the map of TPS with price shown in Fig-ure 5.6. The resulting map is able to detect a high number of regional differences within Södertälje. For areas where there are no observa-tions TPS interpolates a gradient. This behaviour is shown clearly in the southern parts of Södertälje around the edges.

Table 5.1 shows that the RMSE for TPS with price as predictor was 900.000. Overall the largest difference in RMSE was between TPS and IDW. IDW performed 5.6% better compared to TPS.

Table 5.1: RMSE for residential area #1

tps~price

kriging~price

kriging~price+dist

idw~price

900.000

860.000

850.000

(41)

Figure 5.3: Interpolated map of Södertälje using Kriging with price

(42)

Figure 5.5: Interpolated map of Södertälje using IDW with price

(43)

5.2 Residential area #2

The second residential area is the city of Eskilstuna, Sweden that mainly contains clusters of houses. The total area of Eskilstuna is 30 km2_.

Ex-pert judgement and fine-tuning was once again used to find a grid size able to capture geographical differences at street address level. As the area size of Eskilstuna is nearly the same as Södertälje, a grid size of n = 20000was used for Eskilstuna as well. The number of residence transactions in the data set was 1523. The price distribution of these are shown in Figure 5.7 where the blue vertical line shows the mean price. The mean sold price in the data set was 2.9 MSEK. The Value-guard HOX mid-sized index was chosen for this area.

Distribution of observations

Sold price

Frequency

2.0e+06 4.0e+06 6.0e+06 8.0e+06 1.0e+07 1.2e+07

0 10 20 30 40 50

Figure 5.7: Distribution of observations for Eskilstuna

(44)

that the houses are rather clustered. For the areas where there are no observations, they are either apartment areas or have no residences at all. Figures 5.9 and 5.10 show kriging with price and kriging with price and distance. Visually, the interpolated maps perform similarly. According to experts, both kriging models manage to capture the price differences that are present in Eskilstuna. As in residential area #1 in Section 5.1, kriging with price and distance once again manages to in-terpolate more unique prices for areas where there are no observations. In Table 5.2 we see that including distance as a predictor reduced the overall RMSE from 760.000 to 750.000 resulting in a decrease of 10.000 or a 1.3% decrease. IDW (shown in Figure 5.11) resulted once again in a map that is visually very general and less smooth compared to the two kriging models. In Table 5.2 we see that IDW with price as predictor resulted in 770.000 which is slightly higher than the two kriging mod-els. Lastly we have the map of TPS with price shown in Figure 5.12. The resulting map was able to detect a high number of regional differ-ences within Eskilstuna. As with residential area #1 in Section 5.1, TPS interpolates a gradient for areas where there are no observations. This behaviour is shown for instance in the southern part of Eskilstuna.

Table 5.2 shows that the RMSE for TPS with price as predictor was 790.000. Overall the largest difference in RMSE was between TPS and kriging with price and distance. The latter performed 5.0% better com-pared to TPS in terms of RMSE.

(45)

Figure 5.8: Distribution of observations for Eskilstuna

(46)

Figure 5.10: Interpolated map of Eskilstuna using Kriging with price and distance to transport

(47)

Figure 5.12: Interpolated map of Eskilstuna using TPS with price

5.3 Residential area #3

The third residential area is the city of Stockholm, Sweden that con-tains a mixture of residence types. For this area the chosen residence type was apartments. The total area of Stockholm is 189 km2_{. Due to}

(48)

Distribution of observations Sold sqm price Frequency 50000 100000 150000 0 50 100 150

Figure 5.13: Distribution of observations for Stockholm

(49)

for the two kriging models. Lastly we have the map of TPS with price shown in Figure 5.18. The resulting map was once again able to de-tect a high number of regional differences within Stockholm. As with the previous residential areas, TPS interpolated a gradient for areas where there are no observations. This is clear in the northern part of Stockholm in Figure 5.18. TPS resulted in an RMSE of 12000 which is the same as IDW but higher than the kriging variations. Overall the largest difference in RMSE was between TPS (or IDW) and kriging with price and distance. The latter performed 18% better in terms of RMSE.

tps~price

kriging~price

kriging~price+dist

idw~price

12000

10000

9900

12000

(50)

Figure 5.15: Interpolated map of Stockholm using Kriging with price

(51)

Figure 5.17: Interpolated map of Stockholm using IDW with price

(52)

5.4 Expert judgement

Experts on the three residence markets were unanimous on their judge-ment. They conclude that all interpolation methods generated interest-ing and accurate heat maps. The maps from the two kriginterest-ing variations were always preferred as they managed to capture all major residence price differences. For example, the two kriging variations put more emphasis on the Tuna Park and Borsökna areas in residential area #2 in Section 5.2 which the experts claim are slightly more expensive house areas. TPS was also able to capture this area but was judged to be too general and including. The central parts of Eskilstuna (the area around the train station) are for example not very well differentiated and are given an overall high predicted price. IDW was not able to capture this as good as the other methods but resulted in a map that was still able to display relatively good price differences in Eskilstuna. Experts agreed that the IDW map was too general.

The expert judgement for the interpolated maps for residential area #1 in Section 5.1 was similar to Eskilstuna. The same strengths and weaknesses were found for each heat map. The kriging maps were once again the most preferred ones.

For residential area #3 in Section 5.3, the experts believed that the four maps were more alike in terms of accuracy. As with previous residential areas, IDW resulted in a map that was the most general but was still able to point out trend changes between the sub-areas in the residential area.

5.5 Computation time

(53)

Table 5.4: Computation time for residential area #1 (Södertälje) mea-sured in ms

tps~price

kriging~price

kriging~price+dist

idw~price

6.6 · 10

3

3.0 · 10

4

2.8 · 10

4

2.1 · 10

2

Table 5.5: Computation time for residential area #2 (Eskilstuna) mea-sured in ms

tps~price

kriging~price

kriging~price+dist

idw~price

1.6 · 10

4

1.0 · 10

5

1.1 · 10

5

2.0 · 10

1

Table 5.6: Computation time for residential area #3 (Stockholm) mea-sured in ms

(54)

Discussion

This chapter discusses the methodology, data sets used and results. The first section discusses the results obtained as discussed in the pre-vious chapter. The second section discusses the data sets, the method-ology used and how they affected the results.

6.1 Quantitative and qualitative results

All three interpolation methods showed promising results. Each method had its own unique strengths and weaknesses. Generally, including the second predictor dist, always resulted in an improvement in terms of an even lower RMSE compared to only using price as predictor. For kriging, using dist as an additional predictor was an improve-ment for residential areas #1, Södertälje, and #2, Eskilstuna, in terms of RMSE. For residential area #3, Stockholm, the RMSE was slightly higher. This could be due to the increased number of public trans-portation stations. For residential area #3, 100 stations were included which is many more compared to the first two residential areas. This could lead to issues as the method used in this report appoints equal weight for all stations regardless of popularity.

Including distance to public transportation stations led to a greater decrease in RMSE for Södertälje (see Section 5.1) compared to Eskil-stuna (see Section 5.2). This decrease in percentage was twice as large. This could be explained since citizens of Eskilstuna are probably less likely to use the train as a mean for transportation in their daily life because of logistic reasons. On the other hand, citizens in Södertälje are more likely to rely on using public transportation as Södertälje is a

(55)

reasonable option for residence if you need to commute to for exam-ple Stockholm. This factor could be an explaining factor as to why the predictor dist had a comparatively larger impact on the RMSE for Södertälje.

Kriging was the best method in terms of accuracy, as both kriging variations resulted in low values for the RMSE compared to the other methods. Kriging was on the other hand the most computationally de-manding method. The computation time for interpolating a map for the areas tested was largely different for the interpolation methods. Even though kriging generates the most accurate maps it might not be a feasible option to use for some cases. Kriging would not be ideal for larger areas such as, e.g., the whole country of Sweden since the model of the kriging method would be too general. Depending on the hardware it could be too inefficient in terms of computation time. As kriging has shown to be favourable in terms of accuracy for the tested residential areas, a larger map could still be formed by combining the outputs of several smaller maps from kriging. Quantitatively, IDW with price as predictor performed only slightly poorer than kriging for all residential areas tested. The resulting map was more general, though, in the sense that the map was smooth only for the immediate nearby locations of observations. For areas far away from observations we see that IDW interpolates very similar and general prices which is not ideal as this is not a correct reflection on the real residence market prices. IDW was on the other hand the fastest interpolation method by far for all tested residential areas, surpassing both kriging and TPS with a significant factor. For cases when a fast and reliable interpola-tion needs to be done, IDW could be a good compromise. Lastly TPS generated maps that were always able to capture the more expensive areas in the tested residential areas. According to experts this was a strength for all TPS generated maps. Unfortunately it was too general in the areas mentioned. In expensive areas where there were several smaller sub-areas with different prices, TPS was unable to interpolate this.

6.2 Data sets

(56)

This is shown in Figures 5.1, 5.7 and 5.13.

The size of the residence data set will naturally have an effect on the computational time but will also affect the quality of the residence maps. The fewer the number of observations there are, the larger the portion of the map that needs to be predicted. Naturally the oppo-site holds as well. A larger data set also comes with the disadvantage of being more computationally demanding. As Hiemstra and Sluiter [3] mention, the number of observations required depend largely on the distribution of the observations. Having a sufficient geographi-cal spread is particularly important when the residential area exhibits small local details. For the residential areas tested we seem to have sufficient data to capture local details. In Figure 5.2 we see that there are evident clusters of residences. These clusters are for example well represented in Figure 5.3 and 5.4. We also see that both kriging vari-ations interpolated very general prices for areas where there are no observations. For these areas TPS is still able to interpolate different prices as seen in Figure 5.6 in the form of a gradient. This supports the findings and theory from [3] which state that kriging is in particular dependent on having sufficient and well distributed data.

(57)

6.3 Sustainability and ethics

The results and discussions in this thesis report do not raise any sus-tainability or ethical issues. The Swedish residence transaction data (which this thesis relies heavily on) is publicly available on various web pages on the Internet. The residence transaction data used is also not connected to any personal entity and can therefore be considered non-sensitive information. The same conclusion holds for the coordi-nates of the public transportation stations used in this study as well.

(58)

Conclusion

Interpolation methods are viable for generating heat maps showing residence price. The methods tested in this report all showed indi-vidual strengths and weaknesses and could be viable choices depend-ing on the purpose. The krigdepend-ing variations consistently resulted in low root-mean-squared error (RMSE) and generated the most visually accurate maps but required the longest computation time. Including distance to public transportation stations always led to a decrease in RMSE. According to experts these heat maps were in addition slightly more visually correct than kriging with only price as predictor. Even though an interpolation takes longer time using kriging compared to the other methods, it could still be used for situations where an in-terpolation does not need to be done frequently. In a situation where heat maps need to be interpolated on a more frequent basis, inverse distance weighting (IDW) with price as predictor would be preferred over thin plate spline (TPS) as TPS is close to kriging in terms of com-putation time.

7.1 Future work

This thesis project can in the future be expanded in various ways. One suggestion is to investigate how using a more sophisticated method for calculating distances to various areas of interest affects the results. Calculating the real walking distance to such areas instead of the Eu-clidean distance could lead to greater decreases in RMSE and also heat maps of even better quality. Including a weight system for the public transportation station depending on the popularity of the

(59)

(60)

[1] Bostadspriser i Riket – Svensk Mäklarstatistik – [Swedish res-idence broker statistics]. https://www.maklarstatistik. se/omrade/riket/#/villor. Accessed: 2017-05-31.

[2] Jorge Chica-Olmo. Prediction of housing location price by a mul-tivariate spatial method: Cokriging. Journal of Real Estate Research, 29(1):91–114, 2007.

[3] Paul Hiemstra and Raymond Sluiter. Interpolation of Makkink evaporation in the Netherlands. Technical report, TR-327, De Bilt: KNMI, 2011.

[4] William J McCluskey, William G Deddis, Ian G Lamont, and Richard A Borst. The application of surface generated interpo-lation models for the prediction of residential property values. Journal of Property Investment & Finance, 18(2):162–176, 2000. [5] Beatriz Larraz and Javier Población. An online real estate

valua-tion model for control risk taking: A spatial approach. Investment Analysts Journal, 2013(78):83–96, 2013.

[6] Robin A Dubin. Predicting house prices using multiple listings data. Journal of Real Estate Finance and Economics, 17(1):35–59, 1998. [7] Nasdaq OMX Valueguard-KTH Housing Index (HOX). http: //www.valueguard.se/stockholmbr. Accessed: 2017-06-02. [8] Michael F Hutchinson. Interpolating mean rainfall using thin plate smoothing splines. International Journal of Geographical In-formation Systems, 9(4):385–403, 1995.

[9] Geoffrey M Laslett. Kriging and splines: an empirical comparison of their predictive performance in some applications. Journal of the American Statistical Association, 89(426):391–400, 1994.

(61)

[10] Stephen J Jeffrey, John O Carter, Keith B Moodie, and Alan R Beswick. Using spatial interpolation to construct a comprehen-sive archive of Australian climate data. Environmental Modelling & Software, 16(4):309–330, 2001.

[11] Feng-Wen Chen and Chen-Wuing Liu. Estimation of the spatial rainfall distribution using inverse distance weighting (IDW) in the middle of Taiwan. Paddy and Water Environment, 10(3):209– 222, 2012.

[12] Roger Bivand, Edzer Pebesma, and Virgilio Gómez-Rubio. Applied Spatial Data Analysis with R. Springer-Science+BusinessMedia, 1 edition, 2008.

[13] David Eberly. Thin-Plate Splines. Geometric Tools Inc, 2002:116, 2002.

[14] Penelope Hancock and Michael F Hutchinson. Spatial interpola-tion of large climate data sets using bivariate thin plate smooth-ing splines. Environmental Modellsmooth-ing & Software, 21(12):1684–1694, 2006.

[15] Simon N Wood. Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95–114, 2003.

[16] Arpan Ghosh. Efficient thin plate spline interpolation and its application to adaptive optics. Master’s thesis, Johannes Kepler Universität Linz, 2010.

[17] Eric PJ Boer, Kirsten M de Beurs, and A Dewi Hartkamp. Kriging and thin plate splines for mapping climate variables. International Journal of Applied Earth Observation and Geoinformation, 3(2):146– 154, 2001.

[18] Tomislav Hengl, Gerard Bm Heuvelink, and Alfred Stein. Com-parison of kriging with external drift and regression-kriging. Technical note, ITC, 51, 2003.

(62)

[20] SWEREF 99 – Lantmäteriet. https://www.lantmateriet. se/Kartor-och-geografisk-information/

GPS-och-geodetisk-matning/Referenssystem/

Tredimensionella-system/SWEREF-99/_. _Accessed: 2017-06-02.

[21] Carroll Croarkin, Paul Tobias, and Chelli Zey. Engineering statis-tics handbook. NIST iTL, 2002.

[22] Takaaki Ohnishi, Takayuki Mizuno, Chihiro Shimizu, and Tsu-tomu Watanabe. On the evolution of the house price distribu-tion. Technical report, Institute of Economic Research, Hitotsub-ashi University, 2010.

(63)