Evaluation of Calibration Methods to Adjust for Infrequent Values in Data for Machine Learning

(1)

Degree Thesis in Microdata Analysis

Level: Master of Science (MSc) in Business Intelligence Evaluation of Calibration Methods to Adjust for

Infrequent Values in Data for Machine Learning

Author: Felipe Dutra Calainho

Supervisor: Ilias Thomas Co-supervisor: Jerker Westin Examiner: Siril Yella

Subject/main field of study: Microdata Analysis Course code: MI4001

Credits: 30 ECTS

Date of examination: June 12, 2018

At Dalarna University it is possible to publish the student thesis in full text in DiVA.

The publishing is open access, which means the work will be freely accessible to read and download on the internet. This will significantly increase the dissemination and visibility of the student thesis.

Open access is becoming the standard route for spreading scientific and academic information on the internet. Dalarna University recommends that both researchers as well as students publish their work open access.

I give my/we give our consent for full text publishing (freely accessible on the internet, open access):

Yes ☒ No ☐

(2)

Evaluation of Calibration Methods to Adjust for Infrequent Values in Data for Machine Learning

Degree Thesis in Microdata Analysis

Contact: Felipe Dutra Calainho (felipecalainho@gmail.com)

(3)

Abstract

The performance of supervised machine learning algorithms is highly dependent on the distribution of the target variable. Infrequent values are more difficult to predict, as there are fewer examples for the algorithm to learn patterns that contain those values.

These infrequent values are a common problem with real data, being the object of interest in many fields such as medical research, finance and economics, just to mention a few.

Problems regarding classification have been comprehensively studied. For regression, on the other hand, few contributions are available. In this work, two ensemble methods from classification are adapted to the regression case. Additionally, existing oversampling techniques, namely SmoteR, are tested. Therefore, the aim of this research is to examine the influence of oversampling and ensemble techniques over the accuracy of regression models when predicting infrequent values.

To assess the performance of the proposed techniques, two data sets are used: one concerning house prices, while the other regards patients with Parkinson’s Disease. The findings corroborate the usefulness of the techniques for reducing the prediction error of infrequent observations. In the best case, the proposed Random Distribution Sample Ensemble reduced the overall RMSE by 8.09% and the RMSE for infrequent values by 6.44% when compared with the best performing benchmark for the housing data set.

Key words: Data mining, resampling, ensemble.

(4)

Acronyms

AUC Area Under the Curve.

CV Cross-Validation.

ENN Edited Nearest Neighbours.

FIM Discharge Functional Independence Measure.

ICC Intra-Class Correlation.

kNN k-Nearest Neighbours.

LOOCV Leave One Out Cross-Validation.

LR Linear Regression.

MAD Mean Absolute Deviation.

MAE Mean Absolute Error.

MARS Multivariate Adaptive Regression Spline.

ML Machine Learning.

MSE Mean Squared Error.

NRMSE Normalized Root Mean Squared Error.

PCA Principal Component Analysis.

RBF Radial Basis Function.

RDSE Random Distribution Subsets Ensemble.

RF Random Forest.

RMSE Root Mean Squared Error.

RU Random Under-sampling.

Smote Synthetic Minority Over-sampling Technique.

SmoteR Synthetic Minority Over-sampling Technique for Regression.

SVM Support Vector Machine.

(7)

SVR Support Vector Regression.

TRS Treatment Response Scale.

UDSE Uniform Distribution Subsets Ensemble.

XGB Extreme Gradient Boosting.

(8)

1 Introduction

Society faces day-to-day challenges such as oil spills detection in satellite radar images (Kubat et al. 1998); bio-medical applications (Li et al. 2016); prediction of Forest Fires (Torgo et al. 2015) and fraud detection (Fawcett & Provost 1997). More recently, these challenges are being mitigated with the aid of Machine Learning (ML). ML algorithms use data to perform several tasks that help alleviate these difficulties, e.g forecast, analyze, classify, cluster. However, this ever growing area of science and technology still has much to develop.

The outcome performance of ML algorithms is sometimes limited, part of the reason lies in the idiosyncrasies of the data. Real-data often does not possess a uniform distribution, some observations may contain values that are more frequent than others. This happens in discrete cases (classification) as well as in continuous cases (regression).

An example for classification is data regarding credit card fraud (Bhattacharyya et al.

2011), where only a small percentage of credit card transactions are fraudulent. In regression, an example is data regarding patients with Parkinson (Thomas et al. 2017), where Treatment Response Scale rating below -2 and higher then 1 are more infrequent and cause the predictions to be less accurate for such values.

This work aims to evaluate different approaches to deal with one of such peculiarities and to propose the best alternative, particularly when there are infrequent events. It also tries to improve the performance of machine learning algorithms when the number of observations is not evenly distributed among the possible values by testing and comparing different techniques.

Infrequent cases in data may create a majority and mean biases for classification and regression, respectively. These biases make it harder to predict those more extreme or rare events. For dealing or reducing the impact of the bias in model there are several techniques for classification (He & Garcia 2009, Galar et al. 2012, Sun et al. 2015, D´ıez-Pastor et al.

2015), but not many for the regression case (Torgo et al. 2015, Mendes-Moreira et al.

2012). The aim of this project is to examine approaches for dealing with infrequent values in regression problems.

1.1 Proposed Approach

The present work investigates if resampling, ensembling or hybrid techniques will improve the accuracy of regression models with infrequent values. Additionally, it is of interest to verify if any of the techniques employed performs better than the others across distinct data sets. It will also be tested if the prediction results are statistically different from one

(9)

another. To avoid any bias related to the machine learning algorithm used, as Torgo et al.

(2015) pointed out, three algorithms will be used, i.e Support Vector Regression (SVR), Extreme Gradient Boosting (XGB) and Random Forest.

Due to time limitations, the focus will be on researching resampling, ensemble and hybrid techniques only. This work proposes the adaptation of two ensemble techniques for the regression problem, which I name Random Distribution Subsets Ensemble (RDSE) and Uniform Distribution Subsets Ensemble (UDSE). These were originally developed by Sun et al. (2015) and D´ıez-Pastor et al. (2015), respectively. The resampling techniques will focus on under-sampling and Synthetic Minority Over-sampling Technique for Re- gression (SmoteR). Other approaches such as extreme value theory, empirical Bayes and alternative statistical approaches were not addressed and are of interest for future work.

As a mean to assess the performance of the proposed techniques, Root Mean Squared Error (RMSE), RMSE for the tails, correlation between the predictions and actual values and the Mean Absolute Error (MAE) are used to measure the accuracy of these models.

For comparing and checking the effects of the proposed techniques, a Wilcoxon rank sum test and the Bland-Altman methods are used.

The rest of this work is structured as follows. In Section 2, former works for both classification and regression problems will be discussed. In Section 3.1, the data sets used are described. Section 3 presents the feature selection and sampling techniques, as well as the algorithms and emsemble approaches employed. The results are presented and discussed in Section 4. Finally, Section 5 unveil the final discussion and conclusion of the present Thesis work.

(10)

2 Literature Review

To present the framework that underpins this work, it will first be introduced the concept of imbalanced data in classification. Then I will discuss the concept of infrequent data in regression. Finally, the techniques that have been employed for dealing with infrequent values in regression problems are presented.

According to Japkowicz (2000) a data set is considered imbalanced when one class is represented by a large number of examples whereas the other is represented by a small number. He & Garcia (2009) argue that, technically, any data set that has an unequal distribution between its classes can be considered imbalanced. However, the authors com- plement by stating that a data set to be considered imbalanced should exhibit significant between-class imbalance on the order of 100:1, 1,000:1 or 10,000:1, for example.

The first study, to my knowledge, that has imbalanced data set problem as the main focus is Japkowicz (2000). In this paper, the author mentions previous works that faced imbalanced data as an obstacle to solve their goals. These studies present the fields in which data imbalance disturbs the performance of certain classifiers, however, they do not discuss if every imbalance is prejudicial and to what degree the different types of imbalances alter classification performances (Japkowicz 2000).

Data imbalance is a characteristic that has already been acknowledged and studied in different areas. Some of the fields where the negative impact of this problem was recognized are: fraud detection (Fawcett & Provost 1997); oil spills detection in satellite radar images (Kubat et al. 1998); direct marketing (Ling & Li 1998); defect prediction in software (Tan et al. 2015); bio-medical applications (Li et al. 2016) and web author identification (Vorobeva 2016), just to mention a few.

It is important to consider that, the minority class is usually the one that provides greater interest from a learning point of view and misclassifying it is an expensive error (Elkan 2001). This is more evident if taken in consideration the aforementioned areas of study. Consider oil spills detection using radar images as an example. Oil spills are uncommon events, therefore, out of thousands of images only a few images will contain oil spills (Kubat et al. 1998). As the purpose of the study is to detect oil spills, the minority class (radar images containing oil spills) is the one that provides greater interest and misclassifying it is an expensive error.

Additionally, imbalanced data will often lead to the so called accuracy paradox of predictive analytics. Borrowing an example from Zhu (2007, p.119), take in consideration that an insurance company received 10,000 insurance claims, where 9,850 cases are not fraudulent while the remaining 150 cases are fraudulent. A model that considers all cases

(11)

as non-fraudulent will have an accuracy of 98,5%. However, miss-classifying frauds will incur in very high expenses for the company.

In the case under analysis, type II error (false negative) is detrimental to the continuity of the business, whereas type I error (false positive) is mostly a nuisance for the compa- nies’ customers. This example demonstrates that high accuracy is not unquestionably associated with high quality predictions. This is also known as the accuracy paradox.

There are several methods to overcome the hurdles from data imbalance in data mining. He & Garcia (2009) separate these methods into three main categories: Sampling methods, Cost-Sensitive methods and Kernel-Based methods. Galar et al. (2012) comple- ment by adding a fourth category of Ensemble methods. Next it will be presented several works using different methods for coping with data imbalance.

2.1 Discrete Case

Bach et al.’s (2017) goal was to search the best method for dealing with imbalanced data regarding osteoporotic patients. Their data set contained 729 patients from which 675 (92.6%) had no fractures (negative cases). The other 54 (7.41%) patients had reported at least one fracture (positive cases). The authors tried two under-sampling methods, Edited Nearest Neighbours (ENN) and Random Under-sampling (RU), and one over-sampling method called Synthetic Minority Over-sampling Technique (Smote). The combinations between the two under-sampling methods with the over-sampling method and no sampling (raw data) was also tested. These new re-sampled data-sets were then used in several different classifiers, namely C4.5, k-Nearest Neighbour, Naive Bayes, Bagging, AdaBoost, MultiBoost, RandomSubSpace and Random Forrest.

Bach et al.’s (2017) study proposed three experiments: determining the performance of the tested classifiers for all analyzed balancing levels; finding the classification method, which has the best average performance for all balancing levels; finding the balancing level, which allows to gain the required precision of classification. This study concluded that Random Forest was the overall best classifier for all balancing levels. Also, the best balancing method was Smote, increasing by 300% the minority class and boosted by the ENN. The Random Forest with raw data obtained 0 sensitivity, 0 specificity and an Area Under the Curve (AUC) of 0.54. In comparison, the Results of the Random Forest with Smote 300% plus the ENN are of 0.95 sensitivity, 0.98 specificity and AUC of 0.99.

Sun et al.’s (2015) work focus on a novel ensembling method which splits the original data set into several balanced subsets. These subsets are used to train a number of classifiers, which latter are combined by a specific ensemble rule. For balancing the data

(12)

The authors argue that the sampling methods alter the original data class distribution, stating that over-sampling might lead to overfitting and under-sampling may neglect some potentially useful information. The authors compared the performances of several classifiers, specifically Naive Bayes, C4.5, RIPPER (Cohen 1995), Random Forest, SMO (Platt 1998), and IBK (Aha et al. 1991).

Sun et al.’s (2015) also tested the performance of several ensembling rules, respectively MaxDistance, MinDistance, ProDistance, MajDistance and SumDistance. The authors performed these tests in 46 highly imbalanced binary sets from Keel data set repository (Alcal´a-Fdez et al. 2009, 2011). These 46 imbalanced data sets vary in number of instances, attributes and class imbalance ratios. The results that will be further discussed regard all the aforesaid data sets.

Furthermore, Sun et al. (2015) proposed to investigate six questions: which ensemble rule performs best with ClusterBal; which ensemble rule performs best with SplitBal;

which combination of data balancing method and ensemble rule performs best; is their class imbalance data classification method more efficient than the external methods (sampling methods); is their class imbalance data classification method more effective than the internal methods; is the value 1 added to distance in their ensemble rules reasonable.

Sun et al.’s (2015) findings show that the best ensembling rule for both ClusterBal and SplitBal was MaxDistance, having, respectively, an average AUC of 0.9034 and 0.9044 across all of the aforementioned classifiers. The comparison between the two balancing methods was performed using MaxDistance ensembling rule, as it was the best for both balancing methods. ClusterBal had a better AUC average with Naive Bayes and SMO.

On the other hand, SplitBal had a better AUC average with C4.5, RIPPER and Random Forest. IBK results from both balancing methods were not statistically different, being considered equivalent.

Finally, (Sun et al. 2015) show that SplitBal balancing method is also better than several under- and over-sampling methods, i.e RU, random over-sampling, Smote, RU- Boost, SMOTE-Boost. Hence, SplitBal + MaxDistance with Random Forest had the best overall result showing an average AUC of 0.9309.

D´ıez-Pastor et al. (2015) propose to split the original data set into several subsets, each with distinct randomly chosen balance between the classes. Sun et al.’s (2015) approach is similar to D´ıez-Pastor et al.’s (2015), where, instead of having subsets with balanced proportions between classes, there are random proportions.

D´ıez-Pastor et al. (2015) divided the experiments of their study into three families and proposed a specific Random Balance method for each family. The methods created are Ensemble Random balance (E-RB) for the data-processing family, Bagging Random

(13)

Balance (BAG-RB) for the bagging and Random Balance Boost (RB-B) for the boosting family. Each RB method is compared with several other methods, which vary according to its family. As an example, E-RB is compared to ensemble plus Smote, ensemble plus RU and partitioning plus ensemble, among others.

D´ıez-Pastor et al. (2015) performed their experiments in two collections of data sets:

20 data sets from the HDDT collection¹ and 66 data sets from the KEEL collection (Alcal´a-Fdez et al. 2011). In total, there are 86 different data sets that vary in number of observations, features and imbalance ratio. C4.5 was selected as the standard classifier in all ensembles, being later compared with 1-Nearest Neighbor (1-NN) and Support Vector Machine (SVM) using the Gaussian kernel.

D´ıez-Pastor et al. (2015) assess the experiments’ performance using three different measures, AUC, F-measure and Geometric mean. The final checks use an average rank of each method for each measure, as well as, a combined rank that merges all measures ranks. To check if the methods are statistically different the authors use the Hochberg test. The best method was the RB-B with a combined average rank of 3.5320. For the classifier analysis, C4.5 outperformed SVM and 1-NN, also using the combined average rank as a performance measure.

2.2 Continuous Case

Thus far, several techniques to deal with data imbalance for classification problems were presented. Nonetheless, as Mendes-Moreira et al. (2012) point out, successful classification techniques are often not directly applicable for regression. Additionally, Torgo &

Ribeiro (2009) affirm that standard error measures like Mean Squared Error (MSE) and Mean Absolute Deviation (MAD), often used to measure models accuracy in regression problems, are not sufficient for these tasks. Torgo & Ribeiro (2009) claim that these standard error measures consider all prediction errors equal throughout the domain of the target variable.

Therefore, Torgo & Ribeiro (2009) propose to adapt two well known statistics from classification, i.e precision and recall, to the regression case. In order to modify these statistics, Torgo & Ribeiro (2009) recommend the use of a relevance function φ(). This continuous function, outline the target variable domain within a [0, 1] scale of relevance, where 0 represents minimum and 1 maximum relevance. Furthermore, Torgo & Ribeiro (2009) point out that the notion of relevance is inversely proportional to the target variable probability density function.

1Data sets available at: https://www3.nd.edu/ dial/hddt/

(14)

From the relevance function, Torgo & Ribeiro (2009) formulates the regression adapted precision and recall, shown as follows:

recall = P

i:ˆzi=1,zi=1

(1 + u_i) P

i:zi=1

(1 + φ(y_i)) (1)

and

precision =

P

i:ˆzi=1,zi=1

(1 + ui) P

i:ˆzi=1,zi=1

(1 + φ(y_i)) + P

i:ˆzi=1,zi=0

(2 − p(1 − φ(y_i))) (2) where: p is a weight discerning the types of errors, u_i is the raw utility score, φ(y_i) is the relevance value of observation i target, while ˆz and z are binary properties associated with being in the presence of a rare extreme case (Torgo et al. 2015).

Based on the relevance function φ(), Torgo et al. (2015) developed an adaptation of the Smote over-sampling technique for the regression case called SmoteR. Torgo et al.

(2015) explain that the goal of this technique is to change the target distribution closer to the regions where the relevance function has higher values.

Torgo et al. (2015) used three regression algorithms, specifically Multivariate Adaptive Regression Spline (MARS), SVM and Random Forest. This decision was taken to avoid any algorithm-dependent bias that might distort the experimental analysis of the effects of resampling (Torgo et al. 2015). Also, the authors used in the test: two values of over- sampling (200%, 500%); four values of under-sampling (50%, 100%, 200%, 300%); five nearest neighbors for case generation and a relevance threshold of 0.7.

For testing this new technique, Torgo et al. (2015) applied it to 18 regression data sets, that vary from 198 to 9517 observations and have between 6 and 38 features. Three repetitions of 10 fold Cross-Validation (CV) were performed. The best performing technique was the under-sampling 50%, followed by SmoteR 200% + under-sampling 50% and SmoteR 500% + under-sampling 50%, with a median F -score of 0.2592, 0.2413 and 0.2409 respectively. The F-score that Torgo et al. (2015) uses is calculated based on Torgo &

Ribeiro (2009) precision and recall adaptations for regression. Additionally, Torgo et al.

(2015) used a pairwise comparison with Wilcoxon test with a Boniferroni correction, these tests show that the three best ranking techniques are all statistically different from one another.

Sprint et al. (2015) carried out a study that investigates the contribution of wearable sensor data to predict the Discharge Functional Independence Measure (FIM) score, which indicates whether a patient will be discharged from the rehabilitation facility or not.

(15)

This study used data from 20 patients, collected by inertial measurement units (wearable sensors) and cognitive assessments, while the patient performed a test in a ambulatory circuit. Additionally, data from 4936 patients that did not participate in the wearable sensor study was also used.

Sprint et al. (2015) use three algorithms for the FIM motor score prediction, Linear SVM, Linear Regression and Random Forest. These results were compared to a Linear SVM that used a dataset that underwent a SmoteR over-sampling technique. The SmoteR was performed using all data, as well as, only the extremes.

Sprint et al. (2015) study results are obtained after a Leave One Out Cross-Validation (LOOCV). Also, it is worth to point out that the accuracy measures used by Sprint et al.

(2015) were RMSE and Normalized Root Mean Squared Error (NRMSE). The average performance of the Linear SVM for predicting the FIM motor score was higher than the Linear Regression and Random Forest, even though Linear Regression obtained the lowest RMSE value of all tests.

Sprint et al.’s (2015) results show that the models that use SmoteR are slightly worse than the ones without it. The results regarding the relevance of SmoteR use are inconclu- sive, as the results’ differences are minimal and no statistical test to assess the difference between models was performed. Sprint et al. (2015) claim that the results from the algorithms using SmoteR suggest that additional ambulatory circuit participant data might accentuate the prediction accuracy.

(16)

3 Methods

This section will discuss and explain the data and techniques being tested as well as present the evaluation methods used to assess their effect. Three distinct algorithms are used together with the proposed techniques. The reason behind this choice is to reduce any bias in the performance evaluation of the proposed techniques due to different algorithm characteristics. Additionally, to minimize any data dependence or bias, two datasets are used.

Figure 1: Thesis Work-flow, that will be explain in details throughout this section

Figure 1 shows the methodology that was applied for both data sets in this work.

In order to compare the effects of the sampling and ensemble techniques, the original data set, with no sampling nor ensemble, was used as a benchmark. The results of these techniques will be shown and discussed on Section 4.

All sampling and ensemble techniques, as well as the models with the original data set, were tested with all the feature selection methods proposed. This was performed to ensure that no bias related to the features used is introduced. Additionally, this choice allows a discussion of the impact of the feature selection in the performance of the models.

(17)

3.1 Data

In this work two data sets are used: the Parkinson’s Disease data set from Thomas et al.

(2017) and the King County house sales data set from Kaggle². A brief description of both data sets is available in Table 1. Further information is available in Subsections 3.1.1 and 3.1.2, respectively.

Table 1: Description of the Data Sets

# Obs. # Tail Obs. # Variables Target Target Range

Parkinson 229 29 101 TRS mean -3 to 2.33*

King County 21,613 1465 21 price 75,000 to 7,700,000

*: actual range, possible values are between -3 and +3. # (Tail) Obs.: number of (tail) observations. # Variables: total number of variables, including predictors, target and id variable before data cleansing.

3.1.1 Parkinson’s Disease Data Set

The Parkinson’s data set used by Thomas et al. (2017) were collected in accordance with the Helsinki declaration and study was approved by the Uppsala (Sweden) regional ethics board committee. The data was collected from 19 patients with Parkinson’s disease as well as 22 healthy participants. In the present Thesis, only data regarding individuals subject to the Parkinson’s disease are considered.

The data set consists of 229 observations of the patients at up to 14 different ellapsed times from taking levodopa and 1 prior to medication. Note that levodopa is a medicine used for the treatment of Parkinson’s disease. At each time point, the patients were video recorded performing a 20 second hand pronation-supination movement with the each hand, starting with the right hand.

Three distinct movement disorder specialists evaluated the recordings (presented in random order) in terms of Treatment Response Scale (TRS), a scale ranging from -3 (Under medicated) to +3 (Over medicated), where 0 represents close to properly medical.

The target variable (TRS mean) is the mean of their assessments. It is important to note that the maximum value of TRS mean actually found in the data is +2.33.

In the data set, there is an identification variable (subject), two time-related variables (timeslot and new time), a measure of levodopa levels on the blood (LD). The remaining 96 features (p1 to p96) are Shimmer3 wrist sensor signals collected as the patients

2Available for download at: www.kaggle.com/harlfoxem/housesalesprediction/data

(18)

Figure 2: TRS mean histogram

performed the tasks. From the 96 sensor variables, p33 to p40 are excluded from the modeling, resulting in 88 features overall.

In this study it is relevant to identify the tails of the distributions in order to perform the sampling techniques proposed. For the Parkinson data set, tail observations were considered to be those that have target values higher than 1 or lower than -2, which amount to 12.66% of the total number of observations. Figure 2 displays the histogram of TRS mean.

3.1.2 King County Housing Data Set

King County is located in the state of Washington, in the northwest of the United States.

It is composed of 35 cities, with the county seat in Seattle. King County data set contains data regarding houses sold in the county between May 2014 and May 2015. Table 2 contains the description of all variables in the King County data set and Figure 3 displays the histogram for the target variable (price).

The King County data set distribution is centered on houses with lower values and only very expensive houses are in the tail. Hence, tail observations are taken as those with price higher than 1,000,000. These constitute 6.78% of the observations in this data set.

(19)

Table 2: King County Data Set Features Variable Description

date Date of the home sale bedrooms Number of bedrooms

bathrooms Number of bathrooms, where .5 accounts for a room with a toilet but no shower

condition Index from 1 to 5 on the condition of the apartment floors Number of floors

grade Index from 1 to 13, where 1-6 is a low level, 7 is an average level and 11-13 is a high quality level of construction and design.

id Unique ID for each home sold

lat Latitude

long Longitude

price Price of each home sold

sqft above Square footage of the interior housing space above ground level sqft basement Square footage of the interior housing space below ground level sqft living Square footage of the interior living space of the house

sqft living15 Square footage of interior living space of the nearest 15 neighbors sqft lot Square footage of the land space

sqft lot15 Square footage of the land lots of the nearest 15 neighbors view Index from 0 to 4 of how good the view of the property is waterfront Dummy variable - whether or not the apartment overlooks the

waterfront

yr built Year the house was originally built yr renovated Year of the last renovation of the house zipcode Zipcode area that the house is in

(20)

Figure 3: King County house prices histogram

3.2 Feature selection

Four distinct approaches were used for selecting the variables used as predictors in the models. The reasoning behind using distinct feature selection methods is to reduce any bias induced by these methods in the results. The methods for feature selection used in this work are: Principal Component Analysis (PCA), Random Forest (RF), correlation and correlation on the tails. Details of these techniques are presented below.

The final variables selected by each approach for both data sets are displayed in Table 3. It is important to note that for the King County data set, sqft above is extremely correlated with sqft living since it is a subset of it. For this reason, whenever both variables are selected by any of the selection approaches, sqft above is removed to avoid colinearity.

During the development of the current work it was thought that it could also be possible to merge the features selected by each model. This merge could be by popular vote, meaning that, the features selected by the majority of the methods will be used.

Other types of merges are also possible and could impact positively on the model accuracy.

These ideas were not implemented, because of the time limitations, but this discussion is made so it can be used or debated in future studies.

(21)

3.2.1 Principal Component Analysis

PCA is a statistical procedure that transforms the original possibly correlated variables into linearly independent principal components. For both data sets, the five features that contribute the most for the two most informative principal components were selected as features for the machine learning models. The approach of selecting the top variables ensures the possibility of a comparable selection, since the very distinct scale of the target variables for the data sets analyzed leads to PCA and RF having results in different, not comparable scales.

3.2.2 Random Forest

RF is a graph-based machine learning algorithm that iteratively samples features and observations from the original data set and constructs Decision Trees whose outputs are combined for the final prediction of the RF. Based on the mean decrease in accuracy from removing each feature, their variable importance can be calculated. The 6 most important variables were selected for each data set.

3.2.3 Correlation and Tail Correlation

The third and fourth approaches are based on the correlation between predictor variables and between these and the target. When the predictor is a categorical variable, as occurs in the King County data set, the Intra-Class Correlation (ICC) is used. The objective is both to identify predictors that are correlated among themselves and keep only one from each group of correlated variables to avoid colinearity, as well as to select predictors that exhibit either positive or negative correlation with the target variable.

For the standard correlation approach all observations are considered, whereas for the tail correlation approach, only observations in the tails of the target variable distribution are considered. As mentioned in Section 3.1, tail observations are those more extreme than 1 or -2 for the Parkinson data set and higher than 1,000,000 for the King County data set.

In the standard correlation selection approach, a threshold of -0.5/+0.5 is used for both the Parkinson’s and the King County data sets. With this threshold, 6 variables from the King County data were preselected for modeling, all of which are positively correlated with the target variable (price). As aforementioned, sqft above was removed for having high correlation (0.88) with sqft living. The remaining predictors did not exhibit correlation among themselves more extreme than -0.76/+0.76.

With regard to the Parkinson’s data set, only one variable (p10) had correlation with

(22)

the target (TRS mean) more extreme than the threshold. Hence, additional analysis of the correlation between variables was performed. Variables with correlation among predictors higher than 0.7 or lower than -0.7 were preselected. From these, the 7 variables with most appearances (all above 11) in the preselection were further analyzed. Those that had correlation among themselves more extreme than 0.75 or -0.75 were excluded.

The final selection for the Parkinson’s data using the correlation approach is composed of 4 variables (see Table 3).

Moving on to the tail correlation selection method, a first attempt was made using the same configurations used in the standard correlation. In the case of the Parkinson data set, 5 variables were selected and no further changes to the thresholds used were performed.

For the King County data set, the tail correlation approach resulted in only one variable selected (sqft living). Two further attempts were performed, with threshold - 0.45/+0.45 and -0.4/+0.4. Hence, for this approach the modeling of both data sets is not based on comparable thresholds for variable selection. No additional variable was selected in the second attempt. In the last trial, four more variables were selected, including both sqft living and sqft above, the latter being removed to avoid colinearity.

Table 3: Features Selected

Selection Approach Parkinson’s Data King County Data

PCA

stcomAcc, stdcd2accZ,

stdcd2comAcc, stddwtcomAcc, stdcd2accX, mrotY, mrotZ, meandwtrotZ, stddwtrotZ, strotZ

sqft living, grade, bathrooms, sqft living15, sqft basement, zipcode, view, long,

yr built

Random Forest

mcomAcc, stddwtcomAcc, meandwtcomAcc, staccY, aecomRot, aerotY

lat, grade, sqft living, long, sqft living15

Correlation staccX, staccY, m21comAcc, stdcd1comAcc

sqft living, grade, sqft living15, bathrooms, waterfront

Tail Correlation

mcomRot, m21comAcc, stddwtaccZ, stdcd1accZ, meandwtcomRot

sqft living, bathrooms, grade, waterfront

Variable sqft above was removed from King County selection to avoid colinearity for all selection methods.

(23)

3.3 Algorithms

Three algorithms are used in this work, namely Linear Regression (LR), SVR and XGB. In all cases, the model is such that the variables selected by each feature selection approach explain the target variable, i.e. TRS mean for the Parkinson’s data set and price for the King County data.

3.3.1 Linear Regression

The simplest decision support algorithm used is LR, which serves as a baseline for the more complex SVR and XGB. LR is a statistical technique to model the linear relationship between a target (dependent) variable and one or more predictors (independent variables).

This technique does not require parameter tuning.

Equation 3 presents the mathematical demonstration of a multiple linear regression model with k independent predictor variables x1, ..., xk, the residual term and one response variable y.

y_i = β₀ + β₁x₁+ · · · + β_kx_k+ (3) For n observations on the k + 1 variables we have the following:

y_i = β₀+ β₁x_i1+ · · · + β_kx_ik+ _i, i = 1, ...n (4) For generating the linear model in least-squares regression it is necessary to fit a hyperplane into (k+ 1)-dimensional space that minimizes the sum of squared residuals (Bremer 2012).

n

X

i=1

e²_i =

n

X

i=1

y_i− β₀−

k

X

j=1

β_jx_ij

2

(5) To fulfill this task we must take the derivatives with respect to the model parameters β₀, ..., β_k, setting them equal to zero and derive the least-squares normal equations that our parameter estimates ˆβ₀, ..., ˆβ_k.

In this work linear regression was performed by using R’s base function, namely lm which generates the linear model and using the predict function it predicts the unknown values.

3.3.2 SVM

Support Vector Machines are usually applied on classification problems. However, this method has an expansion which can be used for regression - the SVR. The latter is a

(24)

method for finding a general nonlinear function that models the relation between predictors and response in the fittest way possible, with minimal error.

Cristianini & Shawe-Taylor (2005) note that SVR is mainly characterized by: 1. the existence of a non-linear function which is learned by a linear learning machine in a Kernel-induced feature space; 2. the capacity of the system is controlled by a parameter that is independent of the space’s dimensionality. Still according to the authors, the SVR learning algorithm minimises a convex functional and has a sparse solution.

As Sch¨olkopf & Smola (2002) highlight, SVMs were first developed for pattern recog- nition. when Support Vectors were generalized to the case of regression estimation, it was crucial to find a way in which SVR would retain tme SVM’s sparseness properties.

For this purpose, Vapnik developed the -insensitive loss function, a margin of tolerance which does not penalize errors below some ≤ 0, chosen a priori Vapnik (1995).

The resulting dual representation of the SVR -regression model, as implemented in kernlab Karatzoglou et al. (2004), is described as:

minα,α^∗

1

2(α − α^∗)^>Q(α − α^∗) +

l

X

i=1

(α_i+ α_i^∗) +

l

X

i=1

y_i(α_i− α^∗_i)

s.t 0 ≤ α_i, α^∗_i ≤ C, i = 1, ..., l,

l

X

i=1

(α_i− α^∗_i) = 0 .

(6)

The differential of the SVR is the possibility to make a non-linear regression on data in its original plan using a technique called Kernel Trick. By this approach, a hyperplane in which the linear regression of the data is possible and the margin of the regression is min- imized is constructed. This is done via a transformation ϕ(x) such that Φ : X → F where F is a characteristic space of X, the Kernel function being: K(x_i, x_j) = ϕ(x_i)^Tϕ(x_j), K : X × X → F .

(25)

Figure 4: Kernel Trick

The Kernel function transforms the data into a higher dimensional feature space to make it possible to perform the linear regression. Any symmetric and positive semi- definite function that satisfies the Mercer (1909) condition can be used as a Kernel function (Cortes & Vapnik 1995, Smola & Sch¨olkopf 2004).

The R package kernlab provides nine options of Kernel functions for SVM and SVR:

Radial Basis Function (RBF), Polynomial, Linear, Sigmoid, Laplacian, Bessel, ANOVA RBF, Spline and String. In this study the RBF Kernel was employed and the cost (C) and epsilon parameters were tuned for the original Parkinson’s and King County data sets. The best parameters found were then used to model all other SVR models for these data sets. Function ksvm from R package kernlab was used to model the SVR models (Karatzoglou et al. 2004).

3.3.3 XGB

The name gradient boosting comes from the use of a gradient descent algorithm that minimizes the loss when adding new models. XGB, when used for regression, generates several weak learners that are regression trees. Training proceeds iteratively, generating new trees that predict the residuals or errors of preceding trees. These new trees are then combined with the previous trees to assemble a final prediction.³

3reference available at:https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-HowItWorks.html

(26)

XGB is a derivation of the Tree boosting machine learning method, created by Chen &

Guestrin (2016). It is a sparsity-aware supervised learning algorithm that uses weighted quantile sketch to perform tree learning. This algorithm relies on data compression and sharding to build a scalable boosting system. It aims to predict the target variable by combining the results of simpler models, namely decision trees. Hence, XGB being an ensemble algorithm.

This algorithm minimizes an objective function that is a combination of both a convex loss function and a cost for model complexity using a gradient descent algorithm. It is trained by iteratively adding new trees to the model. In order to better understand the XGB methodology, it is important to note that this technique is based on a previous model called Tree Ensemble. Chen & Guestrin (2016) further mention that this model has one disadvantage: “The tree ensemble model includes functions as parameters and cannot be optimized using traditional optimization methods in Euclidean space”. Eq.(7) displays the mathematical demonstration of the Tree Ensemble model.

L(φ) =X

i

l(y_i, ˆy_i^(t−1)+X

k

Ω(f_k)

where Ω(f ) = γT + 1 2λkwk²

(7)

ˆ

y_i is the predicted value and y_i is the target. l is a differentiable convex loss function that measures the difference between the prediction and the target. Ω penalizes the complexity of the model and avoid over-fitting.

Alternatively, Gradient Tree Boosting is trained in an additive way. The prediction of the i -th instance at the t -th iteration is given by ˆy_i^(t).

L^(t) =

n

X

i=1

l(y_i, ˆy_i^(t−1)+ f_t(x_i)) + Ω(f_t) (8) According to the authors: “we greedily add the f_t that most improves our model according to Eq.(7)” (Chen & Guestrin 2016, p. 786). The optimization of the objective is achieved via second-order approximation.

L^(t) w

n

X

i=1

[l(y_i, ˆy_i^(t−1)) + g_if_t(x_i) + 1

2h_if_t²(x_i)] + Ω(f_t), (9) where g_i = ∂_y_ˆ(t−1)l(y_i, ˆy^(t−1)) and h_i = ∂²_ˆ_y(t−1)l(y_i, ˆy^(t−1)). Then the constant terms are removed to obtain the simplified objective at moment t.

L˜^(t) =

n

X

i=1

[g_if_t(x_i) + 1

2h_if_t²(x_i)] + Ω(f_t). (10)

(27)

The instance set of leaf j is defined as I_j = {i|q(x_i) = j}. Eq.(10) can be rewritten by expanding Ω:

L˜^(t) =

n

X

i=1

[g_if_t(x_i) + 1

2h_if_t²(x_i)] + γT + 1 2λ

T

X

j=1

w²_j

=

T

X

j=1

[(X

i∈Ij

g_i)w_j+ 1 2(X

i∈Ij

h_i+ λ)w²_j] + γT.

(11)

Eq.(12) is then used to calculate the optimal value. It can also be used as a scoring function of q, i.e. the quality of the tree structure.

L˜^(t)(q) = −1 2

T

X

j=1

(P

i∈Ijg_i)² P

i∈Ijh_i+ λ + γT (12)

Depending on the complexity of the task it may be impossible to enumerate all the possible tree structures q. To work around this problem, a greedy algorithm is used instead. This algorithm starts from a single leaf and iteratively adds branches to the tree.

Assume I = ILEFT∪ IRIGHT, where ILEFT and IRIGHTare instance sets of the left and right nodes after the split respectively. Then the loss reduction function is given by:

L_split = 1 2

"

(P

i∈ILgi)² P

i∈IL+h_i+ λ+ (P

i∈IRgi)² P

i∈IR+h_i+ λ − (P

i∈Ig_i)² P

i∈I+h_i+ λ

#

− γ (13)

In this work, caret’s train function was used to optimize the parameters max depth, eta, colsample bytree, subsample, nrounds, min child weight and gamma: (Kuhn 2017). To build the models once the best parameters were found, function xgboost from R package xgboost (Chen et al. 2017) was used.

The best parameters for each data set and feature set, for both SVR and XGB are displayed in Table 4.

3.4 Sampling Methods

In terms of sampling methods used, two distinct procedures are used and are compared with the original datasets as a baseline. The sampling methods used are SmoteR Over and SmoteR balance. Both methods are applications of SmoteR, which is discussed below.

3.4.1 SmoteR

SmoteR was created by Torgo et al. (2015) and is an adaptation for regression cases from the Smote algorithm for classification created by Chawla et al. (2002). SmoteR takes on

(28)

5 arguments: data set, relevance threshold, over and undersampling percentages and the k number of neighbours to use for synthetic data generation.

In order to properly work for regression cases, Torgo et al. (2015) proposed the use of a relevance function φ(y) ranging from 0 to 1 to give higher weights to the relevant instances (infrequent observations) and lower weights to observations with frequent values.

The default relevance function is the inverse function of the probability density function.

Based on the weights attributed to each observation in the data set, the rare cases are selected according to the user inputted threshold. SmoteR gets the k-Nearest Neighbours (kNN) of the selected observations and generate synthetic instances that mimic these at the proportion of oversampling required by the user.

The approach is similar for the frequent observations, however, the percentage of undersampling is calculated with respect to the number of synthetic data generated for the infrequent cases instead of being with regard to number of frequent instances. The method then returns a data set composed of synthetic data generated on both frequent and infrequent observations.

For both SmoteR techniques tested in this work the oversampling happens after the feature extraction phase. This choice is to ensure that the synthetic data generated by SmoteR is only used for training. This procedure guarantees that the synthetic data is not used for prediction during the CV. For each CV iteration new synthetic data is created and used solely for training. Therefore, only real data is used to be predicted.

3.4.2 SmoteR Balance

SmoteR Balance is a variation of SmoteR where the sampling method returns an ap- proximately balanced data set based on the weights allocated to the observations. R package UBL provides function SmoteRegress that implements this sampling approach if the argument C.perc is set to extreme (Branco et al. 2016).

Observations that returns a φ(y) greater than a the threshold, that in all cases is set to 0.7, will be used as an archetype for generating new cases. As mentioned before, the new cases are generated base on a number of k nearest neighbors. Initially, the proposition was to use a k equal to 2% of the data set size rounded up.

For the Parkinson’s data set 2% of the data set size is 5, hence k=5, but after running the code it was noted that in some folds during the cross validation, it was not possible to use k equal to 5 so the algorithm automatically reduce k value until a valid number of k is found. Therefore, for each fold of the cross validation of the Parkinson data set the value of k varies. For the King County data set 2% of the data set rounded up is equal to 433, hence k=433. This value of k was used in all cases as this value was always a valid

(29)

Figure 5: Effect of the SmoteR Balance technique over the TRS Mean histogram

option.

In order to generate the roughly balanced data, this function performs both over and undersampling of the original data set. The core of this approach is to flatten the frequency bumps in the original data so that synthetic data is generated to compensate for rare values and undersampling is performed to reduce the number of observations in the peaks of the distribution (high density regions).

Branco et al.’s (2016) SmoteRegress function returns a data set with a mix of real and synthetic values. The output contains n − 1 observations, where n is the size of the original data set. The effect of this function can be seen in Figure 5.

3.4.3 SmoteR Over

SmoteR Over is another implementation of SmoteR by which the sampling method returns synthetic data that imitate the infrequent values. As of May 2018, this function is not implemented on Branco et al.’s (2016) UBL package for R. However, Branco P., one of the developers of SmoteR (Torgo et al. 2015) and of UBL, provides a function named Smote.exsRegress on GitHub⁴ that performs this type of sampling.

As in SmoteR Balance the threshold is the same, 0.7. Smote.exsRegress function is still experimental and it does not have the functionality that reduces the k automatically when no valid option, as in SmoteR balance. Therefore, when the input k is not valid

(30)

the program will incur in a fatal error and abort the application. Hence, the only valid k value found for the Parkinson’s data set was equal to 3. For the King County data set k is equal to 433, the same as in SmoteR balance.

To control for the number of examples to be generated, a number (N) greater than 1 must be supplied as an argument to the function. The number of extra (synthetic) data generated by Smote.exsRegress is equal to (N − 1) × number of infrequent observations.

If 25% of the infrequent observations is to be artificially generated, then N should be set to 1.25, for example.

N equal to 1 means that no synthetic data will be artificially generated. N equal to 2, on the other hand, implies that the same number of infrequent events in the data set will be artificially generated and the data set used for modeling will contain double the number of infrequent events (original plus synthetic).

SmoteR Over returns a larger data set composed of the complete original data plus the synthetic data. The distribution of the resulting data set will vary according to the percentage of oversampling to be performed.

For both SmoteR Balance and SmoteR Over, the cross validation procedure is such that, at each iteration, the training and test data sets are selected, the sampling techniques are performed on the training set and testing is performed on the selected test data set.

No sampling technique is applied to the test data, which remains unaltered.

3.4.4 Relevance Function

When applying the aforementioned functions to the Parkinson’s data set, an error was raised with regard to the default relevance function used. This error made all the relevance values (φ(y)) be equal to zero. It is plausible that the reason behind this error was the nature of the distribution of the target variable. TRS mean from the Parkinson’s data set has a bimodal distribution which is a somewhat harder to model distribution and may not be supported by the functions. This relevance function was used in all cases except for the SmoteR balance for the King County data set.

Following the directions of how a relevance function should behave, i.e. output values between 0 and 1 where observations with closer to 0 are the least relevant and those closer to 1 are the most relevant, this work proposes the use of following relevance function instead:

φ(y_i) = 1 q frequency_i

min frequency

where 0 < φ(y_i) ≤ 1, ∀i

(14)

(31)

where frequency_i is the number of instances of the unique value i and min frequency is the minimum frequency across all observations. The output (φ(y)) is a vector containing the computations of φ(y_i) for all observations.

This function was chosen because it is simple and easy to implement. Infrequent observations return values closer to 1 and more frequent observations return values closer to 0. As the SmoteR functions have a threshold there is no need to φ(y) output values to actually be equal to zero as this function tends to zero in infinity. Infrequent values often appear only once, in the case that the most infrequent value shows up twice the min f requency argument scales the output down to 1.

3.5 Ensemble Methods

For the ensemble methods, all combinations of features selected and sampling methods are used for each ensembling approach. Additionally, the learning algorithm running within the ensemble is either LR, SVR or XGB. Three distinct ensemble approaches are used:

uniform, random and tails.

Throughout this work a discussion emerged regarding other two new possible ensemble methods. One is an unfolding of the feature selection discussion from Section 3.2. This yet unexplored ensemble method proposes to create a model that randomly sample features for each algorithm within the ensemble. Hence all models in the ensemble would use different combinations of features, making an overall cross checking between the data.

The second new ensemble method is to ensemble different machine learning algorithms.

A possibility would be to have one algorithm which is better at predicting the tails and another which has a good overall prediction. The ensemble would then combine these different machine learning algorithms into a single ensemble algorithm. Ensembling SVR with XGB, for example. These two additional ensembling methods were not used in this work due to time limitations, but it is worth to mention as it opens possibility for future works or discussions.

As mentioned in Section 1, Subsection 1.1, the uniform and random ensemble approaches are an adaptation to a regression case of Sun et al.’s (2015) and D´ıez-Pastor et al.’s (2015) ensemble methods for classification, respectively. These adaptations are a proposition of the current work. Details of the implementation proposed are discussed below.

(32)

3.5.1 Uniform

The UDSE approach takes as input the data set, the number (k) of folds in the CV and a number (N) corresponding to the number of samples and, thus, to the number of learning algorithms that will be built in each fold. For all cases a N equal to 1000 was used.

Initially, the data is divided into k folds. For each fold, N machine learning algorithms are trained and tested. For this purpose, the data is divided into k sections, one of which is allocated as test data and k-1 as a preselection of training data. In all learning algorithms within the same fold, the testing will be performed on the same test set.

With regard to the training data set, for each learning algorithm, a uniform sampling (with replacement) is performed across the existing unique values of the target variable from the preselected data such that the training data for the specific machine learning algorithm is uniformly distributed throughout the range of possible values. The number of observations to be sampled from each value is equal to the minimum frequency of all observed values in the training set for each fold.

Once N machine learning algorithms are trained and tested, all observations in the original data set will have been used for both training and testing. The result is a data frame containing N columns with the predicted values for the test data set from each machine trained. The final prediction for the ensemble is computed as the mean of the predicted values for each observation by each algorithm.

Figure 6 illustrated the sampling that takes place within each fold of the CV. In the Figure, the original data set has a standard normal distribution and once a single observation of each unique value is selected, the training data for all learning algorithms is uniform across the possible values of the target. For each learning algorithm a distinct training set is presented.

In order to assess the actual size of the training set, the observations within each fold must be considered. However, it is noteworthy that, for the Parkinson’s data set, TRS mean has 16 unique values, a minimum frequency of 1 and a maximum of 39. In the King County data set, on the other hand, price has 4028 unique values with a minimum frequency of 1.

3.5.2 Random

The RDSE approach takes as input the data set, the number (k) of folds in the CV, the sample size (S) and a number (N) corresponding to both the number of samplings and of learning algorithms to be performed in each fold. For all cases an N equal to 1000 and a S equal to two times the number of unique values was use.

(33)

Figure 6: Illustration of how the original distribution of the target variable in each fold is transformed in N uniform subsets, which will be used to train N learning algorithms

The RDSE ensemble approach is similar to the UDSE ensemble approach. However, at each fold N random samples of user-defined size are taken with replacement from the preselected training data. This is different from the UDSE, by which the sample sizes are determined by the minimum frequency of the unique values. Apart from this, the remainder of the RDSE approach is the same as in the UDSE. Figure 7 demonstrates that the random sampling results in training sets with different distributions for each machine learning algorithm in each fold.

3.5.3 Tails

The tails ensemble approach is simpler and only has two learning algorithms, one using the features selected using correlation and the other learner using the features selected using the Tails correlation, both trained once for each fold in the CV. Unlike the previous ensemble methods, for each fold, the entirety of the training set is used for training.

Similar to the previous ensemble techniques, at the end of the training and testing for each fold, there will be two predictions that will be averaged into a single final prediction.

The algorithm is coded such that it is also possible to perform a weighted average of the learning algorithm predictions.

(34)

Figure 7: Illustration of how the original distribution of the target variable in each fold is transformed in N randomly distributed subsets, which will be used to train N learning algorithms

3.6 Performance Measures

The performance measures for the proposed models are RMSE, RMSE tails, correlation between actual and predicted values of the target variable and MAE(Shmueli et al. 2016).

To analyze the best techniques to deal with infrequent values in regression problems, the sampling and ensemble techniques are ranked according to the lowest RMSE for each data set. Additionally, the lowest tail RMSE and MAE, as well as the highest correlation between prediction and actual target are highlighted and discussed.

3.6.1 RMSE

RMSE is a frequently used error measure for regression problems. It reflects the average magnitude of the error in a prediction model and is a measure of accuracy. As the error (y_j− ˆy_j) is squared, the positive and negative errors are equally represented in the final measure. It is possible to use it to compare forecasting errors of different models for a specific data set, but it cannot be used to compare models applied to different data sets, as it is scale-dependent (Hyndman & Koehler 2006).

(35)

RM SE = v u u t 1 n

n

X

j=1

(y_j − ˆy_j)² (15)

In Equation 15, y_j is the actual value, ˆy_j is the predicted value and n is the total number of observations. Therefore, smaller the RMSE the better your model is.

3.6.2 RMSE Tails

RMSE Tails is the same measure as the previous Subsection but it only measures RMSE for y_j (actual values) that are smaller than -2 or greater than 1 for TRS Mean in Parkin- son’s data set, and for price greater than 1 million for the King County data set.

3.6.3 Correlation Between Model Output and Actual Value

The correlation between model predictions and the actual values was also implemented as a performance measure. It aids in the analysis of whether the model follow the actual value trends. The closer to +1 this correlation is, the better the model is since the outputs are close to the actual values both in magnitude and sign.

A correlation close to 0, implies no linear relation between target and predictions. On the other hand, a correlation close to -1 signifies that there is an opposing relationship between actual and predicted values. Thus, negative correlation means that the model captures a trend that is the opposite of the actual trend.

3.6.4 Mean Absolute Error

MAE also measures the average magnitude of the errors in a prediction model but in a different manner than RMSE. As it can be seen in Equation 16, its formula is very similar to RMSE formula (Equation 15). The main difference is that MAE takes in consideration the absolute error, instead of the squared error like in RMSE. Smaller the MAE value, the better your model is. MAE is also a scale-dependent accuracy measure (Hyndman &

Koehler 2006).

M AE = 1 n

n

X

j=1

|y_j− ˆy_j| (16)

MAE and RMSE are similar, yet distinct measures. The comparison between these two measures aids to identify if the errors are evenly distributed (MAE=RMSE) or if there is a variance of the errors across different values of the target variable (MAE<RMSE).

MAE will always be smaller or equal to RMSE.

(36)

3.7 Comparison between Models

In order to check the differences between the models that utilize the proposed techniques and the benchmark, the Altman plot and the Wilcoxon signed-rank test are implemented.

3.7.1 BlandAltman method

The Bland Altman plot is a tool for visualizing the divergence between two measurements. In this work, it is used for comparing the predictions between the best models using the techniques proposed and the respective predictions with the original data set.

BlandAltman is a two dimensional scatter plot that compares the differences between two quantitative measurements.

Myles & Cui (2007) state that the BlandAltman method calculates the mean difference (or “bias”) between two methods of measurement and 95% confidence of agreement limits of the computed mean difference. It is expected that the 95% limits include 95% of differences between the two measurement methods.

The assessment of the agreement between the measurements is based on visual analysis. Smaller lengths between the limits indicate a better agreement between the measures analyzed. How small is small enough is a context dependent matter. A way of assess- ing the suitability of the measurements is to consider if a difference as extreme as that indicated by the 95% limits would significantly impact the interpretation of the output (Myles & Cui 2007).

3.7.2 Wilcoxon signed-rank test

Wilcoxon (1945) develop a test to compare the results of different treatments. Instead of using ranking methods in which scores 1, 2, 3,...,n it substitutes for the actual numerical data. This aids in understanding the approximate significance of the differences in experiments. In our case, the treatments are the techniques proposed to deal with infrequent observations in a data set. This test was performed because Torgo et al. (2015) uses in their work to check the differences between the proposed models which combines different values of over and under sampling. This test helps to assess if the models are statistically different or not.

(37)

Table 4: Tuned Parameters for Both SVR and XGB Models Algorithm Data set Feature Set Best Parameters

SVR Parkinson

PCA epsilon: 0.15151515; C: 1 RF epsilon: 0.2171717172; C: 1 Correlation epsilon: 0.0353535354; C: 1 Tail Correlation epsilon: 0.2424242425; C: 1

SVR King County

PCA epsilon: 0.2222222223 ; C: 12 RF epsilon: 0.1666666667 ; C: 23 Correlation epsilon: 0.3333333334 ; C: 12 Tail Correlation epsilon: 0.3888888889 ; C: 12

XGB Parkinson

PCA

eta: 0.3; max depth: 12; subsample: 0.75;

colsample bytree: 0.6; nrounds: 100 min child weight: 2; gamma: 0

RF

eta: 0.3; max depth: 12; subsample: 1;

Correlation

Tail Correlation

colsample bytree: 0.8; nrounds: 150 min child weight: 2; gamma:0

XGB King County

PCA

RF

Correlation

Tail Correlation

eta: 0.3; max depth: 3; subsample: 0.75;

Evaluation of Calibration Methods to Adjust for Infrequent Values in Data for Machine Learning

Degree Thesis in Microdata Analysis

Level: Master of Science (MSc) in Business Intelligence Evaluation of Calibration Methods to Adjust for

Infrequent Values in Data for Machine Learning

Contents

Acronyms

1 Introduction

1.1 Proposed Approach

2 Literature Review

2.1 Discrete Case

2.2 Continuous Case

3 Methods

3.1 Data

3.2 Feature selection

3.3 Algorithms

3.4 Sampling Methods

3.5 Ensemble Methods

3.6 Performance Measures

3.7 Comparison between Models