Predicting Multimodal Rehabilitation Outcomes using Machine Learning

(1)

Uppsala University

Department of Informatics and Media

Alexandru Cheltuitor & Niklas Jones-Quartey

Course: Bachelor's Degree Project Level: C Semester: Spring-2020 Date: 2020-06-13 Supervisor: Simone Callegari

(2)

Abstract

Chronic pain is a complex health issue and a major cause of disability worldwide. Although multimodal rehabilitation (MMR) has been recognized as an effective form of treatment for chronic pain, some patients do not benefit from it. If treatment outcomes could be reliably predicted, then patients who would benefit more from MMR could be prioritized over others.

Machine learning has been proven capable of accurately predicting outcomes in other healthcare related domains. Therefore, this study aims to investigate the use of it to predict outcomes of MMR, using data from the Swedish Quality Registry for Pain Rehabilitation (SQRP). XGBoost regression was used for this purpose, and its predictive performance was compared to Ridge regression. 12 models were trained on SQRP data for each algorithm, in order to predict pain and quality of life related outcomes. The results show similar performances for both algorithms, with mean cross-validated R² values of 0.323 and 0.321 for the XGBoost and Ridge models respectively. The average root mean squared errors of 6.744 for XGBoost and 6.743 for Ridge were similar as well. Since XGBoost performed similarly to a less computationally expensive method, the use of this method for MMR outcome prediction was not supported by the results of this study. However, machine learning has the potential to be more effective for this purpose, through the use of different hyperparameter values, correlation-based feature selection or other machine learning algorithms.

Keywords: Machine learning, XGBoost, regression, Multimodal Rehabilitation, SQRP, chronic pain, treatment outcome, prediction

(3)

Table of Contents

1 Introduction 1

1.1 Problem Domain 2

1.2 Research Aim 3

1.3 Delimitations 4

2 Theory 5

2.1 Multimodal Rehabilitation (MMR) 5

2.2 Swedish Quality Registry for Pain Rehabilitation (SQRP) 5

2.2.1 Multidimensional Pain Inventory (MPI) 6

2.2.2 Hospital Anxiety and Depression Scale (HADS) 6

2.2.3 The Short Form Health Survey (SF36) 6

2.2.4 The European Quality of Life instrument (EQ-5D) 7

2.2.5 Health-related Quality of Life (Hr-QoL) 7

2.3 Multicollinearity 7

2.4 Linear and Ridge Regression 8

2.5 Machine Learning 9

2.5.1 Preprocessing 9

2.5.1.1 Correlation-based Feature Selection (CFS) 10

2.5.1.2 KNN Imputation 10

2.5.2 Grid Search 10

2.5.3 Decision Trees 11

2.5.4 Gradient Boosting and Boosted Regression Trees (BRT) 11

2.5.5 eXtreme Gradient Boosting (XGBoost) 12

2.5.6 Evaluation 12

2.5.6.1 R² 12

2.5.6.2 MSE and RMSE 13

2.5.6.3 Cross-validation 13

2.5.6.4 Feature Importance 14

3 Method 15

3.1 Research strategy 15

3.1.1 Problem identification and motivation 15

3.1.2 Define the objectives for a solution 15

3.1.3 Design and Development 15

3.1.4 Demonstration 16

3.1.5 Evaluation 16

3.1.6 Communication 16

3.2 Machine Learning Approach 16

(4)

3.3 Data 17

3.3.1 Data Collection 17

3.3.2 Data Exploration 17

3.3.2.1 Multicollinearity Analysis 17

3.3.3 Data Preprocessing 18

3.3.3.1 Feature Reduction 18

3.3.3.2 Data Transformation 18

3.3.3.3 Outlier Detection 18

3.3.3.4 Target Data Extraction 18

3.3.3.5 Imputation 18

3.4 Features 19

3.5 Targets 21

3.6 Model Selection 22

3.6.1 XGBoost Regressor 22

3.6.2 Baseline Regressor 22

3.7 Model Evaluation 23

4 Analysis and Results 24

4.1 Multicollinearity 24

4.2 Grid Search Results 25

4.2.1 XGBoost Parameters 25

4.2.2 Baseline Parameters 26

4.3 Model Performance 27

4.4 Model Analysis 28

5 Discussion 33

5.1 Model Performance 33

5.2 Model Analysis 34

5.3 Future Research Implications 35

6 Conclusion 36

6.1 Future research 36

References 38

Appendix 42

(5)

1 Introduction

Chronic pain is a complex health issue and a major cause of disability globally. Several chronic pain conditions, including lower back pain and neck pain, were in 2016 confirmed to be leading causes of disability since 1990 (Mills, Nicolson & Smith 2019). Consequently, a large number of people is affected not only by long lasting recurrent pain, but also by secondary symptoms such as depression, anxiety, insomnia, fatigue and a reduced ability to work (Mills, Nicolson & Smith 2019; Milton, Börsbo, Rovner, Lundgren-Nilsson, Stibrant-Sunnerhagen, & Gerdle 2013).

Chronic pain does not only have an impact on the patients themselves, but also on hospitals and society in general. For instance, the estimated costs of chronic pain treatment in Sweden amounted to 7.5 billion SEK in 2003. Furthermore, the estimated costs in lost production, due to pain induced sick leave, were 80 billion SEK during that same year (SBU 2006).

Because of the complexity of chronic pain as a health issue, treatment involving several medical disciplines can be effective in combating it. An example of such a treatment is multimodal rehabilitation (MMR), which has been recognized as an effective interdisciplinary treatment for chronic pain. MMR is a form of treatment that combines aspects of physiotherapy, psychotherapy and medicine. It is performed within programmes run by specialized Swedish clinics in order to treat chronic pain. (Nationella Kvalitetsregister 2019)

Data related to MMR outcomes, such as a reduced level of pain severity, has been collected and stored in the Swedish Quality Registry for Pain Rehabilitation (SQRP) since 1998 (Nationella Kvalitetsregister 2019). The SQRP is one of many quality registries in Sweden that collects patient data in order to improve Swedish healthcare (Nationella Kvalitetsregister 2016).

The registry contains pre-treatment, post-treatment and 12-month follow up values for several MMR outcomes. SQRP data has enabled research to be carried out for the purpose of improving MMR programmes (Molander, Dong, Äng, Enthoven & Gerdle 2018).

Although MMR is often successful in reducing pain and pain related symptoms, previous research has indicated that it is less beneficial for some patients (Gerdle, Åkerblom, Jansen, Enthoven, Ernberg, Dong, Stålnacke, Äng & Boersma 2019a; Gerdle, Åkerblom, Stålnacke, Jansen, Enthoven, Ernberg, Dong, Äng, & Boersma 2019b; Molander et al. 2018). Therefore, more targeted provisions of treatment could enable patients who would benefit more from MMR to be selected over others (Gerdle et. al. 2019a).

Machine learning techniques have been widely researched within the healthcare domain, with the aim of predicting and improving treatment outcomes (Senders, Staples, Karhade, Zaki, Gormley, Broekman, Smith & Arnaout 2018a). By exposing the algorithms to large amounts of information, new patterns could be learned to support practitioners in their decision making, thus improving the quality of healthcare (Senders, Arnaout, Karhade, Dasenbrock, Gormley, Broekman & Smith 2018b). Machine learning has been studied within domains similar to MMR.

Staartjes, Quddusi, Klukowska & Schröder (2020), for instance, conducted a pilot study to assess classification accuracy of different types of diagnostic categories, in patients with lower back pain. Another research by Santana, Cifre, Novaes de Santana & Montoya (2020) aimed at classifying chronic pain conditions with the help of convolutional neural networks. Despite its

(6)

prevalence within medical research, machine learning has not been utilized within MMR related studies.

1.1 Problem Domain

MMR programmes are expensive and time-consuming (Ringqvist, Dragioti, Björk, Larsson &

Gerdle 2019). Since 2009, Swedish providers of MMR have received financial compensation for when a patient completes treatment, due to a government initiative to reduce the rate of sickness induced work absence (Enthoven, Molander, Öberg, Stålnacke, Stenberg & Gerdle 2017).

Despite the efforts being successful for many, some patients do not benefit from MMR (Ringqvist et al. 2019). Therefore, the initiative is thereby failing to achieve its purpose for those patients.

Using cluster analysis, previous research has identified subgroups of patients with varying responses to treatment. The conclusion that patients with the worst pre-treatment symptoms improve the most through treatment, has been drawn by several studies (Gerdle et al. 2019a;

Gerdle et al. 2019b; Ringqvist et al. 2019). These subgroups tend to have a more complex clinical situation, as indicated by their more severe psychological symptoms. Those who benefit less, on the other hand, tend to have a less complex collection of symptoms (Ringqvist et al.

2019). These findings have the potential to provide beneficial information to MMR clinics for their pre-treatment assessments of potential MMR patients. However, detailed assessments would be required (Gerdle et al. 2019a), adding to the already time-consuming process of selection and participation in MMR.

Rather than evaluating patients more thoroughly prior to treatment, the prediction of treatment outcome is an alternative that has been researched within other healthcare related domains (Hatton, Paton, McMillan, Cussens, Gilbody & Tiffin 2019; Pearson, Pisner, Meyer, Shumake & Beevers 2019; Senders et. al. 2018a). It is a promising approach, since a patient’s suitability for a treatment can be determined by predicting its outcome (Pearson et al. 2019).

Predicting outcomes of MMR using pre-treatment variables, however, has proven to be difficult.

Ringqvist et al. (2019), for instance, aimed to investigate correlations and to determine if pre-treatment variables could predict MMR outcomes. A multivariate improvement score was used as an outcome measure, which was derived from multiple SQRP variables. While the authors were successful in fulfilling other objectives of their study, their OPLS regression yielded an R ² value of 0.08 for their outcome variable. Consequently, they stated that outcome could not be predicted to a clinically useful extent with their method.

Machine learning based solutions for predicting treatment outcome have outperformed simpler statistical methods in several studies. In a comparison of predictive performance by Ulenberg, Belka & Bączek (2016), for instance, a SVM regressor outperformed both OPLS regression and multiple linear regression. Pearson et al. (2019) achieved similar results by comparing two machine learning models to a linear autoregressive model, with the former yielding lower root mean squared errors. Although advanced machine learning algorithms usually outperform simpler methods, it is also possible for them to perform equally (Deng,

(7)

Fannon & Eckelman 2018; Hayward, Alvarez, Ruiz, Sullivan, Tseng & Whalen 2010). For instance, Deng, Fannon, and Eckelman (2018) reported that gradient boosting performed very similarly to linear regression in predicting a variable. The study by Hayward et al. (2010) is a similar example of machine learning performing equally to simpler methods (linear and logistic regression) in some cases. Furthermore, linear and logistic regression achieved higher R² values than machine learning algorithms in predicting tumor sizes. This implies that the suitability of a machine learning technique for predicting a certain type of data needs to be determined through research.

If a machine learning model can accurately predict treatment outcomes, it has the potential to be used in clinical practice as a tool for decision making support, in order to optimize treatment provisioning (Person et al. 2019; Senders et al. 2018a). The selection of patients for MMR programmes is currently suboptimal, since some patients do not benefit from it (Ringqvist et al.

2019). Therefore, there is a potential utility of machine learning within the domain of MMR, if outcomes can be predicted as accurately as they have in other domains of healthcare.

1.2 Research Aim

For machine learning to be used in clinical practice for the purpose of outcome prediction, its predictive performance should be superior to that of a simpler, less resource intensive method.

This aligns with the principle of selecting the simplest model out of two alternatives, unless the more complex model has lower prediction errors (Tan, Steinbach, Karpatne & Kumar 2020, p.

177). Thus, if the machine learning solution does not outperform the simpler method, the former would be less preferable to use for the purpose of MMR outcome prediction in clinical practice.

Though if the opposite is proven, an indication that such a machine learning model may be of practical value is provided. If so, further research could be conducted in order to determine if its prediction accuracy is sufficient for use within clinical practice, and if it can be used within clinical decision support systems. Therefore, this study aims to answer the following research question:

Can supervised machine learning outperform a simpler regression method in predicting outcomes of MMR?

In order to enable this comparison, a prototype that utilizes a machine learning algorithm to predict MMR outcomes has been constructed. In terms of this study, the resulting prototype is to be interpreted as a proof of concept, for the purpose of evaluating the use of machine learning to predict outcomes of MMR. Thus, the objective of this study is to develop and evaluate the prototype, in order to answer the research question and to draw conclusions on the usefulness of machine learning within the context of predicting MMR outcomes.

There are several potential benefits of this study. Firstly, it provides evidence, for or against, a machine learning approach being potentially useful for predicting MMR outcome using SQRP data. Furthermore, it results in evidence for the machine learning algorithm of choice being

(8)

more, equally or less effective than the simpler algorithm for this purpose. Second, it provides further insight regarding the predictability of post-treatment SQRP variables. Lastly, it may provide useful prospective knowledge for implementing information systems for decision support within health care. As described in the previous subchapter, such an implementation could enable healthcare professionals to prioritize certain patients over others, which could be useful for MMR clinics.

1.3 Delimitations

This study has several delimitations. Firstly, SQRP data used in this study consists of patient data collected between the years of 2008 and 2015. This could potentially have an impact on the study’s relevance and results due to the possibility of outdated, missing and/or incomplete data.

Another delimitation concerns the choice of the machine learning algorithm. Several algorithms were considered, and the comparison could also have included other more computationally expensive methods. However, such alternatives were not explored due to time and resource limitations.

Two delimitations were imposed on how data was preprocessed in this study, in order to limit its scope. The first concerns the use of imputation to replace missing values. Although the imputation method used could have been optimized by comparing the effectiveness of different hyperparameter values, the default was used instead. The other delimitation is the use of a simple method for outlier removal, rather than one of the more complex methods used in previous SQRP studies.

Lastly, this study does not aim to construct a fully functional prototype. Therefore, no user interface will be designed and constructed, since it is not intended to be adopted in clinical practice in its current form.

(9)

2 Theory

This chapter introduces and explains underlying concepts and theories of this thesis. The chapter begins with an explanation of MMR, followed by the SQRP and important concepts related to it.

Subsequently, relevant information related to statistics and machine learning is presented.

2.1 Multimodal Rehabilitation (MMR)

MMR is a form of treatment for chronic pain that is carried out in specialized Swedish clinics. It is based on the biopsychosocial model, because of its underlying assumption that biological, psychological and social factors are imperative to the patient’s well-being. Because of this, the patient’s state of mind is deemed an important factor to consider within treatment. Therefore, both pain and mental aspects are considered during MMR (SBU 2010). The latter is achieved by including psychotherapy within the programmes, typically cognitive behavioral therapy (CBT), which is an important aspect of the rehabilitation process (Ringqvist et al. 2019).

MMR programmes are long-running processes that include non-medicinal treatments of pain.

Examples of typical measures are CBT, pain related education, life counseling and physical fitness training (Lemming 2016). The purpose of MMR is not to completely remove pain, but to reduce it and its impact on the patient’s mental health and ability to work (Lemming 2016; SBU 2006). To achieve this, several clinicians design a personally tailored MMR programme for the patient. These clinicians form a team that lasts for the duration of the treatment process, and the team usually consists of two to four therapists (SBU 2010). The types of clinicians within a team can vary, and professionals such as doctors, physical therapists, occupational therapists, psychologists and nurses can be involved in them. The content of an MMR program also varies, depending on the patient and the expertise of the involved specialists (SBU 2006). Furthermore, the length of programmes differ. Their length can range from less than 30 hours to over 100 hours in total, where the latter is considered an intensive programme (SBU 2015). Despite the differences between programmes, MMR in general has been shown to be more effective than standard treatment for chronic pain. For instance, MMR for chronic lower-back pain is more effective than treatment in non-specialist clinics, for reducing pain and pain related activity inhibitions (SBU 2015).

2.2 Swedish Quality Registry for Pain Rehabilitation (SQRP)

The SQRP includes data gathered from Swedish specialized pain rehabilitation clinics. SQRP data consists of demographic data, data related to the patients past and current life, as well as clinical data related to pain symptoms and psychological symptoms. The registry contains background variables for each patient such as age, sex, level of education, country of birth and days of unemployment (Milton et al. 2013; Gerdle et al. 2019b). The SQRP also contains data related to the characteristics of the patient’s pain. Such data includes variables for the level of pain the patient has felt during the past week, the duration of pain and the number of anatomical

(10)

regions the patient has experienced pain in (Milton et al. 2013). The majority of SQRP data, however, is a product of surveys that are answered by patients before treatment, directly after treatment and at a 12 month follow-up (Nationella Kvalitetsregister 2019). The important questionnaires and relevant variables in regards to this study are described in the following subsections.

2.2.1 Multidimensional Pain Inventory (MPI)

The MPI questionnaire contains 61 items for a number of subscales, and the SQRP contains a variable for each subscale. The survey is divided into three parts.

The first part contains questions regarding chronic pain, mental health and the patient’s life.

Two of its five subscales are related to the impact of pain on the patient’s life: Pain Interference and Life Control. The former measures pain related interference in a patient’s daily life, and the latter measures the extent of which the patient feels a sense of control over his or her life. It also contains the Pain Severity subscale, which measures the level of pain the patient is experiencing.

The final two subscales are related to perceived distress and perceived social support from a significant other. (Milton et al. 2013)

The second part of the MPI consists of questions regarding the patient’s perception of a significant other’s reactions to his or her pain induced suffering. It contains three subscales:

Distracting Responses, Punishing Responses and Solicitous (caring and protective) Responses.

The final part of the questionnaire concerns the extent to which a patient carries out activities, and it can be combined into the General Activity Index (GAI). (Milton et al. 2013)

2.2.2 Hospital Anxiety and Depression Scale (HADS)

HADS is a measurement scale that is used to measure symptoms of anxiety and depression. It consists of seven items for each of the two subscales (HADS-A and HADS-D), both of which range from 0 to 21. These items consist of a set of questions that each patient has to answer.

Each answer has a value of either 0, 1, 2 or 3. For each subscale, these values are then summed up to obtain a final diagnosis. Usually, a score of 7 or less on both subscales indicates no presence of either anxiety and/or depression, while a score between 8-10 indicate possible signs of either anxiety and/or depression and a score of 11 or more indicate severe symptoms of anxiety and/or depression. (Milton et al. 2013)

2.2.3 The Short Form Health Survey (SF36)

SF36 measures a patient’s perceived physical and mental health. SF36 variables have a standardized scale of 0 to 100, of which 100 is the greatest perception of health. The SQRP contains individual SF36 variables that measure perceived physical health, including pain intensity, and perceived mental health. Additionally, it contains the variables SF36-PCS (Physical Component Summary) and SF36-MCS (Mental Component Summary), which are summaries of the individual variables. (Molander et al. 2018)

(11)

2.2.4 The European Quality of Life instrument (EQ-5D)

The EQ-5D tool gives a generic measure of health across a spectrum of different disease types.

This tool consists of two different parts. The first part of the EQ-5D consists of a questionnaire that measures the patient’s perceived state of health in regards to mobility, normal activities and mental health. The answers to the latter are summarized to an index value (EQ5D-index). The other part of the EQ-5D tool consists of a visual analogue scale that a patient uses to rate their own overall health, on a scale from 0 to 100 (EQ-VAS). A higher EQ-VAS value corresponds to a better perception of health. (Molander et al. 2018)

2.2.5 Health-related Quality of Life (Hr-QoL)

Hr-QoL indicates the overall impact of chronic pain on the patient’s quality of life, and it has been recognized as an outcome domain for chronic pain treatment by several committees. Both EQ-5D and SF36 variables are measures of HR-QoL aspects. (Molander et al. 2018)

2.3 Multicollinearity

SQRP data is affected by multicollinearity to some extent (Gerdle et al. 2019a; Gerdle et al.

2019b; Ringqvist et al. 2019). Multicollinearity is a result of two or more independent variables being highly correlated. Consequently, the results of regression analysis become more difficult to interpret, due to it resulting in unstable regression coefficients and overestimated standard errors for individual variables. Therefore, it is a major issue for regression analysis that seeks to explain rather than predict (Schreiber-Gregory 2018). Although it is less of an issue for prediction, it should still be dealt with before performing regression (Ibid.). In the case of prediction, this can be achieved by increasing the number of samples used for fitting the model (Ibid.). Other methods include omitting highly correlated independent variables, and the use of Principal Component Analysis (PCA) to combine such variables (Daoud 2017). Correlated independent variables can be identified by calculating their Variance Inflation Factor (VIF). A variable is considered to be highly correlated with others if it has a VIF value of 5 or higher (Daoud 2017).

A different alternative to using PCA or omitting such variables, is to use algorithms that can handle multicollinearity. One such algorithm is ridge regression, which is often used and can effectively handle multicollinear data (Schreiber-Gregory 2018).

Machine learning algorithms can also be capable of handling multicollinearity within data.

For instance, Dipnall, Pasco, Berk, Williams, Dodd, Jacka, & Meyer (2016) showed that boosted regression trees can overcome the issue of multicollinearity, by reducing the number of selected features during boosting.

(12)

2.4 Linear and Ridge Regression

Linear regression is a statistical method that attempts to model linear relationships between one or more independent variables and one dependent variable (target). This linear method can be used to predict the value of a dependent variable using a single independent variable, or using multiple independent variables (multiple linear regression). The equation of multiple linear regression is presented below, where each X is an independent variable, β 0is the intercept, βnare regression coefficients (describing the change in Y as a result of a change of 1 in X n), ε is the model’s residuals and Y is the dependent variable (Equation 1).

β β β .. β ε

Y = ₀ + ₁× X₁ + ₂× X₂ + . + _n × X_n +

Equation 1. Multiple linear regression.

Although it is a relatively simple method to implement and understand, linear regression assumes that the input variables are all independent from each other. As a result, it is vulnerable to the effects of multicollinearity that are described in chapter 2.3.

Ridge regression is a linear method that can predict a dependent variable using several independent variables, similarly to multiple linear regression. A key difference, however, is that it can also handle multicollinearity. As described by Mason & Brown (1975) “The purpose of ridge regression is to reduce the high variances of the estimated coefficients…”. This method is considered to be a superior approach in comparison to completely leaving out highly correlated features, due to the possibility of losing important information when performing the latter (Schreiber-Gregory 2018; Mason & Brown 1975).

Ridge regression makes use of regularization. Regularization can be understood as a penalty that is added to model parameters, in order for the model to generalize the data and avoid overfitting, which is considered to be a side effect of multicollinearity (Schreiber-Gregory 2018).

A large enough penalty makes it more resilient to multicollinearity. Having an overly large penalty, however, can result in an under-fitted model (Schreiber-Gregory 2018).

Ridge regression has several hyperparameters. The Ridge implementation of the Python package Scikit-learn (Scikit-learn 2020-05-02), for instance, has parameters with default values that can be changed prior to execution. The alpha parameter is a positive float number that reduces the variance of the coefficients, thus the larger the alpha value, the stronger the regularization penalty is. Fit_intercept and normalize are two parameters that take boolean values. The former is by default true and is used to calculate the intercept of the model (normalizing the data in the process of it), although if false is passed, then it expects the data to be already normalized. The latter parameter normalizes the data, though it only does so if fit_intercept is true. The implementation has several other parameters as well.

(13)

2.5 Machine Learning

Machine learning (ML) branches from the domain of artificial intelligence, and is a broad term that is often used to describe the employment of mathematical algorithms, that are applied with the help of different computer based methods to uncover complex relationships in the data, while also providing highly accurate predictions, without the need of being explicitly programmed (Obermeyer & Emanuel 2016). As described by Patel, Khalaf and Aizenstein (2016) ML is “...a group of methods (that are) used to develop prediction models from empirical data to make accurate predictions about new data.”. Such models are normally used to aid humans in multiple domains, such as medicinal diagnosis (Manogaran, Vijayakumar, Varatharajan, Malarvizhi Kumar, Sundarasekar & Hsu 2018), treatment outcome prediction (Hatton et.al. 2019), fraud-detection (Herland, Bauder & Khoshgoftaar 2020), CRM (Aluri, Price & McIntyre 2019) and also within the automotive industry (Awan, Saleem, Minerva & Crespi 2020). ML is usually grouped into either supervised or unsupervised learning (Rokach & Maimon 2008, pp. 4-5;

Senders et. al. 2018a).

Supervised learning often refers to those learning algorithms that are trained on certain desired output, referred to as labeled data, thus allowing them to predict unseen data based on previous learned labels. Unsupervised learning refers to those learning algorithms that are trained on unlabeled data. In this case, the algorithms rely on the structural properties of the data, by clustering them based on feature similarity. Within supervised learning, regression and classification methods can be applied (Rokach & Maimon 2008, p. 5). The key difference between them is that regression attempts to predict a continuous value (i.e.: real estate price) while classification attempts to classify the data (i.e.: customer churn/no-churn or email spam/not-spam). The success of supervised learning is usually dependent on its parameters.

These parameters, which are often called as hyperparameters, differ from one algorithm to another (Jiménez et.al. 2009; Tan et.al. 2020, pp. 188).

Yet, ML does not come without its own set of difficulties, as ML algorithms can be prone to overfitting and underfitting. Underfitting can be understood as when a trained model is too simple, meaning that the model is unable to map the relation between the input and output features (Tan et.al. 2020, p. 169). Overfitting can be understood as when a model fits perfectly to a training set but fails to predict new unseen data, thus widening the gap between training error and test error (Tan et.al. 2020, p. 169).

2.5.1 Preprocessing

Data preprocessing consists of a group of methods and techniques that are used to clean and reduce the amount of data. This is done because, most of the time, data is imperfect and can contain missing values, noise, irrelevant and redundant data (García, Ramírez-Gallego, Luengo, Benítez & Herrera 2016).

Preprocessing involves data preparation and feature reduction. The former consists of several methods. One such method is data transformation, which consists of multiple techniques with the most common ones being feature binarization and discretization. Simply put, discretization is the

(14)

process of transforming continuous values into categorical ones and binarization is the process of transforming either continuous or categorical values into binary representations (Jerez, Molina, García-Laencina, Alba, Ribelles, Martín & Franco 2010; Tan et.al. 2020, p. 83). Other methods for data preparation are data cleaning, in order to fix errors within the data, removal of outliers and imputation to replace missing data (García et.al. 2016). Feature reduction consists of lowering the amount of features without reducing model performance, in order to remove redundant and/or irrelevant features (Tan et.al. 2020, p. 76-77).

2.5.1.1 Correlation-based Feature Selection (CFS)

CFS is based on the assumption that input features should be correlated to the target and uncorrelated with each other. It entails the selection of such features, thereby excluding the others. At its heart, this method is based on the following hypothesis: “Good feature subsets contain features highly correlated with the class, yet uncorrelated with each other.”. CFS achieves this by ranking the subset of features instead of the features themselves. (Hall 2000)

CFS is capable of improving the performance of several types of ML algorithms, including Naive Bayes (probabilistic learner), decision tree regressors and k-Nearest Neighbors (Hall 2000;

Hayward et al. 2010). Hall (2000) concluded that CFS drastically reduces the size of the dataset while still being able to maintain or improve the performance of the previously mentioned ML algorithms.

2.5.1.2 KNN Imputation

KNN imputation is an imputation method that uses the k-Nearest Neighbors algorithm to replace missing data. Since it is KNN-based, it takes an argument (k) for the number of neighboring samples to find for each missing value. These k neighbors are the non-missing samples that are closest to the missing value, and the optimal k-value to use is often determined by cross-validation (see chapter 2.5.6.3). Each missing value is replaced with the mean (for numerical values) or the mode (for non-numerical values) of the identified neighbors. (Jerez et al. 2010)

2.5.2 Grid Search

Grid search is a technique that is used to choose appropriate model hyperparameters. This technique is often seen as an important step due to the fact that model performance is usually dependent on hyperparameter tuning. This is accomplished by testing different sets of parameter values for a model to find the best combination, based on prediction accuracy. There are multiple variations of grid searches, though the most common ones are deterministic and stochastic grid searches. (Jiménez et.al. 2009)

Deterministic grid search consists of a complete use of the hyperparameter range, where for each individual parameter, an exhaustive list of possible values is provided to the grid search (Jiménez et.al. 2009). As an example, the learning_rate parameter of XGBoost ( chapter 2.5.5) has a minimum value of 0 and a maximum value of 1. If a deterministic grid search is to be

(15)

employed, then all decimal values from 0 to 1 would have been provided to the grid, making sure that the range is as exhaustive as possible, though within reasonable boundaries.

Stochastic (or random) grid search consists of hyperparameter values that are not evenly spread and randomly chosen (Jiménez et.al. 2009). As an example, similar to the one above, if stochastic grid search were to be employed then values for learning_rate would be provided at random, between 0 and 1, for this particular grid search.

2.5.3 Decision Trees

Decision trees can be used for either regression or classification purposes. They are called decision trees, because they resemble a tree-like shape during the decision making process, by creating test nodes and leaf nodes. Each test node, also known as an internal node, splits the data based on a certain feature value (Rokach & Maimon 2008, pp. 8-9). A leaf node in its turn, contains the final decision value for regression, or class label for classification. For example, if a classification tree would have been used to classify if a person is able to buy alcoholic beverages or not, based on their age, then the test node would look at the value of the age feature and thus split the data based on the observed value. If the person is underage, then the label of “Not allowed” would be applied to that person, otherwise the label of “Allowed” is applied since the person is within the legal age of alcohol consumption. Decision trees then continue splitting based on such conditions, until a node is pure or the maximum allowed tree depth is reached. A node is considered pure when the tests result in a single outcome for all objects, such as all persons being tested as underage in the previous example. Thus, if a node is pure, the value of the resulting leaf node becomes the predicted value for those objects. (Tan et al. 2020, pp.

139-147)

Although the main goal of decision trees are of predictive nature (as it is with most of the ML algorithms), they are also used for exploratory purposes, since the decision making path is comparable to how humans make decisions and thus easy to interpret (Rokach & Maimon 2008, pp. 6-7).

2.5.4 Gradient Boosting and Boosted Regression Trees (BRT)

Gradient boosting is a ML technique that is used to improve general model performance. It does so by repeatedly running a weaker algorithm (most often being a simple decision tree) on multiple random training sets. During this boosting process, new decision trees are created based on previous ones. At the end of the boosting process, all previously created trees are then aggregated in order to predict a class or a continuous value. Gradient boosting redistributes the weights of features based on incorrect predictions by previous trees, thus putting more emphasis on the more difficult parts that are to be predicted, which result in increased general performance. (Freund & Schapire 1996)

Although decision tree boosting is a general term that is used when boosting is applied to decision trees, it often refers to classification. Boosted Regression Trees refer instead to when boosting is applied to decision trees for regression purposes. Such BRTs have a somewhat different approach when compared to classification, due to the fact that it is a form of functional

(16)

gradient descent, since it attempts to minimize the loss function by subsequently creating new trees. A loss function calculates the errors of the prediction model and adjusts its behavior accordingly. Thus, the lower the value of the loss function the better is the model at correctly making predictions. These new trees then attempt to reduce, at best, the loss function. Hence the term “gradient descent”, since new trees are created, and the loss function is minimized by descending down through trees. Also, the initial tree is the one that focuses on maximally reducing the loss function, whereas subsequently created trees focus on reducing residual values, that is, certain variation that is not explained by the model. (Elith, Leathwick & Hastie 2008) 2.5.5 eXtreme Gradient Boosting (XGBoost)

XGBoost is a parallel tree boosting algorithm that can be used for either classification or regression, and has the capability of handling big amounts of sparse data, without it being limited due to low computational power (Chen & Guestrin 2016; XGBoost 2020-04-22). XGBoost is a variation of Gradient Boosting, which improves scalability, portability, memory-efficiency and predictability when compared to Gradient Boosting (Chen & Guestrin 2016). XGBoost builds multiple decision trees in order to predict classes for classification or values for regression, where every tree is evaluated using a scoring function.

Several hyperparameters can be configured for XGBoost regression prior to model training.

The n_estimators parameter is an integer that defines the number of trees in the model. Tree depth can also be limited with the max_depthparameter, which can be increased to make trees more complex but more prone to overfitting. For boosted regression trees, there is a performance trade-off between the number of trees and tree complexity (Elith, Leathwick & Hastie 2008).

Thus, if the max depth is increased, for instance, the n_estimatorsvalue should be decreased. In order to control overfitting, the splitting of trees into new leaf nodes can be restricted with the min_child_weight and gamma parameters, both of which have a minimum value of 0. Unlike the previous parameters, several that have a value range of 0 to 1 exist as well. One such parameter is the learning_rate, which shrinks feature weights and can prevent overfitting. Another is subsample, which determines what rate of the training data is randomly sampled in each iteration. A lower value for the latter parameter can prevent overfitting. Others are the three colsample (column sample) parameters that determine the ratio of subsampling by tree, level or node. (XGBoost 2020-04-24)

2.5.6 Evaluation

Both classification and regression methods use different evaluation metrics due to the nature of the data. In regression analysis such metrics as Coefficient of determination (or R²), adjusted R², Mean Squared Error (MSE), Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) are used. Some methods are described below.

2.5.6.1 R²

R² (Equation 2) is a statistical measurement method that is used to measure how well a regression line fits the data. This value represents the proportion of variance in the output value,

(17)

that is predictable from the input value. The value of R² is normally between 0 and 1, where 0 represents absence of predictability, while 1 represents complete predictability of the output value based on input values.

² 1

R = − SSregressors_SStotal

Equation 2. R², where SSregressors is the sum of the squares derived from the regression, which measure how well the model represents the data. SStotal is the total sum of squares is a measurement that explains the variation in the observed data.

2.5.6.2 MSE and RMSE

MSE is a measurement that measures the average of the overall squared errors, with the goal of evaluating the error level in model prediction. This is achieved by calculating the distance from the data points to the regression line and squaring the results so that the negative values do not cancel positive ones (data points below the regression line have negative values, while data points above the regression line have positive values). Usually, the lower the MSE value, the closer the fit to the data is. RMSE ( Equation 3) is the square root of MSE. It is used to represent the errors in the same unit type as the data that is being displayed. (Salkind 2010, p. 1287)

MSE

R = √ⁱ⁼¹^∑ⁿ ^{(p – x )}ⁱ ⁿ ⁱ ²

Equation 3. RMSE, where (p_i – x _i)²is the predicted value minus the observed value squared and n is the number of observations.

2.5.6.3 Cross-validation

Cross-validation (CV) is a method that is used for model evaluation. Depending on the CV strategy, data is usually partitioned into a training and a test set, where the training set is used to train the model while the test set is used to validate how well the model is performing. This process is repeated a number of times, during which a different set of data is selected as the test set for each run, as displayed in Figure 1. For K-fold cross-validation, the data is at first split into k sets of equal sizes. The CV process is then repeated for k iterations. (Patel, Khalaf &

Aizenstein 2016)

(18)

Figure 1: Example of a 3-fold cross-validation. In the first run, the first set is used as a test and the second and third sets as training.

2.5.6.4 Feature Importance

Since XGBoost is a tree based algorithm, it can generate a list of the most important features that are used during boosting. This list of feature importance is normally derived from the type of importance metric that is to be calculated. XGBoost accepts one of the following metrics for calculating feature importance: gain, weight, cover, total gain and total cover (XGBoost 2020-05-07). Some of the metrics are described below.

● Weight represents the amount of times that a single feature was used for decision making during decision tree creation, within the XGBoost model. The more a feature is used for splitting, the more importance it gets. (Shi, Wong, Li, Palanisamy & Chai 2019)

● Gain measures the decrease of node impurity. For each feature, gain is determined by calculating the average decrease in impurity that its test node splits resulted in (Shi et.al.

2019).