Hospital Readmission Risk Prediction Using Machine Learning

(1)

Hospital Readmission Risk

Prediction Using Machine

Learning

LOUISE ABRAHAMSSON KWETCZER

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Prediction Using Machine

Learning

LOUISE ABRAHAMSSON KWETCZER

Master in Computer Science specializing in Machine Learning Date: June 10, 2020

Supervisor at KTH: Pawel Herman

Supervisor at Cambio: Marcus Petersson Examiner: Erik Fransén

School of Electrical Engineering and Computer Science Host company: Cambio Healthcare Systems

Swedish title: Prediktion av återinskrivningsrisk med hjälp av maskininlärningsmetoder

(4)

(5)

Abstract

Hospital readmissions result in higher risk of mortality, more difficult treat-ments and increased cost for the healthcare sector. Therefore, decreasing the number of readmissions are of interest. This thesis sets out to investigate the usefulness of machine learning in hospital readmission risk predictions and further to examine what features are important for predicting the readmis-sion risk. Several feature selection methods and machine learning algorithms were examined in order to find the features that had the most impact on the readmission risk. The methods were applied to data collected from Uppsala Akademiska Sjukhus and consisted of information stored in electronic health records. Two different definitions of readmission were considered, however one proved to be much harder to model. The features were selected using a combination of recursive feature elimination and statistical testing. The num-ber of previous admissions was the most predictive feature, using multiple time windows to count the number enhanced the performance of the models as well. The importance of features was evaluated based on their abilities to separate two classes using different classifiers. Logistic regression had the best performance and a short training time. The AUC score, recall and precision of the best model were 0.708, 0.311 and 0.439, respectively, yielding a model with modest discriminative capabilities. The difficulty of predicting whether a patient was readmitted or not was found to vary depending on which disease a patient suffered from. For example, devising a model specific for heart failure patients resulted in better performance compared to the model trained on the entire dataset.

(6)

Sammanfattning

Återinskrivningar resulterar i en förhöjd risk för dödlighet, svårare behand-lingar samt ökade kostnader för sjukvården. Därför vill man minska antalet återinskrivningar. Det här examensarbetet går ut på att undersöka möjligheten att använda maskininlärning inom åreinskrinvingsförutsägelser och vidare att identifiera variabler som påverkar återinskrivningsrisken. Flera metoder för att välja ut viktiga variabler samt olika maskininlärningsalgoritmer undersöktes för att hitta de variabler som hade störst påverkan på återinskrivningsrisken. De olika metoderna användes på data som bestod av elektroniska patientjour-naler insamlad från Uppsala Akademiska Sjukhus. Två olika definitioner på återinskrivning användes, varav en visade sig vara mycket svårare att model-lera. Variablerna som användes valdes ut genom rekursiv eliminering samt sta-tistiska tester. Den variabel som påverkade resultatet mest var antalet tidigare inskrivningar, att beräkna antalet tidigare inskrivningar under olika tidsperio-der förbättre resultatet ytterligare. Vikten av en variabel utvärtidsperio-derades genom att undersöka olika klassificerares förmåga att separera två klasser. Logistisk regression var den modell som gav bäst resultat och hade kortast träningstid. Den bästa modellen hade ett AUC värde på 0.708, recall på 0.311 och precision på 0.439, vilket motsvarar en modell med måttliga diskriminativa förmågor. Svårigheten att förutse om en patient blev återinskriven eller inte visade sig variera mycket beronede på vilken sjukdom en patient hade. Genom att skapa en specifik modell för patienter med hjärtsvikt kunde bättre resultat observe-ras jämfört med att använda modellen skapad för att göra förutsägelser på hela datasetet.

(7)

Acknowledgements

I would like to express my gratitude to my supervisors at Cambio Healthcare Systems and KTH, your respective guidance and advice have been of great importance.

I would like to thank Associate Professor Pawel Herman at KTH for taking time to provide good feedback and insights during this project. Your com-ments and suggestions have been of great value to me.

I would like to thank everyone at Cambio for providing a comfortable atmo-sphere to work in and showing interest in this project. Finally, I would like to extend a huge thank you to my supervisor at Cambio, Marcus Petersson. You have shown a great interest in this project and always been available to answer any questions. Your inputs and support have been greatly appreciated.

Louise Abrahamsson Kwetczer Stockholm, April 2020

(8)

1 Introduction 1

1.1 Research question . . . 2

1.2 Scope and delimitations . . . 3

1.3 Thesis outline . . . 3

2 Background 4 2.1 Feature selection . . . 4

2.1.1 Filter methods . . . 4

2.1.2 Wrapper methods . . . 7

2.1.3 Pros and cons with the feature selection methods . . . 7

2.2 Classification methods . . . 8

2.2.1 Logistic regression . . . 8

2.2.2 Random forest . . . 9

2.2.3 Support vector machine . . . 10

2.2.4 Multilayer perceptron . . . 12

2.3 Metrics . . . 13

2.3.1 Basic metrics . . . 14

2.4 Statistical hypothesis testing . . . 16

2.5 Related work . . . 17

2.5.1 Readmission risk prediction for a specific patient group 18 2.5.2 All-cause readmissions . . . 21 2.5.3 Summary . . . 22 3 Methods 24 3.1 Definition of readmission . . . 24 3.2 Data explanation . . . 25 3.3 Resampling . . . 29 3.4 Experimental procedure . . . 30 vi

(9)

4 Results 32

4.1 Initial data analysis . . . 32

4.1.1 Filtering with univariate independent statistical tests . 33 4.1.2 Correlations . . . 33

4.2 Feature categories . . . 36

4.3 Comparison of readmission definitions . . . 38

4.3.1 Relaxed definition – readmitted within 30 days . . . . 38

4.3.2 Strict definition – readmitted within 30 days with the same diagnosis . . . 41

4.3.3 Comparison definitions . . . 43

4.4 Heart failure . . . 43

4.5 Comparison of diagnoses . . . 45

4.6 Resampling ratio and threshold variation . . . 46

5 Discussion 50 5.1 Feature importance . . . 50

5.2 Readmission definitions . . . 52

5.3 Diagnosis specific models versus general model . . . 53

5.4 Risk prediction versus classification . . . 53

5.5 Ethics, sustainability and social impact . . . 54

6 Conclusion 56 6.1 Future work . . . 57

Bibliography 59 A ROC and precision recall curves 62 A.1 ROC curves . . . 62

(10)

(11)

Introduction

Healthcare is a central part of every modern society and all nations invest a lot of resources into their healthcare systems each year. Yet, even though very large amounts of resources are given to healthcare, an aging and increasing population puts a lot of strain on the system and hospitals. With a limited num-ber of doctors, hospital beds, surgeons, etc., it is crucial that these resources are utilised in the most efficient way possible.

As societies become more digitized, so are hospitals. Nowadays, all patient records and administrative data, e.g. surgeon availability and room availabil-ity, are kept in electronic format. This has allowed for the development of multiple decision support systems, helping doctors and other administrative personnel making informed decisions on how to more efficiently utilize the available resources [1]. Some examples are diagnostic support systems, pa-tient management systems and alarms systems [2], [3]. Naturally, these de-cision support systems have evolved and nowadays there is a huge interest in attempting to apply machine learning methods within healthcare in order to develop more advanced support systems. More advanced algorithms, faster computers and larger amounts of data are all factors that have lead to increased utilisation of machine learning methods.

In this project, readmissions, or re-hospitalizations, in a Swedish hospital are examined. Readmission means that a patient has been admitted and discharged from a hospital and within a certain number of days is admitted again. Hospi-tal readmission risk prediction can be used to identify high risk patients that are in need of more treatment, which can lead to better utilization of hospital resources. Unplanned readmissions result in a higher risk of death and longer

(12)

length of stay [4]. Therefore, they are commonly used as an indicator of hospi-tal performance. Hence, by using machine learning to predict the readmission risk of a patient, a doctor is able to make a more informed decision and thus increase hospital performance.

The ultimate goal would be to incorporate the predictions as a decision support for the medical personnel. If a machine learning algorithm has the potential to make use of available medical history, as well as present vital parameters etc., of a patient to make a prediction within a few seconds, then this would be of great use to hospital staff as they most likely do not have time to go through a patient’s entire medical history. Today doctors decide when a patient is dis-charged by looking at a patient’s stats and through years of experience. The utilization of machine learning could further support a doctor’s decision. Mul-tiple studies have been conducted in this area, however, the majority of previ-ous projects focus on patients suffering from a specific disease, i.e. congestive heart failure or diabetes [5]. The most common model used for predicting the readmission risk is a logistic regressor but in recent years, techniques using deep learning have been utilized more frequently [6], [7]. One limiting factor of the performance of a machine learning algorithm is the information avail-able in the features that are used to make predictions. In previous published works depending on what disease is investigated different features have been reported as important. Some report that the notes given in the medical jour-nals are important [6], the number of previous readmissions [8], length of stay and age [9].

1.1 Research question

In order to better understand the applicability of machine learning within hos-pital readmission risk predictions, this degree project aims at investigating the relevance of medical features for predicting the risk of readmission. The re-search question is thus:

What features are of key importance to the prediction of risk for hospital readmission?

In this project a dataset with a wider range of diseases was utilised. To en-able us to investigate general models and to facilitate insights into the major predictive parameters.

(13)

1.2 Scope and delimitations

Feature selection is not a new topic, multiple methods have been developed in order to find the features that are of most importance to solve a given classi-fication problem. In order to limit the scope of this project feature are mainly evaluated by their usefulness for the predictions by examining the performance of classifiers. Another limitation in this project is that the available data con-sisted of information accessible at the end of a treatment. Hence the models were trained on all data and cannot be compared with a doctors ordination halfway through the treatment. Further, the data was obtained from one hos-pital in Sweden, which might lead to the model being fitted to certain kinds of social/geographic factors. Finally, the available data consisted of information provided in electronical health records1_{, EHR, which includes a wide range of}

data, for example demographics, medical history laboratory test results, vital parameters, personal statistics such as weight etc. There was no free text jour-nal data used in this project. The free text include all notes taken by a doctor and provides information about the patient’s health situation that might not be available in the EHR data.

1.3 Thesis outline

Chapter 2 gives an introduction to the theory used in this project it further summarizes related work previously done in the area of readmission risk pre-diction. Chapter 3 explains the available data and describes the definitions of readmission used in the project. Furthermore, an explanation of the experi-ments performed in order to investigate the research question is given. Chapter 4 presents the results of the experiments together with statistical significance testing. Chapter 5 discusses the findings and looks at ethical and sustainable impacts of readmission risk predictions. Finally, Chapter 6 concludes the find-ings and answers the posed research question, it also presents some recommen-dations for future work.

1_{What is an electronic health record (EHR)?,}

(14)

Background

The following chapter outlines the theoretical background of the methods used in this thesis as well as related work. Section 2.1 introduces feature selec-tion methods, Secselec-tion 2.2 goes through classificaselec-tion algorithms, Secselec-tion 2.3 describes evaluation metrics and Section 2.4 focuses on statistical hypothe-sis testing. Section 2.5 outlines relevant work done by other research groups within the subject.

2.1 Feature selection

There are three main principles of feature selection methods, filter, wrap-per and embedded methods. In this project filter and wrapwrap-per methods were utilised. Filter methods use univariate analysis to select features and measures the predictive performance between an independent and dependent variable re-gardless of model, for example correlations. Wrapper methods select features based on the usefulness of the feature when used to train a model.

2.1.1 Filter methods

P-value tests A p-value is the probability of obtaining a result at least as extreme as the result obtained, assuming that some null-hypothesis is true. This means that a smaller p-value indicates that one should reject the null-hypothesis, as the outcome is unlikely, had it been true.

One way of doing feature selection is to compare the distribution of a parame-ter coming from different classes. What we want to investigate is whether we

(15)

can reject the hypothesis that two samples originates from the same distribu-tion when they belong to different classes. Three different statistical tests were used depending on the type of variable. The tests were not used to perform statistical inference rather they were used to guide the feature selection. Each variable was treated independently.

Binomial test First, binary variables are considered. Consider N , random binary variables {Xi}Ni=1and form

Y =

N

X

i=1

Xi

If {Xi}Ni=1are i.i.d. and PXi = 1 = p, then

Y ∼ Bin(N, p)

Hence the sum of random binary variables are binomial distributed [10]. With N given for any set of binary variables, e.g. for a certain diagnosis, only p for that specific group is unknown. Hypothesize then, that p = p0, where p0is the

observed overall readmission rate. Then, by using the CDF of the binomial distribution and given the set of observed variables {xi}Ni=1 and y =

P

ixi,

the p-value is calculated as

p-value = PY 6∈ [EY − z, EY + z]

= 1 + FY(N p0 − z) − FY(N p0+ z − 1)

where

z = |y − EY | and Y ∼ Bin(N, p0)

Chi-squared For categorical variables, a χ2-test can be used for significance testing. This section is based on the theory in [11]. Unlike the binomial test, the χ2-test is not exact, instead the central limit theorem is used to argue that the test statistic approaches the χ2distribution.

Consider n random variables Xi, taking values 1, 2, . . . , k. The test statistic is

then evaluated as Γ = k X i=1 Oi− Ei]2 Ei

where Oidenotes the number of observed occurrences of value i and Eiis the

(16)

central limit theorem, as n −→ ∞, Oi −→ Ei, which results in Γ approximately

being a χ2 distributed random variable. The significance level is then easily computed using the χ2distribution.

Denote the observed occurrences taking value i using Ri and Ni for the

read-mitted and non-readread-mitted patients, respectively. In this case, the null hypoth-esis is that the readmitted patients have similar distribution as the non read-mitted ones. Hence, the expected number of occurrences, Ei, is given by αNi

for

α = Total number of readmitted patients Total number of non-readmitted patients

Two-sample Kolmorogov Smirnov test For continuous random variables, a two-sample Kolmorogov Smirnov test can be applied for significance testing. This section is based on the theory in [12]. Given two samples of empirical cumulative distribution functions, CDFs, in this case from readmitted and non-readmitted patients, the test examines the likelihood of the two samples being drawn from the same underlying distribution.

The test statistic is given by

Dnm= sup x

|Fn(x) − Gm(x)|

where F and G are two empirical CDFs of n and m samples. The null-hypothesis is rejected at significance level α = 0.05 if

Dnm > 1.36

r n + m nm .

Correlation Correlation is a measure of how linearly dependent two vari-ables are. This measure can give an indication on how predictive a variable can be. Hence it is often used as a basic statistic measure. Since correlation is a linear measure it can not show dependencies between variables that are not linear and hence can miss important relationships. Correlation is given by

corr(X, Y ) = E[(X − µX)(Y − µY)] σXσY

where X, Y are two random variables and µ and σ denote the mean and stan-dard deviation of the variables.

(17)

2.1.2 Wrapper methods

Wrapper methods uses a classification algorithm in order to evaluate the use-fulness of different sets of features. A classifier is trained on different subsets of the given features and each subset is evaluated using a measure, for ex-ample the accuracy or AUC score. The feature subset that yields the optimal performance is selected.

Recursive feature elimination, RFE RFE is a backward elimination method that fits a model to the entire set of features and then recursively removes the least important feature. After removing one feature the model is re-built and a new feature is removed, this goes on until a desired number of features re-main. To be able to remove the least important feature an importance score for each predictor is computed at each iteration. RFE combined with cross vali-dation can be used to select the optimal number of features. At each iteration the model is evaluated using cross validation and the optimal number of fea-tures is reached when the score reaches a maximum. The metric to optimize is selected depending on the problem.

The RFE algorithm is compatible with any model as long as the model has an importance score, i.e. the importance of each feature can be measured, hence for example logistic regression, SVM and random forest were all supported. However, the method used to select the features can strongly influence the selected features as well as the number of features if the regularization is too small and overfitting occurs.

2.1.3 Pros and cons with the feature selection

meth-ods

Filter methods are very fast and they are completely model agnostic but they only compare two features to each other. By only comparing one feature to the target values combinations of features that might be important are overlooked. For example if a patient suffers from two diseases that can easily be treated individually but together they are much harder. Filter methods would not find this relationship.

In comparison to filter methods wrapper methods are significantly slower at selecting features however they might find relationships between variables that are impossible to find using filter methods. Wrapper methods are not model agnostic and they can suffer from overfitting.

(18)

2.2 Classification methods

This section gives an overview of the classifiers used in this work. They are mainly chosen based on previous work as outlined in Section 2.5.

2.2.1 Logistic regression

Logistic regression is the most utilised modelling technique when it comes to readmission risk prediction [5]. A logistic regression model uses a logistic function to make the output of the model range between zero and one and should hence be used for classification. The logistic function is given by,

P (y = 1) = 1

1 + exp −(wT_{x + b)}.

where x is the input and w, b are model parameters. The output is the modelled probability of the input belonging to a certain class [13].

Interpretation To interpret the meaning of the weights one can rearrange the terms in the above equation.

log P (y = 1) 1 − P (y = 1) = log P (y = 1) P (y = 0) = wTx + b

The term P (y=1)_{P (y=0)} is called the odds. The model thus tries to model the odds through a linear equation. By applying the exponent.

Training Like most machine learning methods, the parameters are optimized with respect to some loss function. Given a set of data points {(xi, yi)}i, where

xi is the input and yiis the true output. Let ˆyidenote the output of the logistic

regressor. Then, the parameters w, b are chosen according to

w∗, b∗ = arg min w,b −X i yilog ˆyi+ (1 − yi) log(1 − ˆyi)

otherwise known as the log-loss function. The minimization problem is solved iteratively until the parameters converge, typically using a coordinate descent algorithm.

(19)

2.2.2 Random forest

Random forest is an ensemble method that combines multiple decision trees in order to make a prediction. By combining multiple weak learners, more sta-ble and accurate predictions can be made. Further, ensemsta-ble methods reduce variance and are less prone to overfitting.

Given a training set {Xi}Ni , M subsets are created by selecting a random

num-ber of samples with replacement from the original dataset. To each subset, m ∈ 1, ..., M a model fm is trained. Given a new unknown sample, x0 the

final prediction, ˆf , is then the average of each individual classifier [14].

ˆ f = 1 M M X m=1 fm(x0)

Decision tree A decision tree is best described as a sequence of questions. The idea is that questions are asked and depending on the answers new ques-tions are asked, thus constructing a tree. By following the trajectory of the questions and answers, data points are classified using the leaf nodes in the tree. This section is based on the theory in [15].

The tree is constructed by deciding, at each node, which question to ask. This is decided based on the information gained of each potential question, or how much they reduce the uncertainty in the dataset. The entropy, which is a mea-sure of uncertainty of a dataset, T , is defined as

Entropy(T ) = −X

t∈T

p(t) log₂p(t)

The information gained by knowing the value of some feature A is then given by,

Gain(A) = Entropy(T ) − X

v∈Values(A)

|Tv|

|T | Entropy(Tv)

where Tv is the subset where feature A takes value v. Thus, during the

con-struction of the decision tree, at each node the feature to make a decision on is given by

arg max

A∈FeaturesGain(A)

The construction is either terminated once the entropy of a subset is zero, or when the tree has reached some maximum depth.

When evaluating a sample, the trajectory decided by the tree is followed until a leaf node is reached. The resulting label is decided using the majority class

(20)

of the training subset of that leaf node. An estimated probability can also be outputted by comparing the class sizes found in the leaf node.

2.2.3 Support vector machine

A support vector machine (SVM) is a binary classifier that attempts to create a separating surface between the two classes. The following section is based on the outline of the method in [14].

Separating hyperplane Consider a set of data points xi ∈ Rn, each with

a corresponding label ti ∈ {−1, 1}. A popular idea in the machine learning

community is to separate the two classes using a hyperplane

wTxi − b = 0

and then, for a given sample xi, classifying its label using

yi = ( 1 wT_x i− b > 0 −1 wT_x i− b < 0

If the dataset is linearly separable, the aim is then to choose w, b such that

ti(wTxi− b) > 0 ∀(xi, ti) (2.1)

Margins Given a linearly separable problem, there are infinite number of solutions. For any solution w, b of Equation 2.1,

˜

w = 2w, ˜b = 2b

is another solution. One solution is to maximize the margins of the hyperplane, see Figure 2.1. Replace the constraint in Equation 2.1, by

ti(wTxi− b) ≥ 1 (2.2)

Maximizing the margin is then equivalent to minimizing

||w||2 _{= w}T_w

(21)

Figure 2.1: Separating hyperplane together with maximized margins.

Dual problem Using Lagrange optimization, one can derive the dual prob-lem of maximizing X i αi− 1 2 X j αiαjtitjxTi xj (2.3)

under the constraints

X

i

αiti = 0 αi ≥ 0 ∀i (2.4)

Kernels and SVM Of course, most datasets are not linearly separable. How-ever, through a non-linear transformation into high dimensional space, a dataset is much more likely to be linearly separable - thus each sample xi is

trans-formed through some non-linear function

f : Rn→ Rm _{m > n}

and the problem is instead considered using zi = f (xi).

Kernel K(xi, xj)

Polynomial of degree d (xT

i xj + 1)d

Radial basis exp(−γ||xi− xj||2)

Table 2.1: Common kernels, d is an integer γ is a scale parameter.

Furthermore, in the dual problem of Equations 2.3, 2.4, only scalar products of two samples are required. Therefore, it is of practical benefit to use transforms where the scalar product can be implemented efficiently in low dimensional space instead. Such low dimensional implementations, that represent high

(22)

dimensional scalar products, are called kernels. A few common kernels can be seen in Table 2.1.

2.2.4 Multilayer perceptron

A multilayer perceptron (MLP) is a feedforward artificial neural network (ANN) architecture. The idea behind ANNs is to construct a vast connected network of neurons. Each neuron contains a weight, a bias, a nonlinear activation func-tion, and receives the output from neurons in previous layers as inputs. Thanks to the nonlinear activation functions, a neural network is able to learn nonlin-ear functions and decision boundaries. The weights and biases are lnonlin-earnt using gradient descent with respect to some loss function. This section is inspired by [14].

One network layer A MLP consists of multiple layers of neurons, each having an associated weight, bias and activation function. Denote these by

w(l) _{∈ R}n×m_{, b}(l) _{∈ R}m _{and h}(l)_{, respectively for layer l which has n inputs}

and m outputs. The outputs of each layer is then given by

z_j(l)= h(l) m X i=1 w_ji(l)z_i(l−1)+ b(l)_j

where z(l−1)_i are the outputs from the previous layer. In the first layer, these are simply the inputs to the network. For example, a network of one hidden layer, the total output is given by

z_k(2) = h(2) M X j=1 w_kj(2)z_j(1)+ b(2)_k (2.2.4) = h(2) M X j=1 w_kj(2)h(1) D X i=1 w(1)_ji z_i(0)+ b(1)_j + b(2)_k

where D is the number of neurons in the input layer, and M is the number of neurons in the hidden layer.

Common activation functions are

ReLU(x) = max{0, x} tanh(x) = e x_{− e}−x ex_{+ e}−x sigmoid(x) = e x ex_{+ 1}

(23)

Training Training a neural network typically consists in minimizing some loss function. The training relies on predetermined data points {(xi, yi)}i,

where xi is the input and yi is the true output. Denote the neural network by

f (x, w, b), for some given input x. Then, the error is defined by

E(w, b) =X

i

e(f (xi, w, b), yi)

where e is the error function, e.g. e(z, y) = (z − y)2.

The parameters of the neural network are then tuned using methods based on gradient descent,

wτ +1 = wτ − η∇E(wτ_{, b}τ₎

bτ +1 = bτ− η∇E(wτ_{, b}τ₎

In practice, methods based on stochastic gradient descent are used to speed up the learning process.

Binary classification Binary classification is a common problem in ma-chine learning and specific network architectures have been developed for this purpose. Specifically, the loss function and activation function of the final layer.

In order to keep the outputs constrained in the range [0, 1], a sigmoid activation is used. The loss function is the log-loss, defined by

e(z, y) = −y log z − (1 − y) log(1 − z)

where y ∈ {0, 1} and z is the network output. These tricks allow for the treat-ment of the network output as a probability. Furthermore, the log-loss is much more suitable than the popular mean square error in a binary setting.

2.3 Metrics

Model evaluation techniques are used to find the effectiveness of a model and to find the most suitable algorithm for a particular problem. The metrics used in this project are outlined below. The most common classification scores are accuracy, recall and precision. The normal accuracy score works well when the ratio between the two classes are balanced but when dealing with highly imbalanced datasets accuracy can be misleading. For example, in the case of the strict readmission definition, the minority class only occurs in 6% of the cases hence the accuracy would be 94% if the model predicted all patients as

(24)

not being readmitted. This is a good accuracy but the model is performing very poorly hence in these cases recall, precision, balanced accuracy and F1 score are more useful.

2.3.1 Basic metrics

A predicted sample can be categorised in one out of four categories depending on the class label and the predicted label as shown in Figure 2.2. In this case, a true positive sample is a sample correctly predicted as readmitted and a false positive sample is a sample wrongly predicted as readmitted. Likewise, a true negative sample is a sample correctly classified as not readmitted and, finally, a false negative sample is a sample incorrectly classified as not readmitted. The number of samples in each category are given by, T P = true positive, F P = false positive, T N = true negative, F N = false negative. The four categories are used to define the basic metrics outlined below.

Figure 2.2: Confusion matrix showing the four possible combinations of pre-dicted and actual class labels. TN = True negative, TP = True positive, FN = False negative, FP = False positive.

Accuracy Defined as

T P + T N T P + T N + F P + F N the accuracy reports the overall accuracy of a model.

(25)

Balanced accuracy The balanced accuracy gives the average prediction ac-curacy of the two classes and is defined as

1 2 T P T P + F N + T N T N + F P .

Precision The precision reports the accuracy of the positive predictions made by the model, defined by

T P T P + F P.

Recall The recall reports the coverage of actual positive samples. It is given by,

T P T P + F N.

Specificity The specificity reports the coverage of actual negative samples and is given by

T N T N + F P.

F1-score The F1-score returns a harmonic mean of precision and recall, use-ful when dealing with imbalanced data sets. It is given by

2T P

2T P + F P + F N.

Receiver operating characteristic (ROC) curve and area under curve (AUC) AUC is a metric that has been used in the majority of previous stud-ies focusing on readmission risk prediction. It measures the area under ROC curve. The receiver operating characteristic curve plots the true positive rate against the false positive rate at various probability thresholds. Hence, it shows the trade-off between these two rates. The AUC score ranges between 0 and 1 and says how well the model can separate two classes, a higher score indicates higher discriminant power. An AUC score of 0.5 means that a model can not separate two classes and if the score is less than 0.5 the model predicts more negative samples as positive and vice versa. The true positive rate, TPR, and the false positive rate, FPR, are given by

T P R = T P

T P + F N, F P R =

F P F P + T N

(26)

Precision-recall curve and average precision score (AP) The precision-recall curve shows the precision plotted against the precision-recall at different proba-bility thresholds, much like the ROC curve. The average precision score is an approximation of the area under the curve. The value range between 0 and 1 and for lower thresholds the value approaches the underlying minority class rate.

2.4 Statistical hypothesis testing

In order to compare the results statistical hypothesis testing is performed. Hy-pothesis testing is used to examine whether there is sufficient evidence to reject a null hypothesis. Two methods, Kruskal-Wallis H test and Wilcoxon signed-rank test are used in this project to compare the performance of differ-ent models. Kruskal-Wallis H test is used to compare multiple groups while the Wilcoxon signed-rank test compares two groups with each other. A significant Kruskal-Wallis H test result states that at least one group differs from the other, but it does not state which one. A post hoc analysis between two groups is re-quired to further determine which groups differ, to this end Wilcoxon signed-rank test is used.

Kruskal-Wallis H test Kruskal-Wallis H test is used to test whether sam-ples originate from the same distribution. The null hypothesis states that the median of the observations in each group is the same. The test can be used to compare multiple groups of samples. The theory in this section is based on [16]. Given C different groups of samples with ni samples in the ith group

and a total of N = P ni samples. The test statistics, H, is computed using

ranking of the samples. The samples are ranked based on the magnitude of the observed values. The smallest value is given rank one and then rank 2 and so on, if the same value occurs more than one time the observation is given the mean of the ranks for the tied values. If the value of each sample is unique, i.e. there are no ties, the test statistics, H is given by,

H = 12 N (N + 1) C X i=1 R2 i ni − 3(N + 1)

where Ri is the sum of the ranks in the ith sample. If there are ties the test

statistics is divided by the term

1 − PG

i=1(t 3_{− t)}

(27)

where t is the number of times the same value occurs and G is the total number of values that lead to a tie. If the sample size is large H is distribution as χ2_{(C −1), where C −1 is the degrees of freedom. The values of H is compared}

to critical values Hcin order to decide whether to reject the null hypothesis or

not.

Wilcoxon signed-rank test Wilcoxon signed-rank test is used to compare whether two observations originate from the same distribution. Let n be the number of pairs (xi, yi), where i ∈ 1, ..., n. The null hypothesis in Wilcoxon

signed-rank test states that the difference between two samples di = xi − yi

are symmetric about zero. This section is based on the theory in [17]. The test is performed according to the following steps1:

Step 1: Compute the difference xi − yi and the absolute difference |xi −

yi|.

Step 2: Rank the differences according to the absolute difference, with the smallest difference set to 1.

Step 3: Compute the sum of the negative ranks and the positive ranks. The test statistics, W , is the smallest value of the two sums.

Step 4: For large sample sizes, n, the test statistics , W , is approximated as normal distributed with

µW = n(n + 1) 4 and σW = r n(n + 1)(2n + 1) 24 − P t3₋_{P t} 48 where t represents the number of times the same absolute difference occurs. Step 5: Finally the z-score is computed according to z = W −µW

σW and the

p-value can be computed using the standard normal distribution. If the p-p-value is smaller than a specified level the null hypothesis can be rejected.

2.5 Related work

Many studies have investigated the use of machine learning methods to pre-dict the risk of readmission. Various models and methods have been used and yield varying results, the review article written by Artetxe et al. [5] exam-ined 77 different studies where 68% used logistic regression or some other 1_{Charles Zaiontz. Wilcoxon Signed-Ranks Test.}

(28)

kind of regression method, 13% used survival analysis and 18% used machine learning techniques for classification, with decision trees and support vector machines, SVM, being most frequently used. 64 of the included studies re-ported an AUC score lower than 0.75, which is equivalent to modest predictive performance2_.

2.5.1 Readmission risk prediction for a specific

pa-tient group

Chronic Obstructive Pulmonary Disease

Most of the studies focus on a specific patient group which allows for investi-gating more disease specific features as well as a higher readmission rate than by looking at the general case. A higher readmission rate results in a less im-balanced problem which facilitates the training. Min et al. [18] look at Chronic Obstructive Pulmonary Disease. They used different scores as features, HOS-PITAL score and LACE Index which both are scores computed from Elec-tronic Health Records, EHR. Furthermore, Min et al. included diagnosis, pro-cedure and pharmacy codes as features. They compared the predictive ability of various machine learning techniques including logistic regression, logistic regression with L1 or L2 penalization, random forest, SVM, gradient boosted decision trees, multi-layer perceptron and deep models like convolutional neu-ral networks, recurrent neuneu-ral networks including long short-term memory and gated recurrent units. The model that yielded the best score was the gradient boosted decision trees with an AUC score of 0.654, the best deep model gave a score of 0.650 and included a gated recurrent unit. There was not a single model that outperformed the rest, all of them produced similar results.

Diabetes

Ramírez et al. [19] investigated readmission prediction of diabetic patients. They used logistic regression, single layer perceptron, multi-layer perceptron and random forest. To reduce the dimensionality of the data, principal compo-nent analysis was conducted which resulted in 45 features being used instead of 100 but still preserving 98% of the variance. They also used oversampling of the minority class to deal with the imbalanced dataset. Their random forest model got an AUC score of 0.9999. Comparing this result to the other models 2_{Tom Tape, The Area Under an ROC Curve, url:http://gim.unmc.edu/dxtests/roc3.htm}

(29)

the second best yielded an AUC score of 0.6458. It is unclear if the reported AUC score of 0.9999 concerns a training or test dataset. They argued that the main contributing factor to the result was the preprocessing of the data by grouping the diagnosis codes, dimensionality reduction, oversampling as well as transforming the problem to a classification problem and not a probability prediction problem. Hammoudeh et al. [20] also investigated diabetes. They used convolutional neural networks and their model yielded an AUC score of 0.95. They preprocessed the data by removing completely irrelevant informa-tion such as patient number, moreover variables with a lot of missing values or small variations in the values were removed.

Heart failure

Another patient group that is frequently investigated are heart failure patients. Mahajan et al. [21] investigated the risk of readmission in this patient group. Their dataset contained clinical data, administrative and psychosocial factors including demographics, pre-index admission factors, comorbidities and con-current procedures. They used ten different “base learners” including Random Forest, AdaBoost, logistic regression, decision trees, neural network, naïve bayes etc. and a “meta learner” which took a set of the predictions from the base learners and made the final prediction. They concluded that the ensemble models were at least as good as the best base model. The highest AUC score of the base learner came from extra tree with 0.6993 and the ensemble method using gradient boost yielded 0.6987. AbdelRahman et al. [9] also focused on heart failure. They used a three-step approach consisting of preprocessing, systematic model development and risk factor analysis. In the preprocessing step, variables that were missing in more than 50% of the cases were removed. For the remaining variables, two ways of handling missing values were used. Complete-case analysis, where incomplete data was removed from the dataset, and mean-based imputation, where the missing values was replaced by the mean/mode of the complete values using a K-means algorithm. However, in their final model only complete data was used and they did not fill the missing values. The data included demographics, comorbidities, laboratory tests, vital signs and healthcare utilization during the previous six months. Noteworthy is that they had significantly less data than other studies, their training set only consisted of approximately 750 instances and the test data of 188 including data with missing values. To evaluate the significance of variables a statistical test was used with p-value threshold of ≤ 0.001 indicating a significant vari-able and ≥ 0.1 an irrelevant varivari-able and values in between were considered

(30)

to be moderately significant. The best model was a voting classifier averaging multi-nominal logistic regression and voting feature intervals classifier. Their model received an AUC score of 0.79 and out of 42 risk factors, discharger disposition (where the patient was discharged to), age and indicators of ane-mia were the most significant, length of stay was also found to be an important feature.

Another group that focused on heart failure patients were Ashfaq et al. [7]. They used data from a Swedish hospital between 2012-2016. Their work fo-cused on building a predictive model using long short-term memory, LSTM, to take the sequential nature of readmissions into account and hence use more information in the model. They separated the features into human derived and machine derived. Human derived included comorbidities, length of stay, demographics, type of admission, medication, Charlson comorbidity index, number of prior emergency care visits, number of prior admissions, number of prior outpatient visits. Machine derived features included all clinical codes, procedures and medications and lab tests. To deal with the imbalance of the dataset a cost term was added to the loss function with three times larger cost for misclassifying a readmission visit than a non-readmission visit. They re-ceived an AUC score of 0.76 when using LSTM, using human- and machine derived features simultaneously. They found that capturing the sequential or-der of the visits increased the AUC score by 26% compared to a memory-less network like multi-layer perceptron.

Comparison of different patient groups

A study that compared the ability to predict the readmission risk for specific patient groups to the entire dataset containing all patients were done by Fu-toma et al. [22]. They expected the ICD-10 codes to be the most informative as well as the broader diagnosis group codes. They used the codes and addi-tional background variables, e.g. age, gender, race, length of stay, number of admissions in the past year and a few more to make predictions. They used logistic regression, logistic regression with a multi-step heuristic approach for variable selection, penalized logistic regression, random forest and support vector machine. They tested their models on specific patient groups according to the diagnosis group code and on the entire data set. The random forest and the penalized logistic regression performed best. The mean AUC was 0.69, however it ranged between 0.57-0.95 which indicated that some groups were easier to model and some were much harder. They did see a better result when modelling a specific group rather than modeling the entire dataset, which they

(31)

hypothesized was caused by the homogeneity of a specific group.

2.5.2 All-cause readmissions

There are studies that do not focus on a specific patient group but on all-cause readmissions. Below some of the most influential studies are outlined. A study conducted by Ben-Assuli and Padman [23] focused on patients with a large number of repeated hospital visits, more precisely patients having visited the emergency department seven times in in the past four-year period. They compared logistic regression, boosted decision trees, support vector machines, bayes point machine and neural network. Features included EHR data, health maintenance organization, ED unit, hospitals, creatine results, differential di-agnosis, length of stay, days between referrals, age, gender, total number of ED visits and total number of readmissions. The best AUC score was 0.92 and was achieved by the model boosted decision trees. This study extends many other studies by not only making the prediction based on one visit but all previous visits the past four years. Jamei et al. [8] used artificial neural networks to pre-dict the readmission risk. Their features consisted of encounter reason, hos-pital problems, procedures, medications, providers, discharge, socioeconomic variables, information related to the admission, lab results, comorbidities, ba-sic demographics, health history, inpatient visits, vital parameters and payer. They tested logistic regression, random forest and neural networks. The best performing model was a two-layer network with a hidden layer containing half the number of nodes of the input layer and dropout nodes between all layers. First they trained the model with 1667 features, then they re-trained the model with the top N features that correlated with 30 day readmission. Using N=100 features the model achieved 95% of the optimal precision. They found that the most correlating features with 30 day readmission were the number of in-patient visits in the past 12 months, number of inin-patient visits in 6 months, Charlson Comorbidity index, inpatient visits in the past 4 months. They man-aged to get an AUC score of 0.78 however precision was only 0.23 and recall 0.59 and the training time was significantly higher, 2650 sec compared to 60 sec for logistic regression. Lin et al. [6] also explored the usage of recurrent neural networks with LSTM similar to Ashfaq et al. [7] however they did not use historical data but data from the past 48 hours to investigate readmissions to the Intensive care unit, ICU and their dataset consisted of people older than 18 years. To extract a time series, they used test results from the last 48 hours of the ICU stay. They dealt with missing values using a method called last-observation-carried-forward imputation. They used ICD codes, demographic

(32)

information and health care provider notes as features. The notes were found to have a big impact on the readmission risk. They reported an AUC score of 0.791.

Multiple predictions

Tabak et al. [24] tried to give predictions at admission as well as at discharge and hence take advantage of the available information at different points in time. The information that were available at admission came from the EHR system, they used a score called Acute Laboratory Risk of Mortality score, ALaRMS, it comprises of demographic data and admission laboratory test results, e.g. age, gender, laboratory test results and the number of admissions in the previous 90 days. The discharge model also included social economic status, length of stay, admission source, discharge disposition and discharge diagnosis. They fitted a multivariable logistic regression model. From the early readmission model it could be seen that a higher ALaRMS score lead to a higher risk of readmission and that one previous admission within the 90 days doubled the risk, the AUC score was 0.697. When administrative data that was available at discharge was added, the AUC score increased to 0.722. They found that the feature ‘disease group’ enhanced the discriminative power of the model.

Prediction of risk category

All above articles only tackle the readmission risk prediction as a classifica-tion however Shulan et al. [25] also tried to split their predicclassifica-tion into five categories, including ‘very low risk’, ‘low risk’, ‘moderate risk’, ‘high risk’ and ‘very high risk’. They used logistic regression to predict readmissions and only include variables with p-values less than 0.1 in the final regression model to avoid overfitting. They added more and more features to show the features impact on the result. The model containing the highest number of features got the highest AUC score, 0.80. They found that their model overes-timated the low-risk category and underesoveres-timated the number of readmissions in the high-risk category.

2.5.3 Summary

A lot of different models and methods have been tested in previous works, lo-gistic regression and its variants are most frequently used but in later years ar-tificial neural networks have become more common. Readmission risk

(33)

predic-tion has almost exclusively been investigated as a binary classificapredic-tion problem and not a probability prediction problem. Most studies have yielded modest predictive powers. Only a few studies have considered the fact that the data is imbalanced between the number of patients being readmitted and not readmit-ted. Important features are EHR data and demographics with length of stay and number of previous readmissions having a high correlation with 30-day readmission risk.

(34)

Methods

This chapter describes the definitions of readmission used in this project, a description of the used dataset is given. Furthermore, an overview of the con-ducted experiments are presented.

3.1 Definition of readmission

In this project two different definitions of readmission were used, a more strict one and a relaxed one, stated in Table 3.1. Both definitions are of interest when investigating readmissions. Definition 2 is more restrictive, by using it it is guaranteed that the cause of the second admission is the same as the first. There is however a risk that a second admission is caused by a disease related to the first admission but not necessarily the same disease. Definition 1 on the other hand is not as restrictive and it captures the above mentioned related diseases. However, using definition 1 there is a risk that unrelated diseases will be classified as readmissions when they are two separate events having nothing to do with each other. The following example illustrates this dilemma. Assume that a person was admitted after suffering a thrombus and the patient was given anticoagulants as a treatment, within 30 days the patient is again admitted but this time because of nausea. Nausea is a known side effect of the anticoagulants. According to definition 2 this is not a readmission, since the patient is not suffering from the same disease, but we know that the second admission is caused by the treated thrombus. In this case definition 1 would suit better. On the other hand let’s assume that a person was admitted after suffering a thrombus, and within 30 days this patient slips and fractures a leg. According to definition 1 this would be a readmission even though these two

(35)

events have nothing to do with each other, hence in this case definition 2 is more suited.

The readmission rates for definition 1 and 2 are 19.8% and 6%, respectively. Throughout this report definition 1 is refereed to as the relaxed definition and definition 2 as the strict definition.

Definition of readmission Readmission rate (%)

1 The patient is admitted to a hospital within

30 days of a prior hospital discharge. 19.8

2

Same as 1 but with the additional requirement that the diagnosis code of

both visits are the same.

6.0

Table 3.1: The two definitions of readmission used in this project. Definition 1 is referred to as the relaxed definition and definition 2 as the strict definition.

3.2 Data explanation

The available data consists of EHR data, it was collected from a Swedish hos-pital over a time period of 5 years. The data consists of vital parameters, chemistry tests, radiology investigations, diagnosis codes, procedure codes, prescriptions, social situation, emergency priority as well as dates and lengths of previous visits. The dataset was anonymized using UUID-coding, which meant that the patients integrity were respected. In addition all time stamps were removed, the time of discharge was set to zero and all measurements and tests were given a time related to time zero. The dataset did not consist of unique patients rather there were unique contacts with the hospital. Hence a patient could have visited the hospital multiple times and each visit would be reported as a unique event, meaning that multiple samples originated from the same patient but were collected at different points in time. There were however no way of understanding if two unique contacts came from the same person or how many contacts a person had had.

Before starting the modelling process the data had to be pre-processed. The original data were stored in multiple csv-files each including information about a specific category of the EHR data. Most of the data had a time dependence, e.g. the vital parameters were measured multiple times during a visit, a patient had multiple procedures, radiologies and chemistry tests conducted during a

(36)

visit, all diagnosis codes were given at each visit. Thus, focus was put into building a feature extraction pipeline manually extracting features with rele-vant information and saving them in a single table.

In the following sections a more detailed description of the features are given. Only medical data from the present visit were used except for the previous number of admissions and visits. In Figure 3.1 an overview of the dataset is shown. The first and second columns show the category and sub-category of each variable whereas the third column specifies how the variable has been encoded when there exists multiple measurements.

Patient information

The patient information includes all features having a one-to-one relation with the contact including the gender, age, day of week at discharge, social status, way of admission, reason for admission and length of stay. These features were encoded as binary or categorical depending on their nature. The age of a patient was given in a 10 year interval. A patients social status included whether a patient was married of living alone as well as some other categories. However, there were no information on whether a patient had home care or not. One of these features, the length of stay, will not be available until the patient is discharge, however it is included since in a future decision support system the present length of stay could be used.

Vital parameters

The vital parameters include pulse, temperature, respiration frequency, blood pressure, BMI, indirect oximetry. These parameters were measured multiple times during each contact. As can be seen in Figure 3.1 the first- and last val-ues, the mean and the variance were extracted. The first- and last value can give an indication on the trend, a patient can start with an abnormal value and after treatment the value goes back to the normal range. The mean and vari-ance of a parameter were extracted to give an understanding on the stability of a parameter. A high variance can indicate that a patient have recovered from a unhealthy high value to a normal range value. In addition to the above four mentioned features two boolean features were added to describe the indirect oximetry parameter, whether a patient had received oxygen and a hypoxia indi-cator. Furthermore the vital parameters also included awareness- and NEWS scores. To measure the awareness of a patient the AVPU scale is used, it in-cludes the categories, Alert, Voice, Pain, Unresponsive, and is a way for health

(37)

care professionals to rank a patients consciousness. The NEWS score, Na-tional Early Warning Score, is a value between 0 and 20 that indicates the severity of a patients illness based on the vital parameters pulse, blood pres-sure, temperature, respiration, awareness and oxygen saturation. If a patient has measurements that deviate from the normal range the patient receives a higher score.

It is important to note that the vital parameters had a considerable amount of missing values, each feature had ∼ 65% missing values, which is a signifi-cant number, the missing values were imputed using the median of all other contacts. By imputing the values in this manner a lot of the variability in the feature were lost and the usability of the feature were degraded. Nonetheless, the values were imputed instead of removing the samples from the dataset to maintain enough samples. The significant number of missing values were due to the fact that the vital parameters were documented and saved in another way during the time period that the dataset was extracted from and hence were much harder to access and bring forth in a structured way.

Diagnosis codes, procedure codes, prescriptions and radiology investiga-tions

Diagnosis codes, procedure codes and prescriptions are structured in a hier-archical manner. The diagnosis codes use the system ICD-10-SE, the codes starts with a letter specifying the main category and then numbers follow to specify the exact diagnosis. In order to reduce the feature space the first three characters of the diagnosis codes were used resulting in ∼ 1100 unique codes. The procedure code follow the system ICD-10-PCS, in this case the entire code were used. For prescriptions the ATC system was used, in this case only the first character were utilized resulting in 14 groups. The radiology inves-tigations do not have a hierarchical structure hence each code was used as a feature. Since there could exist multiple codes in each category the codes were encoded as binary, either the contact had the specific code or not. In addition the number of codes in each category were added as a feature, e.g. the number of diagnosis a patient had at the present visit or the number of procedures that had been conducted during the visit. The number of diagnosis, procedures, radiologies and prescriptions can work as an indicator of a patients overall health situation.

(38)

Figure 3.1: Summary of the data set used in this project. Last column specifies how the features were encoded if multiple measurements existed.

(39)

Chemistry tests

Another set of features that had to be handled were the chemistry data, i.e. lab results. Since different lab tests are conducted for different diseases there were a lot of missing values for each test. Two ways of encoding the chemistry data was tested. One way was to use the actual result of a test and imputing the missing values with the median of the existing values. The other way was to encode each feature as binary, the value in itself did not matter, either the patient had a test done or not. By imputing the missing values with the mean the same problem arises as for the vital parameters, parts of the variability of the feature is lost.

If the data weren’t anonymized normal range values could be used whenever a value is missing, this would give more information on whether the test results deviate from the normal range.

Preceding contacts

Preceding contacts consisted of all previous contacts a patient had had with the healthcare together with a time stamp of how long ago the contact had occurred. The frequency of admissions as well as the number of contacts a patient has had with the healthcare has shown to be very important predictors in hospital readmission risk prediction. To extract informative features, giving both a long and short term aspect of a patients previous medical encounters the number of inpatient visits and the inter-admission times the past five years, one year and six months were saved together with the number of outpatient visits the past five years. For all contacts resulting in an admission the past five years the mean length of stay were computed.

To begin with, all categories were used resulting in a sparse feature space, to avoid overfitting and insignificant results, features that were present in less than 1% of the contacts were excluded.

3.3 Resampling

As could be seen in Table 3.1 the readmission rate for definition 1 was 19.8% and for definition 2 6.0%. This constitutes to an imbalanced problem. Machine learning models are built to handle problems with a relative equal number of observations in each class. The imbalance problem can be tackled using a re-sampling method. Two basic techniques are overre-sampling and underre-sampling.

(40)

Oversampling duplicates samples from the minority class to increase the num-ber of minority class samples while keeping the majority class untouched. Un-dersampling works in the opposite way, samples from the majority class are removed from the dataset while the minority class is untouched. Both meth-ods changes the prior knowledge of the class distributions during the training phase while the original distributions are kept in the testing phase. Changing the prior too much will change the probability predictions. This problem is more evident using definition 2, to make the problem more balanced in this case the prior has to be changed in a higher degree.

3.4 Experimental procedure

In this section, the experimental procedure is described. The model perfor-mance was evaluated using AUC, precision, recall, F1-score, accuracy, bal-ance accuracy, AP-score and specificity. Furthermore, all results were tested for statistical significance using Kruskal-Wallis H test or Wilcoxon signed-rank test. Depending on whether two or more groups of results were to be compared, either the Wilcoxon signed-rank test or the Kruskal-Wallis H test was used, respectively.

The feature selection was done in multiple steps to reduce the feature space and find the features of most importance. To begin with, a univariate analysis was performed using auxiliary statistical tests as filter methods described in Section 2.1.1. Furthermore, RFE together with cross-validation on training data was used as an alternative feature extraction method. The RFE was run with a logistic regressor as a wrapper classifier.

Following initial results, a small set of basic features was constructed. This ba-sic set was then used to evaluate the importance of different feature categories, e.g. diagnoses, chemistry tests and vital parameters. By extending the basic set with features from a specific category, the performance differences could be examined. The relaxed definition, i.e. a patient was readmitted within 30 days was used together with a logistic regressor.

Then, different classifiers were trained and tested on a feature set obtained by RFE, consisting of features from all different categories. Each definition of readmission had its own set of features. The performances of the different clas-sifiers were compared in order to investigate the usability of machine learning for readmission risk predictions. Furthermore, a comparison of the perfor-mance of the best classifiers using the two different definitions of

(41)

readmis-sion was also conducted. The classifiers used were SVM, logistic regresreadmis-sion, random forest and MLP. Grid searches were conducted to find best possible hyper-parameters for each model.

Moreover, the model performances within different diagnoses were investi-gated to see if a general set of features have the same predictive capabilities under different diagnoses. Then, a specific set of features for heart failure patients was developed in a similar fashion as above. The specific features were then compared to the general set of features to examine the sources of any model performance differences, here only logistic regression was evalu-ated.

Finally, with the dataset being heavily imbalanced, the effect of the resampling methods used was evaluated. By varying the resampling ratio as well as deci-sion threshold, the impact on recall and precideci-sion scores was examined.

(42)

Results

In the following sections the results of the experiments performed in this project are given. The purpose of the experiments were to investigate the importance of various features, to evaluate the usability of the two different definitions as well as compare model performance. As outlined in Section 3.4, this was done in three different steps. First (Section 4.1), a univariate analysis was conducted on each unique feature to guide the feature selection, then (Section 4.2) fea-tures belonging to different categories, e.g. diagnoses, chemistry tests, were collected in various feature sets and a model was trained and evaluated using each set, this step provided some insights into the information available in each category. Finally (Section 4.3), a mixture of all features was used to compare different models and the two definitions of readmission. The importance of the given feature set was further evaluated using subsets of the original data, each subset consisting of patients suffering from a specific disease (Sections 4.4, 4.5). In the end (Section 4.6), an experiment was conducted to see how the re-sampling ratio affected the performance of a model and how the classification threshold could be altered to capture more readmitted patients.

4.1 Initial data analysis

In this section, filter methods based on univariate statistical tests and corre-lation results are presented. The initial data analysis was performed to gain an understanding of the given dataset and to further guide the feature selec-tion. Although the results provide some insights into feature importance, the analysis is only concerned with each feature in isolation.

(43)

4.1.1 Filtering with univariate independent statistical

tests

At first, the data was analysed using the methods presented in Section 2.1.1. By applying the selected statistical tests, 87 features reported p-values below the chosen significance level of 0.05 when using the strict definition of readmis-sion. Using the relaxed one resulted in 190 features, 72 features were present in both sets. In Figures 4.1 and 4.2, the distribution of a few of these are visu-alized using the relaxed definition. It is apparent that even though the differ-ences are statistically significant at the level of univariate first-order statistics, the overlaps between the distributions are large. Similar overlap was observed for all features reporting low p-values, no matter which definition of readmis-sion was used. Although the results do demonstrate some interesting charac-teristics, for example the difference in distribution depending on the gender, the overlapping distributions also highlight the difficulty in discriminating be-tween the two classes based on individual features.

4.1.2 Correlations

In Table 4.1 and Table 4.2 the top ten correlations between a feature and the boolean variable informing about readmission, using the strict and relaxed def-initions, respectively, are presented. Much like in the previous section, the re-sults merely highlight the difficulty in discriminating between the two classes. Features related to the number of previous admissions clearly appear as the most important feature, whereas the other correlations are small and could simply be due to noise.

Features Correlation Number of previous admissions, past year 0.118 Number of previous admissions, past five years 0.109 Number of previous admissions, past six months 0.107 Mean inter-admission time, past six months 0.099 Mean inter-admission time, past year 0.099 Mean inter-admission time, past five years 0.087 Has diagnosis 148 0.070 Last NEWS value 0.070 Number of prescriptions 0.068 Has diagnosis 261 0.064

Table 4.1: Top ten largest empirical correlations observed, using the strict definition, i.e. a patient being readmitted within 30 days and having the same diagnosis at both readmissions.

(44)

(a) (b)

(c) (d)

(e) (f)

Figure 4.1: Distribution of a few parameters for readmitted and not readmitted patients. In a) the age distribution is visualized. The blue histogram shows the distribution of patients not readmitted, the pink (opaque) shows the distribu-tion of readmitted patients and the purple secdistribu-tions indicate the overlap of the two distributions. Figures b) - f) show smoothed distributions, the blue curve shows the distribution of not readmitted patients and the red curve the readmit-ted patients. The distributions are shown for the following features: b) initial systolic blood pressure, c) chemistry test 496, d) number of previous admis-sions during the last year, e) mean time between visits during the last year for patients being readmitted, f) number of diagnoses. Due to anonymization it is not known what chemistry test 496 stands for.

(45)

(a) (b)

(c) (d)

Figure 4.2: Distribution of a few parameters for readmitted and not readmitted patients. Figures a) - d) visualize the distribution of some binary variables, the distributions are shown for: a) gender, b) has procedure 656, c) has diagnosis 261 and 4.2d) social factor 17. Due to anonymization it is not known what has procedure 656, has diagnosis 261 and social factor 17 stands for.

Features Correlation Number of previous admissions, past year 0.233 Number of previous admissions, past six months 0.227 Number of previous admissions, past five years 0.220 Mean inter-admission time, past six months 0.189 Mean inter-admission time, past year 0.189 Mean inter-admission time, past five years 0.167 Number of previous visits (including non admissions) 0.138 Number of diagnoses 0.105 Number of prescriptions 0.097 Has diagnosis 739 0.083

Table 4.2: Top ten largest empirical correlations observed, using the relaxed definition, i.e. a patient being readmitted within 30 days.