Conformal survival predictions at a user-controlled time point

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

Conformal survival predictions at a

user-controlled time point

The introduction of time point specialized

Conformal Random Survival Forests

JELLE VAN MILTENBURG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

iii

Abstract

The goal of this research is to expand the field of conformal predic-tions using Random Survival Forests. The standard Conformal Ran-dom Survival Forest can predict with a fixed certainty whether some-thing will survive up until a certain time point. This research is the first to show that there is little practical use in the standard Conformal Random Survival Forest algorithm. It turns out that the confidence guarantees of the conformal prediction framework are violated if the Standard algorithm makes predictions for a user-controlled fixed time point. To solve this challenge, this thesis proposes two algorithms that specialize in conformal predictions for a fixed point in time: a Fixed Time algorithm and a Hybrid algorithm. Both algorithms trans-form the survival data that is used by the split evaluation metric in the Random Survival Forest algorithm. The algorithms are evaluated and compared along six different set prediction evaluation criteria. The prediction performance of the Hybrid algorithm outperforms the pre-diction performance of the Fixed Time algorithm in most cases. Fur-thermore, the Hybrid algorithm is more stable than the Fixed Time algorithm when the predicting job extends to various time points. The hybrid Conformal Random Survival Forest should thus be considered by anyone who wants to make conformal survival predictions at user-controlled time points.

(4)

iv

Sammanfattning

(5)

v

Acknowledgements

First, I would like to thank my supervisor Anders Vesterberg at Scania A.B. for the rich discussions that we had. The planned and unplanned conversations with Anders always led to new insights to continue the work. Also, his genuine interest in this project was a great motivation throughout the research. All in all, it was a pleasure to work together with Anders.

Furthermore, I would like to express my appreciation towards my su-pervisor Prof. Henrik Boström. Henrik guided me through difficulties in the process, but he also never failed to steer my research in the right direction when obstacles emerged. Additionally, his remarks on my work allowed me to take the quality of the thesis to a higher level. Henrik’s guidance did not only benefit the quality of the thesis, it also helped me to develop myself as a better researcher.

Finally, I would like to thank my examiner Assoc. Prof. Sarunas Girdz-ijauskas for his critical feedback, which helped to improve the overall quality of the thesis.

(6)

1 Introduction 1 1.1 Background . . . 1 1.2 Problem statement . . . 3 1.3 Purpose . . . 5 1.4 Goal . . . 5 1.5 Environmental sustainability . . . 6 1.6 Methodology . . . 6 1.7 Delimitations . . . 6 1.8 Outline . . . 7 2 Related work 8 2.1 Survival analysis . . . 8 2.1.1 Motivation . . . 8 2.1.2 Survival data . . . 9 2.1.3 Censoring . . . 10

2.1.4 Treatment of censored data . . . 11

2.1.5 Event probability . . . 12

2.2 Random Survival Forest . . . 14

2.2.1 Split . . . 14

2.2.2 Predicting . . . 17

2.3 Conformal Prediction . . . 18

2.3.1 Motivation . . . 18

2.3.2 The framework . . . 19

2.3.3 Applying conformal prediction to the Random Survival Forest . . . 21

3 Methodology 23 3.1 Algorithm design . . . 23

(7)

CONTENTS vii 3.2 Evaluation . . . 24 3.2.1 Evaluation metrics . . . 24 3.2.2 Experimental setup . . . 28 4 Algorithm proposal 33 4.1 Intuition . . . 33 4.2 Algorithm design . . . 34

4.3 Data treatment for Fixed Time algorithm . . . 34

4.4 Data treatment for the Hybrid algorithm . . . 36

5 Results 39 5.0.1 Parameter values . . . 39

5.1 Accuracy Error . . . 40

5.2 ✏-independent unobserved criteria . . . 43

5.3 ✏-dependent unobserved criteria . . . 44

5.4 ✏-independent observed criterion . . . 46

5.5 ✏-dependent observed criterion . . . 47

5.6 Base case . . . 48

6 Conclusion 51 6.1 Discussion . . . 54

(8)

List of Figures

2.1 Example of pre-processing of Type I right censored data. First the data is gathered. Then the data gathering stops at time tc, while some instances didn’t experience the

event yet: these instances are censored at the censoring time tc. Finally, the time points are normalized such that

each instance has a time point with respect to a globally similar starting time point. . . 12 2.2 The steps of the Random Survival Forest algorithm. . . . 14 2.3 Example of the selection of the best split in the creation

of a binary tree. . . 15 2.4 Example of a grown binary tree. In practice these trees

grow much taller. . . 16 2.5 Example of a Random Survival Forest, consisting out of

multiple binary trees. . . 17 2.6 The process of event probability prediction. . . 18 2.7 A visualization of a conformal prediction framework in

the context of survival analysis. . . 20 2.8 Example of how a new data instance i is classified for

the two different class labels based on its p-values at dif-ferent time points. . . 22 3.1 The count of two classes of the event indicator, y = 0

and y = 1, at different time points. For (c) the y-axis of the figure is manually adjusted for readability purposes. The actual data of (c) show a count of 490 instances at time 180 with class label y = 0. . . 30 4.1 Data treatment of the Fixed Time algorithm to align the

data instances to a fixed time point. . . 35

(9)

LIST OF FIGURES ix

4.2 Data treatment of the Fixed Time algorithm to align the data instances to a fixed time point. . . 37 4.3 Data treatment of different algorithms to align the data

instances to a fixed time point. . . 38 5.1 The average Accuracy Error (AE), the average deviation

of the accuracy from the given confidence threshold, of different algorithms on data sets with a different censor-ing ratio, when evaluated by fixed time point predictions. 41 5.2 The average Accuracy Error (AE), the average deviation

of the accuracy from the given confidence threshold, of different algorithms on data sets with a different cen-soring ratio, when evaluated by various time point pre-dictions. These various time points come from the OOB-data and are randomly drawn from the same probabilis-tic data distribution as the data sets. . . 42 5.3 The average Accuracy Error (AE), the average

devia-tion of the accuracy from the given confidence thresh-old, of different algorithms with different given confi-dence thresholds, when evaluated by fixed time point predictions. . . 43 5.4 The average Unconfidence (U), the average second

high-est p-value, of different algorithms on data sets with a different censoring ratio, when evaluated by fixed time point predictions. . . 44 5.5 The average Number (N), the average number of class

labels in the set prediction of different algorithms, of dif-ferent algorithms on data sets with a difdif-ferent censoring ratio, when evaluated by fixed time point predictions. . . 45 5.6 The average Multiples (M), the percentage of prediction

sets with more than one class label, in the set prediction of different algorithms on data sets with a different cen-soring ratio, when evaluated by fixed time point predic-tions. . . 46 5.7 The Observed Unconfidence criterion (OU), the average

(10)

x LIST OF FIGURES

5.8 The Observed Multiple criterion (OM), the percentage of predictions where the prediction set contains false class labels of different predictions, of different algo-rithms on data sets with a different censoring ratio, when evaluated by fixed time point predictions. . . 48 5.9 The error score of different criteria of different algorithms,

(11)

List of Tables

2.1 Example input data of a survival analysis algorithm, where t denotes the time point, y denotes the class la-bel and x denotes a feature. . . 10 3.1 Algorithm efficiency criteria. . . 27 3.2 Descriptive statistics about the different data sets. ’p25’,

’p50’ and ’p75’ refer to the corresponding percentiles of the data. . . 28 3.3 A summary of parameters that are used in the

experi-ments. . . 32 4.1 Example input data of a survival analysis algorithm

af-ter data treatment of the Fixed Time algorithm, where tf _{denotes the fixed time point, y}f _{denotes the class}

la-bel that corresponds to the fixed time point, t denotes the original time point, y denotes the original class label and x denotes a feature. . . 36 4.2 Example input data after alternative data treatment. . . . 37 5.1 The average criteria score of every criterion by different

algorithms when evaluated by fixed time point predic-tions. The result is the average of experiments with the different data sets, the different confidence thresholds and the different fixed time points. . . 49

(12)

xii LIST OF TABLES

0.1 Acronyms and abbreviations

The thesis makes extensive use of acronyms and abbreviations to make the work more readable. Every term is explained when it is used for the first time. The set of acronyms and abbreviations can be found below.

CHF Cumulative Hazard Function CP Conformal Prediction

CRSF Conformal Random Survival Forest ICP Inductive Conformal Prediction ML Machine Learning

OOB-data Out-of-bag data

RSF Random Survival Forest RF Random Forest

SE Split Evaluation

TCP Transductive Conformal Prediction

(13)

Chapter 1 Introduction

Machine learning algorithms can use data of the past to generate a model that can predict behavior in the future. The field of machine learning has made way for decision making optimization in all in-dustries among society [1]. Machine learning is applied to decision making for predicting, for example, the optimal moment to perform maintenance on an engine. This prediction can either be too ’lousy’ or too ’strict’, leading to a lack of maintenance or unnecessary mainte-nance respectively. Machine learning algorithms can predict whether an engine breaks down before a given time point if they have access to data about such break down events for similar engine instances. For cases like maintenance planning the certainty of each prediction is crucial since bad maintenance planning is costly. Ideally, an algorithm predicts with a fixed certainty whether an instance will break down after a specific time point. After all, it is hard to base decisions on a prediction with an unknown or variable certainty. In this research we will explore ways to predict with a fixed certainty whether a new data instance will experience an event before a specific moment.

1.1 Background

Predicting event occurrences with a fixed certainty comes with a two-folded challenge: that the event does not occur for every data instance in the training data, and that predictions are usually of an unknown

(14)

2 CHAPTER 1. INTRODUCTION

an uncontrollable confidence. These challenges separately have been researched before.

The first challenge addresses the fact that an event does not always occur for every instance in the data that is used to build a predictive model. In other words, the training data at hand consists out of in-stances that experienced the event at a specific point in time, as well as instances that did not experience the event. In the latter case it is called a censored data point.

(15)

CHAPTER 1. INTRODUCTION 3

and is therefore 88% accurate". However, the confidence of a single point prediction may vary significantly depending on the feature val-ues. Some instances are simply harder to predict than others. This makes point prediction not sufficiently informative to be used in deci-sion making.

Experience from training data could also be used to output a predic-tion with the fixed confidence. Shafer and Vovk [7] proposed a new framework that relaxes the output space from a point prediction to a set prediction , for example = {2, 3, 4}. This relaxation makes it possible to fix the confidence for every prediction. This framework is called Conformal Prediction (CP). Conformal prediction is a machine learning framework where successive errors of predictions are proba-bilistically independent. Using this, one can provide a set prediction for any given confidence threshold ". This means that the probability of the real variable value falling in set "_{is ".}

The use of conformal prediction becomes intuitive when one thinks about a scenario with patients A and B. A has a rare disease, and B has a well known disease. Circumstances could lead a doctor to estimate both their survival times ˆyA= ˆyB = 10, as a point prediction. However,

an prediction set with a fixed certainty might be more useful in this case, where "

A = [2, 20] and "B = [9, 11]. With a prediction set the

doctor can incorporate the set and confidence of every prediction in his decision making, where a larger set means a greater uncertainty. In short, conformal prediction can be very useful when it comes to better decision making because of the fixed confidence of every individual prediction. Conformal prediction is therefore a compelling technique to deal with the second challenge.

1.2 Problem statement

(16)

Pa-4 CHAPTER 1. INTRODUCTION

papetrou [8]. Their method, a Conformal Random Survival Forest (CRSF), combines the robust Random Survival Forest algorithm [6] with a conformal prediction framework. This allows a user-determined confidence for every survival prediction [7]. The error rate is proven to be in line with the confidence level. The research by Boström, Gu-rung, Asker, Karlsson, Lindgren, and Papapetrou [8] covers a standard case in which the Conformal Random Survival Forest outputs a con-formal prediction to predict the event indicator of new instances. The CRSF, however, is used for generic purposes, meaning that they are trained to predict the event indicator of instances at a big variety of time points.

(17)

CHAPTER 1. INTRODUCTION 5

1.3 Purpose

CRSF has already proven its value in predicting data in a generic set-ting. The performance effects of a time point specialized CRSF algo-rithm are yet unknown. The challenge remains to find a way to create a specialized model that can predict survival event indicators for a specific time point. This thesis will explore the effects of time specific survival algorithm in a conformal prediction framework. The purpose can be summarized by finding an answer to the following research question:

What are the performance effects when a Conformal Random Survival Forest is trained to make conformal survival predictions for a specific time point? The answer to this question will fill advance the field of CRSF algo-rithms.

1.4 Goal

(18)

6 CHAPTER 1. INTRODUCTION

1.5 Environmental sustainability

Environmental sustainability has been defined as ’maintenance of cap-ital’ in widely accepted research [10]. The results of this thesis can help to improve maintenance planning of vehicles, or other mainte-nance needing objects. This will lead to more effective maintemainte-nance and therefore less breakdowns and replacements. Therefore, this re-search will directly contribute to the environmental sustainability im-pact that KTH Royal Institute of Technology strives to achieve.

1.6 Methodology

Two new algorithms are proposed to make conformal survival pre-dictions for specific time points. The performance of the proposed methods are measured with a quantitative analysis, and compared with the state-of-the-art. The comparison of performance of survival algorithms in a conformal prediction framework is not evident. Vovk, Nouretdinov, Fedorova, Petej, and Gammerman [11] recently published a study that functions as a guideline for set prediction evaluations. The evaluation of the proposed algorithms will therefore heavily rely on the criteria published in the work of Vovk, Nouretdinov, Fedorova, Petej, and Gammerman [11]. The statistical analysis of the proposed method uses public data sets, with known properties, as well a data set from a real life use case, as gathered by Scania A.B. Although the individual tests involve large volumes of quantitative data, only six different data sets are used. This means that research methodology can be classified as both quantitative, and qualitative.

1.7 Delimitations

(19)

mean-CHAPTER 1. INTRODUCTION 7

ing that the class labels are either 0 (no event occured) or 1 (an event occured).

1.8 Outline

(20)

Chapter 2 Related work

In this chapter the background of survival analysis, Random Survival Forests and conformal prediction are described. These domains com-bined cover a specific field in machine learning. Machine learning is typically considered a sub-field within computer science. The goal of machine learning is stated by Simeone in his Introduction to machine learning as ’the development of efficient pattern recognition methods that are able to scale well with the size of the problem domain and of the data sets’ [12]. First, survival analysis is described. This in-cludes the motivation of this algorithm family, the data it can work with and the prediction type it can output. Second, the Random Sur-vival Forest (RSF) is described in more detail. The RSF is a specific survival analysis algorithm that is adopted by the research. Second, conformal prediction as a machine learning framework is motivated and explained.

2.1 Survival analysis

2.1.1 Motivation

This research tries to make an algorithm that is able to predict whether an event happens or not before a time point. This event does not al-ways occur for every instance in the training data. Thus the data at hand consists out of instances for which the event occurred before a

(21)

CHAPTER 2. RELATED WORK 9

specific point in time, as well as instances for which the event did not occur before a specific point in time. In the latter case it is called a censored data point.

Survival analysis algorithms belong to the family of algorithms that are used when ’the time until an event occurs’ is the variable which one is trying to predict. The handling of the censored data points, is generally taken on by survival analysis algorithms, since standard ma-chine learning techniques are often not able to turn censored data into a accurate predictive model. Survival analysis is a branch of the statis-tical learning domain. It involves labeled data and is therefore a form of supervised learning. Depending on the task, survival analysis can be both a classification problem, as well as a regression problem. Sci-entist generally speak of classification problems if the target variable is discrete and regression problems if the target variable is continuous. We speak of a regression problem when one is trying to predict when an event will occur, but we speak of a classification problem when the time is fixed and we are interested in whether the event will occur at that time. In this research we focused on survival analysis as a binary classification problem.

2.1.2 Survival data

Survival data is structured and well defined. It consists out of three different elements: time t, event indicator y and input feature vector x. We denote ti as the normalized time for subject i. yi is the event

indicator, where yi =

(

1if the event was observed

(22)

10 CHAPTER 2. RELATED WORK

Assumption A-1. A subject reports the event at the same time as it is

expe-rienced.

This assumption makes evaluating survival models practical. A sub-ject that reports an event after fixed time t, can now be labeled as ’0’ at time t. Without this assumption, the same subject could have ex-perienced the event before fixed time t, but only reporting it after t. This makes the interpretation of this subject at time t ambiguous. The resulting data can be visualized as in Table 2.1.

Table 2.1: Example input data of a survival analysis algorithm, where tdenotes the time point, y denotes the class label and x denotes a fea-ture. t y x1 xm 65 1 B 34 41 0 C 26 102 0 B 43 97 1 A 34

2.1.3 Censoring

The data, on which a survival model is based, does not only include instances for which the event actually occurred. An instance is called right-censored when a time is given, but the event is not actually oc-curred up until that point in time [13]. This data is valuable, since it confirms the subject to be ’alive’ or ’not broken’ at that point in time.

There are three general types of right-censoring. First of all fixed type I censoring occurs when the data is gathered in a study with a fixed end time C. Not all subjects that participated in that study necessarily have to observed the event in this fixed time. This means that the subjects for which the event dit not occur, are censored at time C.

(23)

This type of right-censoring is the most common in the use cases of this thesis.

Third, type II censoring occurs when the data gathering study is de-signed in a way that it ends when a certain amount of events are re-ported. If this threshold is lower than the total amount of subjects, all the other subjects are automatically right-censored.

To treat cencored data, we built on the following assumption.

Assumption A-2. The fact that a data point is censored does not influence

the probability of an event. This means that censoring is caused by one of the types listed above, and not by anything else such as impending failure.

This assumption is necessary because otherwise the right-censored data points cannot any longer be seen as points in time where no event has happened. The survival function can only be created when this as-sumption holds.

2.1.4 Treatment of censored data

This study concerns only type I censoring: the study ends at time tc.

The data is therefore pre-processed such that it can be used for survival analysis algorithms. The most important adjustment is the normaliza-tion of time. This ensures that t0 is the same for every instance, and

(24)

Truck delivery Failure

𝑡𝑐 𝑡𝑐

Data gathering Censoring Normalize time

Figure 2.1: Example of pre-processing of Type I right censored data. First the data is gathered. Then the data gathering stops at time tc,

while some instances didn’t experience the event yet: these instances are censored at the censoring time tc. Finally, the time points are

nor-malized such that each instance has a time point with respect to a glob-ally similar starting time point.

2.1.5 Event probability

The event probability can be described using a set of functions that are interlinked. These functions only make sense if we make the following assumption.

Assumption A-3. The training and test data are drawn from an unknown

underlying distribution. Each input instance is characterized by an input vector x, a time point t and an event indicator y. The distribution is specific but constant: no data drift occurs during the data gathering. This automati-cally assumes exchangeability of instances.

Instances z1, ..zn are exchangeable if for all possible permutation ⌧ of

integers 1, ..., n, the instances w1, ...wnhave the same joint probability

distribution as z1, ..., zn, where wi = z⌧ (i) [14]. The exchangeability of

(25)

First of all, the death density function f(t|x) is a curve fitted to the his-togram that represents the death occurrences at different time points, where t represents time, and x the feature vector. The death distribu-tion funcdistribu-tion

F (t|x) = 1 S(t|x) (2.2) is formed by extracting the area under the curve, representing the event probability as a function of t.

S(t|x) = P r(T > t|x) = e H(t|x) (2.3) is the survival function, representing the survival probability function of t. The survival function is non-increasing, since the probability of surviving past time t is decaying. Furthermore, S(0|x) = 1 and S(1|x) = 0, meaning that the all subjects survive past time 0, and will observe the event eventually. The survival function is smooth in theory. In practice the function is continuous but not differentiable, because time is recorded on a discrete scale. The survival function is expressed in terms of the cumulative hazard function (CHF).

H(t_{|x) =} Z t

0

h(u_|x)du (2.4) which describes the probability up to time t that the event of interest occurs. The hazard function h(t) is the instantaneous probability of an event happening at time t. The hazard function is expressed as:

(26)

2.2 Random Survival Forest

The Random Survival Forest (RSF) algorithm [15] extends the random forests (RF) algorithm [16] to fit right-censored survival data. It is able to uncover complex data structures and patterns while within the range of accepted computational complexity.

The algorithm grows random survival trees out of independent boot-straps samples, excluding on average 37% of the data. The excluded data is called the out-of-bag data (OOB data)[15]. The splits in the ran-dom survival trees are based on a ranran-dom subset of size p of the input features. The tree is grown to full size, where a node is turned into a leaf node if there is either a minimum amount of event occurrences in either one of the splits, or if the number of training instances de-creases below a certain threshold, where the terminal nodes, denoted as g ✓ G. The CHF is calculated for each tree, and averaged to retrieve the ensemble CHF. The earlier excluded OOB data can now be used to calculate the prediction error for this final CHF. The algorithm is sum-marized in 2.2.

1. Draw B bootstrap samples, the excluded samples are out-of-bag data (OOB data)

2. Grow survival tree from each bootstrap sample. Split at each node based on a split metric

3. Grow tree full size with constraint that every terminal node should have at least one eventoccurrence

4. Calculate CHF for each tree; average to obtain the ensemble CHF 5. Use the OOB data to calculate the prediction error

Figure 2.2: The steps of the Random Survival Forest algorithm.

2.2.1 Split

(27)

Harrell’s concordance index (C-index) is an example of this, where all recorded survival times are used for the evaluation [17]. These met-rics are computationally expensive, due to the examination of multiple time points. This research uses a cheaper split evaluation (SE) metric that examines only one time point per instance, since this work fo-cuses on instances with one specific time point, which is also in line with Boströms research on surival analysis.[8]. This metric, referred to as the SE-min splitting rule, minimizes the squared error of predicted survival probabilities of the resulting split children nodes. This esti-mator is calculated as:

SE(L, R) = nl X i=1 (ˆy_il(tl_i) y_il)2+ nr X i=1 (ˆy_ir(tr_i) yr_i)2 (2.6) where l is the left and r is the right child node, containing the instances L = {(xl 1, tl1, yl1), .., (xln1, t l n1, y l n1)} and R = {(x r 1, tr1, y1r), .., (xrnr, t r nr, y r nr)}

respectively. Each child node should have at least |yl

i| = |yrj| = 1 for at

least one i and j. The model is therefore penalized for making errors in both branches. It is intuitive that the model will favor splits that result in smaller errors, when predicting any the event probability for any time t. 𝑥𝑈 0,𝑛𝑥>10 𝑆𝐸(𝐿) = 9,02 𝑆𝐸 𝑅 = 8,42 𝑆𝐸(𝐿, 𝑅) = 17,50 𝑥𝑈 0,𝑛𝑥>10 𝑆𝐸(𝐿) = 9,02 𝑆𝐸 𝑅 = 8,42 𝑆𝐸(𝐿, 𝑅) = 19,40 𝑥𝑈 0,𝑛𝑥>14 𝑆𝐸(𝐿) = 10,56 𝑆𝐸 𝑅 = 8,42 𝑆𝐸(𝐿, 𝑅) = 18,98 Select min(𝑆𝐸(𝐿, 𝑅))

Figure 2.3: Example of the selection of the best split in the creation of a binary tree.

(28)

function of that terminal node will then estimate the event probability. Such a tree can be seen in Figure 2.4.

𝑥𝑈 0,𝑛𝑥>10 𝑥𝑈 0,𝑛𝑥=4 𝑔1= {(𝑥1, 𝑡1, 𝑦1), … , (𝑥𝑛, 𝑡𝑛, 𝑦𝑛)} 𝑔₂= … 𝑥𝑈 0,𝑛𝑥∈[𝐴,𝐵,𝐶] 𝑔3= … 𝑔4= …

Split nodes Terminal nodes

ො 𝑦𝑔1(𝑡) = 1 − 𝑆𝑔1 𝑡 ො 𝑦𝑔2(𝑡) = 1 − 𝑆𝑔2 𝑡 ො 𝑦𝑔3(𝑡) = 1 − 𝑆𝑔3 𝑡 ො 𝑦𝑔4(𝑡) = 1 − 𝑆𝑔4 𝑡 Event probability

Figure 2.4: Example of a grown binary tree. In practice these trees grow much taller.

(29)

CHAPTER 2. RELATED WORK 17 𝑥𝑈 0,𝑛𝑥>10 𝑥𝑈 0,𝑛𝑥=4 𝐺1= {(𝑥1, 𝑡1, 𝑦1), … , (𝑥𝑛, 𝑡𝑛, 𝑦𝑛)} 𝑔2= … 𝑥𝑈 0,𝑛𝑥∈[𝐴,𝐵,𝐶] 𝑔3= … 𝑔4= … 𝑥𝑈 0,𝑛𝑥>10 𝑥𝑈 0,𝑛𝑥=4 𝐺1= {(𝑥1, 𝑡1, 𝑦1), … , (𝑥𝑛, 𝑡𝑛, 𝑦𝑛)} 𝑔2= … 𝑥𝑈 0,𝑛𝑥∈[𝐴,𝐵,𝐶] 𝑔3= … 𝑔4= … 𝑥𝑈 0,𝑛𝑥>10 𝑥𝑈 0,𝑛𝑥=4 𝐺1= {(𝑥1, 𝑡1, 𝑦1), … , (𝑥𝑛, 𝑡𝑛, 𝑦𝑛)} 𝑔2= … 𝑥𝑈 0,𝑛𝑥∈[𝐴,𝐵,𝐶] 𝑔3= … 𝑔4= … 𝑥𝑈 0,𝑛𝑥>10 𝑥𝑈 0,𝑛𝑥=4 𝐺₁= {(𝑥₁, 𝑡₁, 𝑦1), … , (𝑥𝑛, 𝑡𝑛, 𝑦𝑛)} 𝐺2= … 𝑥𝑈 0,𝑛𝑥∈[𝐴,𝐵,𝐶] 𝐺₃= … 𝐺4= … Random Survival Forest

Figure 2.5: Example of a Random Survival Forest, consisting out of multiple binary trees.

2.2.2 Predicting

The goal of the survival analysis is to predict whether a instance will survive up to a specific time. This research focuses on predicting the correct label of an instance, given the time point of that instance and a feature vector. In standard machine learning, this is done by generat-ing a ensemble cumulative hazard. Instead, we use the Nelson-Aalen estimator to estimate hazard function [18, 19]. We denote eg,i as the

number of events and rg,i as the number of instances at risk, at time

tg,i. ˆ Hg(t) = X tg,it eg,i rg,i (2.7) The CHF is estimated differently for specific features x in the different terminal node. Therefore the CHF estimate for input feature vector x in the resulting terminal node is

ˆ

Hg(t|x) = ˆHg(t) (2.8)

(30)

probability can then be calculated with the Kaplan-Meier estimator [20]: Sg(t) = Y th,it (1 eh,i rh,i )_{⇡ e} Hˆg(t|x) (2.9)

In this research, we will focus on event probabilities, which is the in-verse of the survival probability:

ˆ

yg = 1 Sg(t) (2.10)

We can make predictions on the event probabilities based on these for-mulas and historic data. Figure 2.6 shows the full process of an event probability prediction.

1. New instance i arrives with feature vector xi

2. We want to predict the survival probability of the instance at point t

3. We drop i in the all survival trees, where it ends up in a terminal node based on xi

4. The survival probability is then calculated for each terminal node, based on ˆH(t)of each terminal node

5. ˆyi is then calculated by averaging all survival probabilities.

Figure 2.6: The process of event probability prediction.

2.3 Conformal Prediction

2.3.1 Motivation

The estimated confidence of a new prediction ˆy(t|x)is usually based on

the accuracy of the whole model. However, feature vectors can have significant differences in terms of label variance. In other words: some instances are harder to predict than others. Therefore, the certainty of a prediction would differ for every new prediction, depending on the input feature vector.

(31)

it was relying on the Transductive Conformal Prediction (TCP) [21]. This means that the underlying model had to be retrained to determine the set prediction of a new instance. The Inductive Conformal Predic-tion (ICP) is developed to solve this computaPredic-tional problem [22]. It uses a calibration set, separated from the training set, which allows one underlying model. That model is now generated with less test instances, but the method is now computational feasible on a larger scale.

2.3.2 The framework

The ICP framework relies on an underlying model H and a calibration set C = {(x1, y1), ..., (xn, yn)}, where xi is an input feature vector and

yi 2 Y is the class label. In this research yi 2 0, 1. The non-conformity

measure, denoted as A(H, x, y), is a function that calculates the unlike-lihood of y as the label for x, generated by model H. A higher score means a higher unlikelihood. In the context of survival analysis, how-ever, one can also add time t to the non-conformity measure, resulting in A(H, x, t, y). With the non-conformity measure, a p-value can be cal-culated for x and y at time t, for a model H and calibration set C. The p-value is calculated as below:

px,t,y =

l + e· U[0, 1]

|C| + 1 (2.11)

where l = P_(x_c_,t_c_y_c₎_2C (A(H, x, t, y) < A(H, xc, tc, yc)) is the sum of

instances in the calibration set whose non-conformity scores exceeds the non-conformity score of x and y, where e = P_(x_c_,t_c_,y_c_)2C (A(H, x, t, y) = A(H, xc, tc, yc)) is the sum of instances in the calibration set

(32)

20 CHAPTER 2. RELATED WORK Er ro r i f ! = 1 Er ro r i f ! = 0 %∈; ! = 1 %∈; ! = 0 p-va lue pe r cl ass la be l 1 0 OOB instance

(

Real y = 1 Real y = 0

Figure 2.7: A visualization of a conformal prediction framework in the context of survival analysis.

We can now set a confidence level d and retrieve our set prediction for instance x as

ˆ

Y (x) ={y : y 2 Y ^ px,t,y < d} (2.12)

which includes all labels y whose p-values are less than the confidence threshold d. The set prediction contains the real label with a probabil-ity greater or equal to the confidence level, or P [y 2 ˆY (x, t)] d, by ap-plying A-3. This guarantee does not apply for individual class labels; the error probability of specific labels can exceed d 1. The Mondrian cross-conformal prediction can be applied to extend the guarantee to specific class labels [23, 24]. In Mondrian cross-conformal prediction, the calibration set is divided based on the class labels. The p-values are then calculated for each specific label with the conforming calibra-tion particalibra-tion. This, however, does not mean that each individual set prediction has a probability of including real label y. One can for ex-ample retrieve a item set with all existing labels, or an empty set. This means that for those individual predictions, the change of including y is 100% and 0%, respectively.

(33)

ordi-CHAPTER 2. RELATED WORK 21

nary machine learning frameworks, like accuracy, because the error probability is fixed and will not differ from the confidence threshold. An alternative evaluation methods is discussed in Chapter 3.2.

2.3.3 Applying conformal prediction to the Random

Survival Forest

In this research, we apply the conformal prediction to the RSF algo-rithm, described in Section 2.2. The most essential part of the appli-cation, is the choice of non-conformity measure. The non-conformity measure in binary classification problems is often the estimated prob-ability that a label is incorrectly classified by the underlying model [8]. Therefore, this research uses that non-conformity measure and applies it on the RSF algorithm:

A(H, x, t, y) = {y = 0}H(x, t) + (y = 1)(1 H(x, t)) (2.13) where H(x, t) denotes the estimated probability that y = 1 before time t, considering input features x, outputted by the RSF. This non-conformity measure A(H, x, t, y), model H, calibration set C, and con-fidence threshold d, a prediction can be calculated for the input vector xand time t as:

P (x, t) ={y : y 2 {0, 1} ^ px,t,y < d} (2.14)

(34)

22 CHAPTER 2. RELATED WORK 1 0 Exceeds !∈ for # = 1 Below !∈ for # = 0 i1 ' Below !∈ for # = 1 Exceeds !∈ for # = 0 i2 Exceeds !∈ for # = 1 Exceeds !∈ for # = 0 i3 !_∈; # = 1 !∈; # = 0 p-va lue pe r cl ass la be l i ) = 0 ) = * 1 1 0 2 0 1 3 0 0

(35)

Chapter 3 Methodology

The methodology of this research should provide the strategy to an-swer the research questions "What are the performance effects when a Conformal Random Survival Forest is trained to make conformal survival predictions for a specific time point?". To answer this question, a new Conformal Random Survival Forest algorithm should be designed and a method should be proposed to measure performance effects. There-fore, the methodology of this research consists out of two parts. First, the process of designing a new algorithm is explained. Second, the evaluation of the algorithms is described in which both the evaluation criteria, as well as the experimental setup are illustrated.

3.1 Algorithm design

This research aims to find a way to train a Conformal Random Sur-vival Forest to make conformal surSur-vival predictions for a specific time points. To achieve this, the standard CRSF algorithm needs to be ad-justed. Inspiration for the adjustments come from the research by Boström, Gurung, Asker, Karlsson, Lindgren, and Papapetrou [8] and Vovk, Gammerman, and Shafer [23]. Two different changes are imple-mented based on two different intuitions, leading to the Fixed Time algorithm and the Hybrid algorithm. The changes involve the use of fixed time point tf_{. The different algorithms together with this new}

time point will be explained in more detail in 4, but the terms will

(36)

24 CHAPTER 3. METHODOLOGY

already be used in the experimental design.

The algorithms are implemented in Python 3, on top of the standard CRSF algorithm. This way, the standard CRSF algorithm and the two new CRSF algorithms can reliably be compared. The comparison method of different set prediction algorithms is not evident. The next section will describe the challenges of evaluating set predictors and ways to overcome this.

3.2 Evaluation

The two new algorithms are designed. Now the challenge remains to compare the performance with each other and the state-of-the-art. This chapter covers two parts of the evaluation of the proposed algo-rithms. First of all, the evaluation metrics are described that enables an analytic comparison between the different algorithms. Thereafter, the experimental setup is described in which the data sets and parameters are outlined.

3.2.1 Evaluation metrics

Common evaluation metrics for classification algorithms, such as ac-curacy or error probability, cannot be applied when evaluating CRSF’s, because the error probability is determined by the confidence thresh-old given by the user.

(37)

CHAPTER 3. METHODOLOGY 25

The different metrics which are informative about the performance of the CRSF algorithm are listed and described below.

Accuracy Error

The Accuracy Error of a conformal predictor can be calculated by tak-ing the number of predictions that include the right class label with respect to the total number of instances. By design, this should be ac-cording to the confidence threshold d. The metric can be formulated as:

AE = 1 1 |Y |

X

i2Y

((yi = 1) ^ (1 62 ˆyi) + ((yi = 0) ^ (0 62 ˆyi)) (3.1)

where Y is the set of test instances, yi 2 {0, 1} is the true class label

of instance i, and ˆyi ✓ [0, 1] is the set prediction set of instance i. The

optimal result would be when AE = 0.

Number

The number criterion N refers to the number of labels that are on aver-age included in the set predictions. We formulate this metric as:

N = 1 |Y | X i2Y | " i| (3.2)

Ideally, N = 1, since item sets with size 1 are the easiest to interpret: assume that the predicted label in the singleton set is the real label with probability ". Also, the criterion is "-dependent, since " determines the size of "

i. More elements will be included in the prediction sets if the

" increases in order establish an the higher confidence to include the true class label.

Sum

(38)

It is "-free and lower values are preferable. However, the CRSF will give an S-value of 1, by design: the CRSF will output two p-values that always add up to 1. Therefore we will not take this criterion into account.

Unconfidence

The unconfidence criterion U measures the average second highest p-value. In a binary case, this is formulated as:

U = 1 |Y | X i2Y min y maxy0_6=y p y0 i (3.4)

U-values are "-free and preferably small. If this value is high it means that there is another value, besides the most likely value, that is con-sidered likely by the algorithm. This makes the algorithm more un-confident, which is negative.

Multiple

The multiple criterion M measures the percentage of prediction sets that have more than one class label included in the set prediction. For binary classification, this can be calculated as

M = 1 |Y |

X

i2Y

{| "i| > 1} (3.5)

M-values are "-dependent, since the change of multiple increases when the " is increased: more elements will be included in the prediction sets to establish higher confidence that it includes the true class label. Small M-values are preferable. The lower this value, the closer the prediction sets approximate point predictions, which are easier to in-terpret. If M is too high, it means that very few prediction sets consist out of one prediction. These cases are especially useless when one is dealing with a binary classification problem like a CRSF.

Observed unconfidence

(39)

ex-CHAPTER 3. METHODOLOGY 27

ample, is the largest p-value py

i where y 6= yi. This statistic can be

expressed as: OU = 1 |Y | X i2Y max y6=yi py_i (3.6)

Smaller OU-values are preferable low. Furthermore, OU-values are "-free. If the OU-value is high, it means that algorithm gives high p-values to false class labels, making them more likely to end up in the set prediction. This is, as discussed multiple times, undesirable.

Observed multiple

The observerd multiple criterian OM uses the percentage of predic-tions where the prediction set contains false class labels. The OM can be calculated as OM = 1 |Y | X i2Y { "i\{yi} 6= 0} (3.7)

Lower values are preferred, since false labels in a prediction set only hinder the interpretation. Note that it is also possible to output a pre-diction set of size one, with one false label. OM-values are "-dependent, since the change of multiple increases when the " is increased: more el-ements will be included in the prediction sets to establish an the higher confidence that it includes the true class label.

The individual criteria are not able to evaluate a set predictors. The different evaluation metrics together, on the other hand, should give a good indicator of the efficiency of a conformal predictor. The different efficiency criteria are summarized in Table 3.1.

Table 3.1: Algorithm efficiency criteria. " independent " dependent

S (Sum) N (Number)

(40)

3.2.2 Experimental setup

We now have our evaluation metrics, to measure how the algorithm performs in absolute measurements. These metrics are not informative without a comparison with the state-of-the-art. Only such a compari-son leads to a conclusion of the performance of the algorithm. In this research, we use the vanilla model, the Conformal Random Survival Forest as given by Boström, Gurung, Asker, Karlsson, Lindgren, and Papapetrou [8]. The performance in several scenarios are compared. The experiment uses five different algorithmic parameters: data set, algorithm, evaluation data, confidence and time point. The different pa-rameters are futher motivated and described in the following subsec-tions.

Data set

The experiment uses six different survival data sets. This means that the instances of these sets have both a event indicator, as well as a time point value. The data vary in size between 195 and 10000 instances, with an average of 706 instances, as shown in Table 3.2. The number of features varies between 6 and 24, with an average of 14 features. The features include both scale and categorical values.

Table 3.2: Descriptive statistics about the different data sets. ’p25’, ’p50’ and ’p75’ refer to the corresponding percentiles of the data.

Data set # Instances # Features % Censored Max.Time Min.Time

(41)

Data set p25 Time p50 Time p75 Time St.dev. Time Mode Time

Actg 174 257 300 90 293 Gbcs 798 1338 1826 619 740 Grace 11 177 180 80 180 Pharynx 238 445 782 418 112 Steer bearing 808 1540 2213 814 1817 Whas 296 632 1365 705 1

(42)

30 CHAPTER 3. METHODOLOGY 0 2 4 6 8 10 12 14 16 18 20 1 ₁₄ ₂₆ ₄₆ ₅₉ ₆₈ ₈₁ ₉₁ 10 5 11 6 12 7 13 8 15 1 16 5 17 4 18 6 19 6 20 8 22 1 23 1 24 3 25 6 26 9 28 0 29 2 30 5 31 8 32 9 34 2 35 5 Co un t o f c la ss la be l Time (a) Actg 0 0.5 1 1.5 2 2.5 3 3.5 8 18 6 36 8 48 8 56 6 63 1 69 6 74 5 79 8 86 8 93 6 10 13 10 94 11 63 12 11 12 93 13 49 14 35 15 05 16 03 16 56 17 18 17 85 18 46 18 84 19 77 20 30 21 27 22 13 23 59 25 63 Co un t o f c la ss la be l Time (b) Gbcs 0 5 10 15 20 25 30 35 40 45 50 0, 5 4 8 ₁₂ ₁₆ ₂₀ ₂₄ ₂₉ ₃₄ ₄₀ ₄₇ ₅₃ ₆₀ ₆₇ ₇₃ ₇₇ ₈₄ ₈₉ ₉₄ 10 3 11 4 12 3 12 8 14 0 15 2 15 9 16 5 16 9 17 3 17 7 Co un t o f Cl as s la be l Time (c) Grace 0 0.5 1 1.5 2 2.5 3 3.5 11 90 ₁₂7 ₁₅4 ₁₇3 ₁₉2 ₂₁9 ₂₄3 ₂₆4 ₂₇6 ₃₀7 ₃₂8 ₃₄7 4₃₇ ₄₁3 ₄₆1 ₅₁3 ₅₄1 ₅₆0 8₆₀ ₆₇2 ₇₃3 ₇₈2 ₈₂5 ₉₁8 ₉₉8 10 92 13 12 14 72 16 44 Co un t o f c la ss la be l Time (d) Pharynx 0 2 4 6 8 10 12 14 16 18 20 37 ₁₅5 ₂₆0 ₃₄8 ₄₃7 ₅₂5 ₆₂3 ₇₂1 ₈₁9 ₉₁5 10 15 11 07 12 03 13 00 14 00 14 90 15 81 16 73 17 64 18 60 19 49 20 45 21 41 22 32 23 19 24 14 25 06 26 03 26 96 27 93 28 89 Co un t o f c la ss la be l Time

(e) Steer bearing

0 1 2 3 4 5 6 7 8 9 1 ₁₉ ₅₃ ₉₃ 13 4 20 0 32 1 38 5 41 1 44 2 47 3 52 1 55 2 61 4 70 4 10 65 11 23 11 63 12 00 12 45 12 79 13 32 13 78 14 44 15 77 18 85 19 34 19 93 20 84 21 39 21 90 Co un t o f c la ss la be l Time (f) Whas Y = 0 Y = 1

Figure 3.1: The count of two classes of the event indicator, y = 0 and y = 1, at different time points. For (c) the y-axis of the figure is man-ually adjusted for readability purposes. The actual data of (c) show a count of 490 instances at time 180 with class label y = 0.

(43)

censored, represented in both Table 3.2 and in Figure 3.1. This per-centage is especially high in the Actg and Steer Bearing data set. One could argue the quality of any predictive model with one heavily over-represented class.

Algorithm

Three types of algorithms are tested. First of all, there is the stan-dard Conformal Random Survival Forest algorithm. This is the vanilla model, which functions as a benchmark. Second, we use the Fixed Time algorithm. This model uses the split function as described in 4.3, where some elements are discarded in the split function. Third, we test the Hybrid algorithm as described in 4.4. This model uses fixed time data points if possible, but uses the original time stamp and event in-dicator in cases where this transformation is not possible.

Evaluation data

The performance of the different algorithms is dependent on the data that is used for the evaluation. In the first case, we use the original data. This means that all the algorithms are evaluated by their ability of predicting the event indicator at variable times. Second, the fixed time data is used for evaluation. This means that a specific time point is set, and the original data is adjusted according to 4.3 (including the discarding of data points if necessary). Third, the hybrid data is used for the evaluation of the algorithms, where the original data is trans-formed according to 4.4.

Confidence

(44)

provide us with a saturated amount of insights about effects of the confidence level on the performance of the algorithms.

Time point

The fixed time point determines the prediction job of the algorithms. When the fixed time point is low, the predictions will be for a short term, and vice verse. To test the stability of the algorithm over dif-ferent fixed time points, we test with three difdif-ferent values. The data points do not share the same time point distribution. To compare the results, we used percentiles to express the fixed time point. The three different time point refer to the 25th_{, 50}th_{and 75}th_{percentile of the time}

points in each data set. These three fixed time points should give satu-rated insights in the effect of the time point on the performance of the algorithms.

The parameters are summarized in Table 3.3. By these means, we are able to analytically assess the performance of each algorithm.

Table 3.3: A summary of parameters that are used in the experiments. Parameter Experimental values

Dataset [actg, gbcs, grace, pharynx, whas] Algorithm [standard, fixed time, hybrid] Evaluation data [standard, fixed time, hybrid]

Confidence [0.75, 0.80, 0.90, 0.95] Fixed time point [p25, p50, p75]

(45)

Chapter 4 Algorithm proposal

This chapter will go through the different steps that are taken in order to design the two different algorithms that this thesis proposes. The algorithm proposal is split up into the intuition behind the algorithm and the specifications of the data treatment and thus the changes in the CRSF algorithm itself.

4.1 Intuition

The performance of the CRSF algorithm heavily relies on the assump-tion that the split metric penalizes the models mistakes for predicassump-tions of all training instances. Therefore, the splits in every tree in the sur-vival forest are picked as the split that performs the best on the training data, with all time instances. The CRSF will therefore be optimized to make predictions for all instances of time t, since they are considered in the splitting metric.

When one knows that the future predictions are only done for one spe-cific time point tf_{, we could incorporate this in the splitting metric. We}

can penalize the model for mistakes of predictions of instances at time point tf_{. This would lead to a tree where all splits perform the best for}

specific time point tf_{. This could time fixed model could outperform}

the general model when it comes to predicting the event probability of an instance at time tf_.

(46)

34 CHAPTER 4. ALGORITHM PROPOSAL

4.2 Algorithm design

The original splitting metric is mentioned as Equation 2.6. It is clear that the squared error is calculated based on the difference between the predicted label ˆy and real class label yiat variable time ti. The new

algorithm does not deal with variable time points, and considers only the difference between ˆyi and yi at fixed time point tf. The SE-metric

can therefore be reformulated as:

SE(L, R) = nl X i=1 (ˆy_il(tf) y_il)2+ nr X i=1 (ˆy_ir(tf) yr_i)2 (4.1) where the variables are adapted from Equation 2.6 and tf _{is the fixed}

time point for which the model is specialized. This new split evalua-tion metric should specialize every survival tree for predicevalua-tions of time point tf_{, but it has a problem: the training data generally does not}

con-sists out of training instances of specific time point where ti = tf and

if so, than time specialization of the algorithm could not diverse, since then none of the instances would have the matching time point. Two different data treatments can take care of this problem: a Fixed Time algorithm and a Hybrid algorithm.

4.3 Data treatment for Fixed Time algorithm

The data treatment starts with the input data table, as described in Table 2.1. Subsection 4.2 describes the fact that the split evaluation metric should only take into account the instances where ti = tf. To

remain a large training data set size, we compose a new label yf i(tf)

for every instance, with respect to tf_{. The state y}f

i at time tf can be

deduced which knowledge about ti and yi(ti). The rule for generate

the new label is:

yf_i(tf) = 8 > < > : Category 1: 1 if yi(ti) = 1 ^ tfi ti Category 2: 0 if tf i  ti

Category 3: discard otherwise

(47)

CHAPTER 4. ALGORITHM PROPOSAL 35

Where yi(ti)is the class label of instance i at variable time point ti and

y_if(tf₎_{is the class label of instance i at fixed time point t}f_{. This can be}

justified when one takes a closer look to the three different scenarios. First scenario: a reported event (yi(ti) = 1) at a time before the fixed

time point (tf _t

i) means that the event would also have been seen

at a later point in time, and thus yi(tf) = 1. Second scenario: an

in-stance which is reported after the fixed time point (tf _{ t}

i) would not

have seen the event at that fixed time point, because that would con-tradict with assumption A-1. Thus in that case yf

i = 0. Third scenario:

a censored instance (yi(ti) = 0), reported before the fixed time point

(tf _{ t}

i), provides no information on whether the event happened or

did not happen in the time range (ti : tf]. This information does not

guarantee a state of yi at time point tf and we can therefore not give

a class label yf

i(tf). The easiest way to deal with these instances is

to discard them in the split evaluation metric: they will not be taken into account when the splits are evaluated. It is important to note that the instances will not be discarded for the construction of the survival functions in the leave nodes of the survival forest. Figure 4.1 displays the data treatment of the Fixed Time algorithm.

𝑡0 𝑡𝑓 𝑡0 𝑡𝑓

Original split evaluation data Hybrid algorithm data treatment

(48)

The data set is now processed such that the split evaluation metric is able to use the data of different points in time to penalize a model on the performance on predictions of specific time point tf_{. This}

algo-rithm will be referred to as the Fixed Time algoalgo-rithm. The new data, as extended on Table 2.1, can be found in Table 4.1.

Table 4.1: Example input data of a survival analysis algorithm after data treatment of the Fixed Time algorithm, where tf _{denotes the fixed}

time point, yf _{denotes the class label that corresponds to the fixed time}

point, t denotes the original time point, y denotes the original class label and x denotes a feature.

tf _yf _t _y _x 1 xn 90 1 65 1 B 34 90 - 41 0 C 26 90 0 102 0 B 43 90 0 97 1 A 34

4.4 Data treatment for the Hybrid algorithm

In subsection 4.3, we argue that an instance should be discarded in the split evaluation if it is censored, but reported before the fixed time point. Depending on the data and the tf_{, we lose training data which}

is used in making of the splits of each survival tree. A second way of dealing with the data points in the third category is to leave them as they are. We can now construct a hybrid time point th _{and the}

corre-sponding hybrid class label with the following rule:

(th_i, yi(th)) = 8 > < > : Category 1: (tf i, 1)if yi(ti) = 1 ^ tfi ti Category 2: (tf i, 0)if t f i  ti Category 3: (ti, yi(ti)), otherwise (4.3) tf _{= t}

i and thus yi(tf) = yi(ti). This way, the model can use these

(49)

CHAPTER 4. ALGORITHM PROPOSAL 37

𝑡0 𝑡𝑓 𝑡0 𝑡𝑓

Original split evaluation data Hybrid algorithm data treatment

Figure 4.2: Data treatment of the Fixed Time algorithm to align the data instances to a fixed time point.

Even though this penalizes errors of predictions of instances at vari-able time points, this alternative data treatment could lead to better splits since no training data is lost. Furthermore, this alternative way of dealing with these instances could lead to a CRSF that is more ro-bust in a setting where most instances fall in the third scenario. An example of the data after the alternative data treatment can be found in Table 4.2.

Table 4.2: Example input data after alternative data treatment. th _yh _tf _yf _t _y _x 1 xn 90 1 90 1 65 1 B 34 41 0 90 - 41 0 C 26 90 0 90 0 102 0 B 43 90 0 90 1 97 1 A 34

(50)

aligns all data instances to the same fixed time point at the cost of dis-carding some items. The Hybrid algorithm does not align all items to the same time point, but it is able to preserved all instances.

𝑡0 𝑡𝑓

Discarded Discarded

𝑡0 𝑡𝑓

Original split evaluation data Fixed time algorithm data treatment Hybrid algorithm data treatment

(51)

Chapter 5 Results

The results can be divided into different sections. First of all, we present the results of the experiment in a base case. This gives a good overview of what the different metrics mean with respect to each other. Second, the results of the influence of the different parameters on the perfor-mance is given. This will provide more insight in a general effect of the new algorithm.

5.0.1 Parameter values

The 648 different experiments lead to a lot of data points. These data points can displayed granular or more aggregated, depending on the insights that they provide. More granular display of results could lead to more in depth findings but would also lead to less reliable results, since every individual data points has more influence on the result. During the analysis of the results, it was made clear that there was no obvious reason to split the data up for the different confidence levels or fixed time points. Therefore, the graphs and tables that are shown in this section consist out of the the average results of the experiments of the different confidence levels and fixed time points, unless otherwise reported. This means that on a data point in a graph represents the average value of (4 confidence thresholds ⇥ 3 fixed time points =) 12 experiments.

(52)

40 CHAPTER 5. RESULTS

5.1 Accuracy Error

(53)

CHAPTER 5. RESULTS 41 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 20% 30% 40% 50% 60% 70% 80% 90% 100% Av er ag e Ac cu ra cy E rr or

Percentage of data points which is censored in data set Hybrid Fixed Time Standard

Figure 5.1: The average Accuracy Error (AE), the average deviation of the accuracy from the given confidence threshold, of different algo-rithms on data sets with a different censoring ratio, when evaluated by fixed time point predictions.

(54)

42 CHAPTER 5. RESULTS .05000 .1000 .15000 .2000 .25000 .3000 .35000 .4000 20% 30% 40% 50% 60% 70% 80% 90% 100% Av er ag e Ac cu ra cy E rr or

Figure 5.2: The average Accuracy Error (AE), the average deviation of the accuracy from the given confidence threshold, of different algo-rithms on data sets with a different censoring ratio, when evaluated by various time point predictions. These various time points come from the OOB-data and are randomly drawn from the same probabilistic data distribution as the data sets.

(55)

CHAPTER 5. RESULTS 43 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 70% 75% 80% 85% 90% 95% 100%

Av

er

ag

e

Ac

cu

ra

cy

E

rr

or

Confidence threshold given to the algorithm

Hybrid Fixed Time Standard

Figure 5.3: The average Accuracy Error (AE), the average deviation of the accuracy from the given confidence threshold, of different algo-rithms with different given confidence thresholds, when evaluated by fixed time point predictions.

5.2 ✏-independent unobserved criteria

(56)

44 CHAPTER 5. RESULTS 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 20% 30% 40% 50% 60% 70% 80% 90% 100%

Av

er

ga

e

U

nc

on

fid

en

ce

Percentage of data points which is censored in data

set

Figure 5.4: The average Unconfidence (U), the average second highest p-value, of different algorithms on data sets with a different censoring ratio, when evaluated by fixed time point predictions.

This means that the Hybrid algorithm has higher second highest p-values. Lower values are preferred for this metric. The Standard al-gorithm seems to perform better when it comes to unconfidence, and thus ✏-independent unobserved criteria. A side note to this is that the Standard algorithm violates the AE guarentees, and is therefore harder to compare with the other algorithms.

5.3 ✏-dependent unobserved criteria

(57)

abil-CHAPTER 5. RESULTS 45

ity to predict survival for fixed time points. The Fixed Time algorithm performing best for this criteria for most of the data sets. The Standard algorithm performs worse than the Fixed Time algorithm, but usually better than the Hybrid algorithm.

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 20% 30% 40% 50% 60% 70% 80% 90% 100% Av er ag e N um be r

Percentage of data points which is censored in data set

Figure 5.5: The average Number (N), the average number of class labels in the set prediction of different algorithms, of different algo-rithms on data sets with a different censoring ratio, when evaluated by fixed time point predictions.

The set predictions are better if they have a low amount of Multi-ples (M). Figure 5.6 shows the ratio of set predictions that include more than one value. A predictions with more than one class label is not easy to interpret. Therefore a lower M is preferable. The Stan-dard algorithm scores has on average a lower percentage of multiples. This could partially explain the higher Accuracy Error that the stan-dard error makes: it does not include enough labels in the prediction set. The Hybrid algorithm performs better than the Fixed Time algo-rithm.

(58)

46 CHAPTER 5. RESULTS 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 20% 30% 40% 50% 60% 70% 80% 90% 100% Av er ga e M ul tip le s

Figure 5.6: The average Multiples (M), the percentage of prediction sets with more than one class label, in the set prediction of different al-gorithms on data sets with a different censoring ratio, when evaluated by fixed time point predictions.

5.4 ✏-independent observed criterion

(59)

CHAPTER 5. RESULTS 47 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 20% 30% 40% 50% 60% 70% 80% 90% 100% Av er ag e Ob se rv ed U nc on fid en ce

Figure 5.7: The Observed Unconfidence criterion (OU), the average p-value of the false class label, of different algorithms on data sets with a different censoring ratio, when evaluated by fixed time point predic-tions.

5.5 ✏-dependent observed criterion

(60)

48 CHAPTER 5. RESULTS 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 20% 30% 40% 50% 60% 70% 80% 90% 100% Av er ag e Ob se rv ed M ul tip le s

Figure 5.8: The Observed Multiple criterion (OM), the percentage of predictions where the prediction set contains false class labels of dif-ferent predictions, of difdif-ferent algorithms on data sets with a difdif-ferent censoring ratio, when evaluated by fixed time point predictions.

5.6 Base case

We constructed a base case by taking the results of all data sets, all confidence levels and all time points. We used the ’fixed time’ data for evaluation, because this will simulate the task of predicting the sur-vival at one specific time point. The results are displayed in Table 5.1, where each algorithm scores on the different criteria in the described testing environment.

(61)

in-CHAPTER 5. RESULTS 49

Table 5.1: The average criteria score of every criterion by different al-gorithms when evaluated by fixed time point predictions. The result is the average of experiments with the different data sets, the different confidence thresholds and the different fixed time points.

Criteria fixed time hybrid standard AE 0.021 0.017 0.273 S 1.000 1.000 1.000 N 1.622 1.413 1.260 U 0.173 0.236 0.144 M 0.623 0.414 0.320 SE 0.624 0.414 0.381 OM 0.313 0.187 0.221 OU 0.355 0.319 0.423 cluded. 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 Average of M Average of U Average of N Average of S Average of OM Average of OU Average Accuracy Error

Error score Ev al ua tio n Cr ite ria Average of

M Average ofU Average ofN Average ofS Average ofOM Average ofOU

Average Accuracy Error Standard 0.36 0.10 1.32 1.00 0.27 0.31 0.22 Fixed Time 0.64 0.15 1.63 1.00 0.35 0.30 0.02 Hybrid 0.45 0.17 1.45 1.00 0.23 0.24 0.02

Figure 5.9: The error score of different criteria of different algorithms, when evaluated by fixed time point predictions.

(62)

50 CHAPTER 5. RESULTS

(63)

Chapter 6 Conclusion

The goal of this research was to contribute to the field of conformal predictions using Random Rurvival Forests. The original Conformal Random Survival Forest is researched to predict whether something will survive up until a certain time point. More specifically, the CRSF can include two class labels in its set prediction: ’1’ to denote that in-stance will break down, and ’0’ to denote that the inin-stance will not break down. A CRSF outputs such set predictions that include the correct class label with a confidence that matches the user-given confi-dence threshold on average.

The standard CRSF is trained to make such predictions for various time points. This algorithm is referred to as the Standard algorithm. In practice, however, the algorithm could be used to make predictions for one fixed time point. It is intuitive to imagine that a model can im-prove its performance on specific time point predictions if it is trained to do so. The performance effects of a time point specialized CRSF has not yet been researched.

This thesis proposes two algorithms that specialize in making confor-mal predictions for a fixed point in time. Both algorithms transform the original survival data that is used by the split evaluation metric in the Random Survival Forest algorithm. They are transformed such that all data points have the same fixed time point. The event label, whether an instance has experienced the event, is then adjusted ac-cordingly based on the information of the original data point.