Aggregating predictions using Non-Disclosed Conformal Prediction

(1)

Aggregating predictions using

Non-Disclosed Conformal Prediction

Robin Carri´ on Br¨ annstr¨ om

Department of Statistics Uppsala University

Supervisors:

Johan Bring Fan Yang-Wallentin

2019

(2)

Abstract

When data are stored in different locations and pooling of such data is not allowed, there is an informational loss when doing predictive modeling. In this thesis, a new method called Non-Disclosed Conformal Prediction (NDCP) is adapted into a regression setting, such that predictions and prediction intervals can be aggre- gated from different data sources without interchanging any data. The method is built upon the Conformal Prediction framework, which produces predictions with confidence measures on top of any machine learning method. The method is eval- uated on regression benchmark data sets using Support Vector Regression, with different sizes and settings for the data sources, to simulate real life scenarios. The results show that the method produces conservatively valid prediction intervals even though in some settings, the individual data sources do not manage to create valid intervals. NDCP also creates more stable intervals than the individual data sources. Thanks to its straightforward implementation, data owners which cannot share data but would like to contribute to predictive modeling, would benefit from using this method.

Keywords: Conformal Prediction, Non-Disclosed Conformal Prediction, Support Vector Re-

gression, Reliable Machine Learning

(3)

1 Introduction 1

2 Background 3

2.1 Different confidence predictors . . . . 4

2.2 Research question . . . . 5

3 Method 6 3.1 Conformal Prediction . . . . 6

3.2 Inductive Conformal Prediction . . . . 7

3.3 Cross-Conformal Prediction . . . 10

3.4 Non-Disclosed Conformal Prediction . . . 11

3.5 Support Vector Regression . . . 12

3.6 Combination of prediction intervals . . . 16

3.7 Evaluation measures . . . 17

3.8 Experimental setup . . . 17

4 Results 20 4.1 Experiment 1: Random sources of equal sizes . . . 20

4.2 Experiment 2: Random sources of unequal sizes . . . 22

4.3 Experiment 3: Non-random sources of equal sizes . . . 24

5 Discussion 27 5.1 Experiment 1: Random sources of equal sizes . . . 27

5.2 Experiment 2: Random sources of unequal sizes . . . 27

5.3 Experiment 3: Non-random sources of equal sizes . . . 28

5.4 Dispersion of prediction interval width . . . 29

5.5 Error sources . . . 29

5.6 Further research . . . 29

6 Conclusion 31

Acknowledgments 32

References 33

Appendix 35

(4)

1 Introduction

When using statistical and machine learning methods for predictive modeling it is often beneficial to use as much data as possible. If data are stored in different locations, the common approach is to pool the data before proceeding with any analyses. However, pooling data is not always possible due to for example data secrecy. But at the same time that sharing data between different sources is not possible, there are occasions when data owners may want to contribute to predictive models, without disclosing any data (Spjuth, Carlsson and Gauraha 2018). Federated learning models (e.g. Shokri and Shmatikov (2015) for deep learning) have been developed for this purpose but are usually complex to implement in practice.

When making predictions using machine learning methods, a typical drawback is that the user is not provided with any – or any useful – measures of confidence for a certain prediction (Shafer and Vovk 2007). The common approach to evaluate predictions is to split the data into training and test sets, train the algorithm on the training set and then evaluating the model. The evaluation is performed by predicting observations in the test set, and then using some fit measure, for example root mean squared error (RMSE) or error rate, to determine how well the model performs on average (James et al. 2013). In other words, new predictions are expected to behave similarly to past predictions. But what if a new observation which is to be predicted is not alike the observations in the training and test set used to evaluate the model? Can model fit measures then be generalized to this new observation? For this, predictions need to be accompanied by a confidence measure. Conformal Prediction measures how unusual a new observation is compared to previous observations; an unusual observation will be given a larger prediction region, at the same time as a similar new observation will obtain a smaller prediction region. Conformal Prediction has the advantage of that – under mild assumptions – it can be used with any machine learning method, such as Decision Trees, Boosting and Neural Networks, in both classification and regression problems (Vovk, Gammerman and Shafer 2005).

Spjuth, Carlsson and Gauraha (2018) propose a new approach to the problem of not

being able to pool data which are located in different places, using the Conformal Predic-

tion framework, in a new method called Non-Disclosed Conformal Prediction (NDCP).

(5)

This method has been shown to yield acceptable results in binary classification but has not yet been formulated in a regression setting.

The outline of this thesis is as follows. In Section 2 the background concepts are

introduced, as well as the research question. In Section 3 the framework of Conformal

Prediction and all relevant variants are introduced. Also, the regression method as well

as the experimental setup are presented. The results are presented Section 4, followed up

by a discussion in Section 5 and conclusion in Section 6.

(6)

2 Background

In situations where data are located in different places and pooling of such data is not possible, the available information is often not being used to its full potential. There are however situations when different data owners would like to share information, without sharing their data. One example can be found in the pharmaceutical industry where different companies hold their own data on for example chemical compounds from differ- ent drug discovery projects. This data is very valuable for these companies which makes them unwilling to share it with others. Sometimes these companies want to cooperate without disclosing data to others, in for example collaborative efforts or precompetitive alliances (Spjuth, Carlsson and Gauraha 2018). A method which can aggregate predic- tions from different sources would therefore be beneficial for actors in such situations.

Additionally, the ability to predict with confidence is crucial in several areas, and it is therefore important that this method can provide predictions with confidence measures.

Predicting with confidence is a well-studied area within traditional statistical methods, where it always has been important to be able to construct for example prediction inter- vals. The most common methods do sometimes require strong and parametric assump- tions about the data, and/or are more suitable in low dimensions (Balasubramanian, Ho and Vovk 2014). When predictive power outweighs interpretability it is common to con- sider more complex optimization methods, such as for example Support Vector Machines and Neural Networks. Many of these methods do not require any parametric assumptions about the data and do also work well in higher dimensions, hence new frameworks have to be considered when predicting with confidence using these more complex methods.

Machine learning methods have traditionally had a small focus on confidence mea-

sures of single observations. Instead, the literature has focused on performance through

measures as for example RMSE in regression settings and accuracy, misclassification rate

etc., when classification techniques are used. However, it is often important to be able to

know the uncertainty of a single prediction. An obvious example can be found in cancer

diagnostics. If some model predicts that a patient do not have cancer, how certain can

we be about this prediction? If the prediction is very uncertain, then perhaps the patient

should undergo further examinations (ibid.).

(7)

2.1 Different confidence predictors

Many algorithms have been proposed for the purpose of producing confidence measures for predictions in the machine learning context. Popular approaches are Bayesian methods, Probably Approximately Correct (PAC) approaches, as well as ad hoc methods such as quantile regression.

Confidence predictions using Bayesian methods require correctly specified priors to be able to produce prediction intervals with correct coverage probability. When this prior is specified correctly, it performs well and sometimes slightly better than the Conformal Prediction framework. However, this difference is often negligible (Melluish et al. 2001).

In the case when the prior is misspecified, which might not be known in advance, the results from this method can be misleading. Experiments in Melluish et al. (ibid.) show that Bayesian intervals with a confidence level of 90%, in reality can have an error rate of between 20% and 40% when the prior is misspecified. In real-world applications a prior is often chosen arbitrarily, why this can become a serious issue when drawing conclusions from Bayesian prediction intervals.

PAC theory approaches on the other hand do not make any assumptions of the un- derlying distribution. These intervals are however often not useful in practice due to probability estimates exceeding 1 (Proedrou et al. 2002). Another drawback is that the bounds obtained represent the overall error instead of properties of individual predictions (Papadopoulos, Gammerman and Vovk 2010).

There are also algorithms developed specifically for individual methods, such as quan- tile regression. This specific method is justified by asymptotics, and does for this reason not always perform well in practice (Rosenfeld, Mansour and Yom-Tov 2018).

Conformal Prediction will be introduced in Section 3 and has several advantages compared to other methods developed for the purpose of predicting with confidence.

Important advantages are (Lei et al. 2018):

• No parametric assumptions.

• Intervals are always valid.

• Performs very well in high dimensions compared to conventional methods.

(8)

2.2 Research question

NDCP was proposed by Spjuth, Carlsson and Gauraha (2018) in a classification setting.

In this thesis, NDCP will be modified to work in a regression setting. The method will then be tested and evaluated; if there exist data at multiple locations which can not be pooled, is it possible – using the Conformal Prediction framework – to aggregate predictions from all individual data sources for more precise results? To examine this, three different settings which are supposed to represent different real life scenarios, will be investigated. The first scenario is when all different data sources are of the same size.

Next, a scenario where the sizes of the data sources are different. Finally, a scenario where different data sources have different distributions of the data. Summarized, one main question and three sub questions will be answered:

• Can predictions be aggregated using NDCP for more reliable results?

– How do different numbers of data sources affect the results?

– How do different sizes of the data sources affect the results?

– How do different distributions of data in the data sources affect the results?

(9)

3 Method

3.1 Conformal Prediction

A Conformal Predictor is a confidence predictor; instead of only outputting bare point predictions, it also outputs some confidence measure related to the prediction. Conformal Prediction will only be discussed in a regression setting in this thesis, however it can also be applied to classification problems (Vovk, Gammerman and Shafer 2005).

The concept of Conformal Prediction is to measure how ”strange” a new observa- tion which is to be predicted is, compared to previous observations. This strangeness- measure, the nonconformity measure, measures the difference between a new observation compared to previous data. If a new observation is very nonconforming, i.e. very dissim- ilar compared to previous data, its prediction should be given high uncertainty. If a new observation instead is conforming, then it can be predicted with high accuracy. These nonconformity scores are converted into p-values

¹

, which makes it possible to estimate the confidence of predictions (Balasubramanian, Ho and Vovk 2014).

A major advantage of this method is the lack of parametric or any other strict assump- tions; the only imposed assumption is that the observations are i.i.d. This assumption can also be relaxed into the weaker assumption of exchangeability, implying that observations in fact can be dependent but should not follow any particular order (Vovk, Gammerman and Shafer 2005). Since most standard prediction algorithms already assume that ob- servations are i.i.d., no additional assumptions need to be imposed (Papadopoulos and Haralambous 2011). The Conformal Prediction framework can be applied to any machine learning method, that is, any underlying algorithm.

A confidence predictor has two important performance measures; validity and effi- ciency . It is valid if, in the long run, its errors do not exceed the significance level δ, at the chosen confidence level, 1 − δ. This means that a 95% prediction interval must not fail to include the true label in more than 5% of the cases. Furthermore, it is efficient if the prediction interval it produces is as narrow as possible. Conformal Predictors are always valid (Vovk, Gammerman and Shafer 2005), hence the focus lies on the efficiency of the predictor.

1

Not to be confounded with ordinary p-values from hypothesis testing. Defined in Equation 3.

(10)

Conformal Prediction was originally proposed as Transductive Conformal Prediction (TCP), constructed to work in an online setting; the model needs to be re-trained with every new observation. This leads to very inefficient computations, especially when deal- ing with large sets of data or very complex models. For this reason a modification was proposed by Papadopoulos et al. (2002) named Inductive Conformal Prediction (ICP).

In the following subsection, ICP will be introduced thoroughly.

3.2 Inductive Conformal Prediction

Inductive Conformal Prediction (ICP) is a method introduced by Papadopoulos et al.

(ibid.) to handle the computational inefficiency of TCP – the computational efficiency of ICP is almost as good as of the underlying algorithm. ICP do come with some loss of efficiency compared to TCP, since all available data can not be used. However, this loss of efficiency is negligible (Papadopoulos 2008). When dealing with large data or very complex models, ICP will be the only feasible option. Following is a formal introduction of ICP.

Consider a training set {(x

1

, y

₁

), . . . , (x

l

, y

_l

)} of l observations, where x

i

∈ R

^p

are the attributes (independent variables) and y

i

∈ R are the labels (dependent variables), i = 1, . . . , l. The attributes of a new observation are x

l+g

∈ R

^p

. The only assumption made is that all observations are i.i.d. The data is then divided into two subsets:

• A proper training set {(x

1

, y

₁

), . . . , (x

m

, y

_m

)} with m < l elements

• A calibration set {(x

m+1

, y

_m+1

), . . . , (x

l

, y

_l

)} with k = l − m elements

The first step is to train the underlying algorithm on the proper training set. When the model has been trained, it is applied onto the calibration set, where the nonconformity measure for each pair (x

m+i

, y

_m+i

), i = 1, . . . k, is calculated. A common nonconformity measure for regression problems is the absolute value of the residuals,

α

_m+i

= ∣y

m+i

− ˆy

m+i

∣, i = 1, . . . k, (1)

where ˆy

m+i

are the predictions on the calibration set, using the underlying algorithm.

The nonconformity score for a new observation x

l+g

is defined as

α

_l+g

= ∣y − ˆy

l+g

∣, (2)

(11)

where ˆy

l+g

is the prediction of a new observation using the underlying algorithm. A single nonconformity score does not provide any useful information, instead it must be compared to nonconformity scores of other observations to determine how nonconforming the new observation is. This is done by calculating the p-value of the unobserved label y , defined as

p (y) = #{i = m + 1, . . . , m + k, l + g ∶ α

i

≥ α

l+g

}

k + 1 . (3)

Equation 3 is the fraction of nonconformity scores α

i

, in the calibration set, that are greater than or equal to the new observation’s nonconformity score α

k+g

. The value is bounded between

_1+l¹

and 1. If it takes a small value close to its lower bound

_1+l¹

, the observation is very nonconforming. A large value, close to 1, means that it is very conforming.

Consider also some confidence level 1−δ. Given the significance level δ, the prediction region

{y ∶ p(y) > δ}, (4)

is the interval containing all labels with a p-value greater than δ. This will be such that it captures the true label of a new observation with (1−δ)% confidence (Vovk, Gammerman and Shafer 2005).

The nonconformity score defined in Equation 1, as the absolute value of the residuals, will give equal-sized prediction intervals for all predicted observations. This measure can be extended to take into account the predictive accuracy of the underlying algorithm, and hence giving the intervals different widths (Papadopoulos et al. 2002). This modification also leads to tighter prediction intervals, on average (Papadopoulos and Haralambous 2011). A normalized nonconformity measure is defined as

α

_i

= ∣ y

_i

− ˆy

i

exp(µ

i

) ∣ , where µ

i

= ln(̂ y

_i

− ˆy

i

). (5)

Here, µ

i

is the prediction of the logarithm of the absolute residuals in the calibration

set, produced by the underlying algorithm. The denominator in Equation 5, exp(µ

i

), is

considered an estimate of the accuracy of the underlying algorithm. Also, the use of the

logarithm ensures that the estimates are always positive (Papadopoulos et al. 2002). For

clarification, an error model is first trained onto the logarithm of the absolute residuals,

in the training set. Next, the logarithm of the absolute residuals in the calibration set,

(12)

are predicted. This will ensure that the nonconformity scores are normalized using the accuracy of the underlying algorithm. The result of this is larger prediction intervals for

”difficult” observations and smaller for ”easy” ones. The error model is often constructed less complex compared to the main model, such that it only captures important variations, and is not affected by noise (Papadopoulos and Haralambous 2011).

When constructing prediction intervals, the vector α of nonconformity scores is sorted in ascending order. Considering the 95th percentile of this vector means that 95% of the observations have nonconformity scores smaller than or equal to this particular score. In other words, to obtain a confidence level 1 − δ the biggest nonconformity score α, such that when α

m+i

= α

l+g

then p(y) > δ, is chosen. The prediction interval for the g:th new observation is constructed as follows:

(ˆy

l+g

− α

(m+s)

µ

_l+g

, ˆy

l+g

+ α

(m+s)

µ

_l+g

), (6)

where s is calculated such that the nonconformity score α corresponding to desired con- fidence level is selected, as

s = ⌊δ(k + 1)⌋. (7)

In Figure 1 the scheme of evaluating the ICP is presented. Note that the test set in real life applications instead are the new observations which are to be predicted.

Data set

Training set

Calibration set

Nonconformity scores

Proper training set

Model

Error model

Test set

Predictions

Intervals

Figure 1 Overview of ICP. Using the proper training set, both an ordinary model and error model

are created. The point predictions are created using the ordinary model. Both models are then applied

to the calibration set, where nonconformity scores are calculated. The interval for a new observation is

created using the error models and the nonconformity scores. The solid lines represent data, the dashed

lines represent model training and prediction.

(13)

3.2.1 Calibration set size

Sample size recommendations for the calibration set is still an open research area with no definitive answer. The ideal proportion when splitting the training set into the proper training set and calibration set depends on the learning curve for the given algorithm (Vovk, Gammerman and Shafer 2005). This splitting comes with a trade-off: a cali- bration set which is too small will give a high variance of confidence due to the lack of nonconformity scores, making the calibration unreliable. A proper training set which is too small may instead lead to a downward bias in confidence, since nonconformity scores based on a too small proper training set produce less confident predictions (ibid.). The consequences of using an improper proportion for the calibration set are pronounced when dealing with small sample sizes. Linusson et al. (2014) suggest using a small calibration set relative to the data available, roughly 15-30%. Vovk (2015) applies a 2:1 split referring to the standard proportion of dividing a training set. Papadopoulos et al. (2002) use a calibration set of roughly a size of 25% of the training data.

3.3 Cross-Conformal Prediction

The efficiency loss of ICP compared to TCP can be greater when the sample size is small. This is due to the fact that ICP only use a subset of the available data to train the underlying algorithm on. Cross-Conformal Prediction (CCP) was proposed by Vovk (2015) and introduced into the regression setting by Papadopoulos (2015) as a remedy of not being able to train models using all available data. This by combining ICP with cross-validation.

In Section 3.2 we saw that the training set was divided into a calibration set and a proper training set. Then the underlying algorithm was trained on the proper training set whereas the nonconformity scores were calculated out of the calibration set. CCP does this procedure K times: the training set is partitioned into K folds S

1

, . . . , S

_K

. The nonconformity score of an observation (x

i

, y

_i

) ∈ S

k

is calculated by the underlying algorithm trained on the data ⋃

^m≠k

S

_m

, m = 1, . . . K (ibid.).

Essentially, ICP is performed K times using cross-validation. This means that there

will be K predictions accompanied with K prediction intervals. The final output of

CCP consists of one point prediction, created by calculating the mean of the K point

(14)

predictions, and the median of the lower and upper bounds respectively. Experiments by Papadopoulos (2015) show that an increasing number of folds yield tighter intervals.

However, this decrease in interval width is not always significant. Sample size must also be considered when deciding how many folds should be used, since a large number of folds together with a small sample size implies a too small calibration set.

When using CCP the underlying algorithm must be trained K times instead of one.

Computation time will grow but is still much more computationally efficient compared to TCP (ibid.).

3.4 Non-Disclosed Conformal Prediction

Non-Disclosed Conformal Prediction was suggested in Spjuth, Carlsson and Gauraha (2018) allowing to aggregate binary predictions from different data sources, without in- terchanging any data between them. The method was proposed using TCP together with Random Forest. In this thesis, the method is extended into the case where the depen- dent variable is continuous, by first of all replacing TCP by ICP because of two reasons:

ICP is much more computationally efficient and also makes it possible to use normalized nonconformity scores. This enables the prediction intervals for different observations to have different widths. The underlying algorithm which will be used for regression NDCP is Support Vector Regression (presented in Section 3.5), but the implementation using any other regression method is analogous.

The design of NDCP is straightforward which makes it easy to implement. Consider K sources, each consisting of disjoint datasets D

k

, k = 1, . . . , K, possibly of different sizes.

The aim is to predict the label of a new observation, x

new

. Individual ICP:s or CCP:s

are applied in each of the K data sets, resulting in K different point predictions and

prediction intervals. These are then transferred into a common location A, where they

are combined. This results in aggregated predictions and prediction intervals, where no

data have been disclosed between the different data sources. The details are presented in

Algorithm 1.

(15)

Algorithm 1: Non-Disclosed Conformal Prediction (NDCP) Input : K Data sources: D

₁

, ..., D

_K

, test example: x

_new

Output: Point prediction and prediction interval for x

_new

for each D

_k

, k ∈ {1, ..., K} do

With ICP/CCP, compute prediction interval I

_k

for test example x

_new

; Compute point prediction ˆ y

_k

for test example x

_new

;

Transfer I

_k

and ˆ y

_k

to a common location A;

end

Combine all intervals and predictions into I and ˆ y;

return I, ˆ y

3.5 Support Vector Regression

One of the advantages of the Conformal Prediction framework is the lack of any strict assumptions. The underlying algorithm should therefore be similar in terms of assump- tions. Support Vector Regression (SVR) is a commonly used method when dealing with regression problems and i.i.d. data.

Support Vector Machines (SVMs) was first introduced as a classification method (Cortes and Vapnik 1995). The decision boundary in SVMs, which separates differ- ent classes, is created only using a subset of all data. This subset consists of observations close to the decision boundary, the support vectors. Observations not included in this subset are not considered to have any information regarding the position of the decision boundary. This property was later transferred into the regression setting using the ε- insensitive loss function (Vapnik 2000). In SVR the task is to find a function f(x) as flat as possible – to avoid overfitting – that deviates with at most ε from the true values y

i

, in the training data. This means that errors will not be taken into account if they are smaller than ε. Only errors greater than ε, the support vectors, will contribute to the model in terms of minimizing errors.

Consider some training data {(x

1

, y

₁

), . . . , (x

l

, y

_l

)}, where x

i

∈ R

^p

are the attributes (independent variables) and y

i

∈ R are the labels (dependent variable). The aim is to find a function that approximates all y

i

, with at most ε deviation from the true values y

_i

. Consider the linear SVR function

f (x) = ⟨w, x⟩ + b, (8)

(16)

where ⟨⋅, ⋅⟩ denotes the dot product and w the normal vector to the hyperplane. Dot products are used to be able to generalize to non-linear cases later. The ”as flat as possi- ble” is expressed in minimizing the norm of w, ∣∣w∣∣. Minimizing this norm is equivalent to finding the largest margin in the SVM case, where the decision boundary is set to be as wide as possible. In regression, making the margin as large as possible, under the condition that the value y of all examples deviates less than the required accuracy ε, im- plies minimizing model complexity. This problem can be stated as the following convex optimization problem (Smola and Sch¨olkopf 2004):

minimize 1 2 ∣∣w∣∣

²

subject to ⎧⎪⎪

⎨⎪⎪ ⎩

y

_i

− (⟨w, x

i

⟩ + b) ≤ ε y

_i

+ (⟨w, x

i

⟩ + b) ≥ −ε.

(9)

This implies that the difference between y

i

and the fitted function is to be smaller than ε and larger than −ε, i.e., that all y

i

should lie inside the ”ε-tube”. This is however not always a good solution. If there exist points outside the ε-tube, one option would be to to increase ε such that it includes all points. The other option is to allow for errors; some y

i

:s can be permitted to lie outside the ε-tube. This can be done in the same way as the soft margin function is used in SVMs; by assigning slack variables ξ

i

, ξ

_i^∗

to observations outside of the tube. The so called ε-insensitive loss function ∣ξ∣

ε

,

∣ξ∣

ε

= ⎧⎪⎪

⎨⎪⎪ ⎩

0, if ∣ξ∣ ≤ ε

∣ξ∣ − ε, otherwise, (10)

is the distance between an observation lying outside the tube and the border of the tube.

Including slack variables, the formulation of the optimization problem is now

minimize 1

2 ∣∣w∣∣

²

+ C ∑

^l

i=1

(ξ

i

+ ξ

i^∗

)

subject to ⎧⎪⎪⎪ ⎪⎨

⎪⎪⎪⎪ ⎩

y

_i

− (⟨w, x

i

⟩ + b) ≤ ε + ξ

i

⟨w, x

i

⟩ + b − y

i

≥ ε + ξ

_i^∗

ξ

_i

, ξ

^∗_i

≥ 0,

(11)

where the cost C > 0 is the trade-off between the flatness of the function and the amount

of deviations larger than ε that are tolerated (ibid.). A low C-value will allow more

(17)

observations to lie outside of the ε-tube, while a high value prioritizes including all obser- vations within ε. In Figure 2 we can see how only observations outside the grey-shaded area – the ε-tube – contribute to the cost C, with the amount of ξ.

Figure 2 ε-insensitive loss function for a linear SVR (Sch¨ olkopf and Smola 2002).

The optimization problem in Equation 11 is in its primal form. To solve this problem with more ease, as well as to be able to extend it to nonlinear problems, it is instead rewritten in its dual form, using Lagrange multipliers (Smola and Sch¨olkopf 2004):

maximize W (α, α

^∗

) = − ε ∑

^l

i=1

(α

i

+ α

i^∗

) + ∑

^l

i=1

(α

i

− α

^∗i

)y

i

− 1 2

l

∑

i,j=1

(α

i

− α

^∗i

)(α

j

− α

^∗j

)⟨x

i

, x

_j

⟩

subject to ∑

^l

i=1

(α

i

− α

^∗i

) and α

i

, α

^∗_i

∈ [0, C],

(12)

where α and α

^∗

are the Lagrange multipliers associated with the constraints of Equa- tion 11. Using the solutions of this dual form optimization, the regression estimate takes the form (ibid.)

f (x) = ∑

^l

i=1

(α

i

− α

^∗i

)⟨x

i

, x ⟩ + b. (13) Note that to estimate the parameters α

i

, α

^∗_i

, i, j = 1, . . . , l and b the dot products of all pairs of training observations are calculated. To evaluate the function in Equation 13 the dot product between a new observation and all training observations are considered (James et al. 2013).

Moving into the nonlinear setting, this could be done by preprocessing the variables

into some dimension, for example by raising them to some power. However, this would

become computationally infeasible why instead one of the main building blocks of SVM –

the important kernel trick – is used; the transformation is done implicitly by mapping via

(18)

kernels. By replacing the inner product with a valid kernel function, one can implicitly perform a non-linear mapping to a high dimensional feature space (Cristianini and Shawe- Taylor 2000). For a kernel to be valid, the function k(x, x

^′

) should correspond to a dot product in some feature space F (Smola and Sch¨olkopf 2004). In other words, a kernel returns the result of a dot product performed in some space.

Equation 12 and 13 involve x only through inner products. This means that it suffices to know a kernel function k(x, x

^′

) = ⟨Φ(x), Φ(x

^′

)⟩ instead of Φ explicitly, where Φ is a mapping from X to an inner product feature space F (ibid.). Replacing the dot products in Equations 12 and 13 with the kernel function gives the final optimization problem

maximize W (α, α

^∗

) = − ε ∑

^l

i=1

(α

i

+ α

i^∗

) + ∑

^l

i=1

(α

i

− α

^∗i

)y

i

− 1 2

l

∑

i,j=1

(α

i

− α

^∗i

)(α

j

− α

^∗j

)k(x, x

^′

)

subject to ∑

^l

i=1

(α

i

− α

^∗i

) and α

i

, α

^∗_i

∈ [0, C],

(14)

and the final regression estimate

f (x) = ∑

^l

i=1

(α

i

− α

^∗i

)k(x, x

^′

) + b. (15)

This means that in the nonlinear setting the optimization is performed through finding the flattest function in the feature space, using kernels, instead of finding the flattest function in input space (ibid.). Some popular choices of kernels are (Hastie, Tibshirani and Friedman 2009)

Linear: K(x, x

^′

) = ⟨x, x

^′

⟩ Polynomial K(x, x

^′

) = ⟨x, x

^′

⟩

^d

Radial basis function (RBF): K(x, x

^′

) = exp(−γ∣∣x − x

^′

∣∣

²

)

(16)

SVR performs well only if its hyperparameters are tuned well. Therefore, all parameters

such as C, ε and any kernel parameters, are in practice selected through grid search and

cross-validation (Cristianini and Shawe-Taylor 2000).

(19)

3.6 Combination of prediction intervals

Improving predictions by combining several models is a well studied area, including for example model averaging and stacking (Hastie, Tibshirani and Friedman 2009). The principle is that there exist a set of candidate models for a specific training set. These models may be of the same type with different parameters, or different models perform- ing the same task. After each model has been trained, the predictions can be weighted together in different ways. In committee methods, the simple average of predictions is used. Other weighting options can be to weigh the predictions using some fit measure, for example BIC (ibid.). The idea is that combining several models, with slightly different information, a stronger model in terms of predictive accuracy is obtained. The crucial difference in above mentioned setting, compared to the one in this thesis, is that models are trained on the same training set. Since the aim of this thesis is to combine predic- tions from models trained on different data sets, other approaches must be considered.

However, the idea of combining several models into one that performs better in terms of predictive accuracy is retained.

The concept of combining models to obtain better predictive accuracy has been adapted in some settings in the Conformal Prediction framework (Carlsson, Eklund and Norinder 2014; Toccaceli and Gammerman 2019). These methods aim to address the problem of informational efficiency and are built upon the concept of combining p-values.

However, combining p-values leads to models which are not guaranteed to be valid (Li- nusson et al. 2017).

Assuming that the only information that can be exchanged between data sources are point predictions and prediction intervals, makes it tricky to perform an efficient combination of intervals. A relatively naive approach is to simply take the average of all point predictions and intervals, the approach taken in the CCP (Papadopoulos 2015) methodology. The idea is simple; averaging across valid intervals should give a valid averaged interval. Also, when evaluating results in the Conformal Prediction framework, it has been common to use the median instead of the mean. The reason for this is that if some prediction intervals are extremely wide or narrow, due to either noise or overfitting, then the median will not be affected (Papadopoulos et al. 2002).

With this reasoning, aggregation of intervals will in this study be performed by calcu-

(20)

lating the median lower bound and median upper bound respectively. Point predictions will be aggregated using the mean, following the CCP methodology. A graphical illus- tration of how prediction intervals from three different sources are combined can be seen in Figure 3. Again, this is a relatively naive approach, due to the fact that the only information that is allowed to be disclosed between data sources are point predictions and prediction intervals.

NDCP

Source 1 Source 2 Source 3

Figure 3 Illustration of how prediction intervals are combined from three different sources, into an aggregated interval in NDCP. NDCP is created using the median lower bound and median upper bound.

3.7 Evaluation measures

Different measures are used when evaluating predictions and prediction intervals. To examine point prediction performance, the root mean squared error (RMSE),

¿ Á Á À 1

n

∑

i=1

(y

i

− ˆy

i

)

²

,

is used where a low value is to prefer. To be able to assess the empirical validity of the intervals, accuracy is measured as

1 n

n

∑

i=1

I

_y_i_∈PI_i

,

where I is an indicator function taking the value 1 if the true value lies inside the prediction interval, and 0 otherwise. Finally, the informational efficiency will be measured through the width of the different intervals, considering mean and median width.

3.8 Experimental setup

The evaluation of NDCP is done by applying the method on benchmark data sets from

the UCI Machine Learning Repository (Dua and Graff 2019). The experiments will be

(21)

performed on the Concrete Compressive Strength data set (Yeh 1998), and repeated on the Wine data set (Cortez et al. 2009) to verify the results. To simulate the scenario where data are located in different places, data are split into subsets, representing different data sources. After a test set has been set aside, the data is split in three different ways to simulate real life scenarios, described below.

• Random splits of equal sizes: Training data is randomly split into equally large data sources.

• Random splits of unequal sizes: Training data is randomly split into different sizes, where one data source is assigned approximately twice as many observations compared to the rest. This simulates the real life scenario where one data owner holds more data than others.

• Non-random splits of equal sizes: Training data is split by weighting of the response variable such that one data source gets a larger proportion of observations with a high value of the response variable. This means that none of the data sources will have data distributed identically to the test set. This simulates the real life scenario where data owners hold differently distributed data.

Due to the fact that sample sizes become relatively small when dividing data into subsets, CCP is used instead of ICP. This to be able to make use of as much information in the data as possible. A large number of folds is to prefer, but since sample sizes will decrease when comparing several subsets, 4 fold-CCP is a feasible approach in this situation. For a fair comparison, the same number of folds will be used for all different sample sizes.

The evaluation procedure is outlined below, and is graphically illustrated in Figure 4.

1. Randomly split the data set into a training set (90%) and a test set (10%) 2. Split the training set into K = 2, 4, 6 disjoint data sets of

(a) random sources of equal sizes (b) random sources of unequal sizes

(c) non-random sources of equal sizes 3. Perform CCP on each individual data set

4. Aggregate predictions from all K data sets using NDCP

(22)

5. Perform CCP on the pooled data from all K data sets 6. Repeat step one to five 100 times

For NDCP to be useful, its performance should be better than the performance based on the individual data sources. Additionally, it is desirable that it performs close to the results of the pooled data.

Together with these results, a hypothetical Ideal NDCP will also be presented. This represents an NDCP with an ideal combination of intervals. The ideal intervals will be constructed by symmetrically decreasing the prediction interval widths, until a correct coverage probability has been obtained. Note that this is not possible to do in a real life scenario; it can only be done after the true labels have been revealed and is only to represent the results NDCP would yield in the case of an optimal symmetric interval combination.

A linear SVR is used for estimating the error models, as described in Section 3.2. In all other cases, a nonlinear SVR with an RBF kernel is used, and the parameters C, and γ are optimized in each run and data set through grid search and selected by 10-fold cross validation.

”Pooled” data set

Training set

Source 1

CCP

Source 2

CCP

Source k

CCP

Test set

Aggregated predictions & intervals

90% 10%

· · ·

Figure 4 Illustration of the evaluation procedure. The hypothetical ”Pooled” data set is divided into a training and test set. The training set is then split into different subsets, simulating different data sources. Each data source constructs a CCP using SVR and then predicts the test observations.

Predictions are finally aggregated using NDCP.

(23)

4 Results

The results in this section are based the Concrete Compressive Strength data set and are divided into three parts, investigating each and one of the three different settings for splitting the data: random sources of equal sizes (Section 4.1), random sources of unequal sizes (Section 4.2) and non-random sources of equal sizes (Section 4.3).

To verify the results, all experiments have also been performed on the Wine data set.

Those results show the same patterns and can be found in the Appendix.

4.1 Experiment 1: Random sources of equal sizes

In Table 1 the results from 100 simulations, applying random disjoint splits of equal sizes, are presented. The table is divided into three parts, where each part represent a setting with different number of data sources. In the different parts of the table, each row represents a different setting: NDCP are the aggregated intervals. Ideal NDCP represents the NDCP with ideal interval width, i.e. how wide the intervals would have been if the accuracy did not exceed the desired confidence level. These intervals are constructed by symmetrically shrinking the intervals, until a 95% accuracy has been attained. Furthermore, each Source are the individual data sources and finally, Pooled represents a model trained on all data sources pooled together.

Considering the columns, the accuracy determines the amount of true values which have been captured inside the prediction intervals. Furthermore, the RMSE for the point predictions, the mean and median widths of the prediction intervals, and the number of observations in the training set are presented.

Looking at the setting with two sources in Table 1, we can see that the accuracy of all models are above or around 95%, which means that the intervals are valid. In terms of RMSE the pooled model is the best performing, as expected. The individual sources perform slightly worse and NDCP is somewhat in-between. Considering the intervals we see that Pooled has a width of approximately 22, whereas NDCP and the two individual sources have a width of approximately 27. Pooled performs best in all aspects, except in accuracy.

Comparing NDCP with the individual sources we can see that they perform equally

well in terms of interval width. However, there are differences in accuracy. The intervals

(24)

of NDCP are conservative, including 97% of the true values when its actual coverage probability should be of 95%. Ideal NDCP has a tighter interval width, compared to the individual sources, performing close to the pooled data.

Considering four and six sources the results follow the same pattern: Pooled do always have the best performance, as expected, while all individual sources behave approximately equal. NDCP do always perform better than the individual sources in terms of RMSE and accuracy. As the number of data sources increase, the individual sources perform worse and worse, resulting in higher RMSE and wider intervals – but while maintaining a correct accuracy. We can also see that the difference between Ideal NDCP and the individual sources seems to grow larger as the number of sources increase.

Table 1 Performance measures from Experiment 1, equally sized data sources, for models NDCP, Ideal NDCP, the individual equally sized data sources (2, 4 and 6) and Pooled. PI refers to the Prediction Intervals and n to the number of observations underlying the predictions.

2 sources Accuracy RMSE PI Mean Width PI Median Width n

NDCP 0.971 6.005 27.032 26.813 927

Ideal NDCP 0.950 6.005 24.437 24.575 927

Source1 0.957 6.467 27.098 26.965 463

Source2 0.960 6.331 26.966 26.748 464

Pooled 0.964 5.424 22.535 22.283 927

4 sources Accuracy RMSE PI Mean Width PI Median Width n

NDCP 0.978 6.860 32.091 31.674 927

Ideal NDCP 0.950 6.860 27.480 27.630 927

Source1 0.958 7.507 31.906 31.655 231

Source2 0.958 7.592 32.098 31.858 232

Source3 0.955 7.545 32.000 31.793 232

Source4 0.957 7.622 32.522 32.186 232

Pooled 0.962 5.463 22.267 22.052 927

6 sources Accuracy RMSE PI Mean Width PI Median Width n

NDCP 0.977 7.396 34.854 34.492 927

Ideal NDCP 0.950 7.396 29.674 29.853 927

Source1 0.952 8.241 35.317 35.010 155

Source2 0.953 8.171 35.013 34.711 155

Source3 0.952 8.236 35.492 35.229 154

Source4 0.953 8.189 35.269 34.876 154

Source5 0.952 8.266 34.780 34.525 154

Source6 0.949 8.273 34.763 34.547 155

Pooled 0.967 5.485 22.838 22.601 927

In Figure 5 the dispersion of the prediction interval widths coming from two, four and

six data sources, for Pooled, NDCP as well as one of the individual data sources are

presented. The vertical axis represents the interval width. Considering Pooled, we can

see that is has a relatively low dispersion of widths compared to all others, as expected.

(25)

Regardless of number of data sources, NDCP has a lower dispersion of interval widths compared to the individual source. This difference do however increase with the number of data sources and is very distinct when dealing with six data sources. It is also worth recalling that at the same time that NDCP has lower dispersion than the individual source, its prediction intervals are also conservative.

*

**

*

**

*

**

*

**

*

**

***

*

**

*

**

*

**

*

**

*

***

**

*

**

*

**

*

***

*

**

*

**

****

*

**

*

**

*

**

*

****

*

***

*

**

*

**

*

***

*

**

*

**

***

*

**

*

***

*

**

*

**

*

****

*

**

*

**

*

***

*

**

******

**

*

**

* *

*

*****

*

**

****

*

**

*

**

*

**

*

**

*

**

*

**

*

***

**

***

*

**

*

***

*

**

*

**

*

**

*

**

*

**

*

**

*

***

*****

*

**

***

*

**

*

*****

*

**

*

**

*

**

*

****

*

***

**

*

***

*

**

*

**

*

**

*

**

*

**

*

**

*

****

*

**

*

**

*

10 20 30 40 50 60 70 80 90

2 4 6

Number of data sources

Interval Width

Pooled NDCP Individual

Figure 5 Dispersion of prediction interval widths for Experiment 1, equally sized data sources, for 2, 4 and 6 number of data sources. Results from Pooled, NDCP and a randomly selected data source from the equally sized data sources are presented.

4.2 Experiment 2: Random sources of unequal sizes

Presented in Table 2 are the results from data sources which has been randomly split,

where the first data source is approximately twice as large compared to the remaining

data sources. In the first part of the table with two sources, we see that Pooled still

performs best, as expected. Source1 do now have twice as many observations compared

to Source2, giving a relatively large performance gap between the two; Source1 has a

lower RMSE and tighter prediction intervals than Source2. NDCP has tighter intervals

than the small Source2, but is outperformed by the larger Source1. The same can be said

about RMSE. However, NDCP accuracy is again above the confidence level. Considering

Ideal NDCP the prediction intervals has been shrunk until a correct confidence level

has been attained and does now perform equally well as the larger Source1 in terms of

prediction interval width, but worse in terms of RMSE.

(26)

Considering four and six sources we can see the same patterns; NDCP does not manage to obtain comparable prediction intervals to the data source with the largest amount of observations. In the setting of four sources, Ideal NDCP does however manage to obtain tighter intervals compared to the large Source1.

Table 2 Results from Experiment 2, unequally sized data sources, for models NDCP, Ideal NDCP, the individual data sources (2, 4 and 6) and Pooled. PI refers to the Prediction Intervals and n to the number of observations underlying the predictions.

Aggregating predictions using Non-Disclosed Conformal Prediction