A Cross-Validation Approach to Knowledge Transfer for SVM Models in the Learning Using Privileged Information Paradigm

(1)

A Cross-Validation Approach to Knowledge Transfer for

SVM Models in the Learning Using Privileged Information

Paradigm

Submitted by

Fabian Söderdahl

A thesis submitted to the Department of Statistics in

partial fulfillment of the requirements for Master

degree in Statistics in the Faculty of Social Sciences

Supervisor

Johan Bring

(2)

ABSTRACT

The learning using privileged information paradigm has allowed support vector machine models to incorporate privileged information, variables available in the training set but not in the test set, to improve predictive ability. The consequent introduction of the knowledge transfer method has enabled a practical application of support vector machine models utilizing privileged information. This thesis describes a modified knowledge transfer method inspired by cross-validation, which unlike the current standard knowledge transfer method does not create the knowledge transfer function and the approximated privileged features used in the support vector machines on the same observations. The modified method, the robust knowledge trans-fer, is described and evaluated versus the standard knowledge transfer method and is shown to be able to improve the predictive performance of the support vector machines for both binary classification and regression.

(3)

1 Introduction

Models used for prediction are utilized in many real-life applications. Consider the following simplified example from Vapnik & Izmailov (2015) of a prediction problem for binary classifi-cation:

(i) A set of n cancer biopsy images (as image pixels) is given. Furthermore, for each image the resulting outcome, cancer or not cancer, is provided. The goal is to use this data to create a decision rule which can be used for predicting the outcome of future biopsy images, which were not part of the original dataset

There are several methods available to solve the problem in (i). Possible models may be logistic regression, neural networks or support vector machines. For each method the training set is supplied to a classification algorithm and a decision rule is created. Afterwards, the decision rule can be applied to new biopsy images and their outcome can be predicted. Now, consider the following modification of the initial problem in (i):

(ii) A set of n cancer biopsy images and their corresponding outcomes are provided. In addition, each biopsy image is also supplied with a grading given by an expert pathologist indicating the expert view of the degree of cancer in the biopsy. The goal is to use this data to create a decision rule for predicting the outcome of future biopsy images. However, the expert grading will not be available for future biopsy images.

Most traditional models used for prediction assume symmetry between the variables in the data used to create the decision rule, the training set, and the data used to apply the decision rule, the test set. The models require that both datasets contain the same variables. Applying a traditional classification model in (ii), the additional information available when training the model is discarded and possible valuable information for prediction is lost. This situation where there is additional information during training can arise due to multiple reasons, For example it can be expensive to acquire or there may be time lag before the information is available.

(5)

predictive performace but the algorithm is hard to scale to larger samples. As an alternative, Vapnik & Izmailov (2015) introduced the concept of knowledge transfer. The main concept is to transfer the information from the space of privileged information to the space of the standard training features. Methods for knowledge transfer proposed in Vapnik & Izmailov (2016a) showed improvements compared to only using the standard features.

One notable feature of the knowledge transfer method is that all observations in the training set are used both to create a knowledge transfer function to transfer privileged features from privileged space to standard feature space and also to create the approximated privileged fea-tures used to train the SVM model. For prediction models, it is advised not to both estimate a model and perform predictions on the same data, as the model can become overfitted to the training set and not generalize well to other sets of data (Arlot et al., 2010). Cross-validation is often used to prevent this issue. The question is how to introduce the cross-validation approach in the current knowledge transfer framework. This thesis will describe a new implementation of knowledge transfer inspired by cross-validation and evaluate the performance in comparison to the currently available method.

1.1 Aim and research question

The primary aim of this thesis is to describe a modified method for knowledge transfer under the learning using privileged information paradigm, the robust transfer knowledge. The described method will be compared to the existing method for knowledge transfer in order to answer the following question:

• Can the robust knowledge transfer method improve the predictive performance of SVM models compared to the current standard knowledge transfer method?

1.2 Outline of the thesis

(6)

2 Support Vector Machines

Support vector machines are a set of supervised learning algorithms. The following section contains an overview of two learning algorithms belonging to this set which both are consid-ered in this thesis. The first algorithm is the support vector machine (SVM) used for binary classification. The second algorithm is support vector regression (SVR) used for prediction of continuous values. For clarification of notation, in this thesis support vector machines (plu-ral) refer to the whole family of support vector machines prediction models where the support vector machine and support vector regression are two subgroups.

2.1 Support Vector Machine

The Support Vector Machine (SVM) is a model for binary classification. The current form of SVM is a non-linear generalization of the generalized portrait algorithm (Vapnik & Lerner, 1963; Vapnik & Chervonenkis, 1964). The originally proposed algorithm was a linear classifier and Boser et al. (1992) suggested using the kernel trick which extended the model to non-linear classification. The earliest implementations of SVM were developed for linearly separable data and Cortes & Vapnik (1995) proposed an extension to SVM for non-linearly separable data.

2.1.1 The optimization problem

The SVM works by constructing a decision rule, which separates the data into two classes. The goal of SVM is to, from a given set of functions, find the function which best approximates the unknown decision rule. Best in this case refers to the function that gives the smallest probability of error in classification. The approximation of the unknown decision rule is created by, given a set of training data, constructing a separating hyperplane which divides the data into the two classes. Formally the SVM algorithm is described as follows. The training set is given in the form of pairs

(x1, y1), ..., (x`, y`), xi ∈ X , yi ∈ Y (1)

which are generated by an unknown but fixed probability measure P (x, y) = P (y|x)P (x). The training set contains ` decision vectors which come from the n-dimensional decision space X = Rn _{and also corresponding ` decision labels from the label space Y = {−1, 1}. To}

(7)

hyperplane that between the classes has the largest margin (or distance). For observation i the functional margin is defined as

γi = yi(wTxi) + b (2)

where w is called the weight vector and b is bias. If γi is larger than 0, the prediction for

example i is correct. The geometric margin is obtained by scaling the functional margin with kwk. The goal can now be expressed as to maximize the geometric margin, which will lead to the following optimization problems (the intermediary steps to the final optimization problem are not described in this thesis). Two cases can be considered: the separable case and the non-separable case. In the separable case one must solve the following problem with a convex quadratic objective with only linear constraints

minimize w,b 1 2kwk 2 (3) subject to yi[wTxi+ b] ≥ 1, i = 1, ..., ` (4)

In the non-separable case the optimization problem is similar but here slack variables ξi are

introduced which will allow an example to be in the margin

minimize w,b 1 2kwk 2 + C ` X i=1 ξi (5) subject to yi[wTxi+ b] ≥ 1 − ξi, i = 1, ..., ` (6) ξi ≥ 0, i = 1, ..., ` (7)

The constant C controls the importance between minimizing kwk2 _{and keeping most examples}

with a functional margin of at least 1. A simplified optimization problem is acquired by con-structing the Langrangian for the prevous problem yielding a primal problem and the following dual formulation maximize α ` X i=1 αi− 1 2 ` X i=1 ` X j=1 yiyjαiαjxTi xj (8) subject to ` X i=1 yiαi = 0 (9) 0 ≤ αi ≤ C (10)

(8)

the primal problem for finding b. Lastly, for an observation xtin the test set, a prediction ˆytis performed using ˆ yt= sgn ` X i=1 αiyixTi xt+ b ! (11) 2.1.2 Kernels

Using kernels as suggested in (Boser et al., 1992) allow for efficiently estimating non-linear classifiers without explicitly mapping the data to a higher dimensional feature space. The kernel replaces the dot product xT_{x in equation (8) and (11). In the models used in this thesis}

the best of three kernels considered for each evaluation was chosen. The linear kernel:

K(xi, xj) = xTi xj + c (12)

The polynomial kernel:

K(xi, xj) = (αxTi xj + c)d (13)

The radial basis function kernel (RBF kernel): K(xi, xj) = exp −kxi− xjk 2 2σ2 (14)

2.2 Support Vector Regression (SVR)

An extension to support vector machines for the regression case was introduced by Drucker et al. (1997). The method follow similar principles as the classification case, with some dif-ferences. A margin of tolerance, , is introduced. The goal is to fit a hyperplane that has the maximum number of observations that are within ± distance from the hyperplane while keeping flatness, that is minimizing w (Smola & Schölkopf, 2004). Slack variables ξi, ξi∗ allow

points to lie outside of the margin of tolerance. The optimization problem to be solved is

(9)

Once again the Lagrangian of the previous problem yields the primal problem and the following dual formulation, which is a simplified optimization problem

maximize α − 1 2 ` X i,j=1 (αi− α∗i)(αj − α∗j)K(xi, xj) − ε ` X i=1 (αi + α∗i) + ` X i=1 yi(αi− α∗i) (19) subject to ` X i=1 (αi− α∗i) = 0 (20) αi, α∗i ∈ [0, C] (21)

Note that in equation (19), the kernel has been placed directly in the formula as described in section 2.1.2. The solutions to the optimization problems can be found similarly but with slight modifications from the SVM case (Smola & Schölkopf, 2004). Prediction ˆytfor an observation

xtin the test set, is performed using

ˆ yt= ` X i=1 (αi − α∗i)K(xi, xt) + b (22)

Again one can note that here the kernel K(xi, xt) has been placed directly in the formula,

(10)

3 The Learning Using Privileged Information Paradigm

The main concept of learning using privileged information (LUPI) is to utilize additional in-formation in the training set, called privileged inin-formation. This privileged inin-formation is not available in the test set.

3.1 Notation

In the classification case, the outcome space is defined as Y = {−1, 1}. In the regression case, the outcome space is defined as Y = R. The object space for standard features is denoted X ∈ Rn_{, where n is the number of standard features. The object space for privileged features}

is X∗ ∈ Rm_{, where m is the number of privileged features.}

In order to properly describe the knowledge transfer and robust knowledge transfer method, some matrix notation is introduced. The outcome vector is denoted Y. The ` × n standard features design matrix is denoted X and the ` × m privileged features design matrix is denoted X∗. The vector and two matrices are defined as

Y =         y1 y2 . . . y`         , X =         x1 1 x21 . . . xn1 x1₂ x2₂ . . . xn₂ . . . . x1_` x2_` . . . xn_`         , X∗ =         x∗1₁ x∗2₁ . . . x∗m₁ x∗1₂ x∗2₂ . . . x∗m₂ . . . . x∗1_` x∗2_` . . . x∗m_`        

Note that xi refers to the vector of standard features for observation i and xji refers to the j:th

standard feature for observation i. Lastly, let subscripts in parenthesis such as X(k) denote a

partition k of X by rows. Also let a negative subscript in parenthesis X(−k) be all partitions

except k. If K denote the total number of partitions, then

X =         X(1) X(2) . . . X(K)        

3.2 The SVM+ Implementation

(11)

algo-rithm was denoted SVM+ and was shown to improve the predictive performance compared to SVM on only standard features. In particular, it improved the rate of convergence of the error rate to the optimal solution. That indicates that the SVM+ in particular could perform better on training sets with smaller number of observations.

An algorithm for solving the SVM+ optimization problem was presented in Pechyony et al. (2010). The optimization problem that is required to be solved for the SVM+ algorithm is more difficult compared to the original SVM. The algorithm is also difficult to scale making it impractical for implementations on datasets with larger number of observations.

3.3 Knowledge Transfer

To address the scalability issue of the SVM+ model, Vapnik & Izmailov (2015) introduced the knowledge transfer method. By utilizing knowledge transfer, the optimization problem needed to be solved is the same as the optimization problems in section 2, allowing it to be solved by standard SVM algorithms. The case is similar to section 2.1.1 but instead of pairs, a set of triplets is given

(x1, x∗1, y1), ..., (x`, , x`∗, y`), xi ∈ X , x∗i ∈ X ∗

, yi ∈ Y (23)

which are generated according to a fixed but unknown probability measure P (x, x∗, y) = P (x∗, y|x)P (x). The training set consists of ` decision vectors from the n-dimensional space X = Rn_{, corresponding ` privileged vectors from m-dimensional privileged space X}∗

= Rm

and corresponding ` decision labels from label space Y = {−1, 1} or in the regression case Y = R. The aim of the SVM model is the same as previously. However, there is now an added aim for knowledge transfer which is to create a rule to transfer the information from privileged space X∗to decision space X .

The proposed method (Vapnik & Izmailov, 2016a) is to create m multivariate regression functions φi, one for each privileged feature as the dependent variable and using the

(12)

observa-tion j, the matrix of approximated privileged features is Φ(X) =         φ1(x11, ..., xn1) ... φm(x11, ..., xn1) φ1(x12, ..., xn2) ... φm(x12, ..., xn2) ... ... ... φ1(x1`, ..., xn`) ... φm(x1`, ..., xn`)         (24)

As a reminder, the original training set is [Y, X, X∗]. The matrix with approximated features is augmented with the decision labels matrix and standard features design matrix, giving the modified training set [Y, X, Φ(X)]. From the specified modified training set it is clear that after creating the knowledge transfer function the privileged features are no longer required for estimating the SVM model.

The SVM algorithm is applied to the modified training set creating a p + m-dimensional decision rule. The knowledge transfer rule is also applied to the test data and approximate privileged features are estimated for the test set. The test data, with the approximated privileged features, is then applied to the decision rule and the results are evaluated. Figure 1 gives a flow chart of the process of fitting a SVM model with knowledge transfer.

(13)

3.4 Robust Knowledge Transfer

Cross-validation is a commonly applied method in classification models. While there are dif-ferent types of cross-validation, the general idea is to partition the observations in the data and estimate the model on some partitions while evaluating the model on the remaining partitions Trevor et al. (2009). In the knowledge transfer method in section 3.2, the same standard fea-tures are used as input in both the regression models and the consequent prediction of privileged features based on those regressions. This could result in overfitting the knowledge transfer rule to the training data, which in turn will affect the training of the SVM model.

The proposed method in this thesis is called robust knowledge transfer (RKT) and is based on a version of K-fold cross validation. The main concept behind the method is to refrain from using the same examples for creating a decision rule and for predicting the privileged features with the decision rule. The intuition is that by avoiding this the transfer rule will better generalize to the test data.

The first step in RKT partitions the training set into K equal folds. For each partition [X(k), X

∗

(k)], k = 1, ..., K, the m knowledge transfer regressions described in section 3.3 are

performed. The resulting knowledge transfer function i for partition k is denoted with super-script as φk_i and consequently the matrix of approximated features using rule k as Φk. The resulting knowledge transfer rule for k is applied to the partitions not in k. This yields K modified design matrices [X(−k), Φk(X(−k))]

Each modified design matrix k is trained using a SVM model. This results in K different decision rules. The end result of the training phase is thus K knowledge transfer rules and corresponding decision rules. In the test phase each k knowledge transfer rule and decision rule is applied on the test data. This yields K different set of predictions. The last step is to combine the predictions into one value. For the classification case, the resulting predictions are weighted using probabilities calculated according to Platt et al. (1999). Letting pk_i(−1) be the probability for the true value yk

i being -1 and pki(1) be the probability for the true value being 1

for individual i from decision rule k, the combined prediction is

ˆ yi = sgn K X k=1 (−1) × pk_i(−1) + pk_i(1) ! (25)

In the regression case, the combined prediction for individual i is defined as ˆ

yi =

PK

k=1yˆik

(14)

In others words it is the average value of the K different predictions. Figure 2 shows a flow chart of fitting a SVM model with robust knowledge regression using K = 3 partitions.

(15)

4 Method

In the following section the different scenarios used in the evaluation are presented. The eval-uation criteria are described and an overview of the datasets used is given.

4.1 Models and Evaluation Criteria

In order to evaluate the performance of the robust knowledge transfer, four different scenarios will be considered. These four scenarios are the following:

1. SVM Standard: The SVM algorithm is trained and evaluated only with standard fea-tures. Privileged features are not included in this scenario.

2. SVM KT: Privileged features are transferred from privileged space X∗to decision space X using appropriate regression methods on all training data. The SVM is trained and evaluated on standard features and approximated privileged features.

3. SVM RKT: The training set is partitioned into K partitions and privileged features are transferred from privileged space to decision space according to the robust trans-fer knowledge method. The SVM is trained on K sets containing standard features and approximated privileged features and evaluated on the resulting combined prediction. 4. SVM Complete: The SVM algorithm is trained and evaluated on standard and privileged

features. In this scenario the privileged features are treated as if they are available at both training and test time.

The four scenarios are evaluated both on datasets concerning binary classification as well as regression. The main measure for evaluating the models will differ between these two cases. In the binary classification case, the percentage error rate (ER) is used defined as

ER = Number of incorrectly classified observations n

× 100 (27)

where n is the number of examples in the test set. In the regression cases using SVM, the root mean squared error (RMSE) is used to evaluate the performance. Let ˆyi be the predicted value

and yibe the actual value for observation i in the test set. The RMSE is then defined as

RMSE =

r Pn

i=1(ˆyi− yi)2

(16)

Each evaluation is run 50 times where at each iteration the observations in the test set are randomly chosen. The results are presented as the average ER or RMSE over all iterations.

In Vapnik & Izmailov (2017) and Izmailov et al. (2017), an additional evaluation measure is used. As the SVM Complete contains more information than the SVM KT and SVM RKT, the ER/RMSE should be higher for both knowledge transfer methods compared to the SVM Complete. Using the same reasoning the knowledge transfer methods should have a lower ER/RMSE compared to the SVM Standard. Let A be the error of SVM RKT or SVM KT, B denote the error of SVM Standard and C denote the error of SVM Complete. The LUPI-specific gain is then defined as

LUPI-gain = B − A

B − C (29)

The denominator is the difference in error rate of the model with only standard features and a theoretical model which has standard features and fully available privileged features (both for train and test set). The LUPI-gain thus gives how much improvement can be recovered from the Standard model to the Complete model by utilizing knowledge transfer.

4.2 Datasets

An overview of the datasets used for evaluation is given in table 1. There are few publicly available datasets specifically for the LUPI paradigm. For the classification datasets Ionosphere and kc2, privileged features were chosen based on Izmailov et al. (2017). For Parkinsons and all regression datasets, privileged features were chosen based on mutual information in accordance with Vapnik & Izmailov (2017). The datasets used for evaluation of the models are taken from: [1] the UCI Machine Learning Repository (Dua & Graff, 2017) and [2] Harrison Jr & Rubinfeld (1978)

Table 1. Overview of datasets used for evaluation

Type Dataset Size Decision Features Privileged Features

Classification Ionosphere [1] 351 30 4

Classification Parkinsons [1] 195 12 10

Regression Red Wine Data [1] 1599 6 5

(17)

5 Results

This section presents the results from the evaluations on the simulated dataset and the four datasets shown in section 4.2. First the evaluations on classification datasets are presented, followed by the results on regression datasets.

5.1 Classification

The first evaluation is based on simulated data according to (Vapnik & Izmailov, 2016a). Two standard features, (x1_i, x2

i), i = 1, ..., `, are respectively randomly drawn from a

uni-form distribution with minimum -1 and maximum 1. A privileged feature is calculated as x3_i = x1_i + x2_i + 0.01 × W , where W ∼ N (0, 1). The label is computed as yi = sgn(x1i + x2i).

The test set is contructed using the same distribution and contains 10,000 observations. The number of examples ` in the training set is varied at 30, 40, 60 and 80. The SVM is estimated using a RBF kernel and knowledge transfer is performed using multiple linear regression. The SVM RKT is evaluated using K = 5 folds, due to the small number of observations in the training sets considered.

Table 2. Average percentage error rate for classification on simulated data (50 iterations). SVM RKT uses K = 5 folds.

Training size SVM Standard SVM KT SVM RKT SVM Complete

30 8.09 7.24 6.10 6.91

40 6.33 5.50 4.92 5.29

60 5.39 4.69 4.39 4.53

80 4.19 3.43 3.32 3.45

Table 2 presents the average percentage error rates on the simulated data. Both methods utilizing knowledge transfer show smaller percentage error rates compared to the SVM standard model. The SVM RKT also show improvement in error rate compared to the SVM KT, for all training sizes considered. The difference in error rate between the two methods decreases as the training size increases.

(18)

authors attribute this finding to the specification of the privileged feature. The noise introduced in the calculation of the privileged feature is filtered out in regression during the knowledge transfer, yielding a better predictor of the outcome.

For the two sets with real life data considered respectively, 25% of the observations were randomly chosen as validation data and the remaining observations were assigned as training data. The four scenarios were computed 50 times respectively, each time a new training and test set were randomly chosen. The SVM models are estimated using RBF kernel and the regression method for knowledge transfer was kernel ridge regression with RBF kernel. The robust knowledge transfer uses 10 folds.

Table 3. Average percentage error rate and average LUPI gain for classification on Ionosphere and Parkinsons datasets (50 iterations). SVM RKT uses K = 10 folds

Dataset Measure SVM Standard SVM KT SVM RKT SVM Complete

Ionosphere Error rate 6.31 6.24 6.05 5.52

LUPI gain - 8.8% 32.4%

-Parkinsons Error rate 11.69 10.31 8.82 8.05

LUPI gain - 38.0% 78.9%

-The results in table 3 show for both datasets considered best performance for the SVM Complete model, however both knowledge transfer methods have smaller error rate compared to the SVM Standard. The SVM RKT model also yields lower error rates compared to the SVM KT model. For Ionosphere dataset the SVM KT recovers only 9 percent as seen by the LUPI-gain, while the SVM RKT recovers 32 percent. For Parkinsons data the difference is even larger, where SVM RKT recovers 79 percent and SVM KT 38 percent, a difference of 41 percentage units.

5.2 Regression

(19)

Table 4. Average RMSE and LUPI gain for regression using Boston Housing Data (50 itera-tions). SVM RKT uses K = 10 folds.

Measure Training size SVM Standard SVM KT SVM RKT SVM Complete

RMSE 50 8.27 8.14 7.70 5.30 LUPI gain 50 - 4.4% 19.1% -RMSE 100 7.66 7.57 7.26 4.47 LUPI gain 100 - 2.6% 12.4% -RMSE 150 7.51 7.35 6.95 3.95 LUPI gain 150 - 4.4% 15.6% -RMSE 379 6.90 6.20 6.19 3.31 LUPI gain 379 - 19.4% 19.8%

-The results for the evaluation on Boston housing data is given in table 4. -The knowledge transfer methods show some improvements over the SVM Standard, although the improve-ments as seen by LUPI gain are not as large as for the classification cases. The SVM RKT has lower RMSE compared to the SVM KT, for the training size of 50 the difference is more clear, especially looking at the LUPI-gain measure. However as the training sizes increases, the differences between the methods become small. For the largest training size of 379, there is no real difference between the methods.

(20)

Table 5. Average RMSE and LUPI gain for regression on Red Wine Dataset (50 iterations). SVM RKT is estimated using K = 10 folds.

Measure Training size SVM Standard SVM KT SVM RKT SVM Complete

(21)

-6

Discussion

This thesis describes a modified method for knowledge transfer, the robust knowledge transfer, inspired by cross-validation. The method is evaluated against the currently available standard knowledge transfer method from Vapnik & Izmailov (2016a) for various datasets and training sizes. The results in section 5 show that there can be improvements in predictive performance by taking the cross-validation approach in the robust knowledge transfer compared to the cur-rent standard knowledge transfer. It is important to note that the results do not prove that the robust knowledge transfer method is in general better than the standard knowledge method. Instead the results show that there can be improvements using the robust knowledge transfer, but further research would need to examine further in which different scenarios it is motivated to use robust knowledge transfer.

The results give some indication that the improvements using robust knowledge transfer over the standard knowledge transfer in decreased ER and RMSE become smaller when the training size is increased. When the training size is increased, the size of each fold is also increased. When the size of each fold is increased they will start to behave more like the full training set Trevor et al. (2009). This could be the reason why the robust knowledge transfer and the standard knowledge transfer appear to converge when the training size is increased.

There can also be other factors that contribute to the improved performance than just the cross-validation approach. The RKT method yields K different predictions which in turn are combined into a single prediction. For binary classification, probabilities from (Platt et al., 1999) are utilized to combine the K predictions. Vapnik & Izmailov (2016b) present some drawbacks with those probabilities and present another method for combining multiple decision rules. Utilizing more suitable methods for combining the predictions in the robust knowledge transfer could lead to even better predictive performance.

(22)

7 Conclusions

This thesis describes a modified method of knowledge transfer incorporating cross-validation methods. The proposed method, the robust knowledge transfer, is evaluated both in classifica-tion and regression scenarios on publicly available machine learning datasets. The results show that robust knowledge transfer can improve predictive performance, that is yield lower error rates or RMSE, compared to the current standard knowledge transfer. The results also indicate that the improvements over the standard knowledge transfer may be larger when the training size is smaller.

(23)

Acknowledgements

(24)

References

Arlot, S., Celisse, A. et al. (2010), ‘A survey of cross-validation procedures for model selec-tion’, Statistics surveys 4, 40–79.

Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992), A training algorithm for optimal margin classifiers, in ‘Proceedings of the fifth annual workshop on Computational learning theory’, ACM, pp. 144–152.

Cortes, C. & Vapnik, V. (1995), ‘Support-vector networks’, Machine learning 20(3), 273–297. Drucker, H., Burges, C. J., Kaufman, L., Smola, A. J. & Vapnik, V. (1997), Support vector regression machines, in ‘Advances in neural information processing systems’, pp. 155–161. Dua, D. & Graff, C. (2017), ‘UCI machine learning repository’.

URL: http://archive.ics.uci.edu/ml

Harrison Jr, D. & Rubinfeld, D. L. (1978), ‘Hedonic housing prices and the demand for clean air’, Journal of environmental economics and management 5(1), 81–102.

Izmailov, R., Lindqvist, B. & Lin, P. (2017), Feature selection in learning using privileged in-formation, in ‘2017 IEEE International Conference on Data Mining Workshops (ICDMW)’, IEEE, pp. 957–963.

Pechyony, D., Izmailov, R., Vashist, A. & Vapnik, V. (2010), Smo-style algorithms for learning using privileged information., in ‘DMIN’, pp. 235–241.

Platt, J. (1998), ‘Sequential minimal optimization: A fast algorithm for training support vector machines’.

Platt, J. et al. (1999), ‘Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods’, Advances in large margin classifiers 10(3), 61–74.

Smola, A. J. & Schölkopf, B. (2004), ‘A tutorial on support vector regression’, Statistics and computing14(3), 199–222.

(25)

Vapnik, V. (2006), Estimation of dependences based on empirical data, Springer Science & Business Media.

Vapnik, V. & Chervonenkis, A. (1964), ‘A note on one class of perceptrons’, Automation and Remote Control25.

Vapnik, V. & Izmailov, R. (2015), Learning with intelligent teacher: Similarity control and knowledge transfer, in ‘International Symposium on Statistical Learning and Data Sciences’, Springer, pp. 3–32.

Vapnik, V. & Izmailov, R. (2016a), Learning with intelligent teacher, in ‘Symposium on Con-formal and Probabilistic Prediction with Applications’, Springer, pp. 3–19.

Vapnik, V. & Izmailov, R. (2016b), ‘Synergy of monotonic rules’, The Journal of Machine Learning Research17(1), 4722–4754.

Vapnik, V. & Izmailov, R. (2017), ‘Knowledge transfer in svm and neural networks’, Annals of Mathematics and Artificial Intelligence81(1-2), 3–19.

Vapnik, V. & Lerner, A. Y. (1963), ‘Recognition of patterns with help of generalized portraits’, Avtomat. i Telemekh24(6), 774–780.

A Cross-Validation Approach to Knowledge Transfer for SVM Models in the Learning Using Privileged Information Paradigm