Investigation of multivariate prediction methods for the analysis of biomarker data

(1)

Investigation of

multivariate prediction methods

for the analysis of

biomarker data

Graduate thesis project

LITH-IFM-EX-06/1556–SE

Aron Hennerdal Supervisors Kerstin Nilsson

Senior Research Scientist AstraZeneca R&D Södertälje Hugh Salter Associate Director AstraZeneca R&D Södertälje Examiner Bengt Persson Professor of Bioinformatics Linköping University 2006-01-24

(2)

Abstract

The paper describes predictive modelling of biomarker data stemming from patients suffering from multiple sclerosis. Improvements of multivariate anal-yses of the data are investigated with the goal of increasing the capability to assign samples to correct subgroups from the data alone.

The effects of different preceding scalings of the data are investigated and combinations of multivariate modelling methods and variable selection methods are evaluated. Attempts at merging the predictive capabilities of the method combinations through voting-procedures are made. A technique for improving the result of PLS-modelling, called bagging, is evaluated.

The best methods of multivariate analysis of the ones tried are found to be Partial least squares (PLS) and Support vector machines (SVM). It is concluded that the scaling have little effect on the prediction performance for most methods. The method combinations have interesting properties – the default variable selections of the multivariate methods are not always the best. Bagging improves performance, but at a high cost. No reasons for drastically changing the work flows of the biomarker data analysis are found, but slight improvements are possible. Further research is needed.

(3)

List of Figures

1 2D gel electrophoresis . . . 7 2 PLS notations . . . 10 3 PLS illustration . . . 11

4 Simple classification tree . . . 12

5 Basic neuron . . . 13

6 Radial basis network . . . 14

7 SVM illustration . . . 15

8 Parameter tuning, PLS . . . 19

9 Determine cutoff value . . . 21

10 PLS, scaling . . . 35 11 RF, scaling . . . 36 12 PNN, scaling . . . 37 13 GRNN, scaling . . . 37 14 SVMr, scaling . . . 38 15 SVMl, scaling . . . 38

16 Method combinations: group A . . . 43

17 Method combinations: group B . . . 44

18 Method combinations: group C . . . 45

19 Method combinations: group D . . . 46

20 Difference in class prediction rates . . . 47

21 Regression value output . . . 48

22 Genetic algorithms, group A . . . 49

23 Selected variables, group A . . . 50

(5)

25 Sample predictions, group A . . . 52

26 Unweighted voting, group A . . . 53

27 Unweighted voting, group B . . . 53

28 Unweighted voting, group C . . . 54

29 Unweighted voting, group D . . . 54

30 Weighted voting . . . 55 31 Bagging, group A . . . 56

List of Tables

1 Subdivision of dataset . . . 23 2 Abbreviations . . . 23 3 Model parameters . . . 34

4 Method combinations, group A . . . 39

5 Method combinations, group B . . . 40

6 Method combinations, group C . . . 41

(6)

1 Introduction

There is an ongoing project at AstraZeneca with the objective of finding an easily measurable feature in patients of multiple sclerosis (MS), a neu-rodegenerative and so far incurable autoimmune disease, for easier diagnoses and treatment follow-ups. An approach towards finding such a feature, or biomarker, is to monitor the levels of proteins in certain tissues from said patients and to try to identify characteristics that imply presence or absence of the disease. This is done through experimental techniques within the field of proteomics and subsequent computer based analysis of the data.

This graduate thesis was done as a part of the larger project with the objective to investigate and improve some of the operating procedures of the project’s computer modelling. A more detailed description of the thesis can be found at the end of this chapter. The following sections constitutes a background to AstraZeneca’s MS biomarker project and its associated techniques.

1.1 Biomarkers

The idea behind biomarkers is to find a relatively simple and easily measur-able indicator of a certain biological condition. The target condition can be part of a normal biological process, a pathogenic process or the response to therapy of some kind. A requirement of a biomarker is that it is feasible to measure objectively and that it follows the target condition closely over time (Bielekova & Martin 2004). Any measurement, or set of measurements, can become a biomarker once its interpretation is backed by appropriate stud-ies. Biomarkers are used extensively in modern biology and medicine in a number of different areas, from pregnancy tests to cholesterol level monitor-ing (LaBaer 2005). The biomarkers discussed in this paper are biochemical biomarkers, but are referred to simply as “biomarkers” for legibility.

Ideally, a biomarker should reflect a disease in all steps of its progress and also the net effect of any treatment or therapy the patient undergoes. Especially the last requisite is hard to fulfil, since most treatments have both beneficial and adverse effects whose mechanisms and origins can be very different from the disease itself. Diseases such as MS, that have a complicated and diverse set of symptoms and mechanisms, make the task of finding an all-including marker all the more difficult. Any found biomarker usually reflects one or a few of the relevant mechanisms of the disease (Bielekova & Martin 2004).

(7)

1.2 Multiple sclerosis

Multiple sclerosis is a neurological disease with symptoms including partial paralysis, disturbances in vision and eye movements, sensory anomalies and fatigue. Life expectancy for MS patients is considered to be decreased with about 10 years in comparison to healthy persons, although the disease itself is rarely the direct cause of death. More common is a combination of MS-induced general weakness together with another disease such as pneumonia or a kidney infection. The main feature in MS are lesions or plaques in the brain and spinal cord of the patient. These plaques are regions of tissue where the axons of the nerve cells have lost their enclosing myelin sheaths due to an only partially understood auto-immune reaction. The demyelination hampers nerve propagation, with consequences depending on the locations of the plaques (Kesselring 1997).

1.3 Proteomics

The term proteomics has its origin in a combination of the words “protein” and “genome”, and has come to signify a complete characterisation of the proteins and their concentrations in a cell or a tissue. Ideally this would encompass every protein in all tissues at each and every point of time, a task somewhat extensive for most research groups. Usually a specific type of sample is chosen and the time dimension is severely restricted (Lawrance, Klopcic & Wasinger 2005).

One technique for characterising and quantifying the protein content of a liquid solution is two-dimensional gel electrophoresis. The proteins are separated on a polyacrylamide gel according to isoelectric point (isoelectric focusing, IEF) in the first dimension and after molecular weight (sodium dodecyl sulfate electrophoresis, SDS-PAGE) in the second. Since the sepa-ration is based on two fully independent parameters, the proteins distribute themselves relatively uniformly across the gel (O’Farrel 1975). The result is a square of gel speckled with spots of different intensities, where each spot theoretically represents one unique protein (see figure 1).

A gel is made for each sample taken and the gels are photographed indi-vidually. These images are processed in special computer software in order to extract relevant data and remove noise and other unwanted effects. The first step is to filter the images and normalise the spot intensity across the gels to achieve comparability. Thereafter the spots are modelled using two-dimensional Gaussian functions, which give an idealised model of the spots for each image on which all subsequent computations are based. The

(8)

inten-Figure 1: Example of the result from two-dimensional gel electrophoresis (O’Farrel 1975)

sity of each spot is proportional to the relative level of the corresponding protein and is approximated by the volume under the Gaussian in the ide-alised image. The next step is to identify the same protein across gels so that any changes or differences become apparent. The software uses several algorithms to achieve this “matching” of spots automaticly. A particularly clear and distinct gel with many visible spots is taken as reference or “master image”, and the spots on it are matched to spots in other gels. Thereafter so far unmatched spots present in more than a certain fraction of all other gels, usually around two thirds, are added to the master image and the matching procedure is repeated for those spots. The resulting master image contains a spot for each of most – ideally all – of the available proteins in the samples. These spots and the matched counterparts in the gels are used to create a dataset by inserting their respective intensity values into a table or matrix. In practice, all spots in the master gel cannot be assigned a value for each and every gel in the experiment. Spots may be missing all together from some of the gels, they may not be visible in comparison to other more prominent spots or they might not have been matched to the master gel by the software algorithms. All these factors contribute to incompleteness in the collected data, which somehow have to be dealt with during later computer processing (Garrels 1989).

(9)

1.4 Predictive modelling

In predictive modelling, a mathematical model is built from samples with given values and class properties. The idea is that the model should be able to predict the class of new and unknown samples based on their values alone. It also provides valuable information in the form of patterns in the data and relations between variables. In the MS biomarker project this translates to deciding whether a certain sample comes from an individual suffering from MS or from a healthy person.

To validate a model, a part of the dataset is removed and reserved as a test-set. The remaining data is used to build the mathematic model and the test-set is thereafter used to evaluate the model’s performance.

1.5 Thesis objectives

This thesis aims to improve the methods used in the MS biomarker project and investigate alternative multivariate methods and their usefulness and prediction performance on this kind of data. The work is based on previous results from said project (Nilsson 2005).

In broad outline these topics were investigated:

- Different multivariate methods and their prediction performance on the given data

- Scaling methods to improve prediction performance

- Variable selection and its effects on prediction performance

- Combinations of multivariate methods and variable selection methods - Improvement of prediction performance using bagging

(10)

2 Background

2.1 Multivariate analysis

Multivariate analysis is as the name implies the technique of assessing and evaluating many variables together, as opposed to univariate analysis where only one variable (at a time) is considered. When measuring properties of e.g. chemical reactions, features of electronics or, as in this case, proteomics data, a multitude of measurements from many sensors are generated, yielding a multitude of variables with measured values. Using univariate methods, each variable is investigated separately and only the variables containing the most information are kept in a simplified model of the data. Multivariate analyses assess all variables at once, and thus also include information stem-ming from the variables’ influence on each other. Nevertheless, the result from many multivariate techniques can be used to reduce the number of variables necessary to describe the data at hand in an as non-destructive manner as possible, in order to simplify its interpretation (Eriksson, Johans-son, Kettaneh-Wold & Wold 2001).

2.1.1 Partial least squares

Partial least squares (PLS) is a family of methods of multivariate predic-tive modelling that have the important characteristic of being able to per-form well on data where the number of variables greatly exceeds the number of available samples (Boulesteix 2004) as well as being noisy and collinear (Wold, Sjöström & Eriksson 2001).

There are several algorithms for PLS-regression, among the more renowned are Nonlinear Iterative Partial Least Squares (NIPALS) and SIMPLS. The one given here is the NIPALS-algorithm as described by de Jong (1993).

Given a dataset X and a set of response variables Y, NIPALS iteratively calculates A orthogonal score vectors t1, . . . , tA for X and A corresponding

scores u₁, . . . , uAfor Y. The goal is to maximise the covariance between each

ta and ua while keeping the t-vectors orthogonal to each other (see figure

2). A is usually referred to as the number of components in the PLS-model and is chosen beforehand.

The matrices X and Y are mean centred yielding X0 and Y0

respec-tively. Thereafter w1, t1, c1 and u1 are calculated by iterating the following

computations:

w1 ∝ XT0u1 (1)

(11)

Figure 2: Notations used to describe PLS-modelling (Wold et al. 2001). The labels “Structure descriptors” and “Activity measurements” in the picture are specific for the example in the referred article. “obj.s” refers to objects, or in this case, samples.

c1 ∝ YT0t1 (3)

u1 = Y0c1, (4)

where the operator ∝ indicates (here) that the operands should be propor-tional to each other and that the left operand should be normalised to unit length. One of the columns of Y0 is used to initiate u1 and steps (1) to

(4) are repeated until the scores t1 and the weights w1 converges to stable

values according to some measure and tolerance level.

These newly obtained vectors are used to deflate X0 and Y0, generating

new matrices X1 and Y1:

X1 = X0− t1pT1 (5) Y1 = Y0− t1cT1, (6) where p1 = XT₀t1 tT₁t1 (7) p1 is a vector of loadings of the scores in t1 on the variables in X0 and

thus describes how strong the original X-variables are related to the first score vector t1 (de Jong 1993).

The deflation of Y is not strictly necessary, the result will be the same with or without it (Wold et al. 2001).

Steps (1) to (7) are repeated, with an increment of each subscript by one, A times (de Jong 1993). An illustration of the PLS-components and other features can be found in figure 3.

(12)

Figure 3: Illustration of scores and loadings in PLS-modelling (Wold et al. 2001).

The resulting model can be described concisely by matrix notation as

T = XW (8)

X = TPT + E (9)

Y = TCT + F, (10)

where T, W, P and C are the A vectors with corresponding lower-case designations as columns. E and F are the residuals, i.e. the part of the data that is not explained by the model. The classes of a set of new samples X∗ can now be predicted using the weights of the calculated model. A combination of (8) and (10) yields

Y = XWCT + F.

The new class predictions Y∗can thus be calculated from X∗and the model’s weight matrices through

Y∗= X∗WCT (Wold et al. 2001).

2.1.2 Random forest

Binary trees can be used for regression and classification (see figure 4). Each node in the tree represents a division, or split, of the samples in the dataset

(13)

Figure 4: Example of a simple classification tree with two variables (x and y) and three classes (1, 2 and 3). New samples are classified by beginning at the root (top of the tree) and following branches depending on sample properties down to a leaf where the predicted class can be read.

according to some criterion based on the data itself. A commonly used criterion is to choose one or a few variables and investigate where in its range to put a split. The split is chosen so that the sums of squares of distances to the mean value of the respective group is minimised. These divisions are repeated until the leaf nodes contain fewer samples than some in advance chosen value. The resulting tree is then pruned, minimising a cost-complexity function, in order to avoid over-fitting. New samples are classified by following the criteria of the tree’s nodes down from the root to a leaf, where the predicted class can be determined (Hastie, Tibshirani & Friedman 2001).

Random forest is based on bootstrap sets, which in general are sets of samples taken randomly with replacement from the original dataset. A pre-diction model is built for each of the bootstrap sets and are used in dif-ferent manners for examination of the original dataset’s behaviour (Hastie et al. 2001).

Random forest similarly works by creating a large number of classification trees from the input data and letting them vote for each sample to give a class prediction. Each tree is constructed from a bootstrap set of the training data and is not pruned in any way. The node divisions of the tree are based on m number of randomly chosen variables from the training-set. This m is the only user defined parameter of the random forest models. New samples are classified by passing them down all trees, noting the individual predictions and finally assigning the majority result. The performance of

(14)

Figure 5: A neuron, the basic component of a neural network (Demuth & Beale 1998).

the random forest depends on the prediction accuracy of the individual trees and the correlation between the trees. Increased correlation gives larger error in prediction. In addition, Random forest can provide an estimation of the generalisation error during the build-up of the trees using out-of-bag estimates. This works by testing the trees with the third of the training samples that are not included in the current bootstrap sample – the out-of-bag samples. This gives an overestimate of the prediction error during the construction of the forest (Breiman 2001).

2.1.3 Neural networks

Neural networks are loosely based on the composition and function of the human brain. A common form is the feed forward network, which is built up from layers of neurons (figure 5). Each neuron takes multiple inputs and produces a single output based on the combination of inputs and the neuron’s internal weight vector. The inputs for the first layer are the sample vectors of the dataset, and the inputs for the subsequent layers are the previous layer’s outputs. The output of the final layer can be interpreted to give a prediction of the class of the input sample. Each neuron is usually made to comprise a transfer function to keep the output inside a specified range. When using the logsig-function,

f (n) = 1 1 + e−n,

the output is in the range [0, 1] for any input n.

(15)

Figure 6: The structure of a radial basis network (Demuth & Beale 1998).

corresponding classes. This is done by putting the training samples through the network and comparing the output with the real classes. The weights of the network are updated according to some algorithm. A common sort of update is termed back-propagation and works by updating the weights in the opposite direction of the gradient of the prediction error , i.e. the direction where the error decreases the fastest:

wnew= w − α · ∇, α ∈ [0, 1]

where w is the vector of weights and α is the learning rate.

Radial basis networks constitute a special form of feed forward neural networks. They consist of two layers, an initial radial basis layer and a second linear layer (figure 6). The radial basis layer utilises a Gaussian radial basis function centred at the origin as transfer function with the distance between the input vector p and the weight vector w of the neuron together with a bias b as argument:

f (p) = e−b·kw−pk2

The effect of this is that the neuron whose weight vector is closest to the input vector gives the largest output value. The maximum output of 1 occurs when p is equal to w. The bias is set after the user-specified parameter spread, which is the distance between input and weight vector that yields 0.5 as output. The linear layer combines the outputs of the radial basis layer to form a common output. Radial basis networks are faster to construct than traditional feed-forward networks, but may need more neurons.

Two different kinds of radial basis networks are distinguished between, depending on the transfer function of the final layer of neurons in the

(16)

net-Figure 7: Illustration of the maximum margin, hyper-planes and support vectors in a two-dimensional classification problem (Amendolia et al. 2003)

work. In a generalised regression neural network (GRNN), the transfer func-tion used is f (n) = n, which simply passes on the raw floating point numbers from the neuron. To be interpreted as a class prediction this must be evalu-ated with aid of a threshold, or cutoff value.

In a probabilistic neural network (PNN) however, the transfer function returns a vector of equal size to the input vector, but with a one in place of its largest element and zeros elsewhere. This makes for a kind of competition between the neurons of the previous layer, which provided the input, where the strongest input is the sole contributor to the output. The output is thus limited to the values 1 and 0 respectively, and can be interpreted as a class prediction.

(Demuth & Beale 1998)

2.1.4 Support vector machines

A support vector machine (SVM) tries to fit a high dimensional hyper-plane so as to separate samples from two classes with the largest possible margin. The margin is in this case the orthogonal distance between two hyper-planes parallel to the separating hyper-plane, going through the samples closest to the separating hyper-plane (see figure 7).

With sample vectors xi, their corresponding class values yiand the SVMs

weight vector w and bias b, this can be described by

yi(xTiw + b) − 1 ≥ 0, yi = ±1, ∀i. (11)

The samples for which equality holds in (11) are called support vectors and are the points that define the SVM. Other points outside the hyper-plane

(17)

boundaries can be moved around arbitrarily (as long as they don’t cross the boundaries) without changing the SVM.

It can be shown that maximising the SVMs margin is the same as min-imising kwk2 while keeping the inequality in (11) valid. The minimisation and the constraint in (11) can be combined to an optimisation problem using constraints on introduced Lagrange multipliers α_i. This eventually yields the task of minimising LP = 1 2kwk 2₋ l X i=1 αiyi(xTi w + b) + l X i=1 αi (12)

with respect to w, b while keeping derivatives of LP with respect to the αis

equal to 0 and α_i ≥ 0. This is equivalent to maximising L_P under another set of constraints; the gradient of L_P with respect to w and b is kept equal to 0, and αi≥ 0. This gives that

w =X i αiyixi and X i αiyi= 0.

These solutions inserted into (12) produces LD = X i αi− 1 2 X i,j αiαjyiyjxTi xj. (13)

The label has changed in (13) from P to D to distinguish it from (12). Minimising L_P under the first set of constraints corresponds to maximising LD under the second.

So far it has been assumed that the two classes are separable. When this is not the case, the SVM constraints will have to be relaxed somewhat, leading to the soft margin SVM. The harsh demands of (11) are lessened by introducing slack variables ξi, which gives

yi(xTi w + b) − 1 + ξi ≥ 0, yi= ±1, ξi ≥ 0, ∀i. (14)

The optimisation problem becomes LD = X i αi− 1 2 X i,j αiαjyiyjxTi xj. (15)

(18)

under constraints 0 ≤ αi ≤ C and X i αiyi= 0,

where C is a constant parameter to be determined depending on the data being modelled.

By introducing so-called kernel functions to the SVM, it can be made to perform well also on non-linear problems, i.e. where a curved separating hyper-plane is needed. This can be done by mapping the data into a high-dimensional space, H, through a mapping Φ:

Φ : Rd7→ H.

This would however be extremely computationally intensive to do for large datasets. It can be avoided by noticing that the function to be optimised in (15) only contains the input data as dot-products between input vectors, xi,xj. By replacing these dot-products by a kernel function K such that

K(xi, xj) = Φ(xi) · Φ(xj), only K needs to be used in the construction of

the SVM and all laborious explicit mappings are circumvented. A family of such kernel functions are the radial basis functions (see also radial basis networks in section 2.1.3),

K(xi, xj) = e−γkxi−xjk

2

,

where γ is a constant. This type of kernel in fact implicitly maps the data into a space of infinite dimension.

(Burges 1998)

2.1.5 Variable selection

One of the goals of the MS biomarker project is to decrease the number of necessary protein intensity measurements – decrease the number of variables of the model(s) – while maintaining high prediction performance. To achieve this some sort of measure of variable importance for prediction is needed.

PLS provides two such measures. The PLS-regression coefficients, B, are calculated from the PLS-models weights (see section 2.1.1) through the relation

B = WCT.

It reflects variable importance in that it has large values for influential vari-ables for the modelling of Y and smaller values for less influential. This

(19)

however does not take into account a variable’s importance for the mod-elling of X. The other PLS-based method on the other hand does just that. Variable importance for projection (VIP) is a weighted sum of squares of the PLS-weights wi and thus a summary of the importance of a variable for the

modelling of both X and Y (Wold et al. 2001).

Random forest has its own importance measure based on the out-of-bag estimate (see section 2.1.2) of the prediction error.

The commonly used t-test of statistical analysis may be used to estimate variable importance. For each variable, the hypothesis that the mean values of the two groups of samples are equal is tested. The mean values are con-sidered to come from normal distributions with unknown but equal standard deviations. The variables can be ranked by the size of their p-values, i.e. the probability of observing the obtained value by chance if the hypothesis is true (Blom 1989). Note that this is a univariate technique.

Genetic algorithms provide a way to choose variables after importance by a form of simulated evolution. A population of variable subsets is formed, often randomly, where each individual is a vector with as many elements as variables in the original set. Each element is either a ’1’, indicating that the variable in question is included in the subset, or a’ 0’, meaning that it is excluded. These subsets are then evaluated by some method and a fraction of the vectors representing the subsets with best performance are kept, the rest are discarded. The chosen vectors are allowed to breed, i.e. they are copied, so that the original number of vectors is restored. The vectors are mixed through a scheme of cross overs; an element in a vector is chosen randomly and one of the resulting halves is exchanged with a corresponding part of another vector, preserving the length of the vectors. In order to maintain diversity in the population a form of mutation is usually also included; a small number of randomly chosen elements in the vectors have their values inversed. The steps of evaluating, breeding, crossing over and mutating are repeated until some predefined condition is reached (Lucasius & Kateman 1993).

2.1.6 Cross validation

A widely used method to estimate prediction performance for data with few available samples and to avoid over-fitting is cross validation. When data is plentiful, a test-set of samples is separated from the data and a model is built from the remaining data. The model’s classification performance is thereafter evaluated with the test-set. In the case of a sparse supply of data however, this may not be practical or possible. In cross validation the same

(20)

Figure 8: Parameter tuning; prediction performance for different number of variables and different number of PLS-components.

dataset is used for both learning and testing. Modelling and prediction is done multiple times with different parts of the data as training-set and test-set each time. For each cross validation round, each sample in the datatest-set is assigned to either the training or the test-set, thereafter a model is built and evaluated. The overall performance is estimated from the mean prediction rate over all cross validation rounds. Additionally, this gives a better estimate of the prediction error than just one test would have (Hastie et al. 2001).

Cross validation was used to determine parameter settings for multivari-ate methods and to study the prediction performance for subsets of variables. An example of this is the plot of prediction performance versus number of PLS-components shown in figure 8. Other parameters set through these kinds of investigations are the m-value of random forest, the spread-values of the radial basis networks and the γ- and C-values of the SVMs.

2.1.7 Bagging

The term bagging is an acronym for “bootstrap aggregating” and is a method for improving prediction performance when the data contains few samples. In bagging, bootstrap sets (see section 2.1.2) are evaluated individually by a test-set and a common estimate of the prediction performance is computed

(21)

by letting the bootstrap samples “vote” for class predictions. Each sample in the original dataset is assigned a class by majority rule from the boot-strap predictions. It can be shown that bagging improves performance if the procedure of constructing the predictive models is unstable, that is small changes in input may yield large changes in the model (Breiman 1996).

2.2 Assessing prediction

The terms “prediction performance” and “prediction rate” are used inter-changeably in this work, and refers to the relative number of correct pre-dicted samples, i.e.

p.perf. = no. of correct predicted samples total no. of samples .

As measure of prediction performance of a specific class, the ratio between the number of correct class predictions of the class and the total number of instances of the class is used, i.e.

p.perf.class x =

no. of correct predicted samples of class x total no. of samples of class x .

When the classes in question are of unequal sample size, many multivari-ate methods tend to neglect the smaller class in favour of the larger. This can be compensated for by choosing a somewhat biased cutoff value for the output of the method. An example is given in figure 9. Here the method is PLS, which gives values in the approximate range [0,1]. An output close to 1 for a test sample would give one class as prediction, while an output closer to 0 would give the other class. The exact value that is used as a threshold for the output is the cutoff value. In figure 9, The prediction per-formance for class ’A’ is plotted along with the prediction perper-formance of the control group and the total prediction performance. Note that for cutoff values closer to 0 a large fraction of the test samples are predicted to be of class ’A’, hence the high prediction rate for class ’A’ and the correspondingly low for the control class. Where the three curves meet, the prediction rates for the specific classes equals each other as well as the total prediction rate and no class is neglected or favoured. In this work the cutoff value is, where possible, chosen so that both classes are equally well predicted, which as can be seen in the figure may give a somewhat lower, but altogether more informative, total prediction.

(22)

Figure 9: The cutoff value of the regression value for class prediction is chosen to equalise the different prediction rates. The plot stems from PLS with 5 components and variable selection through vip. The mean prediction rates over 700 CV-rounds are used.

(23)

3 Materials and methods

3.1 Computer hardware and software

• A stationary PC with 2.8 GHz CPU, 1.2Gb RAM.

• MATLAB 6.5 with Statistics Toolbox and Neural Networks Toolbox from the Math Works Inc, USA.

• PLS_Toolbox 3.0, a toolbox for MATLAB from Eigenvector Research Inc, USA.

• OSU-SVM, a free distributed toolbox for MATLAB developed at the University of Ohio, USA.

• RandomForest, a free distributed toolbox for MATLAB written by Leo Breiman and Adele Cutler.

• Spotfire Decisionsite, visualisation tool from Spotfire AB, Sweden.

3.2 Datasets

The data on which this thesis is based were collected at Karolinska Institutet in one of their studies. The set consists of 164 samples including technical controls, samples from MS-patients and samples from a control group of 36 persons. The samples were processed as described in section 1.3 and the extracted data was collected into a matrix-structure in MATLAB.

Elements without value in the data matrix was given the intensity of the background, in this case 2. Thereafter the matrix was log-transformed and adjusted for batch variations (Nilsson 2005).

This set was divided into a number of patient subgroups as seen in table 1. Again in accordance with the studies of Nilsson (2005), only variables with missing data in less than 60% of the samples were used.

The scaling method analysis (see section 3.4.1) was conducted on sample group A (the whole set), while the method combinations (see section 3.4.2) were run for each subgroup in turn. The control group of 36 persons was the same for all analyses.

3.3 Notations and abbreviations

Certain abbreviations are used in the following sections as well as in the plots in the appendices. These are listed in table 2.

(24)

Subgroup samples variables

A 62 911

B 17 933

C 36 920

D 9 944

Table 1: Division of dataset into subgroups

Abb. Meaning Described in

PLS partial least squares 2.1.1

RF Random forest 2.1.2

PNN probabilistic neural network 2.1.3 GRNN generalised regression neural network 2.1.3 SVMr support vector machine with radial basis kernel 2.1.4 SVMl support vector machine with linear kernel 2.1.4 vip variable importance for projection 2.1.5 reg regression coefficient vector (b) 2.1.5 imp Random forest importance measure 2.1.5

ttest common T-test 2.1.5

ga genetic algorithms 2.1.5

(25)

3.4 Experiments and investigations

3.4.1 Scaling

Five different methods for scaling of the variables before modelling were tried in order to maximise the predictive capabilities of the models.

• Mean centring: the mean value of each variable is made equal to 0, implemented as the MATLAB-function mncn in the PLS_Toolbox. • Variance normalisation: the variance for each variable is made equal

to 1, implemented as a MATLAB-function by Kerstin Nilsson.

• Auto scaling: both the preceding scalings, implemented as the MATLAB-function auto in the PLS_Toolbox.

• Pareto scaling: each variable’s mean is made equal to 0 and its vari-ance is set to its standard deviation (the square root of the varivari-ance), implemented as a MATLAB-function by Kerstin Nilsson.

• Mnmx-scaling: the data is scaled so that the maximum value is 1 and the minimum is -1, implemented as the MATLAB-functions premnmx and postmnmx in the Neural Networks toolbox.

PLS and RF were evaluated for the first four scaling methods. The mnmx-scaling was added for the evaluation of the neural networks and the SVMs. An investigation without any preceding scaling was conducted as reference for all six modelling methods.

The analysis was conducted through repeated division of the dataset’s samples into train- and test-sets through cross validation (see sections 1.4 and 2.1.6) followed by evaluation of the predictions of the test-set samples. The variables of the test-sets were scaled using the same parameters that were used for the scaling of the training-set.

3.4.2 Variable selection

The multivariate prediction methods in 2.1.1-2.1.4 were combined with the variable selection methods in 2.1.5. The combinations were evaluated and compared using the same training-sets and test-sets through 100 CV-rounds. This was done for all four sample subgroups, although genetic algorithms were only included in the analysis of group A due to its long computation time, resulting in 30 combinations for the prediction of group A and 24 for B, C and D respectively.

(26)

3.4.3 Unweighted voting by method combinations

Each of the previously described combinations of a prediction method and a variable selection method were allowed to cast a “vote” for the class prediction of each sample. A simple majority decision was made for each test-sample to form a combined prediction. Genetic algorithms were excluded as variable selection method, again due to long computation times.

Additionally, a similar voting including only the best performing pre-diction methods, i.e. PLS, SVMr and SVMl, combined with each variable selection method except genetic algorithms were performed.

3.4.4 Weighted voting by method combinations

The same procedure as above was performed with the exception that the majority vote was replaced by a weighted vote. The output predictions from the method-combinations were fed to a feed forward neural network (see section 2.1.3). The data was divided into three sets each CV-round: half of the samples were used to train the models, a quarter was put through the models and the results were used as input for the network, and the last quarter was used as test-set to validate the neural network.

3.4.5 Bagging

PLS combined with the variable selection methods vip, reg and ttest respec-tively enhanced by the bagging technique (see section 2.1.7) were compared with the same combinations without bagging. Since bagging is very time consuming, the experiment was only conducted for group A versus the con-trol group.

(27)

4 Results and discussion

The parameter settings for the multivariate models used in the computations are found in table 3 in appendix A.

4.1 Scaling

Plots of prediction performance for group A versus the control group are collected in appendix B.

The scaling procedure does not seem to effect the performance in any significant way for PLS and RF, as seen in figures 10 and 11 respectively. Noteworthy is however that without any preceding mean centring, PLS will need an extra component since the first component will “point out” the data from the origin and not provide any useful information. In figure 10, four components were used with the variance scaling (uvn), three components when no scaling was used and two components for the mean-, auto- and pareto-scalings. Variance scaling is often used with PLS in order to equalise the variables, when no prior information of their relative importance is avail-able. PLS will otherwise favour the variables with high variance and the model will be based on them in greater extent. These considerations made auto-scaling the method of choice for further PLS-modelling, given the lack of unambiguous empirical results. Auto-scaling was also used with RF, for simplicity and easier comparisons.

PNN and GRNN worked best with the mnmx-scaling, as seen in figures 12 and 13. GRNN becomes a bit unstable for the other scalings with the setting of the spread-constant used to produce the result in the figure. Other settings give stable results for all scalings, but slightly worse prediction rates. The SVMs exhibit reasonably uniform results regardless of scaling method as seen in figures 14 and 15. Auto-scaling was chosen for subsequent analyses in compliance with the choice for PLS.

4.2 Variable selection

Tables 4-7 as well as figures 16-19 in appendix C convey the results of the experiment.

The results are fairly heterogeneous, but some conclusions may be drawn. The best performing methods are PLS and the SVMs, which combined with vip and ttest as variable selection methods constitute the overall best com-binations.

(28)

results, while the variable selection methods are of minor importance. The curves cluster according to prediction method (colours in figures 16-19) and have shapes characteristic of the selection method (symbols).

It is interesting to note that RF performs better combined with practi-cally every variable selection method other than its “own” variable impor-tance measure, imp. This is true for all four sample subgroups. In general imp seems to work less well than the others and gives rise to some peculiar shapes in the plots.

The high prediction rates of PNN and GRNN for group D in table 7 and figure 19 should be taken with a pinch of salt. As the difference in size between the two target classes grows larger, the multivariate models tend to classify a bigger and bigger part of the samples as the larger class, as described in section 2.2. In the case of group D, where the control group is three times the size of D this is especially evident. Predicting every sample as a control sample would yield a prediction performance of 0.75, which may seem fairly high but is not very useful. This is compensated for by setting the cutoff value, of the model output a bit biased towards the smaller class. Since PNN produce binary output (see section 2.1.3), this compensation is not possible and the smaller class tends to be neglected, illustrated in figure 20. GRNN on the other hand should give output values in the range [0,1], but the values tend to aggregate around 0 and 1 making its output as good as binary. This effect and PLS-ttest as comparison is shown in figure 21. The shown plot is strictly speaking a histogram, but the points are interconnected with lines instead of being shown as bars for visibility.

The result from the genetic algorithm selection method is shown in a separate plot (figure 22). Due to the nature of the method, data for only a few variable subsets were obtained (all, 200, 100 and 75 variables). As seen in the plot, the performance is fairly good in comparison with other combinations for all multivariate models. Its practicality is however severely limited owing to the long computation times necessary.

The variable selection methods choose variables pretty similarly, with the exception of imp as seen in the heat map in figure 23. The horizontal lines representing variables have a more red nuance the more times it has been selected. A green line indicates a variable that is left out all CV-rounds. The variables are sorted after vip’s selections. The data was collected during 100 CV-rounds and 100 variables were selected each time. The same was done for the other three sample subgroups with similar results. An interesting feature is that vip and ttest select the same variables almost unequivocally despite having different origins. The vip selection stems from a multivariate PLS-model while the ttest selection is based purely on univariate characteristics

(29)

of the data. The similarity in variable selection is shown in figure 24. In this scatter plot, each square represents a variable. The square’s position along the horizontal axis is proportional to the ratio at which it was chosen by ttest, and its position along the vertical axis gives the ratio according to vip. As can be seen in the figure, the points line up nicely along the diagonal, indicating similar values for both selection methods.

A sample-wise visualisation of the prediction performance of the method combinations was constructed and can be seen in figure 25. Each horizontal row of the heat map represents a specific sample and is divided into a number of fields, one for each method combination. Red colour indicates prediction rates close to 1, whereas lower rates are shown in greener nuances. The heat map is clustered according to Spotfire’s Hierarchical Clustering algorithm. It is apparent from this clustering that there are samples that are predicted very accurately by one group of model-selection combinations while other predict the same samples wrong. The thought arises that a merge of two or more method combinations might thus be able to perform better than the individual combinations alone. This is the background for, and motivation to, the “voting”-experiments described in sections 3.4.3 and 3.4.4.

4.3 Unweighted voting by method combinations

Plots of prediction performance of the groups A-D are found in appendix D. The result from the two variants of unweighted voting are shown, i.e. with all models and with the three best performing models. The curve labelled “unwv all” gives the prediction rate for the voting of all the combinations and the combination of PLS, SVMr and SVMl together with vip, reg, imp and ttest is labelled “unwv best”. A few of the best performing single method combinations are also included in the plots for comparison.

As is apparent from the plots, no obvious improvement has come of the technique. The voted performance is as good or slightly worse than the best single method combination. In a few cases a slight improvement can be seen (for small variable subset sizes in A, B and D). The technique does however provide a more or less stable high performance for each of the sample subgroups – the best single combination is on the contrary different from case to case. It also maintains a high prediction rate for fairly small variable subsets, although the actual number of variables used for the voting may be a little higher. This is because the different selection methods have chosen a number of variables each, and thus not the same ones. The overlap is however significant (see figure 23) and the overall effect should be small.

(30)

4.4 Weighted voting by method combinations

The prediction performance of the neural network trained on the data is shown in figure 30 in appendix E.

The results are disappointing and a little surprising. The weighting ap-proach seems fine in theory, but as can be seen in the plot, the prediction performance is worse than many or all of the contributing method combi-nations. Especially group C versus control group is poorly predicted, with achieved rates little better than pure chance.

The cause of the inadequate performance is not entirely clear and further investigations need to be conducted. A reason may be lack of data, since the procedure requires two divisions of the data instead of just one, resulting in three independent sets for each CV-round. The used method combination needs a training-set and the subsequent training of the neural network re-quires a second. An other reason may be the neural network itself. It would be interesting to test other kinds of networks or methods.

4.5 Bagging

PLS with bagging gives a small but consistent improvement of prediction performance, as seen in figure 31, appendix F. The diagram shows PLS together with ttest as variable selection method; vip and reg gave similar results.

The improvement appears even more miniscule when considering that a round of bagging consists of a number, in this case 100, of bootstrap samples of the training-set and a subsequent PLS-model for each; a procedure 100 times more expensive computation-wise than ordinary PLS-modelling. Additionally, this procedure suffers from the same potential problem as the unweighted voting; the actual number of chosen variables may be a little higher in bagging in each subset.

Studies by Mevik, Segtnan & Næs (2004) confirm the rather marginal improvement of PLS through bagging.

(31)

5 Conclusions

The choice of scaling method seems to be important only for the MATLAB neural networks, where the mnmx-scaling gives best performance.

The best performing methods are PLS and SVM. These combined with vip and ttest are predicting the samples best, up to about 80% prediction performance can be achieved. Overall, the variable selection methods are the “small” part of this performance, since they seem to choose more or less the same variables. The only outsider, Random forest’s importance measure imp, unfortunately changes the prediction performance for the worse.

The unweighted voting procedure, although a promising idea, at best marginally improves performance in comparison with the otherwise best per-forming model-selection combination. It does however provide a consistently good prediction performance for all tested sample subgroups.

The weighted voting experiment did not yield the expected result, but further investigations are needed. The prediction performance is lower than for many of the individual method combinations. Reasons may include lack of data.

The bagging approach improves the performance of PLS marginally but consistently, at a high cost in computation time.

One of the goals of the larger MS biomarker project is to decrease the number of needed variables, in this case measured protein levels, with main-tained high prediction performance. This is possible down to about 100 variables using PLS or SVM or a voting procedure. For smaller variable subsets the performance tends to drop and/or the smaller class is neglected. It is thus not feasible to isolate a single protein, or even a few proteins, as biomarker for multiple sclerosis, but rather a collection of at least a hundred proteins is required.

(32)

6 Recommendations

Nothing in this work suggests any radical change in the current data ana-lysis work flow of the biomarker project. The so far used PLS-modelling with preceding auto-scaling of the data gives a stable and high prediction rate, and although improvements can be made, it comes with an increased cost in computation time. SVM have shown promising results in this study and further research into application of its various forms on the data might prove fruitful. An analysis in the same way as in this work, but with dif-ferent imputation of missing data and other preprocessing might be worth considering.

The combinations of multivariate methods and variable selection methods have shown interesting results, which should be investigated further. The attempts at merging them to a single better performing procedure have yet to give significant results, but should not be abandoned according to my view.

There are also other ways of improving the multivariate analysis that have not been considered in this work, mainly due to time considerations. There are other versions of the ensemble methods than bagging that might be worth looking into. Additionally there are interesting variations to PLS-modelling such as e.g. Orthogonal Signal Projection described by Trygg & Wold (2002).

7 Acknowledgements

There are a number of people I would like to thank for assisting me in this thesis work in various ways.

A lot of gratitude goes to Kerstin Nilsson and Hugh Salter at AstraZeneca R&D in Södertälje for their assistance, instructions and willingness to answer questions. I would also like to thank Bo Franzén and Jan Ottervald for their proof-reading of the more laboratory-oriented parts of the thesis.

I thank my examiner Bengt Persson at the IFM-department of Linköping University.

Additionally a thank you goes to the company of AstraZeneca for giving me the opportunity to learn and try my skills through this thesis work.

(33)

References

Amendolia, S., Cossu, G., Ganadu, M., Golosio, B., Masala, G. & Mura, G. (2003), ‘A comparative study of k-nearest neighbour, support vector machine and multi-layer perceptron for thalassemia screening’, Chemo-metrics and Intelligent Laboratory Systems 69, 13–20.

Bielekova, B. & Martin, R. (2004), ‘Development of biomarkers in multiple sclerosis’, Brain 127, 1463–1478.

Blom, G. (1989), Sannolikhetsteori och statistikteori med tillämpningar, Bok C, Studentlitteratur, Lund. ISBN 91-44-03594-2, pp. 252-272.

Boulesteix, A.-L. (2004), ‘PLS dimension reduction for classification with microarray data’, Statistical Applications in Genetics and Molecular Bi-ology 3(1).

Breiman, L. (1996), ‘Bagging predictors’, Machine Learning 24, 123–140. Breiman, L. (2001), ‘Random forests’, Machine Learning 45(1), 5–32. Burges, C. J. (1998), ‘A tutorial on support vector machines for pattern

recognition’, Data Mining and Knowledge Discovery 2, 121–167. de Jong, S. (1993), ‘SIMPLS: an alternative approach to partial least squares

regression’, Chemometrics and Intelligent Laboratory Systems 18, 251– 263.

Demuth, H. & Beale, M. (1998), Neural Network Toolbox, 3 edn, The Math Works Inc., Natick, USA.

Eriksson, L., Johansson, E., Kettaneh-Wold, N. & Wold, S. (2001), Multi-and Megavariate Data Analysis, Umetrics Academy, Umeå. ISBN 91 973730 1 X.

Garrels, J. I. (1989), ‘The QUEST system for quantitative analysis of two-dimensional gels’, Journal of Biological Chemistry 264(9), 5269–5282. Hastie, T., Tibshirani, R. & Friedman, J. (2001), The Elements of

Statis-tical Learning, Springer-Verlag. ISBN 0 387 95284 2, pp. 214-221,246-249,266-278.

Kesselring, J., ed. (1997), Multiple Sclerosis, Cambridge University Press. ISBN 0 521 48018 3, pp. 7,8,13,71,121.

(34)

LaBaer, J. (2005), ‘So, you want to look for biomarkers’, Journal of Proteome Research 4, 1053–1059.

Lawrance, I. C., Klopcic, B. & Wasinger, V. C. (2005), ‘Proteomics: An overview’, Inflammatory Bowel Diseases 11(10), 927–936.

Lucasius, C. & Kateman, G. (1993), ‘Understanding and using genetic al-gorithms part 1: Concepts, properties and context’, Chemometrics and Intelligent Laboratory Systems 19, 1–33.

Mevik, B.-H., Segtnan, V. H. & Næs, T. (2004), ‘Ensemble methods and partial least squares regression’, Journal of Chemometrics 18, 498–507. Nilsson, K. (2005), So far unpublished work in the framework of the MS

biomarker project at AstraZeneca.

O’Farrel, P. H. (1975), ‘High resolution two-dimensional electrophoresis of proteins’, Journal of Biological Chemistry 250(10), 4007–4021.

Trygg, J. & Wold, S. (2002), ‘Orthogonal projections to latent structures (O-PLS)’, Journal of Chemometrics 16, 119–128.

Wold, S., Sjöström, M. & Eriksson, L. (2001), ‘PLS-regression: a basic tool of chemometrics’, Chemometrics and Intelligent Laboratory Sys-tems 38, 109–130.

(35)

A

Parameter settings

A B C D PLS 0.40 0.37 0.48 0.23 RF 0.36 0.34 0.5 0.21 GRNN∗ 0.4 0.4 0.04 0.4 SVMr 0.16 0.31 0.025 0.54 SVMl 0.16 0.30 0.035 0.54

(a) Cutoff values

scaling components m-value spread γ C PLS auto 2 RF auto 5 PNN mnmx 1.0 GRNN mnmx 1.0 SVMr auto 10−4 16 SVMl auto 10−4 16 (b) Other parameters

Table 3: Parameters used for the building of models. See the respective method descriptions for details. ∗ Difficulties arose when trying to establish cutoff values for GRNN, see section 4.2 and figure 21.

(36)

B

Scaling methods

(a) Different scalings

(b) Same plot enlarged

(37)

(a) Different scalings

(38)

Figure 12: PNN: Prediction performance for different scalings

(39)

Figure 14: SVMr: Prediction performance for different scalings (auto-scaling hidden under pareto-scaling)

(40)

C

Combinations of methods

vip reg imp ttest ga PLS 0.79 0.79 0.79 0.79 0.79 RF 0.63 0.61 0.62 0.63 0.61 PNN 0.66 0.66 0.66 0.66 0.66 GRNN 0.66 0.66 0.66 0.66 0.66 SVMl 0.80 0.80 0.80 0.80 0.80 SVMr 0.80 0.80 0.80 0.80 0.80

(a) All variables

vip reg imp ttest ga PLS 0.75 0.70 0.71 0.76 0.73 RF 0.67 0.64 0.62 0.67 0.64 PNN 0.67 0.55 0.64 0.65 0.64 GRNN 0.66 0.54 0.63 0.64 0.63 SVMl 0.75 0.71 0.71 0.76 0.73 SVMr 0.77 0.71 0.72 0.77 0.74 (b) 200 variables

vip reg imp ttest ga PLS 0.74 0.70 0.66 0.74 0.71 RF 0.67 0.64 0.62 0.69 0.67 PNN 0.68 0.58 0.61 0.67 0.66 GRNN 0.66 0.57 0.59 0.66 0.64 SVMl 0.73 0.70 0.65 0.73 0.71 SVMr 0.75 0.71 0.65 0.75 0.73 (c) 100 variables

Table 4: Prediction performance of all method combinations, group A and control group

(41)

vip reg imp ttest PLS 0.82 0.82 0.82 0.82 RF 0.67 0.69 0.67 0.68 PNN 0.64 0.64 0.64 0.64 GRNN 0.64 0.64 0.64 0.64 SVMl 0.82 0.82 0.82 0.82 SVMr 0.81 0.81 0.81 0.81

(a) All variables

vip reg imp ttest PLS 0.77 0.65 0.72 0.77 RF 0.72 0.65 0.67 0.74 PNN 0.70 0.54 0.64 0.71 GRNN 0.70 0.53 0.63 0.71 SVMl 0.75 0.66 0.69 0.75 SVMr 0.76 0.71 0.73 0.77 (b) 200 variables

vip reg imp ttest PLS 0.74 0.65 0.69 0.74 RF 0.75 0.66 0.68 0.75 PNN 0.72 0.59 0.65 0.73 GRNN 0.72 0.57 0.65 0.73 SVMl 0.74 0.67 0.69 0.74 SVMr 0.76 0.71 0.68 0.76 (c) 100 variables

Table 5: Prediction performance of all method combinations, group B and control group

(42)

vip reg imp ttest PLS 0.73 0.73 0.73 0.73 RF 0.60 0.60 0.59 0.60 PNN 0.62 0.62 0.62 0.62 GRNN 0.65 0.65 0.65 0.65 SVMl 0.73 0.73 0.73 0.73 SVMr 0.74 0.74 0.74 0.74

(a) All variables

vip reg imp ttest PLS 0.71 0.59 0.66 0.71 RF 0.62 0.58 0.60 0.61 PNN 0.64 0.53 0.58 0.63 GRNN 0.66 0.53 0.59 0.66 SVMl 0.71 0.59 0.66 0.71 SVMr 0.71 0.57 0.67 0.71 (b) 200 variables

vip reg imp ttest PLS 0.70 0.59 0.64 0.69 RF 0.62 0.58 0.59 0.61 PNN 0.64 0.55 0.58 0.63 GRNN 0.65 0.53 0.56 0.65 SVMl 0.69 0.59 0.62 0.69 SVMr 0.69 0.58 0.64 0.68 (c) 100 variables

Table 6: Prediction performance of all method combinations, group C and control group

(43)

vip reg imp ttest PLS 0.72 0.72 0.72 0.72 RF 0.64 0.63 0.65 0.62 PNN 0.80∗ 0.80∗ 0.80∗ 0.80∗ GRNN 0.80∗ 0.80∗ 0.80∗ 0.80∗ SVMl 0.72 0.72 0.72 0.72 SVMr 0.72 0.72 0.72 0.72

(a) All variables

vip reg imp ttest PLS 0.71 0.63 0.73 0.70 RF 0.71 0.63 0.67 0.71 PNN 0.81∗ 0.75∗ 0.80∗ 0.82∗ GRNN 0.81∗ 0.74∗ 0.80∗ 0.82∗ SVMl 0.73 0.71 0.72 0.74 SVMr 0.76 0.77 0.83 0.77 (b) 200 variables

vip reg imp ttest PLS 0.70 0.65 0.64 0.69 RF 0.74 0.68 0.64 0.74 PNN 0.82∗ 0.78∗ 0.78∗ 0.82∗ GRNN 0.82∗ 0.78∗ 0.78∗ 0.82∗ SVMl 0.76 0.74 0.66 0.76 SVMr 0.78 0.81 0.80 0.79 (c) 100 variables

Table 7: Prediction performance of all method combinations, group D and control group. ∗The values for PNN and GRNN stem from unbalanced prediction rates (see section 2.2).

(44)

(a) Method combinations

Figure 16: Prediction performance of all method combinations, group A and control group

(45)

Figure 17: Prediction performance of all method combinations, group B and control group

(46)

Figure 18: Prediction performance of all method combinations, group C and control group

(47)

Figure 19: Prediction performance of all method combinations, group D and control group

(48)

(a) Severe difference in class prediction rates

(b) Small difference in class prediction rates

Figure 20: An illustration of differences in class prediction rates. PNN pro-vides no possibility to compensate for class sizes, which makes it harder to predict the smaller class accurately (20(a)). This can be compensated for by changing the cutoff value, as done for PLS (20(b))

(49)

(a) The almost binary output of GRNN

(b) The more informative output of PLS

Figure 21: An illustration of the difficulties in compensation for difference in class sizes associated with GRNN. The cutoff point can be placed ar-bitrarily between the extremes in 21(a) without influencing the prediction performance for the individual classes. In 21(b) however, this is possible.

(50)

Figure 22: Prediction performance of all prediction methods combined with variable selection through genetic algorithms

(51)

Figure 23: Selected variables for group A versus control group. Each hori-zontal line represents a variable and its colour the number of times it was selected, red indicates many times. 100 variables were selected each round of 100 CVs.

(52)

Figure 24: A comparison of variable selection with vip and ttest for group A versus the control group. The ratio between the number of times a variable was selected and the total number of selections is shown. The values of vip is on the vertical axis and the ttest-values are on the horizontal axis. The vip selection stems from a PLS-model with two components.

(53)

Figure 25: Prediction performance for the samples of group A and the control group. Red indicates high prediction performance, green indicates low.

(54)

D

Unweighted voting

Figure 26: Prediction performance of unweighted voting, group A

(55)

Figure 28: Prediction performance of unweighted voting, group C

(56)

E

Weighted voting

Figure 30: Prediction performance of weighted voting through feed forward neural network, each group versus the control group. The best individual method combinations for group A inserted for comparison.

(57)

F

Bagging

Figure 31: Results of bagging compared to prediction without bagging; PLS-ttest, 100 CV-rounds, 100 bootstrap sets for each CV-round.

Investigation of multivariate prediction methods for the analysis of biomarker data