A comparison of normalisation methods for peptide microarrays

(1)

U.U.D.M. Project Report 2009:3

Examensarbete i matematisk statistik, 30 hp

Handledare: Marie Reilly och Davide Valentini, MEB, KI Examinator: Jesper Rydén

Mars 2009

Department of Mathematics

A comparison of normalisation methods for peptide microarrays

Ole Brus

(2)

(3)

A comparison of normalisation methods for peptide microarrays

Ole Brus March 2009

Supervisors: Marie Reilly and Davide Valentini Examiner: Jesper Ryd´ en

Abstract

When measuring the response of peptide microarrays, the overall brightness can often differ due to differences in slide preparation or settings of instruments. If this is the case and the slides are to be compared they need to be normalised.

Today there is no agreed standard method of normalisation for peptide microarrays although several methods exist. Four of the methods in this thesis are adopted from genetic microarrays and this study is, to our knowledge, the first to apply and compare these methods for peptide microarrays. In this thesis the following four methods are discussed and compared; global mean/median normalisation, invariant set normalisation, quantile normalisation and linear model normalisation. The methods were compared using data from slides prepared with and without human serum from healthy individuals.

Sammanfattning

Mätningar av peptidmikromatriser inneh˚aller ofta variationer som uppst˚att p˚a grund av skillnader i preparering eller instrumentinställning mellan de olika mikromatriserna. Om s˚a är fallet behöver resultaten normaliseras. Idag används flera olika metoder för normalisering men ingen av dessa är att betrakta som standardmetod p˚a omr˚adet. Flera av metoderna har tidigare använts inom genetiska mikromatriser och s˚avitt vi vet appliceras och jämförs de här för första g˚angen p˚a peptidmikromatriser. I det här examensarbetet har fyra nor- maliseringsmetoder undersökts och jämförts; globalt medelvärde/median, ick- evarierande mängd, kvantil och normalisering med hjälp av en linjär modell.

Metoderna jämfördes med hjälp av data uppmätta p˚a mikromatriser b˚ade med och utan serum fr˚an friska personer.

(4)

Acknowledgements: I want to thank Marie Reilly who have supervised me and taught me a lot about research and how to communicate my ideas. Davide Valentini has also supervised my work and provided valuable advice on programming and technical issues. Jesper Ryd´en has advised me on a number of topics and given me some good criticism. I’m also grateful to Mattias Linell for reading and commenting on an early draft of the thesis, and to Lovisa H¨ogberg who provided me with help, support and humour.

Finally both Anna Johansson and Karin Sundstr¨om have provided me with encouragement when I needed it.

(5)

1 Introduction

Progress in the diagnosis and treatment of infectious diseases relies on improv- ing the understanding of the human immune response to infections, vaccinations and treatments. For example, to understand tuberculosis and enable the devel- opment of new vaccines, useful knowledge can be gained from studying the differences in the response of the immune system in healthy and tuberculosis- infected persons when they encounter components of the tuberculosis bacterium.

To investigate the immune system in fine detail one can use ‘peptide microarrays’ that simultaneously measure the immune response to hundreds or thousands of peptides (peptide microarrays are further explained in Section 1.1.3). Before the results of such tests can be compared, the arrays for different individuals need to be normalised, to remove systematic errors introduced through differences in preparation and measurement of the slides. Once the data has been normalised to remove such technical variation, the results from healthy and ill people are compared to find biological differences.

In this thesis five methods for normalisation are described, four of which are implemented and compared to each other, whereas the last is rejected for practical reasons. Several of these methods have been developed for genetic microarrays but are here applied to peptide microarrays from studies of the immune system. Currently there is no generally accepted method for normalising peptide arrays.

1.1 Biochemical concepts

To get a basic understanding of how peptide microarrays work a few concepts will be introduced and explained here.

1.1.1 Proteins and peptides

Proteins take part in numerous biological processes in all living creatures. Hav- ing many different sizes, shapes and purposes they can be found in all living tissue in a great number of variations. They can, for example, act as catalysts, transport smaller molecules or be a part of DNA replication.

The structure of a protein can be seen as a number of interconnected chains of amino acids. Shorter chains are called peptides. In general proteins are longer and more complex and can consist of multiple peptide chains.

1.1.2 Antibodies and antigens

Antibodies are molecules that form a vital part of the immune system by recog- nising and binding to different types of intruders in the body such as bacteria or viruses.

The antibodies bind to certain peptides, called antigens on the surface of the bacterias or virus . Many types of antibodies exist and each may only recognise one or a few different antigens. When an antibody encounters a matching antigen the antibody binds to the intruding bacteria or virus. Using the bounded antigen the immune system can recognise the intruder and dispose of it [1].

(7)

1.1.3 Peptide microarrays

To introduce peptide microarrays a fictitious simplified example will be used.

This example illustrates how peptide microarrays work, but on a scale of only a single peptide in contrast to a microarray which contains several thousand peptides.

Example:

To test if a certain sample contains an antibody of interest one could add the sample to a test tube whose inside is coated with antigens that react with the specific antibody. If the antibody is present it will bind to the antigens and unbound antibodies are removed by washing. A fluorescent solution that binds to the antibodies is added and any excess is removed by washing. The fluorescent intensity can then be used as a measure of the concentration of the specific antibody in the sample (see Fig. 1).

Figure 1: Left: peptides fixed on the walls of a test tube. Middle: antibodies have bound to the peptides. Right: after addition of the fluorescent solution the peptides are detectable as light signals.

The same principle of testing a sample for an antibody can be expanded to understand how we test for multiple antibodies at once using a peptide microarray. A peptide microarray has many peptides fixed in specific positions on a small glass slide and the concentrations of antibodies that bind to these are measured in a similar fashion as in the simple experiment above. A peptide microarray slide may contain several thousand peptides printed in fixed positions.

Each slide, or array, often contains several identical subarrays, with each subarray further divided into blocks containing a lattice of positions where the peptides have been fixed (see Fig. 2). These positions are called spots. Some peptides can appear in multiple spots to provide replicated observations.

Negative controls are spots that should not give any signal (often nothing is printed at the spot) and positive control spots contain substances that are expected to give a high signal regardless of the specimen being tested. Positive and negative control spots are used to assess if the array is functioning correctly and are not to be analysed with the rest of the data.

(8)

Figure 2: Example of a slide, or array, with data consisting of two subarrays (b), each containing 24 blocks (a) with each block containing a lattice of 18 × 18 spots.

The microarray is incubated with the sample to be analysed so that the antibodies in the sample bind to the antigens on the slide. After this a solution containing fluorescent molecules that bind to the antibodies is added, and the fluorescence can then be detected and quantified.

1.1.4 Need for normalisation

Several sources of errors can obscure the measurements of the peptide responses including: slide production differences, differences in chemical preparation, measurement differences, a systematic differences between blocks or subarrays. To be able to compare the data acquired from different biological specimens, these technical errors need to be removed before the biological differences can be validly analysed (see for example [2]). The different methods in this thesis remove errors from various different sources, but all are designed to at least remove any subarray effect.

2 Methods

2.1 Data collection

2.1.1 Slides and samples

The slides used were second generation microarrays from JPT, Berlin [3]. Each slide had two subarrays containing 7776 spots arranged in 24 blocks of 18 × 18 spots. One such slide is depicted in Fig. 2. To calibrate the normalisation methods six slides with only buffer solution and eight slides incubated with serum from eight healthy individuals were used. The buffer slides were used to identify false positive peptides (see Section 2.4) and the eight serum treated slides were used for evaluating the normalisation methods. The data have previously been used in Gaseitsiwe et al. [4], where details of the procedures for preparing the slides can be found.

2.1.2 Spot classification and measurements

A GenePix 4000B microarray scanner was used to quantify the signal intensity of each spot on the slide. The fluorescence was recorded for the spot itself, called foreground, and the area around it, called background (see Fig. 3). The signals were read at a wavelength of 635 nm, corresponding to the wavelength of the

(9)

fluorescence. The data was exported as genepix ASCII files with a separate file for each subarray.

After scanning the slides, the quality of an individual spot was classified by the GenePix software with ‘0’ for a good spot, ‘−50’ if the spot was not found,

‘−75’ if nothing was printed on the spot (i.e. a negative control) and ‘−100’ if the spot was ‘bad’. The criterion for ‘bad’ spots was defined by the user. The

‘bad’ criterion used in this project was the same as in Nahtman et al. [5]:

Spot is bad if and only if

F635 mean > 1.5 × F635 median and

F635 median > 40 , (1)

where F635 ‘median’ denotes the median signal at wavelength 635 nm of the pixels in the foreground and ‘F635 mean’ is the corresponding mean value.

Figure 3: Description of areas of measurement for signal strength from one spot: the fluorescence in the dark grey area gives the foreground signal while the black area gives the background, the white circles denote the positions of surrounding spots and the light grey is a transition zone and is not used for any measurement.

2.2 Response variable

To measure the responses of peptides in a peptide microarray, a number of response variables have been suggested. These often use the logarithmic scale as the data are highly skewed. Quintana et al. [6] use

log2(foreground signal − background signal)

whereas Nahtman et al. [5] take the ratio of the foreground and backround signals instead of the difference:

log₂(foreground signal/background signal).

In this thesis the second of these measurements is used and we will refer to this measurement as the ‘response index’.

2.3 Data

The information from each peptide subarray was imported into the R statistical package and treated as a set of vectors. The imported data contained information about intensity of foreground signals, intensity of background signals, spot classifications and names of the peptides.

(10)

2.4 Control spots

As described in Section 1.1.3 each subarray may contain both positive and negative control spots. The negative control spots can be used to characterise the levels of response that can be attributed to experimental noise. A very high response from a negative control spot would suggest ‘spill over’ from a neighbouring spot.

Both the positive and negative control spots can be used to highlight areas of arrays or whole arrays that are not reliable. For example if all positive controls in a block give almost no signal there is probably something wrong with the block in question and one would disregard the responses from any peptides there.

The positive control spots are also useful on buffer arrays, i.e. arrays with no human sample. In the data from these slides, positive controls can be compared with the non-control spots to determine which peptides are giving a false positive response.

For the slides used in this thesis, the slide manufacturer has provided both positive and negative control spots on the subarrays. In addition, those peptides that give a strong signal on the buffer slides were considered as ‘false positives’, as they gave of a strong signal in the absence of human serum. We defined false positive peptides as those for which at least one spot on at least one of the buffer slides had a response greater than a specified value, h. The value of h can be chosen in two ways by plotting all background values against the response index, and secondly by plotting a QQ-plot both of which are illustrated in Fig. 4. By visual inspection of these figures h = 0.2 was chosen since that is approximately where the density in the left figure decreases and at the same time where the sudden change of the QQ-plot occurs.

The removal of control spots, false positives and spots flagged as bad resulted in subarrays of approximately 6600 spots each.

2.5 Oligonucleotide arrays

The microarrays used in genetic research have fractions of genes, called oligonucleotides, instead of peptides placed at the different spots. These arrays were the earliest types of microarrays to be used and many methods of normalisation have been developed for them. In this thesis some of these methods were applied to peptide microarrays even though the two array types differ in several ways.

On oligonucleotide arrays each gene has two response values calculated from a combination of measurements of the multiple oligonucleotides that constitute the gene (see Fig. 5). The first is calculated for the gene whose concentration the experiment is designed to estimate, this value is called the perfect match, abbreviated PM. A second value is calculated for a gene that has the same oligonucleotides except for a single nucleotide , and this value is called the mismatch: MM. The mismatch value is compared to the perfect match to express how much of the specific gene is present in the sample compared to similar genes. In analysis of oligonucleotide arrays it is common to define the response variable as a ratio of perfect match and mismatch similar to our response index for peptide arrays which is the ratio of foreground and background response (see Section 2.2).

The normalisation algorithms for oligonucleotide arrays use the properties

(11)

Figure 4: Left: to identify false positive peptides the background value is plotted vs the response index, Middle: a QQ-plot of the response index variables and a magnification of the sudden change (right). The data comes from buffer slides where the expected response is due to only random experimental noise. A line at y = 0.2 is added to all three plots as that is the chosen threshold for the false positive peptides.

Figure 5: One gene consists of several oligonucleotides, for each oligonucleotide one perfect match (PM) and one mismatch (MM) with a single nucleotide of difference.

(12)

of the oligonucleotides and since these are different from those of peptide microarrays some methods are not directly applicable.

2.6 Summary arrays

Some normalisation methods adopted from genetic microarrays require data that has an equal number of replications of each spot type. To be able to use these methods the mean of each spot type was calculated for each subarray, these new arrays are called ‘summary arrays’. Thus for each subarray a summary array is created with only one value for each spot type on the subarray (see Fig. 6). Those peptides that had all their replications marked as bad on the subarray will be missing on the summary array.

Figure 6: One subarray containing multiple spots for the same peptide, ‘pept k’ and the corresponding ‘summary array’ where all ‘pept k’ are combined by taking their mean; ‘mean k’.

2.7 Normalisation methods

2.7.1 Global mean/median normalisation

Global mean normalisation is a method that normalises each subarray without any influence from the other subarrays. This is done by subtracting the mean of each summary array from the corresponding subarray (for use of this method see Quintana et al. [6]). For each subarray a of length N and corresponding summary array ζ of length M this means:

a =





 a₁

... a_N





 , ζ =¯ ζ₁+ · · · + ζ_M M

so that the global mean normalisation for subarray a is:

a_GM=







a₁− ¯ζ ... aN− ¯ζ







(13)

This method ensures that the mean of the values in each subarray is close to zero, thus making the subarrays comparable for their average level of response.

The global median differs from the global mean method only in that it subtracts the median (instead of the mean) of each summary array from the corresponding subarray, thus giving a median value close to zero for all the subarrays. For the comparisons in this thesis the global median was used.

2.7.2 Invariant set normalisation

The basic idea of invariant set normalisation is to first select one subarray as a baseline array, and then compare it to each of the other subarrays to find those peptides that are relatively constant for each pair. These sets are called

‘invariant sets’. The final step is fitting a normalisation curve for each invariant set and using it to normalise the subarray from which it originated.

The invariant set method was originally developed and used by Li and Wong [2]. This method was developed to normalise oligonucleotide arrays and is therefore not directly applicable to peptide arrays. In particular the program developed by Gautier [7] does not accommodate different numbers of spots for each feature. To overcome this problem, summary arrays were created as described in Section 2.6 by calculating the mean response of the spots with the same peptide to create arrays with only one record for each peptide.

A baseline array δ can be chosen in the following four ways, all of which are implemented in [7] and the first one described in the original paper [2]:

1. Selecting one summary array by first finding the mean of all spots on each of the summary arrays, α₁, . . . , α_n, and then selecting the median of these values, αk, whereafter the summary array from which this value originated (summary array k) is selected as the baseline array.

2. Selecting one summary array by first taking the median of all spots on each of the summary arrays, β1, . . . , βn, and then selecting the median of these values, βk, whereafter the summary array from which this value originated (summary array k) is selected as the baseline array.

3. Creating a new summary array by taking the mean response of each peptide type across all summary arrays and using this as the baseline array.

4. Creating a new summary array by taking the median response of each peptide type across all summary arrays and using this as the baseline array.

For each summary array, ζ, an invariant set is then found using the chosen baseline array. This is done by first considering all peptides to be potentially invariant, except those that are seen as bad in either the array to be normalised or the baseline array. Peptides are then removed from the potentially invariant set by the following steps:

1. The ranks of ζ and δ are calculated (with the lowest rank, 1, given to the smallest value on each of the arrays). These ranks are here called E and B respectively denoting the array being examined and the baseline array.

2. Calculate the average intensity rank (Ri) for each peptide i as Ri= ^Bⁱ_2M^+Eⁱ where M is the number of potentially invariant peptides. The proportional

(14)

rank difference (D_i) for each peptide is also calculated as: D_i = ^|Bⁱ_M^−Eⁱ^|. In array form this gives:

R =







B1+E1

2M...

B_M+E_M 2M





 and: D =







|B1−E1| M...

|B_M−E_M| M







3. The next step is to remove those peptides from the invariant set that fulfill the following condition (see [7]):

(Hj− L0) × R + Lj≤ D i.e.

(Hj− L0)B + E

2M + Lj ≤|B − E|

M (2)

where H and L, abbreviations for high and low, are weight constants that determine how selective the algorithm is, with lower values of H and L increasing the proportion of rejected peptides. The ‘j’ denotes which repetition of the above steps that is currently being calculated. H and L are restricted by: 0 < L ≤ H. L0and H0are constants chosen by the user and H1= 10 × H0, L1= 10 × L0and matches the values used in [7]. The relative difference between H and L determines how large a weight will be given to the high ranking peptides in the criteria for keeping the peptides in the potentially invariant set. The bigger the difference the more likely it is that the high ranking peptides will be kept in the potentially invariant set.

4. Update L and H for j > 1 as follows:

Hj+1=

0.9 × Hj if Hj> H0

Hj otherwise ,

L_j+1=

0.9 × L_j if L_j> L₀ L_j otherwise

If a certain (predefined) number of peptides is removed from the invariant set at step 3 a new cycle consisting of step 2 to 4 is repeated. If fewer peptides have been removed, the invariant set is considered stable.

Examples:

To illustrate the invariant set selection a few examples are presented below:

If a peptide has the same rank on E and B, equation (2) indicates that the peptide will never be rejected since H is larger then L:

(H_j− L₀)i + i

2M + L_j≤ |i − i|

M

⇔

(15)

H_j+ L_j− L₀≤ 0,

where i stands for the rank on E and B. If, instead, a peptide is ranked in the middle of ζ and ranked as the maximum of δ (or vice versa) equation (2) gives:

(Hj− L0)M/2 + M

2M + Lj≤ |M/2 − M | M

⇔ (H_j− L₀)3M/2

2M + L_j≤ M/2 M

⇔ (H_j− L₀)3

4 + L_j≤ 1

2 (3)

In contrast if we have a peptide ranked in the middle of either ζ or δ and having the lowest response in the other array we get:

(Hj− L0)M/2 + 1

2M + Lj≤ |M/2 − 1|

M For a large M this is approximately

(Hj− L0)M/2

2M + Lj≤ M/2 M

⇔ (Hj− L0)1

4 + Lj≤ 1

2 (4)

On the first cycle of the steps above, (3) will give:

(10H0− L0)3

4 + 10L0≤1 2

⇔ (10H0− L0)3

2 + 20L0= 15H0+ 18.5L0≤ 1.

While equation (4) gives:

(10H0− L0)1

4 + 10L0≤1 2

⇔ (10H0− L0)1

2+ 20L0= 5H0+ 19.5L0≤ 1.

Therefore; the higher the ranks of a peptide on the two arrays the less likely it is to be rejected from the invariant set.

For those peptides that are in the invariant set a piecewise linear running median line with the values of the summary array on the x axis and the baseline values on the y axis is fitted for each invariant set. This line is to be used as a normalisation curve for the corresponding subarray.

(16)

The normalised values are computed for each subarray by plotting the normalisation curve and putting the values of the subarray to be normalised, a, on the x axis. The normalised values are those y values of the normalisation curve that corresponds to the x values of a [7].

In step 3 above the lower requirement of a high ranking peptide to be kept in the invariant set is a consequence of the relatively few high responding peptides.

If equal requirements are used for all peptides regardless of response the fit of the normalisation curve will be rather poor for the higher valued peptides since very few high ranking peptides will be chosen for the invariant set.

To calculate the normalised responses using the invariant set method, we used the program written by Gautier, available as open source code from bioconductor [7]. Some modifications were needed since the original program was written for genetic microarrays. Gautier’s program assumes the same number of replications for each oligonucleotide, so first a summary array was created as explained in Section 2.6 so that there was only one observation for each peptide type.

For selecting the baseline array method 4 was used. This ensures robustness in the baseline array since outliers will have little impact because the method uses the median response of each peptide type. It also results in each peptide having a value on the baseline array (unless the peptide is marked as bad in all spots on all slides).

The constants L and H were set to 0.003 and 0.007 respectively, the same as used in the program by Gautier [7]. The number of peptide rejections needed for a reiteration of the sub algorithm in step 3 above was set to 0 as no higher was needed on account of execution speed; to normalise the sixteen subarrays of the eight individuals took about 40 seconds.

2.7.3 Least variant set

The ‘least variant set’ method of normalisation is described in a paper by Calza et al. [8] and implemented (by Calza et al.) in [9]. The idea of least variant set is much like that of invariant set, i.e. finding a subset of peptides whose ranks vary little between the subarrays, and using this subset to normalise the data. Unlike the invariant set that uses pairwise comparisons to find a set for each subarray, the least variant set uses information from all subarrays simultaneously to find a set that have low dispersion across all subarrays.

Least variant set normalisation was developed for oligonucleotide arrays and uses the following model:

Eij = µ + ci+ dj+ ij,

where Eij is the log₂of the perfect match response value for gene i on subarray j, µ is the overall mean, ciis the effect of gene i, and djis the effect of subarray j. The model is fit using M-estimation, a generalisation of maximum likelihood estimation. From the estimation of the covariance matrix V and the estimated c (denoted by ˆc), a χ² test statistic is calculated as: χ²= ˆc^TV⁻¹ˆc.

The identification of the least variant set is then performed by fitting a quantile function of the χ² test statistic to a function of the logarithm of the residual standard deviation. Details can be found in [8].

The least variant set method presupposes multiple measurements of each gene for the gene parameter to be reliably estimated and is thus not applicable

(17)

to the peptide arrays which have very few, if any, repetitions of each peptide.

Therefore the method was not studied further.

2.7.4 Quantile normalisation

Quantile normalisation is presented in a paper by Bolstad et al. [10]. The idea of the normalisation is to give all subarrays the same distribution. This is done by taking the subarrays to be normalised a₁, . . . , a_n and sorting each subarray, producing a_{1 sort}, . . . , a_{N sort}. The sorted subarrays are then combined to form a matrix A_sort = (a_{1 sort}, . . . , a_{N sort})^T. Thereafter the mean of each row in A_sort is calculated and the result put in all columns of the corresponding row in matrix A_mean, with A_meanhaving the same dimensions as A. After this A_mean is split up column wise into n arrays, a_{1 mean}, . . . , a_{N mean}. Lastly the arrays a1 mean, . . . , aN mean are rearranged to reverse the sorting from a1, . . . , aN to a1 sort, . . . , aN sort, this produces the normalised arrays a1norm, . . . , aN norm.

Example:

A simple numerical example of this is:

a1=



 1 4 2



, a2=



 5 7

−2





Sorting the arrays give:

a1 sort=



 1 2 4



, a2 sort=





−2 5 7



, Asort=



 1, 5 4, 7 2, −2





Taking the values of A row wise and replace them with the mean of the same row results in:

Amean=





1−2 2 ,¹⁻²₂

2+5 2 ,²⁺⁵₂

4+7 2 ,⁴⁺⁷₂



=





−0.5, −0.5 3.5, 3.5 5.5, 5.5





Splitting up the matrix into arrays:

a1 mean=





−0.5 3.5 5.5



, a2 mean=





−0.5 3.5 5.5





Thereafter the columns only need to be rearranged to the same order as the original arrays:

(18)

a1 norm=





−0.5 5.5 3.5



, a2 norm=



 3.5 5.5

−0.5





The quantile normalisation was implemented with the help of a program by Smyth [11]. Since some spots were deemed to be bad and thus are not to be included in the normalisation some subarrays are of different length. To handle subarrays of different length Smyth used an algorithm that imputes data if a subarray has missing values. This is done by using the data in the shortened subarray to approximate a subarray of the same length as a subarray with no bad data. The first step of the imputation is to take the available (i.e. non bad)

Figure 7: Illustration of an imputation of missing responses for quantile normalisation. The subarray here should have five responses, but one of them is missing, therefore the other four points are plotted as open circles with their response as y values and the x values being equidistant points between 0 and 1. Straight lines are drawn between neighbouring circles. The new imputed data set is then obtained by taking five equidistant points on the x axis between 0 and 1 and drawing vertical lines to the line connecting the neighbours, choosing the corresponding y values.

data for the subarray and order it. The second step is to create an array, u, of the same length as this subarray with one point at 0, one in 1 and the rest spread out evenly between. Thirdly each subarray is ordered and the values are plotted on a graph with the response value as y values and the values from u with corresponding positions as x values. Then each point in the plot is connected to its neighbour by straight lines, the resulting line is called L. After this another array, w of the same length as the full subarray (including bad spots) is created. The first value of w is set to 0, the last to 1, and the values in between evenly spread out between 0 and 1. The imputed array then consists of those y values of L that have x values equal to w (see Fig. 7). In order to impute data in this fashion it is assumed that the missing data (marked as bad) is randomly distributed across the subarrays.

(19)

2.7.5 Linear model

Normalisation using linear models is done by fitting the peptide response data to the model:

Yij = µ + e1ij+ · · · + eKij+ ij (5) where Yijis the response variable for spot i on array j, µ is the overall average, e1

to eKare effects chosen by the user to be fitted in the model and is the residual.

Examples of fixed effects are systematic effects of array or block. By using a least square estimation to fit equation (5) we obtain estimates of µ, e₁, · · · , e_K. By subtracting the estimated effects e₁ to e_K and µ the normalised responses are ˆ = Y − µ − e_1ij− · · · − e_Kij.

This approach to normalising peptide microarrays was first used by Naht- man et al. [5].

We implemented the linear model with a block effect (g), a slide effect (f ), subarray effect (s) and interaction between slide and subarray (f × s):

Yij = µ + gi+ fj+ si+ fj× si+ ij

3 Results

To compare the results of the different normalisation procedures the normalised data was analysed slidewise since each slide represents measurements from a single person. In order to give each peptide equal weight the mean of each peptide type on each slide was used, similar to the summary arrays but on a slide level. If this aggregation over peptide type would not have been used the peptides with more replications on each slide would be represented by more positions and thus be given a larger weight. The data normalised by the different methods were investigated to find in what manner the methods altered the raw data, (e.g. reducing/preserving outliers, behaviour of distributions), and also to compare the execution time.

To visualise the results of the different normalisation methods two plots were produced: a plot where the unnormalised data was plotted versus the data normalised by the different normalisation methods (Fig. 8), and one where boxplots of the responses on the unnormalised slides and the slides normalised by the different methods were drawn (Fig. 9).

Examination of these plots shows that the global median method only adjusts each individual subarray to slightly higher values by adding a small positive constant. This is not suprising since the median of each subarray is negative and rather close to zero already (see Fig. 9).

For the invariant set method, it can be seen that the different invariant sets give normalised slides that are similar to each other around a response equal to zero. On the edges however there is more dispersion of the data, because of the relatively low density of the data near the highest and lowest values of the subarrays resulting in relatively bad fit of the normalisation line. In Fig. 9 one sees that the invariant set normalisation removes somewhat more outliers than the other methods, and the effect of this can also be seen in Fig. 8 as the extreme values of the unnormalised data are not matched in the invariant set.

The quantile regression plots reveal some of the properties of the quantile normalisation, namely that the peptides with the same rank on each subarray

(20)

Figure 8: Comparison of unnormalised data to data normalised by various methods.

Each point is the average response of all replications of a peptide type on a slide. Top row: (a) unnormalised data vs global median, (b) unnormalised data vs invariant set.

Bottom row: (c) unnormalised data vs quantile, (d) unnormalised data vs linear model.

A line y = x is added to all four plots for reference.

in the unnormalised data are averaged to the same level. Therefore several of the peptides with the same values in the quantile normalised data have different values in the unnormalised data (but equal rank). These show up as a band of points with the same value on the y axis but different values on the x axis. The same phenomenon observed for the invariant set occurs here as well: towards the extremes of the data the plot is more scattered while having lower dispersion around the middle. This pattern appears because of the lower density around the edges, giving large differences around the lowest and highest ranked values.

The middle ranked values sit close together and will not change much between unnormalised data and quantile normalised. Quantile normalisation results in all subarrays having very similar distributions so it is not surprising that the slides have similar distributions and boxplots (see Fig. 9).

In the last frame of Fig. 8, unnormalised versus linear model, the normalised data is shifted slightly upwards and is clustered around (and mainly a bit above) a straight line y = x. The boxplot shows that the linear model normalised data, much like the global median, is similar to the unnormalised data since the estimated fixed effects are rather small.

In Fig. 9 it is easy to see that global median, invariant set and quantile normalisations all have median values that are approximately equal across all

(21)

Figure 9: Comparison of boxplots for each slide; the response of each peptide type is averaged for each slide.

slides. The linear model is fit by least squares giving it similar mean values, not medians.

In the normalisation methods discussed in this thesis neither the mean nor the median value is generally equal to zero. This is to be expected since the algorithms do not shift the subarrays to make zero the mean or median.

A quick look at the execution times for 16 subarrays of 7776 spots each indicates that the global median is the slowest algorithm requiring 73 seconds.

Second slowest was the invariant set taking 42 seconds, followed by the linear model with 11 seconds, and lastly the quantile normalisation with 9 seconds.

4 Discussion

This thesis compares four methods of normalisation for peptide microarrays by implementing them and examining the results. To our knowledge this is the first time such a comparison has been done for peptide microarrays. The thesis adopts normalisation methods designed for oligonucleotide arrays and adapted

(22)

them for peptide microarrays.

Some of the measurements of the peptide microarrays were not considered reliable and were therefore excluded from the normalisation process, if some peptide(s) were then absent from one or more arrays, this would give rise to an imbalance in the size of the data material. This problem was solved using different methods for the different normalisation methods. Another source of imbalance was that some of the peptides in the subarrays were present multiple times. In order to give each type of peptide equal weight new arrays were created whose values were the averages of all repetitions for each peptide on a subarray.

In order to implement the methods a program needed to be written for the global median. For computing the invariant set normalisation adjustments were made to a previously existing program. For the other two methods existing software programs were used.

The normalisation methods used were partly adopted from other microarray applications, namely the invariant set and quantile normalisations that were developed for oligonucleotide microarrays. In a similar way it should be possible to use the methods described in this thesis in other, similar applications.

One big advantage of the programs used to calculate the invariant set and quantile normalisation is that they are available for free as open source software.

This means that anyone can download the source code of the programs and improve them or modify them to fit their own applications.

All methods in this thesis can handle missing data to varying degrees, as long as the data is missing at random. Global mean/median normalisation is not so sensitive to missing data. Since the different subarrays are not compared with each other, differences in length problems will not occur. If only a smaller part of the data set is unusable the invariant set normalisation can be used without imputing the missing data. In order to use quantile normalisation the data set needs to have equal number of peptide replications represented in all subarrays that are to be normalised. If this is not the case, data must be imputed in order to use the method. The linear model can be used with missing data, but less usable data will give a less reliable result.

When running programs for larger sets of data, execution times can often be be a limiting factor. With the data in this thesis, consisting of 16 subarrays of 7776 peptide measurements (with controls), this is not a big issue, as running all the methods takes only a couple of minutes. Therefore the normalisation methods do not produce a big problem in terms of execution time unless the data set used is significantly larger.

If one of the first two methods is used for selecting the baseline array in the invariant set method one of the summary arrays will be chosen as baseline array. This means that all peptides on the chosen summary array that are unreliable measurements will also be unreliable in the baseline array. If instead, as in this thesis, one of the last two methods is chosen to select the baseline, that baseline will have a value for each peptide unless the peptide has no reliable measurements on any subarray. This means that the initial potentially invariant set will be larger and therefore more probable to give a better invariant set.

To determine how well a normalisation method works is not easy. In order to compare one method with another a reference or ‘gold standard’ is needed.

For peptide microarrays no such set currently exists, and therefore comparing normalisation methods implemented with data from different sources can be difficult since that which works for one data set might not be as good for another.

(23)

Therefore a reference data set that can be used to determine the quality of peptide microarray normalisation methods is needed.

The methods described in this thesis need to be tested on other peptide microarray data sets since no common reference set has been used. The set used, coming from measurements of healthy individuals in a vaccination trial, might not be representative for other peptide microarray experiments and further testing of the methods is needed.

To be able to use the programs for normalisation that were used in this thesis some specific skills are needed. Mainly the user needs some experience of statistical software, particulary handling a command window interface and some knowledge of the programming language/environment R. This knowledge is needed to be able to make slight adjustments to the code and operate R. Fur- thermore, in order to understand the programs and make necessary adjustment a certain amount of statistical knowledge is needed.

In order to use several of the methods in this thesis, summary arrays were constructed. A tradeoff of using these is the loss of information due to reducing the amount of data by simply taking the mean of each peptide type on a subarray so that information about each individual measurement is not used.

The quantile normalisation method requires subarrays of equal length, otherwise the missing data must be imputed. In order to impute data it is assumed that the data to be replaced (which is considered bad by some criterion) needs to be distributed randomly. If the data is not missing at random the imputation algorithm can give an unrepresentative imputed set.

The slides from which the data in this thesis originates have their spots fixed in a non randomised manner, for example the corners of each block are always control peptides. This lack of randomisation can contribute to errors in the measurements, for example some spots may give a background value that is higher because of contamination by the signal from the neighbouring spots.

If the spot positions were to be randomised a more accurate measurement for the peptides could be obtained since the effect of a neighbouring spot could be calculated.

If normalisation, unlike in this thesis, is performed on a group of slides incubated with serum from both ill and healthy people in order to compare the results one can choose to either normalise both groups together or each group separately. If the groups are normalised separately any possible mean difference between the groups because of differences in peptides present is conserved.

However, if they are normalised together the result can be compared to other slides where it is unknown whether the subject that contributed the serum is healthy or not. Another advantage with normalising them together would be the increased amount of data, decreasing the uncertainty of the normalisation.

This would require the usage of the same model for all the data regardless of whether it originates from an ill or a healthy person.

To conclude this thesis shows that some methods developed for genetic microarrays are also applicable to peptide microarrays. The original programs that were modified for usage in this thesis can be found online at http://bioconductor.org.

To use these programs the open source programming environment R is also needed (http://r-project.org). The quantile normalisation and linear model softwares needed no modification to be applicable to the data used.

(24)

5 References

[1] Christopher K Mathews, Kensal E Van Holde, and Kevin G Ahern. Bio- chemistry. Addison Wesley Longman, 2000.

[2] Cheng Li and Wing H Wong. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application.

Genome Biology, 2(8):0032.1–0032.11, August 2001.

[3] JPT Peptide Technologies GmbH Berlin, Germany.

[4] Simani Gaseitsiwe, Davide Valentini, Shahnaz Mahdavifar, Isabelle Magal- haes, Daniel F Hoft, Johannes Zerweck, Mike Schutkowski, Jan Andersson, Marie Reilly, and Markus J. Maeurer. Pattern recognition in pulmonary tuberculosis defined by high content peptide microarray chip analysis repre- senting 61 proteins from m. tuberculosis. PLoS ONE, 3(12):1–8, December 2008.

[5] Tatjana Nahtman, Alexander Jernberg, Shahnaz Mahdavifar, Johannes Zerweck, Mike Schutkowski, Markus Maeurer, and Marie Reilly. Valida- tion of peptide epitope microarray experiments and extraction of quality data. Journal of Immunological Methods, 328:1–13, December 2007.

[6] Francisco J Quintana, Peter H Hagedorn, Gad Elizur, Yifat Merbl, Eytan Domany, and Irun R Cohen. Functional immunomics: Microarray analysis of IgG autoantibody repertoires predicts the future response of mice to induced diabetes. Proceedings of the National Academy of Sciences of the United States of America, 101:14615–14621, October 2004.

[7] Laurent Gautier.

R-program ‘normalize.AffyBatch.invariantset’, package ‘affy’, http://bioconductor.org/packages/2.3/bioc/html/affy.html.

[8] Stefano Calza, Davide Valentin, and Yudi Pawitan. Normalization of oligonucleotide arrays based on the least-variant set of genes. BMC Bioin- formatics, 9(140), 2008.

[9] Stefano Calza, Alexander Ploner, and Yudi Pawitan. Flush, 2008.

R-package version 1.1.1, http://www.meb.ki.se/~yudpaw/.

[10] Ben M Bolstad, Rafael A Irizarry, Magnus ˚Astrand, and Terence P Speed.

A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193, 2003.

[11] Gordon Smyth.

R-program ‘normalizeQuantiles’, package ‘limma’, http://bioinf.wehi.edu.au/limma/.

A comparison of normalisation methods for peptide microarrays

U.U.D.M. Project Report 2009:3

Examensarbete i matematisk statistik, 30 hp

Handledare: Marie Reilly och Davide Valentini, MEB, KI Examinator: Jesper Rydén

Mars 2009

Department of Mathematics

A comparison of normalisation methods for peptide microarrays

Ole Brus

A comparison of normalisation methods for peptide microarrays

Ole Brus March 2009

Supervisors: Marie Reilly and Davide Valentini Examiner: Jesper Ryd´ en

Abstract

Sammanfattning

Contents

1 Introduction

1.1 Biochemical concepts

2 Methods

2.1 Data collection

2.2 Response variable

2.3 Data

2.4 Control spots

2.5 Oligonucleotide arrays

2.6 Summary arrays

2.7 Normalisation methods

3 Results

4 Discussion

5 References