Peptide mapping by capillary/standard LC/MS and multivariate analysis

(1)

UPTEC X 04 036 ISSN 1401-2138 AUG 2004

RAGNAR STOLT

Peptide mapping by

capillary/standard LC/MS and multivariate analysis

Master’s degree project

(2)

UPTEC X 04 036 Date of issue 2004-08

Author

Ragnar Stolt

Title (English)

Peptide mapping by capillary/standard LC/MS and multivariate analysis

Title (Swedish)

Abstract

The potential of LC/MS peptide mapping combined with multivariate analysis was investigated using IgG1 as a model protein. Five batches of IgG1 were exposed to different levels of an oxidizing agent.

A method to detect differences between the batches using solely MS data was developed and successfully applied. Four peptide fragments containing methionine residues were found to represent the most significant differences and characterized using MS/MS. In order to evaluate different computational strategies Principal Component Analysis (PCA) was used. Attempts were also made in order to use the information from the whole LC/MS space.

Keywords

Peptide Mapping, LC/MS, PCA, PTM, IgG1, Genetic Algorithms, Matlab Programming

Supervisors

Rudolf Kaiser

AstraZeneca, Analytical Development Södertälje Scientific reviewer

Per Andrén

Uppsala University, Laboratory for Biological and Medical Mass Spectrometry

Project name Sponsors

Language

English

Security

ISSN 1401-2138 Classification

Supplementary bibliographical information

Pages

47 Biology Education Centre Biomedical Center Husargatan 3 Uppsala

Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 555217

Molecular Biotechnology Programme

Uppsala University School of Engineering

(3)

Peptide mapping by capillary/standard LC/MS and multivariate analysis

Ragnar Stolt

Sammanfattning

Inom läkemedelsindustrin är det viktigt att utveckla analytiska metoder för att kunna hitta små skillnader mellan olika prover av ett läkemedelsprotein. Man måste kunna kartlägga vilka förändringar som introduceras i proteinet då det t.ex. lagras i rumstemperatur under lång tid.

Dessa förändringar kan nämligen ändra proteinets egenskaper och eventuellt även leda till ett immunsvar med allvarliga konsekvenser.

Traditionellt har man inom proteinkemin karaktäriserat proteiner bland annat genom s.k.

peptidmappning. Peptidmappning går ut på att enzymatiskt klyva ett protein och analysera de uppkommna peptidfragmenten med vätskekromatografi. Varje resulterande kromatogram motsvarar då ett fingeravtryck av proteinet och små skillnader mellan olika prover kan spåras genom små förändringar i fingeravtrycken.

På så vis kan man avgöra om det föreligger några skillnader men inte vad de består av. Den här studien bygger på att ytterligare förbättra möjligheterna med peptid mappning genom att analysera peptidfragmenten med en masspektrometer.

Små förändringar i form av oxidation infördes på ett modellprotein. Med hjälp av traditionella statistiska metoder har stokastiska icke signifikanta skillnader filtrerats bort och förändringarna kunde via tandem masspektrometri karaktäriseras som oxidation av metionin.

Stor tyngd har lagts på att utveckla algoritmer som kan hantera den komplicerade och stora datamängd som masspektrometrisk data utgör.

Examensarbete 20 p i Molekylär bioteknikprogrammet

Uppsala universitet Augusti 2004

(4)

1 INTRODUCTION ...5

1.1 M ODEL PROTEIN , I MMUNOGLOBULIN G1...6

1.2 P EPTIDE M APPING ...6

1.2.1 Digestion ...6

1.3 R EVERSED P HASE H IGH P ERFORMANCE L IQUID C HROMATOGRAPHY (RP-HPLC) ...7

1.4 M ASS S PECTROMETRY (MS) ...8

1.4.1 Ion Source ...9

1.4.2 Time of Flight Analyzer...9

1.4.3 Tandem Mass Spectrometry ... 10

1.4.4 Hybrid Quadrupoles Time of Flight ... 10

1.4.5 The Detector... 11

1.5 D ATA A NALYSIS ... 11

1.5.1 Normalization... 11

1.5.2 Confidence Interval ... 12

1.5.3 Principal Component Analysis (PCA)... 13

1.5.4 Genetic Algorithms... 14

1.5.5 Wavelet Transformation ... 15

2 MATERIAL AND METHODS ... 17

2.1 E QUIPMENT AND CHEMICALS ... 17

2.1.1 Chemicals... 17

2.1.2 Equipment ... 17

2.2 M ETHODS ... 18

2.2.1 Oxidation of Model Protein... 18

2.2.2 Digestion of Model Protein ... 18

2.2.3 RP-HPLC ... 18

2.2.4 LC/MS ... 19

2.2.5 Design of Experiment ... 19

2.3 D ATA A NALYSIS ... 20

2.3.1 Importing Data to Matlab ... 20

2.3.2 Approach 1: Collapsed Time Scale ... 21

2.3.2.1 Normalization ... 22

2.3.2.2 Principal Component Analysis (PCA) ... 22

2.3.2.3 Confidence Interval ... 23

2.3.2.4 Finding Oxidized Fragments ... 23

2.3.3 Approach 2: Timescale... 24

2.3.3.1 Wavelet Denoising ... 24

2.3.3.2 Preprocessing Using Genetic Algorithms and Normalization ... 24

2.3.3.3 Bucketing... 25

2.3.3.4 Confidence Interval ... 25

3 RESULTS... 26

3.1 D ATA A NALYSIS ... 26

3.1.1 Approach 1: Collapsed Time Scale ... 26

3.1.1.1 Normalization ... 27

3.1.1.2 Principal Component Analysis (PCA) ... 30

3.1.1.2.1 Normalization with Normalization Parameter... 30

3.1.1.2.2 Evaluating Auto Scaling... 31

3.1.1.2.3 Comparing Normalization Techniques... 32

3.1.1.3 Confidence Interval ... 32

3.1.2 Approach 2: Time Scale ... 37

3.1.2.1 Wavelet Denoising ... 37

3.1.2.2 Preprocessing... 39

3.1.2.3 Confidence Interval ... 40

3.1.2.4 Bucketing... 41

3.2 T ANDEM M ASS S PECTROMETRY ... 41

4 DISCUSSION... 44

5 ACKNOWLEDGEMENTS ... 46

6 REFERENCES ... 47

1 Introduction

Today a number of different recombinant proteins are available on the pharmaceutical market.

The breakthrough for recombinant techniques is often associated with the release of insulin produced in E.Coli 1982 [1]. Right from the beginning it has been important to develop methods to characterize and analyze recombinant proteins.

There are problems using recombinant techniques due to posttranslational modifications (PTM). Eukaryotic organisms, especially the human species, have developed a complex system for PTM:s. Vital proteins will not function properly if these PTM:s are missing. On the contrary prokaryotic organisms, e.g. E.Coli, do not perform any PTM:s at all. The pharmaceutical companies have therefore to be able to detect differences between product and native form of the drug candidate protein. Differences from the native copy can lead to dysfunction of the protein drug and also an unwanted immunorespons with hazardous consequences.

There is also a great need of investigating the quality of a protein drug. What kind of modifications will be introduced in the protein when it e.g. is stored at room temperature for days? Maybe a couple of amino acids in the protein will be oxidized and some other will be exposed to deamidation or deglycosylation. These questions need to be answered before commercializing a new protein drug.

A common method to detect differences between protein batches is peptide mapping, using RP-HPLC [2]. To facilitate data analysis a multivariate approach can be successful. Principal component analysis (PCA) is often used [3] to model variations in the data set, making it easier to detect e.g. outliers and to produce information concerning system reproducibility. It is also important to minimize stochastic and system drift variations especially when looking for small differences in the data set. Otherwise it can be difficult to separate non-chemical variations from true physical differences in the protein.

The UV-data collected from the HPLC is however often not sufficient to disclose small variations in the data set. Furthermore the UV-chromatogram does not give any qualitative information. It is not possible using this kind of data to answer the question “Where on the protein are the modifications located and what do they consist of?”. To further enlarge the possibilities of peptide mapping the univariate approach has to be abandoned and more physical information describing the properties of the protein need to be gathered.

One possibility to enlarge the amount of available information is to use a LC/MS system, gathering information not only in the time domain but also in the m/z-domain resulting in a bivariate peptide map instead of the traditional univariate UV-map. Mass data (m/z) can also give qualitative information about the parts of the protein where the modifications are situated. Using MS/MS these parts can be analyzed further and comparing with a reference batch individual differing amino acids can be detected.

This project focuses on studying LC/MS peptide maps and developing computational methods to separate true chemical differences from noise without any a priori information.

Found differences will be characterized using MS/MS.

(6)

1.1 Model protein, Immunoglobulin G1

As model protein Immunoglobulin G (IgG1, κ) has been chosen. The IgG molecule is very important to the immune defense system and the most abundant antibody with approximately 13.5mg/ml in serum [4]. IgG binds to foreign molecules and is thereby activating other members of the immune defense system.

IgG is a molecule consisting of two major chains one smaller forming the light chain and one larger forming the heavy chain each represented twice (fig. 1). The different chains are held together with a total of four disulfide bonds.

The molecular mass of the IgG molecule used in this study is 145 kDa (without any PTM:s) and there are 450 amino acids.

There is a N-linked glycosylation site on each heavy chain.

To be able to evaluate the possibilities with a LC/MS peptide map, small chemical changes were introduced by oxidizing IgG. Comparing batches with different amount of added oxidizing agent hopefully reveals some information about the potential of the analytical LC/MS system.

The amino acid most sensitive to oxidizing agents is methionine. Oxidization of methionine produces methionine

sulfoxid [5] in a reversible reaction. This oxidization corresponds to an addition of an oxygen atom resulting in a 16 Da increment of mass. Increasing the concentration of oxidizing agent further can irreversible oxidize methionine sulfoxid to methionine sulfone. There are six methionine residues represented in the amino acid sequence.

1.2 Peptide Mapping

Peptide mapping is a method used to create a “fingerprint”, specific for a certain protein. The protein is digested with a suitable enzyme and the peptide fragments are separated using e.g.

Reversed Phase High Performance Liquid Chromatography (RP-HPLC). Traditionally an UV-detector is often chosen for data collection. In this study a mass detector was used.

1.2.1 Digestion

The digestion method has to be compatible with the chemical conditions necessary for the HPLC-system and the mass spectrometer. It is important to develop a digestion routine with high reproducibility in order to be able to compare the results from different runs. The enzyme used has to digest the protein into a sufficient number of peptide fragments. Too many and too small fragments risk to obstruct the data analysis and signal to noise ratio will decrease. Too few fragments decrease the amount of information that can be gathered from a peptide map.

Figure 1: Immunoglobulin G1

(7)

1.3 Reversed Phase High Performance Liquid Chromatography (RP-HPLC)

RP-HPLC is a widely used and well-established tool for the analysis and purification of biomolecules e.g. a protein digest. The system uses high pressure to force a mobile phase through a column packed with porous micro particles. Particle sizes range typically between 3 and 50 µm. The smaller particle diameter the more pressure will be generated in the system.

The particle pore size generally ranges between 100-1000Å. Smaller pore silicas may sometimes separate small or hydrophilic peptides better than larger pore silica [6].

The most common columns are packed with silica particles to which different alkylsilane chains are chemically attached. Butyl (C4), octyl (C8) and octadecyl (C18) silane chains are the most commonly used. C4 is generally used for proteins and C18 for small molecules. The idea is that large proteins with a lot of hydrophobic moieties need shorter chains on the stationary phase for sufficient hydrophobic interaction. The choice of column diameter depends on the required sample load and the flow rate. Small-bore columns (1.0 and 2.1 mm i.d.) can improve sensitivity and reduce solvent usage. Column length does not significantly affect most polypeptide separations [6]. To speed up the analytical cycle time short columns with high flow rate and fast gradients can be used at expense of resolution.

An HPLC-system optimized for columns with small inner diameters and low flows are called micro-HPLC. A micro–HPLC system has narrow capillaries, typically 50µm i.d. The pumps are commonly working with a split-flow enabling low flow with high accuracy. The advantage of micro-HPLC is mainly reduction in mobile phase solvent consumption and high sensitivity, which makes it possible to load low amounts of sample, facilitating the connection to a mass spectrometer system.

In this form of liquid chromatography the stationary phase is non- polar and the mobile phase relatively polar.

Analytes will thus be separated mainly due to their hydrophobic properties. During a gradient separation two different kinds of solvents are used as mobile phase. One of the solvents is relatively

hydrophilic and the other is relatively organic (hydrophobic). The two solvents are mixed together and the relative content of the organic solvent increases with time.

Analytes are at the beginning of the gradient attached through hydrophobic interaction to the solid phase. When the organic content of the mobile phase reaches a critical value desorption will take place and the analytes will pass through the column. The majority of peptides (10 to 30 amino acid residues in length) have reached their critical value when the gradient reaches 30% organic content. The separation is however also influenced by molecular size. Smaller molecules will move slower through the column than larger based on the fact that smaller molecules will have access to a larger volume of the column. The analytes partitioning

Polypeptide enters the column at injection

Polypeptide adsorbs to hydrophobic surface

Polypeptide desorbs from stationary phase when organic solvent reaches critical concentration.

Figure 2: The idea behind gradient separation with RP-HPLC

(8)

process between mobile and solid phase will also impact the separation process. However it is quite safe to say that polar analytes elute first and non-polar analytes last.

To get separation mainly based on hydrophobic differences an ion-pairing agent is often added to the mobile phase in order to serve one or more of the following functions: pH control, suppression of non wanted interactions between basic analytes and the silanol surface, suppression of non wanted interactions between analytes, or complexation with oppositely charged ionic groups. It has been shown [6] that addition of an ion-pairing agent has a dramatic beneficial effect on RP-HPLC not only to enhance separation but also to improve peak symmetry. Trifluoroacetic acid (TFA) is an ion-pairing agent widely used. It is volatile and has a long history of proven reliability.

To be able to connect the HPLC system to a mass-spectrometer it is important to choose an ion-pairing agent with care. Ion suppression reduces the sensitivity of the mass-spectrometer system.

Another effect that can influence on peptide separations is temperature. Higher temperature is associated with increased diffusion according to Einstein’s diffusion constant D:

Where η is a viscosity constant, k B corresponds to Bolzmann’s constant, T is temperature and r the radius of the diffusing particle.

It is however difficult to draw any general conclusions because it has been shown that an increase in temperature increases resolution between certain analytes and decreases resolution between other analytes [6]. For good reproducibility a firm temperature control has to be applied.

RP-HPLC is one of the most widely used forms of chromatography mainly because of its high resolution. Chromatographic resolution is defined as the ratio of the difference in retention time between two neighboring peaks A and B and the mean of their base widths:

Where tR corresponds to retention time and w base width. It is possible when using RP-HPLC to separate peptides whose sequence only differs by one single amino acid residue.

1.4 Mass Spectrometry (MS)

A mass spectrometer is an analytical instrument that determines the molecular weight of ions according to their mass to charge ratio m/z. The device consists mainly of three basic components: the ionization source, the mass analyzer/filter and the detector.

η π r

T D k ^B

= 6 ⁽¹⁾

av R B

A R B R A

S w

t w

w t

R t ∆

+ =

= 2 −

(2)

(9)

1.4.1 Ion Source

Ionization is an essential part of the mass spectrometric process. The molecules have to be charged and in gaseous phase in order to accelerate in the electrical field inside the mass spectrometer. Today several different ionization techniques have been developed. The most often used techniques concerning analysis of peptides and proteins are matrix-assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI). These two techniques are so called soft ionization techniques, which means that the molecules are ionized without fragmentation. In this study ESI has been used.

ESI creates a fine spray of highly charged droplets in the presence of a strong electric field.

The sample solution is injected at a constant flow, which makes ESI particularly useful when sample solution is introduced by a LC-system. If the LC-flow is compatible with the mass spectrometer an online LC/MS system is easily established. The charged droplets are introduced to the mass analyzer compartment together with dry gas, heat or both. This will lead to solvent vaporization. When the droplets decrease in volume the electric field density increases and eventually repulsion will exceed surface tension and charged molecules will start to leave the droplet via a so called Taylor cone [7]. This process is conducted at atmospheric pressure and is sometime also called atmospheric pressure ionization (API).

Using ESI it is possible to study molecules with masses up to 150 000 Dalton, mainly because of the fact that ESI generates multiple charged molecules, which means that a low upper m/z limit is sufficient for analysis of large biomolecules. A typical detection limit using ESI is femtomole [7].

1.4.2 Time of Flight Analyzer

The most commonly used analyzers are quadropoles, Fourier transform ion cyclotron resonance and time of flight analyzers (TOF). In this study a TOF analyzer with a reflectron was used. The TOF analyzer is the simplest construction based on the idea that ions are accelerated through an electrical field with the same amount of kinetic energy. These ions will differ in velocity due to their charge to mass ratio:

Where U corresponds to the accelerating voltage. The differences in velocity will in turn lead to different flight time from the ion source to the detector. One advantage of TOF instruments is that no scanning of the m/z spectrum is necessary. Another advantage is the fact that there is virtually no upper mass limit using TOF. However the resolving power of TOF instruments is low. Resolving power, also called resolution, is defined as the ability of a mass spectrometer to distinguish between different m/z ratios at a certain peak height. Looking at just one peak in the mass spectrum resolution is commonly defined as the ratio between the m/z value and the full width of the peak at half maximum. Analyzers with reflectron can improve resolution.

A reflectron is a device with gradient electrostatic field strength. This so-called “ion mirror”

will redirect the ion beam towards the detector. Ions with greater kinetic energy will penetrate z U

v m 2

1  ⋅



 



= 

−

(3)

(10)

deeper into the reflectron compared with low energetic ions. This mechanism will compensate for a wide distribution of initial kinetic energy and thus increase mass resolution.

1.4.3 Tandem Mass Spectrometry

The peptide has to be fragmented in order to determine its sequence. Fragmentation can be achieved by inducing ion-molecule collisions by a process called collision induced dissociation (CID). The idea behind CID is to select the peptide ion of interest and introduce it into a collision cell, with a collision gas often Argon, resulting in break of the peptide backbone. From the resulting daughter ion spectrum the m/z values of the involving amino acids can be found. The peptide fragments can be divided into different series. When charge is retained on the N-terminal (fig. 3) the resulting series of fragments are called a n , b n and c n . When charge is retained on the C-terminal fragmentation can also occur at three different positions called x n , y n , z n .

1.4.4 Hybrid Quadrupoles Time of Flight

To select peptide ions, that are to be investigated by CID, a quadrupole device can be used.

Quadrupoles are four parallel rods with an applied direct current and a radio frequency electromagnetical field. When ions reach the quadrupole they will start to oscillate depending on the radiofrequency field and their m/z value. Only ions with a particular m/z value will be able to escape the quadrupole, the rest will collide with the quadrupole walls. Thus the quadrupole works as a mass filter. By scanning the radio frequency field an entire mass spectrum can be obtained.

NH 2 CH C NH

R 1 O

CH C NH

R 2 O

CH C NH

R 3 O

CH C OH

R 4 O

x 3

a 1

y 3

b 1

z 3

c 1

x 2

a 2

y 2

b 2

z 2

c 2

x 1

a 3

y 1

b 3

z 1

c 3

NH H 2 N ⁺ CHR 1

H 2 N CHR 1 C O ⁺ O H 2 N CHR 1 C NH 3

+

R 4 HC C ⁺ O 2 H

H 3 N ⁺ CHR 4 C OH O

O CHR 4 C OH O

+ C

a 1

b 1

c 1 x 1

y 1

z 1

Figure 3: Collision induced dissociation (CID). Peptide fragments are produced according to the

scheme above. Ions of the b and y series are often dominating the daughter ion spectrum.

(11)

The instrument used in this study is a quadrupole TOF hybrid (fig. 4). The quadrupole is used to select an ion of interest, which is fragmented in the collision cell. The resulting daughter ions are analyzed using a TOF device and a detector.

1.4.5 The Detector

The detector (fig. 5) converts the kinetic energy from the arriving ions to an electrical current.

The amplitude of the current is correlated with the number of ions reaching the detector. Most detectors available today build on the principle of electron multiplication. The detector in this particular instrument is called microchannel plate (MCP). A MCP detector is a huge number of electron multiplicator tubes.

When a charged particle collides with the tube wall secondary electrons will be emitted and reflected further down the tube, leading to a cascade of secondary electrons well gathered in space. The signal is amplified in the MCP detector with typically a factor of 10 ³ - 10 ⁴ .

1.5 Data Analysis

1.5.1 Normalization

When using a LC/MS device small differences in sample concentration, injection volume or loss of sensitivity will introduce variations in the data set that complicates the comparison between different batches. These variations can however be compensated for by normalizing the data set. Most normalization techniques e.g. when treating HPLC data are based on an

Hollow glass capillary with secondary electron emission coating.

Secondary electrons.

Ions

Photoelectron

Figure 5: Microchannel Plate detector.

Collision Cell Ar

Detector Ions from ESI

Selected ion Ion fragments Figure 4: The concept behind a Quadrupole - TOF system.

Quadrupole

Reflectron

TOF

(12)

internal standard or an external standard. Normalization in this context means that the data set is divided by the area or height of the standard peak. As standard peak the peak with largest area in the chromatogram can often be used with good results.

However when analyzing MS data with a large number (up to thousands) of m/z values normalization is not a trivial task. Which m/z value should be chosen as standard peak in order to produce the most accurate normalization? What if a m/z value with large variation or equally bad, with too little variation is chosen? The normalized data set will under these conditions poorly represent the true values. A better approach would be to calculate intensity quotients between the m/z values of a reference sample and a target sample. The mean of these quotients could be used as a normalization parameter.

Averaging the quotients will work as a low pass filter (fig. 6) and only significant trends in the data set will be represented in the normalization parameter thus minimizing the impact of m/z values with large variation. This normalization technique will work well under the conditions that the number of m/z values is fairly large and the chemical differences between the batches are fairly small. Large chemical differences will slip through the low pass filter and give rise to a skew normalization.

1.5.2 Confidence Interval

A classic way of treating the problem with stochastic variation between different samples of the same batch is to estimate a confidence interval. Assuming that the observed variable belongs to the normal distribution it is fairly easy to calculate the probability of finding the true mean value within the variation of the measured variable. Or the other way around: it is possible to calculate the limits within which the true mean value with a certain amount of probability can be found.

Comparing two batches the confidence interval for the differences in mean intensity of the measured m/z values will give some useful information. If the calculated confidence interval ranges from a positive value to a negative value, i.e. includes zero, it is not possible to

Figure 6: The low pass nature of a mean operation. ω corresponds to frequency. H corresponds to the transfer function. The z-transform clearly shows that only low frequency components will slip through the filter.

 

 





 

 





=

+ + +

= +

−

+

− + +

− +

−

= +

+

−

sin 2 sin 2 ) 1

( ...

) ...

(

) 1 (

...

) 2 ( ) 1 ( ) ) (

(

1 2

1 ω ω

ω

n

e n H

n

z z

z z z

X

tion transforma z

n

n n x n

x n

x n n x

x

j

n

0 0.5 1 1.5 2 2.5 3 3.5 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

frequency (rad/s)

A

n = 5

(13)

statistically declare the calculated difference as a true different. By using this approach non- significant changes can be removed.

1.5.3 Principal Component Analysis (PCA)

Principal component analysis is a multivariate projection method designed to extract and display the systematic variation of a data set [8]. The data set is composed by a number of N observations and K variables. Examples of observations can be samples of different batches or time points in a continuous process. The variables are often represented by different kinds of analytical results e.g. UV-data, NIR-data, m/z-data.

Geometrically the data set can be interpreted by representing each observation as a point in the N-dimensional orthogonal variable space (fig. 7), where each axis constitutes a variable. A new set of orthogonal variables is introduced where each new variable minimizes the residual variance of the observations by least square analysis. Minimizing the residual variance is equivalent to maximizing the variance of the observations along the new variable axis. This new set of variables is called principal components (PC). It is possible to calculate as many PC:s as there are variables.

The Euclidian distance between each projection point of the observations on the PC and the PC center point is called the score value. Each observation is represented of a single score value in the principal component space.

The PC:s space has an equal number of dimensions as the original variable space. However reduction of dimensionality can be done by choosing the PC:s for the PCA model which together describes mostly of the variance in the original data set. The degree of variance explained is called the cumulative variance. Two or three PC:s are often sufficient, meaning that the original variable space with N-dimensions has been reduced to a new variable space with two or three orthogonal axis without any significant loss of information. The eigenvalue of each PC is proportional to the variance explained by that particular PC and can thus be an useful tool when ranking the PC:s.

Variable 2 Variable 1

Score value

Residual variance

PC 1

PC 2 δ

Figure 7: Geometrical interpretation of PCA with only two variables. Cosine of δ

corresponds to the loading value of variable 1.

(14)

Another important value besides the score value is the cosine of the angle between the original variable axis and the new PC axis. This value is proportional to the importance of the original variable for the direction of the PC and it is called the loading. Each original variable will give rise to a loading value.

In the resulting PCA-plot it is easy to find relations between observations. It is also possible to collect information about e.g. outliers and classification. The loading plot, where the loadings are plotted in the PC:space reveals information about relationships between variables.

To facilitate the interpretation of the PCA-plot data is often mean-centered and auto scaled.

Mean centering means that the average value of each variable is subtracted from the data set.

After mean centering the mean value of each variable will be zero. Auto scaling means that the standard deviation is calculated for each variable and the obtained scaling factor (1/σ i

where σ is the standard deviation and i = 1,2,3,….K-1,K) is then multiplied with each variable. By putting all variables on a comparable footing, no variable is allowed to dominate over another because of its variance.

PCA is an efficient and nowadays common chemometrical method for decomposition of two- dimensional data sets, however it is important to emphasize that PCA poorly represents nonlinear correlations.

1.5.4 Genetic Algorithms

Using RP-HPLC subtle variations will be introduced in the chromatographic profiles despite identical experimental conditions. These variations can be due to e.g. small changes in TFA concentration (remember that TFA is volatile), column temperature, degeneration of column silica etc. Since these variations do not represent a true change in the sample but still affect the chromatogram, it will be difficult to draw any analytical conclusions. Peak shapes, retention time and baselines are all variables that will be exposed to small non-sample related variations.

To compensate for these subtle variations different alignment algorithms have been developed [10,21], trying to optimize the alignment between chromatograms by slightly altering peak shape and baseline structure.

Today a lot of different mathematical techniques are described dealing with the optimization problem. If an explicit function exists describing the experimental system optimization techniques such as Newton-Raphson or Steepest Descent can be used with success. These traditionally iterative methods are however computationally demanding and if the system is too complex to be described by an explicit function these methods will not be successful. The risk of finding a local optimum instead of the global must also be considered when using these techniques.

Another approach to the optimization problem is to ignore explicit relations and with biased

stochastic methods search the solution space. A Genetic algorithm (GA) is a typical example

of such a stochastic optimization method that can handle fairly large and complex systems

without enormous computational power [11].

(15)

Genetic algorithms simulate the biological evolution and consider populations of solutions rather than one solution at a time. A reproduction process that is biased towards better solutions forms the next population and after a certain number of generations or a specific criterion the optimum is hopefully found.

The first step in the genetic algorithm is to create an initial population. This can be done by using a priori information or just random initialization. The created population of chromosomes can be, e.g. when studying energy minimization, coordinate vectors of the involving atoms.

Next step is to evaluate the chromosomes and to give each a specific value of fitness. The chromosomes that produce the best solution will be given the highest value of fitness. The next population of chromosomes will be a combination of the chromosomes in the preceding generation. The number of offspring each chromosomes produces is proportional to its value of fitness, i.e. chromosomes with higher fitness will have greater impact on the qualities of next generation than chromosomes with lower values of fitness. Mutations and cross-over effects are also introduced during the breeding process. These stochastic elements make it possible to escape a local optimum.

Aligning target chromatograms against a reference is a typical problem that could be solved using genetic algorithms [12,3]. To evaluate the fitness of a chromosome the Euclidian distance between the two chromatograms that are to be aligned can be used.

1.5.5 Wavelet Transformation

Normalization and Genetic Algorithms is not always sufficient when preprocessing LC/MS data. Stochastic noise often disturbs the interpretation of the chromatograms and introduces larger variations than acceptable. Furthermore the alignment genetic algorithm produces a better result if raw data is denoised.

Traditionally in the field of signal analysis denoising and compression of time dependent data is done using different methods of Fourier transformation e.g. Fast Fourier Transformation (FFT), Discrete Fourier Transformation (DFT) [13]. These methods transform the signal from the time dependent space to a frequency dependent space (Fourier space). By using information from Fourier space a low pass filter can be applied and high frequent noise can easily be removed. The filtered signal can via inverse Fourier transformation be analyzed in the time dependent space. Fourier space also reveals information about how the energy of the signal is distributed on different frequencies. Frequency components representing only a small part of the energy can be removed without loosing any significant information, thus compressing the original signal.

However Fourier transformation is not capable of coping with non-stationary signals where

the nature of the signal’s frequency components changes over time. Solving this problem

using Fourier transformation on small time portions of the signal will be hazardous to

resolution because of Heisenberg’s uncertainty principle. A good resolution in the time

domain will lead to a miserable resolution in the frequency domain. A better approach when

studying this type of signals e.g. chromatograms is to use wavelet analysis.

(16)

Wavelet analysis is a technique that opposite to Fourier transformation preserves time information and is capable of revealing aspects of data like trends, discontinuities in higher derivatives and self-similarity. Using wavelet transformation Heisenberg’s uncertainty principle will not cause any problems since wavelet analysis is a multiresolution technique where resolution is proportional to frequency [14].

A wavelet is a waveform of effectively limited duration that has an average value of zero (fig.

8). Wavelets tend to be irregular and asymmetric. Wavelet analysis can be summarized as the process of describing a signal via a number of shifted and scaled versions of the so called mother wavelet.

Mathematically wavelet transformation can be described as the inner product of the test signal with the basis functions:

Where ψ corresponds to the mother wavelet, s is scale, τ is translation and x corresponds to the test signal. The basis functions are the scaled and translated versions of the mother wavelet.

This definition shows that wavelet analysis is a measure of similarity in the sense of frequency components between the basis functions and the signal itself. The calculated wavelet coefficient refer to the closeness of the signal to the wavelet at the current scale. The resulting coefficient will have a scale and translational component. The scale component describes the inverse of the frequency and the translational component describes the time domain of the signal, i.e. the coefficient describes the frequency components of the signal at all time points.

Using discrete implementations of CWT makes it easy to compress or denoise a signal, via low pass filtering [15].

Figure 8: Example of mother wavelet: Daubechies 2 (db2).

A

t

∫ ^ _ ^{ −} ^ _ ^

= Ψ

= dt

s t t

x s s s

CWT _x ^ψ τ _x ^ψ τ 1 ( ) ψ ^* τ

) , ( ) ,

( (4)

(17)

2 Material and Methods

2.1 Equipment and chemicals

2.1.1 Chemicals

IgG1 solved in 10mM acetic acid ¹ . Guanidine-HCl, analytical grade, was purchased from ICN Biomedicals. NH 4 HCO 3 , analytical grade, was purchased from BDH Laboratory supplies. Trypsin was purchased from Promega and H 2 O 2 from Acros Organics.

2.1.2 Equipment

Agilent 1100, micro-HPLC system Micromass LCT

Micromass Quattro Ultima

1 Due to confidential reasons the name of the producing company cannot be mentioned. This clone of IgG is

slightly modified compared to native clones.

(18)

2.2 Methods

2.2.1 Oxidation of Model Protein

Hydrogen peroxide (H 2 O 2 ) was chosen as oxidization agent. It has been shown [5] that hydrogen peroxide gently oxidizes proteins.

Prior to oxidization pH was set using Ammonium Bicarbonate NH 4 HCO 3 (AmBic), 30 µl 1M AmBic was added to 250µl IgG (1.1µg/µl) solution [16].

As oxidizing agent 45.5µl (35%w/v) H 2 O 2 in 955µl H2O was used. The oxidization agent was diluted by adding 100µl to 300µl H 2 O. This reagent is called 1:1 ox-agent and diluted even further according to table 1.

Five batches with 50µl pH corrected IgG solution each were prepared using the following scheme:

Batch nr: Added reagent Added volume (µl)

1 H 2 O 1.0

2 1:30 ox-agent 1.0

3 1:20 ox-agent 1.0

4 1:10 ox-agent 1.0

5 1:1 ox-agent 1.0

The oxidized batches were incubated 10 min in 4°C and evaporated with speedvac. 40 min 32°C. The remaining pellets were stored at -75°C until further analysis.

2.2.2 Digestion of Model Protein

The pellets with more or less oxidized IgG were dissolved via thorough vortexing with 5µl 6M Guanidine-HCl in order to denaturate the protein. To enhance denaturation the batches were preincubated 75 min, 65°C.

Prior to digestion 5µl 1M AmBic and 40µl H 2 O were added. The final pH was approximately 8, which corresponds to pH optimum of the enzyme. The final Guanidine-HCl concentration was 0.6M and AmBic 0.1M. AmBic is volatile and the remaining salt concentrations should be low enough to avoid sensitivity loss during mass spectrometry. For the cleavage reaction 1.25 µl of Trypsin 1µg/µl was added. This corresponds to an enzyme-substrate ratio of 1:40 (w/w). The batches were incubated 15 h at 37°C.

2.2.3 RP-HPLC

A systematic approach to the problem of optimizing HPLC separation is using factorial

design and a suitable optimization algorithm e.g. multisimplex [17]. However complex

systems such as peptide digestions are not, using this approach, easily handled within a

Table 1: Oxidization scheme.

(19)

limited period of time. In this study a more empirical trial and error method was used. The slope of the gradient, flow rate and column temperature are all variables that have to be considered when optimizing HPLC separation.

As organic mobile phase acetonitrile with 0.05% TFA was used. As hydrophilic mobile phase water with 0.05% TFA was used.

The following conditions (table 2 and table 3) were found to give acceptable separation within a reasonable period of time and were used throughout the study:

Time (min) % Organic mobile phase

0 2

30 62

35 90

38 90

40 2

50 2

Flow rate: 56µl/min

Column: Zorbax Extend-C18 1mm*150mm, 3.5µm

Sample temperature: 8°C

Column temperature: 32°C

2.2.4 LC/MS

The outlet capillary from the HPLC system was connected to the inlet of the mass spectrometer. The two systems were independently controlled from two different computers.

The operator program controlling the mass spectrometer was Masslynx 4.0, the corresponding program for the HPLC-system was Chemstation (2002). The mass spectrometer was tuned and calibrated with NaI following the standard procedure described in [18]. The resolution was found to be 3700 which is regarded as low for this specific instrumental setup. Mass data were collected from m/z = 300 Th to m/z = 1500 Th with a rate of 30 centroid spectra per minute.

2.2.5 Design of Experiment

Three samples of each batch were analyzed. Due to technical breakdown only two samples of batch 1 and batch 5 were run. Theoretically the concentration of digested protein in the samples should be about 7 picomol/µl, i.e. 1µl should be enough regarding the sensitivity of the mass spectrometer. However an injection volume of 1 µl turned out to give almost no result at all. Instead 10 µl was used. In order to reduce the influence of systematic errors the samples where analyzed in a randomized order according to the following scheme:

Table 2: Gradient conditions.

Table 3: System conditions.

(20)

Batch, sample Randomized order Injection volume (µl)

1 1 10.0

1b 12 10.0

1c 15 Not run

2 3 10.0

2b 9 10.0

2c 2 10.0

3 11 10.0

3b 10 10.0

3c 5 10.0

4 8 10.0

4b 6 10.0

4c 4 10.0

5 13 10.0

5b 7 10.0

5c 14 Not run

2.3 Data Analysis

2.3.1 Importing Data to Matlab

All data processing was carried out using Matlab 6.1 (Mathworks Inc. USA). Two- dimensional LC/MS data (fig. 9) with one m/z spectrum for each time point were exported from Masslynx to ASCII-format via software called Databridge. The ASCII file contains an array of approximately 160 000 elements where the m/z spectra are saved in ascending time order (fig. 10).

However since only m/z values with intensity over the detection limit will be collected, the lengths between spectra from different time points (fig. 10) will not be identical. This fact has to be compensated for in order to receive a matrix that represents the whole LC/MS space, with columns of equal length.

An algorithm with the following pseudo code was constructed to solve this problem.

Step 1 Find the m/z value with the highest intensity at time n.

Step 2 Collect intensities from all time points (including the present) where this m/z value can be found within a small m/z window. If m/z values cannot be found at a certain time point insert zero intensity.

Table 4: Injection scheme.

m/z Time

I

Figure 9: The nature of LC/MS data.

(21)

Step 3 Repeat step1 to 3 with the m/z value that corresponds to the second highest intensity and so on until all intensities within time point n are analyzed.

Step 4 Remove all found intensities from the original data set. Repeat step 1 to 4 for n= 1,2,3…..N-1, N (N is maximum number of time points).

Step 5 Sort the resulting matrix in ascending m/z order.

The size of the m/z window was set to ±0.125 Thompson. Prior to further analysis the actual peptide map was separated from additional data, i.e. the solvent peak in the beginning and baseline after the last peptide peak. Only data from t = 5 min to t = 21 min was saved.

2.3.2 Approach 1: Collapsed Time Scale

The first approach to detect differences between the different batches was to ignore the chromatographic scale, comparing only m/z spectra. Each sample give rise to a LC/MS matrix described in 2.3.1. By projecting the data on the m/z axis using Simpson’s numerical integration method {5} each value on the m/z axis will be proportional to the amount of the corresponding peptide fragment. Instead of a matrix describing the whole LC/MS space a m/z vector will describe each sample.

Simpson’s rule is shown in {5}. R T corresponds to truncation error. h is the distance along the x-axis between two neighboring data points.

Time point n

Time point n+1

Time point n+2

m/z I

ASCII file

Time point 1 2 3 4 ………. N-1 N m/z

Matlab matrix

Figure 10: Importing data to Matlab.

T n T n

k

n k

n

k

k b

a

R S R x

f x f x

f x

h f dx x

f  + = +



 



 + + +

= ∑ ∑

∫ ⁻

=

− 1

1 2 2

1 1 2

0 ) 4 ( ) 2 ( ) ( )

3 ( )

( (5)

(22)

2.3.2.1 Normalization

Prior to normalization one has to make sure that the m/z vectors contain exactly the same m/z values. Otherwise different m/z values will be incorrectly normalized against each other. This is done by using the sorting algorithm described in 2.3.1 with m/z vectors instead of time points.

The normalization algorithm can be described by the following pseudo code:

Step 1: Calculate the quotient between the intensity of m/z value n from the reference m/z vector and the target m/z vector. This is called the normalization quotient.

The quotient is only saved if it is smaller than cutoff1 and if both the reference value and the target value are larger than cutoff2.

Step 2: Repeat step 1 for n = 1,2,3,4,…….N-1,N (N is maximum number of m/z values).

Step 3: Calculate the mean of the resulting quotients. This is called the normalization parameter.

Step 4: Multiply all intensities corresponding to the target m/z vector with the normalization parameter. The resulting m/z vector will be the new normalized m/z vector.

Cutoff1 is set to 10 to avoid influence of quotients that are the result of division when the denominator is fairly small. This situation will occur if an m/z value is close to noise. Cutoff2 is set to zero to avoid division by zero.

Prior to normalization the normalization algorithm was used with no cutoff1 restraint in order to find any correlation between the size of the quotient and the m/z values.

It is also important to find out whether or not a correlation within the normalized matrix exists between variance of the intensities and the m/z values and if variance is correlated to the height of the intensities. However since variance (the square of σ in {6}) will depend on the height of the measured intensity two m/z values of different intensities cannot be compared directly. A better approach is to calculate the relative m/z intensities within each batch, i.e.

step 1 and step 2 to in the algorithm above, and compare the variance between them. As reference the first normalized sample of each batch was chosen (i.e. sample 1,2,3,4,5). The resulting matrix with relative intensity values are further on called relative intensity matrix.

This approach makes it possible to compare variance between variables of different size.

2.3.2.2 Principal Component Analysis (PCA)

Principal component analysis was preformed by using built-in algorithms in Matlab, based on

a single value decomposition (SVD) method. As variables the intensities at different m/z

values were chosen and each sample represented an individual object. Pre-treatment of data

meaning mean centering and auto scaling was preformed and evaluated.

(23)

2.3.2.3 Confidence Interval

At a significant level of 95%, i.e. the probability that the measured variations do belong to the normal distribution is more than 95%, the confidence interval is calculated as follows:

Where x is the mean value of the measured m/z intensities within each batch, 1.96 represents 95% of the area under the standard normal curve (Gaussian curve), σ represents the standard deviation and n is the sample size. Notice that an increment of the sample size will make the confidence interval narrower. In this study only three samples were collected.

The confidence interval of a difference between two stochastic variables can be calculated as follows:

When comparing two batches one is called reference and the other target, where reference is defined as the batch believed to have undergone smallest chemical change compared to native state. The calculated interval of confidence is used to filter out the non-significant differences.

2.3.2.4 Finding Oxidized Fragments

In order to confirm if observed modifications are due to oxidization an algorithm was constructed to swiftly filter out all differences that do not correspond to an increment in mass between reference and target. This increment should according to the oxidization reaction described in 1.1 correspond to the mass of oxygen or any of its possible m/z values.

The pseudo code describing this algorithm is as follows:

Step 1: Calculate the difference of m/z value n between reference and target intensities using the procedure described in 2.3.2.3.

Step 2: If a confirmed difference also can be found but with opposite sign at m/z value n + window i the found differences are defined as oxidizations. Repeat this step for i = 1,2,3.

Step 3: Repeat step 1-2 for n = 1,2,3…….N-1,N (where N is the length of the m/z- vector).

Window i is an m/z window defined to allow variations in m/z to compensate for limiting m/z reproducibility as follows:

n x

x σ

⋅

±

⊂ 1 . 96 ( )

) 1 (

2 2

− Σ

−

= Σ

n n

x x

σ n ⁽⁶⁾

y y x x

n y n

x y x

2 2

96 .

1 σ σ

+

⋅

±

−

⊂

− ⁽⁷⁾

(24)

i Lower m/z Upper m/z z

1 15.9 16.1 1

2 7.95 8.03 2

3 5.3 5.6 3

2.3.3 Approach 2: Timescale

Collapsing the time scale means that all chromatographic information will be lost. This fact might not be a problem when looking for chemical modifications that corresponds to a fairly large change in m/z. However modifications that corresponds to a very small change in m/z e.g. deamidation that only give rise to ∆(m/z) = 1 Th for z = 1 and ∆(m/z) = 0.5 Th for z = 2 will be much harder to characterize using this approach. It might be beneficial to include chromatographic changes in the analysis. The combination of chromatographic- and m/z information might enhance the possibilities of finding subtle modifications. To investigate this hypothesis the whole LC/MS space was analyzed. Using this approach each sample was represented by the entire LC/MS matrix described in 2.3.1.

Prior to further analysis the LC/MS matrixes were sorted using the procedure described in 2.3.1 but with entire LC/MS matrixes instead of just single vectors. The idea was to make sure that every m/z value was represented in all matrixes. Summarizing all m/z intensities for each time point yields the total ion current (TIC) chromatogram.

2.3.3.1 Wavelet Denoising

Chromatographic data was denoised using a level 1 decomposition with a mother wavelet called Daubechies 2 (fig. 8). Built in Matlab denoising and reconstructing algorithms were used.

Due to the low sampling frequency (only 30 m/z spectra per minute) no data compression was found to be necessary.

The difference d(t) between original TIC data and denoised TIC data is a time dependent function that corresponds to the wavelet operation. For all time points this difference is allocated on to the m/z intensities by adding d(t)/N (t = 5, 5.033, 5.066,…..21; N = number of m/z values) to each m/z intensity. This denoising procedure is only valid under the approximation that noise is mainly due to chromatographic time dependent fluctuations, i.e.

all m/z values for a specific time point will behave in an identical way. The result is called the denoised LC/MS matrix and the difference between the denoised LC/MS matrix and raw data is called the shift surface.

2.3.3.2 Preprocessing Using Genetic Algorithms and Normalization

There are many genetic algorithms constructed to handle the alignment problem available.

The algorithm chosen in this study is described in detail in [3]. It is a combination of a peak

shift alignment and a base line correction algorithm. Inputs to the algorithm are a reference

Table 5: Definition of oxidization windows.

(25)

chromatogram and a target chromatogram. The target chromatogram is aligned with the reference chromatogram. As reference sample 1 was chosen throughout the study.

The first step in the preprocessing procedure is to roughly align the target chromatograms against the reference chromatogram in order to facilitate normalization. This first alignment only uses the peak shift alignment part of the genetic algorithm with 200 generations and a population of 60 individuals. The resulting difference between original target and the aligned chromatogram was allocated onto the target LC/MS matrix according to the procedure described in 2.3.3.1.

The second step is to normalize the target LC/MS matrix against the reference matrix. This is done by collapsing the time scale (as in 2.3.2) for each sample and multiplying the normalization parameter (calculated as described in 2.3.2.1) with the target LC/MS matrix.

The chromatogram corresponding to the normalized LC/MS matrix is once again aligned against the reference chromatogram using genetic algorithms, this time not only the peak shift alignment part but also base line correction. Peak heights are allowed adjustments of maximum 10% of the total height. Both the peak shift alignment part of the algorithm and the base line correction part use 200 generations and a population of 60 individuals. Again the resulting difference between original target and the aligned chromatogram was allocated onto the target LC/MS matrix according to the procedure described in 2.3.3.1.

The resulting target matrix is called preprocessed matrix.

2.3.3.3 Bucketing

To further reduce the influence of chromatographic differences bucketing of the LC/MS matrix might be a successful approach. Bucketing means that the time scale is decomposed into several buckets, i.e. the time scale within the bucket window is collapsed using the procedure described in 2.3.2. In this study a bucket window with 10 time points was evaluated. Using bucketing methods will reduce the chromatographic resolution, but when not depending on excellent chromatographic resolution, as often is the case with LC/MS data, bucketing is a much more simple and straight on method than genetic algorithms.

2.3.3.4 Confidence Interval

Comparing two mean LC/MS matrixes from different batches is done by extrapolating the

procedure described in 2.3.2.3 onto the entire LC/MS matrix. The result will also be a matrix

where differences can be found not only on the m/z scale but also on the chromatographic

scale.

(26)

3 Results

3.1 Data Analysis

The LC/MS space was under the conditions used in this study found to be represented by a matrix with 1798 rows (m/z values) and 1050 columns (time points). The chromatographic TIC profile of the reference sample is shown in fig. 11. The time points between the arrows are representing the peptide map and LC/MS data from this region was saved for further analysis. The resulting LC/MS matrixes were found to have 1798 m/z values and 458 time points and are from here on referred to as raw data.

3.1.1 Approach 1: Collapsed Time Scale

Prior to analysis all intensities lower than 200 counts were ignored in order to reduce the influence of noisy m/z values. The resulting LC/MS matrixes were found to have 976 m/z values and 458 time points. Collapsing time scale yields an unique vector for each sample with 976 m/z values.

5 1 0 1 5 2 0 2 5 3 0 3 5

0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0

t ( m i n )

I (c o u n ts /s )

R e fe r e n c e s a m p l e 1

Figure 11: Chromatographic TIC profile of the reference sample.

(27)

3.1.1.1 Normalization

Following figure shows the normalization quotients with no cutoff1 restraint.

As can be seen in fig. 12 it appears to be a trend, however small, that a larger m/z value corresponds to a larger normalization quotient. The normalization quotients and the m/z values were fitted to linear curves using least square algorithms [19]. The slopes of these curves are given in table 6.

Figure 12: Normalization quotients and fitted linear curves.

20 0 10

13 0 6.5

15 0 7.5

13 0 6.5

12 0 6

25

12.5 m/z: 200 400 600 800 1000 1200 1400 1600 m/z: 200 400 600 800 1000 1200 1400 1600

0 12 12

0 6

15 0 7.5 1

0 0.5

7 0 3.5

7 0 3.5 9

0 4.5 2

2c

3b

2b

3 3c

4c

4b

5 4

1 1b

24 5b

0

(28)

Fitted curve, Sample:

Slope Interception with y-axis

1 0 1.00

1b 0.007 2.44

2 0.0004 1.10

2b 0.0003 1.40

2c 0.0004 1.82

3 0.0004 1.67

3b -0.0004 1.52

3c 0.0005 1.25

4 0.0003 1.52

4b -0.0003 1.51

4c 0.0006 1.32

5 0.0014 1.30

5b 0.0010 1.39

Standard deviation 0.00048

Mean 0.00041

Interval of confidence -0.003, 0.0038

Since the interval of confidence describing the mean of the slopes contains zero no statistical trend can be found regarding the correlation of m/z values and normalization quotients.

Therefore it should be safe to calculate an unbiased mean of the normalization quotients and use this as a normalization parameter.

Figure 13 shows the effects of normalization and it is obvious that normalization is necessary.

Non-normalized intensity bars are not comparable to reference data. Figure 13 shows only a portion of the m/z spectrum but the trend is identical for most m/z values.

Table 7 shows the normalization parameters when using cutoff1 = 10. The normalization parameters are all larger than one, which means that the reference sample has generally the largest

Table 6: Slope of linear curves describing the correlation between the m/z values and the normalization quotients.

Table 7: Normalization parameters.

Sample Normalization parameter

1 1.0000

1b 2.8075

2 1.3614

2b 1.5992

2c 2.1263

3 1.8882

3b 1.2394

3c 1.5658

4 1.6897

4b 1.2209

4c 1.7092

5 1.9833

5b 1.9503

Figure 13: Example of the effects of normalization.

Normalized data sample 1b: blue bars. Raw data sample1b shifted –0.1 m/z: red bars. Reference sample shifted –0.2 m/z: green bars.

448 449 450 451 452 453 454 455 456 457 458 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

m/z I (c o un ts /s )

Effects of normalization

(29)

intensities. The large variation and range of the normalization parameters can be interpreted as low reproducibility.

Note that the normalization parameter for sample 1 is 1.0000 since sample 1 is also the reference sample. In order to evaluate this normalization technique another identical data set was normalized by dividing each m/z vector with the highest intensity within each sample.

Linear approximations of the batch specific variance for the relative intensity matrix are shown in fig. 14. The slopes of the fitted curves are a measure of correlation size. Large positive slopes mean that large m/z values have larger variance compared to smaller m/z. Two slopes are negative and three are positive, it is obvious without further statistical analysis that no correlation between variance and m/z values can be found in the data set.

2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0

- 0 . 2 - 0 . 1 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5

m / z

Figure 14: Fitted curves of first order showing the correlation between m/z value and variance. The numbers correspond to different batches.

4 2 3

1 5

Figure 15: Intensity distribution of sample 1.

dps corresponds to the number of m/z values. Figure 16: Variance calculated for each intensity in the relative intensity matrix. Color code as in fig. 14. dps corresponds to the number of m/z values.

100 200 300 400 500 600 700 800 900

0 0.5 1 1.5 2

x 10

⁴

dps I (c o u n ts /s )

200 300 400 500 600 700 800 900

0 2 4 6 8 10 12 14

dps

v a ri a n c e

Variance