UPTEC X 04 036 ISSN 1401-2138 AUG 2004
RAGNAR STOLT
Peptide mapping by
capillary/standard LC/MS and multivariate analysis
Master’s degree project
UPTEC X 04 036 Date of issue 2004-08
Author
Ragnar Stolt
Title (English)
Peptide mapping by capillary/standard LC/MS and multivariate analysis
Title (Swedish)
Abstract
The potential of LC/MS peptide mapping combined with multivariate analysis was investigated using IgG1 as a model protein. Five batches of IgG1 were exposed to different levels of an oxidizing agent.
A method to detect differences between the batches using solely MS data was developed and successfully applied. Four peptide fragments containing methionine residues were found to represent the most significant differences and characterized using MS/MS. In order to evaluate different computational strategies Principal Component Analysis (PCA) was used. Attempts were also made in order to use the information from the whole LC/MS space.
Keywords
Peptide Mapping, LC/MS, PCA, PTM, IgG1, Genetic Algorithms, Matlab Programming
Supervisors
Rudolf Kaiser
AstraZeneca, Analytical Development Södertälje Scientific reviewer
Per Andrén
Uppsala University, Laboratory for Biological and Medical Mass Spectrometry
Project name Sponsors
Language
English
Security
ISSN 1401-2138 Classification
Supplementary bibliographical information
Pages
47
Biology Education Centre Biomedical Center Husargatan 3 Uppsala
Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 555217
Molecular Biotechnology Programme
Uppsala University School of Engineering
Peptide mapping by capillary/standard LC/MS and multivariate analysis
Ragnar Stolt
Sammanfattning
Inom läkemedelsindustrin är det viktigt att utveckla analytiska metoder för att kunna hitta små skillnader mellan olika prover av ett läkemedelsprotein. Man måste kunna kartlägga vilka förändringar som introduceras i proteinet då det t.ex. lagras i rumstemperatur under lång tid.
Dessa förändringar kan nämligen ändra proteinets egenskaper och eventuellt även leda till ett immunsvar med allvarliga konsekvenser.
Traditionellt har man inom proteinkemin karaktäriserat proteiner bland annat genom s.k.
peptidmappning. Peptidmappning går ut på att enzymatiskt klyva ett protein och analysera de uppkommna peptidfragmenten med vätskekromatografi. Varje resulterande kromatogram motsvarar då ett fingeravtryck av proteinet och små skillnader mellan olika prover kan spåras genom små förändringar i fingeravtrycken.
På så vis kan man avgöra om det föreligger några skillnader men inte vad de består av. Den här studien bygger på att ytterligare förbättra möjligheterna med peptid mappning genom att analysera peptidfragmenten med en masspektrometer.
Små förändringar i form av oxidation infördes på ett modellprotein. Med hjälp av traditionella statistiska metoder har stokastiska icke signifikanta skillnader filtrerats bort och förändringarna kunde via tandem masspektrometri karaktäriseras som oxidation av metionin.
Stor tyngd har lagts på att utveckla algoritmer som kan hantera den komplicerade och stora datamängd som masspektrometrisk data utgör.
Examensarbete 20 p i Molekylär bioteknikprogrammet
Uppsala universitet Augusti 2004
1 INTRODUCTION ...5
1.1 M ODEL PROTEIN , I MMUNOGLOBULIN G1...6
1.2 P EPTIDE M APPING ...6
1.2.1 Digestion ...6
1.3 R EVERSED P HASE H IGH P ERFORMANCE L IQUID C HROMATOGRAPHY (RP-HPLC) ...7
1.4 M ASS S PECTROMETRY (MS) ...8
1.4.1 Ion Source ...9
1.4.2 Time of Flight Analyzer...9
1.4.3 Tandem Mass Spectrometry ... 10
1.4.4 Hybrid Quadrupoles Time of Flight ... 10
1.4.5 The Detector... 11
1.5 D ATA A NALYSIS ... 11
1.5.1 Normalization... 11
1.5.2 Confidence Interval ... 12
1.5.3 Principal Component Analysis (PCA)... 13
1.5.4 Genetic Algorithms... 14
1.5.5 Wavelet Transformation ... 15
2 MATERIAL AND METHODS ... 17
2.1 E QUIPMENT AND CHEMICALS ... 17
2.1.1 Chemicals... 17
2.1.2 Equipment ... 17
2.2 M ETHODS ... 18
2.2.1 Oxidation of Model Protein... 18
2.2.2 Digestion of Model Protein ... 18
2.2.3 RP-HPLC ... 18
2.2.4 LC/MS ... 19
2.2.5 Design of Experiment ... 19
2.3 D ATA A NALYSIS ... 20
2.3.1 Importing Data to Matlab ... 20
2.3.2 Approach 1: Collapsed Time Scale ... 21
2.3.2.1 Normalization ... 22
2.3.2.2 Principal Component Analysis (PCA) ... 22
2.3.2.3 Confidence Interval ... 23
2.3.2.4 Finding Oxidized Fragments ... 23
2.3.3 Approach 2: Timescale... 24
2.3.3.1 Wavelet Denoising ... 24
2.3.3.2 Preprocessing Using Genetic Algorithms and Normalization ... 24
2.3.3.3 Bucketing... 25
2.3.3.4 Confidence Interval ... 25
3 RESULTS... 26
3.1 D ATA A NALYSIS ... 26
3.1.1 Approach 1: Collapsed Time Scale ... 26
3.1.1.1 Normalization ... 27
3.1.1.2 Principal Component Analysis (PCA) ... 30
3.1.1.2.1 Normalization with Normalization Parameter... 30
3.1.1.2.2 Evaluating Auto Scaling... 31
3.1.1.2.3 Comparing Normalization Techniques... 32
3.1.1.3 Confidence Interval ... 32
3.1.2 Approach 2: Time Scale ... 37
3.1.2.1 Wavelet Denoising ... 37
3.1.2.2 Preprocessing... 39
3.1.2.3 Confidence Interval ... 40
3.1.2.4 Bucketing... 41
3.2 T ANDEM M ASS S PECTROMETRY ... 41
4 DISCUSSION... 44
5 ACKNOWLEDGEMENTS ... 46
6 REFERENCES ... 47
Table of contents:
1 Introduction
Today a number of different recombinant proteins are available on the pharmaceutical market.
The breakthrough for recombinant techniques is often associated with the release of insulin produced in E.Coli 1982 [1]. Right from the beginning it has been important to develop methods to characterize and analyze recombinant proteins.
There are problems using recombinant techniques due to posttranslational modifications (PTM). Eukaryotic organisms, especially the human species, have developed a complex system for PTM:s. Vital proteins will not function properly if these PTM:s are missing. On the contrary prokaryotic organisms, e.g. E.Coli, do not perform any PTM:s at all. The pharmaceutical companies have therefore to be able to detect differences between product and native form of the drug candidate protein. Differences from the native copy can lead to dysfunction of the protein drug and also an unwanted immunorespons with hazardous consequences.
There is also a great need of investigating the quality of a protein drug. What kind of modifications will be introduced in the protein when it e.g. is stored at room temperature for days? Maybe a couple of amino acids in the protein will be oxidized and some other will be exposed to deamidation or deglycosylation. These questions need to be answered before commercializing a new protein drug.
A common method to detect differences between protein batches is peptide mapping, using RP-HPLC [2]. To facilitate data analysis a multivariate approach can be successful. Principal component analysis (PCA) is often used [3] to model variations in the data set, making it easier to detect e.g. outliers and to produce information concerning system reproducibility. It is also important to minimize stochastic and system drift variations especially when looking for small differences in the data set. Otherwise it can be difficult to separate non-chemical variations from true physical differences in the protein.
The UV-data collected from the HPLC is however often not sufficient to disclose small variations in the data set. Furthermore the UV-chromatogram does not give any qualitative information. It is not possible using this kind of data to answer the question “Where on the protein are the modifications located and what do they consist of?”. To further enlarge the possibilities of peptide mapping the univariate approach has to be abandoned and more physical information describing the properties of the protein need to be gathered.
One possibility to enlarge the amount of available information is to use a LC/MS system, gathering information not only in the time domain but also in the m/z-domain resulting in a bivariate peptide map instead of the traditional univariate UV-map. Mass data (m/z) can also give qualitative information about the parts of the protein where the modifications are situated. Using MS/MS these parts can be analyzed further and comparing with a reference batch individual differing amino acids can be detected.
This project focuses on studying LC/MS peptide maps and developing computational methods to separate true chemical differences from noise without any a priori information.
Found differences will be characterized using MS/MS.
1.1 Model protein, Immunoglobulin G1
As model protein Immunoglobulin G (IgG1, κ) has been chosen. The IgG molecule is very important to the immune defense system and the most abundant antibody with approximately 13.5mg/ml in serum [4]. IgG binds to foreign molecules and is thereby activating other members of the immune defense system.
IgG is a molecule consisting of two major chains one smaller forming the light chain and one larger forming the heavy chain each represented twice (fig. 1). The different chains are held together with a total of four disulfide bonds.
The molecular mass of the IgG molecule used in this study is 145 kDa (without any PTM:s) and there are 450 amino acids.
There is a N-linked glycosylation site on each heavy chain.
To be able to evaluate the possibilities with a LC/MS peptide map, small chemical changes were introduced by oxidizing IgG. Comparing batches with different amount of added oxidizing agent hopefully reveals some information about the potential of the analytical LC/MS system.
The amino acid most sensitive to oxidizing agents is methionine. Oxidization of methionine produces methionine
sulfoxid [5] in a reversible reaction. This oxidization corresponds to an addition of an oxygen atom resulting in a 16 Da increment of mass. Increasing the concentration of oxidizing agent further can irreversible oxidize methionine sulfoxid to methionine sulfone. There are six methionine residues represented in the amino acid sequence.
1.2 Peptide Mapping
Peptide mapping is a method used to create a “fingerprint”, specific for a certain protein. The protein is digested with a suitable enzyme and the peptide fragments are separated using e.g.
Reversed Phase High Performance Liquid Chromatography (RP-HPLC). Traditionally an UV-detector is often chosen for data collection. In this study a mass detector was used.
1.2.1 Digestion
The digestion method has to be compatible with the chemical conditions necessary for the HPLC-system and the mass spectrometer. It is important to develop a digestion routine with high reproducibility in order to be able to compare the results from different runs. The enzyme used has to digest the protein into a sufficient number of peptide fragments. Too many and too small fragments risk to obstruct the data analysis and signal to noise ratio will decrease. Too few fragments decrease the amount of information that can be gathered from a peptide map.
Figure 1: Immunoglobulin G1
1.3 Reversed Phase High Performance Liquid Chromatography (RP-HPLC)
RP-HPLC is a widely used and well-established tool for the analysis and purification of biomolecules e.g. a protein digest. The system uses high pressure to force a mobile phase through a column packed with porous micro particles. Particle sizes range typically between 3 and 50 µm. The smaller particle diameter the more pressure will be generated in the system.
The particle pore size generally ranges between 100-1000Å. Smaller pore silicas may sometimes separate small or hydrophilic peptides better than larger pore silica [6].
The most common columns are packed with silica particles to which different alkylsilane chains are chemically attached. Butyl (C4), octyl (C8) and octadecyl (C18) silane chains are the most commonly used. C4 is generally used for proteins and C18 for small molecules. The idea is that large proteins with a lot of hydrophobic moieties need shorter chains on the stationary phase for sufficient hydrophobic interaction. The choice of column diameter depends on the required sample load and the flow rate. Small-bore columns (1.0 and 2.1 mm i.d.) can improve sensitivity and reduce solvent usage. Column length does not significantly affect most polypeptide separations [6]. To speed up the analytical cycle time short columns with high flow rate and fast gradients can be used at expense of resolution.
An HPLC-system optimized for columns with small inner diameters and low flows are called micro-HPLC. A micro–HPLC system has narrow capillaries, typically 50µm i.d. The pumps are commonly working with a split-flow enabling low flow with high accuracy. The advantage of micro-HPLC is mainly reduction in mobile phase solvent consumption and high sensitivity, which makes it possible to load low amounts of sample, facilitating the connection to a mass spectrometer system.
In this form of liquid chromatography the stationary phase is non- polar and the mobile phase relatively polar.
Analytes will thus be separated mainly due to their hydrophobic properties. During a gradient separation two different kinds of solvents are used as mobile phase. One of the solvents is relatively
hydrophilic and the other is relatively organic (hydrophobic). The two solvents are mixed together and the relative content of the organic solvent increases with time.
Analytes are at the beginning of the gradient attached through hydrophobic interaction to the solid phase. When the organic content of the mobile phase reaches a critical value desorption will take place and the analytes will pass through the column. The majority of peptides (10 to 30 amino acid residues in length) have reached their critical value when the gradient reaches 30% organic content. The separation is however also influenced by molecular size. Smaller molecules will move slower through the column than larger based on the fact that smaller molecules will have access to a larger volume of the column. The analytes partitioning
Polypeptide enters the column at injection
Polypeptide adsorbs to hydrophobic surface
Polypeptide desorbs from stationary phase when organic solvent reaches critical concentration.
Figure 2: The idea behind gradient separation with RP-HPLC
process between mobile and solid phase will also impact the separation process. However it is quite safe to say that polar analytes elute first and non-polar analytes last.
To get separation mainly based on hydrophobic differences an ion-pairing agent is often added to the mobile phase in order to serve one or more of the following functions: pH control, suppression of non wanted interactions between basic analytes and the silanol surface, suppression of non wanted interactions between analytes, or complexation with oppositely charged ionic groups. It has been shown [6] that addition of an ion-pairing agent has a dramatic beneficial effect on RP-HPLC not only to enhance separation but also to improve peak symmetry. Trifluoroacetic acid (TFA) is an ion-pairing agent widely used. It is volatile and has a long history of proven reliability.
To be able to connect the HPLC system to a mass-spectrometer it is important to choose an ion-pairing agent with care. Ion suppression reduces the sensitivity of the mass-spectrometer system.
Another effect that can influence on peptide separations is temperature. Higher temperature is associated with increased diffusion according to Einstein’s diffusion constant D:
Where η is a viscosity constant, k B corresponds to Bolzmann’s constant, T is temperature and r the radius of the diffusing particle.
It is however difficult to draw any general conclusions because it has been shown that an increase in temperature increases resolution between certain analytes and decreases resolution between other analytes [6]. For good reproducibility a firm temperature control has to be applied.
RP-HPLC is one of the most widely used forms of chromatography mainly because of its high resolution. Chromatographic resolution is defined as the ratio of the difference in retention time between two neighboring peaks A and B and the mean of their base widths:
Where tR corresponds to retention time and w base width. It is possible when using RP-HPLC to separate peptides whose sequence only differs by one single amino acid residue.
1.4 Mass Spectrometry (MS)
A mass spectrometer is an analytical instrument that determines the molecular weight of ions according to their mass to charge ratio m/z. The device consists mainly of three basic components: the ionization source, the mass analyzer/filter and the detector.
η π r
T D k B
= 6 (1)
av R B
A R B R A
S w
t w
w t
R t ∆
+ =
= 2 −
(2)
1.4.1 Ion Source
Ionization is an essential part of the mass spectrometric process. The molecules have to be charged and in gaseous phase in order to accelerate in the electrical field inside the mass spectrometer. Today several different ionization techniques have been developed. The most often used techniques concerning analysis of peptides and proteins are matrix-assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI). These two techniques are so called soft ionization techniques, which means that the molecules are ionized without fragmentation. In this study ESI has been used.
ESI creates a fine spray of highly charged droplets in the presence of a strong electric field.
The sample solution is injected at a constant flow, which makes ESI particularly useful when sample solution is introduced by a LC-system. If the LC-flow is compatible with the mass spectrometer an online LC/MS system is easily established. The charged droplets are introduced to the mass analyzer compartment together with dry gas, heat or both. This will lead to solvent vaporization. When the droplets decrease in volume the electric field density increases and eventually repulsion will exceed surface tension and charged molecules will start to leave the droplet via a so called Taylor cone [7]. This process is conducted at atmospheric pressure and is sometime also called atmospheric pressure ionization (API).
Using ESI it is possible to study molecules with masses up to 150 000 Dalton, mainly because of the fact that ESI generates multiple charged molecules, which means that a low upper m/z limit is sufficient for analysis of large biomolecules. A typical detection limit using ESI is femtomole [7].
1.4.2 Time of Flight Analyzer
The most commonly used analyzers are quadropoles, Fourier transform ion cyclotron resonance and time of flight analyzers (TOF). In this study a TOF analyzer with a reflectron was used. The TOF analyzer is the simplest construction based on the idea that ions are accelerated through an electrical field with the same amount of kinetic energy. These ions will differ in velocity due to their charge to mass ratio:
Where U corresponds to the accelerating voltage. The differences in velocity will in turn lead to different flight time from the ion source to the detector. One advantage of TOF instruments is that no scanning of the m/z spectrum is necessary. Another advantage is the fact that there is virtually no upper mass limit using TOF. However the resolving power of TOF instruments is low. Resolving power, also called resolution, is defined as the ability of a mass spectrometer to distinguish between different m/z ratios at a certain peak height. Looking at just one peak in the mass spectrum resolution is commonly defined as the ratio between the m/z value and the full width of the peak at half maximum. Analyzers with reflectron can improve resolution.
A reflectron is a device with gradient electrostatic field strength. This so-called “ion mirror”
will redirect the ion beam towards the detector. Ions with greater kinetic energy will penetrate z U
v m 2
1
⋅
=
−
(3)
deeper into the reflectron compared with low energetic ions. This mechanism will compensate for a wide distribution of initial kinetic energy and thus increase mass resolution.
1.4.3 Tandem Mass Spectrometry
The peptide has to be fragmented in order to determine its sequence. Fragmentation can be achieved by inducing ion-molecule collisions by a process called collision induced dissociation (CID). The idea behind CID is to select the peptide ion of interest and introduce it into a collision cell, with a collision gas often Argon, resulting in break of the peptide backbone. From the resulting daughter ion spectrum the m/z values of the involving amino acids can be found. The peptide fragments can be divided into different series. When charge is retained on the N-terminal (fig. 3) the resulting series of fragments are called a n , b n and c n . When charge is retained on the C-terminal fragmentation can also occur at three different positions called x n , y n , z n .
1.4.4 Hybrid Quadrupoles Time of Flight
To select peptide ions, that are to be investigated by CID, a quadrupole device can be used.
Quadrupoles are four parallel rods with an applied direct current and a radio frequency electromagnetical field. When ions reach the quadrupole they will start to oscillate depending on the radiofrequency field and their m/z value. Only ions with a particular m/z value will be able to escape the quadrupole, the rest will collide with the quadrupole walls. Thus the quadrupole works as a mass filter. By scanning the radio frequency field an entire mass spectrum can be obtained.
NH 2 CH C NH
R 1 O
CH C NH
R 2 O
CH C NH
R 3 O
CH C OH
R 4 O
x 3
a 1
y 3
b 1
z 3
c 1
x 2
a 2
y 2
b 2
z 2
c 2
x 1
a 3
y 1
b 3
z 1
c 3
NH H 2 N + CHR 1
H 2 N CHR 1 C O + O H 2 N CHR 1 C NH 3
+
R 4 HC C + O 2 H
H 3 N + CHR 4 C OH O
O CHR 4 C OH O
+ C
a 1
b 1
c 1 x 1
y 1
z 1
Figure 3: Collision induced dissociation (CID). Peptide fragments are produced according to the
scheme above. Ions of the b and y series are often dominating the daughter ion spectrum.
The instrument used in this study is a quadrupole TOF hybrid (fig. 4). The quadrupole is used to select an ion of interest, which is fragmented in the collision cell. The resulting daughter ions are analyzed using a TOF device and a detector.
1.4.5 The Detector
The detector (fig. 5) converts the kinetic energy from the arriving ions to an electrical current.
The amplitude of the current is correlated with the number of ions reaching the detector. Most detectors available today build on the principle of electron multiplication. The detector in this particular instrument is called microchannel plate (MCP). A MCP detector is a huge number of electron multiplicator tubes.
When a charged particle collides with the tube wall secondary electrons will be emitted and reflected further down the tube, leading to a cascade of secondary electrons well gathered in space. The signal is amplified in the MCP detector with typically a factor of 10 3 - 10 4 .
1.5 Data Analysis
1.5.1 Normalization
When using a LC/MS device small differences in sample concentration, injection volume or loss of sensitivity will introduce variations in the data set that complicates the comparison between different batches. These variations can however be compensated for by normalizing the data set. Most normalization techniques e.g. when treating HPLC data are based on an
Hollow glass capillary with secondary electron emission coating.
Secondary electrons.
Ions
Photoelectron
Figure 5: Microchannel Plate detector.
Collision Cell Ar
Detector Ions from ESI
Selected ion Ion fragments Figure 4: The concept behind a Quadrupole - TOF system.
Quadrupole
Reflectron
TOF
internal standard or an external standard. Normalization in this context means that the data set is divided by the area or height of the standard peak. As standard peak the peak with largest area in the chromatogram can often be used with good results.
However when analyzing MS data with a large number (up to thousands) of m/z values normalization is not a trivial task. Which m/z value should be chosen as standard peak in order to produce the most accurate normalization? What if a m/z value with large variation or equally bad, with too little variation is chosen? The normalized data set will under these conditions poorly represent the true values. A better approach would be to calculate intensity quotients between the m/z values of a reference sample and a target sample. The mean of these quotients could be used as a normalization parameter.
Averaging the quotients will work as a low pass filter (fig. 6) and only significant trends in the data set will be represented in the normalization parameter thus minimizing the impact of m/z values with large variation. This normalization technique will work well under the conditions that the number of m/z values is fairly large and the chemical differences between the batches are fairly small. Large chemical differences will slip through the low pass filter and give rise to a skew normalization.
1.5.2 Confidence Interval
A classic way of treating the problem with stochastic variation between different samples of the same batch is to estimate a confidence interval. Assuming that the observed variable belongs to the normal distribution it is fairly easy to calculate the probability of finding the true mean value within the variation of the measured variable. Or the other way around: it is possible to calculate the limits within which the true mean value with a certain amount of probability can be found.
Comparing two batches the confidence interval for the differences in mean intensity of the measured m/z values will give some useful information. If the calculated confidence interval ranges from a positive value to a negative value, i.e. includes zero, it is not possible to
Figure 6: The low pass nature of a mean operation. ω corresponds to frequency. H corresponds to the transfer function. The z-transform clearly shows that only low frequency components will slip through the filter.
=
+ + +
= +
−
+
− + +
− +
−
= +
+
−
−
−
sin 2 sin 2 ) 1
( ...
) ...
(
) 1 (
...
) 2 ( ) 1 ( ) ) (
(
1 2
1
ω ω
ω
n
e n H
n
z z
z z z
X
tion transforma z
n
n n x n
x n
x n n x
x
j
n
0 0.5 1 1.5 2 2.5 3 3.5 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
frequency (rad/s)
A
n = 5