• No results found

Building a standard operating procedure for the analysis of mass spectrometry data

N/A
N/A
Protected

Academic year: 2022

Share "Building a standard operating procedure for the analysis of mass spectrometry data "

Copied!
68
0
0

Loading.... (view fulltext now)

Full text

(1)

UPTEC X 12 021

Examensarbete 30 hp Oktober 2012

Building a standard operating procedure for the analysis of mass spectrometry data

Niklas Malmqvist

(2)
(3)

Bioinformatics Engineering Program

Uppsala University School of Engineering

UPTEC X 12 021

Date of issue 2012-09

Author

Niklas Malmqvist

Title (English)

Building a standard operating procedure for the analysis of mass spectrometry data

Title (Swedish) Abstract

Mass spectrometry (MS) is used in peptidomics to find novel endogenous peptides that may lead to the discovery of new biomarkers. Identifying endogenous peptides from MS is a time- consuming and challenging task; storing identified peptides in a database and comparing them against unknown peptides from other MS runs avoids re-doing identification. MS produce large amounts of data, making interpretation difficult. A platform for helping the identification of endogenous peptides was developed in this project, including a library application for storing peptide data. Machine learning methods were also used to try to find patterns in peptide abundance that could be correlated to a specific sample or treatment type, which can help focus the identification work on peptides of high interest.

Keywords

Mass spectrometry, database, spectra, peptide annotation, pattern recognition Supervisors

Claes Andersson

Uppsala University Scientific reviewer

Mats Gustafsson

Uppsala University

Project name Sponsors

Language

English Security

ISSN 1401-2138 Classification

Supplementary bibliographical information Pages

68

Biology Education Centre Biomedical Center Husargatan 3 Uppsala

(4)
(5)

Building a standard operating procedure for the analysis of mass spectrometry data

Niklas Malmqvist

Populärvetenskaplig sammanfattning

Masspektrometri (MS) används för att bestämma massan hos molekyler. Ett vanligt

användningsområde för MS är analys av kroppsegna (endogena) peptider. En peptid är en del av ett protein och fungerar som signalämnen i kroppen för bland annat hormonreglering.

Genom att använda MS på peptider kan man även bestämma deras sammansättning av mindre beståndsdelar: aminosyror. Detta kallas att sekvensera peptider och används för deras

identifiering. På detta sätt ökar man förståelsen för peptiders funktion och roll i kroppen.

Kunskaper inom detta kan användas för att förstå sjukdomsförlopp och att designa läkemedel.

I detta projekt utvecklades en metod för att underlätta identifieringsarbetet av endogena peptider. En databasapplikation skapades för att kunna lagra information från MS och sedan kunna söka bland denna information med nya experimentdata. Detta gör att man kan se vilka peptider som förekommit i tidigare experiment och vilka aminosyror de består av. Med MS genereras stora mängder data vilket kan göra det svårt att överblicka ett experiment. Ett sätt att analysera stora mängder data med datorhjälp är mönsterigenkänning. Metoden användes i detta projekt för att undersöka mönster i peptidförekomst för en viss vävnad. Med hjälp av denna information kan man då fokusera identifieringsarbetet på just de peptiderna som ser ut att vara extra betydelsefulla.

Examensarbete 30 hp

Civilingenjörsprogrammet Bioinformatik Uppsala universitet, juni 2012

(6)
(7)

Contents

Contents 5

1 Preface 9

1.1 Introduction . . . 9

1.2 Project goal . . . 10

2 Background 11 2.1 Mass spectrometry . . . 11

2.1.1 Ionization source . . . 11

2.1.2 Mass analyzer . . . 13

2.1.3 Detector . . . 15

2.1.4 Peptide identification . . . 16

2.1.5 File formats . . . 16

2.2 Existing tools and software . . . 16

2.3 Spectra processing and scoring . . . 18

2.3.1 Peak picking . . . 18

2.3.2 Spectral deconvolution . . . 19

2.3.3 Similarity score . . . 19

2.3.4 Significance measures . . . 20

2.3.5 Scoring in SpectraST . . . 20

2.4 Pattern recognition . . . 22

2.4.1 Classification . . . 22

2.4.2 Feature extraction . . . 23

2.4.3 Attribute importance . . . 27

2.4.4 Cross-validation . . . 28

2.4.5 Permutation test . . . 29

3 Experimental setup and algorithmic solutions 29 3.1 Platform layout . . . 29

3.2 Data set . . . 30

3.3 Peptide library . . . 32

(8)

3.3.1 Spectral library . . . 32

3.3.2 Annotation library . . . 33

3.4 Pattern recognition . . . 33

3.4.1 Data sub-setting and filtering . . . 35

3.4.2 Feature extraction - clustering . . . 35

3.4.3 Pattern recognition . . . 35

4 Results 36 4.1 Using the peptide library . . . 36

4.1.1 Importing spectral data . . . 36

4.1.2 Searching the library . . . 37

4.2 Performance . . . 38

4.3 Pattern recognition . . . 39

4.3.1 Data sub-setting and filtering . . . 39

4.3.2 Feature extraction, clustering . . . 39

4.3.3 Pattern recognition . . . 40

4.3.4 Attribute importance . . . 48

5 Discussion 48 5.1 Peptide library . . . 49

5.2 Pattern recognition . . . 53

6 Future work 53 7 Acknowledgments 54 8 Bibliography 54 Appendix A Software setup 59 A.1 Requirements . . . 59

A.2 Configure and build SpectraST . . . 59

Appendix B Source code 60

Appendix C User manual 61

(9)

Keywords

Endogenous Exists within a living organism (in the present context) LCMS Liquid Chromatography Mass Spectrometry

MGF Mascot Generic Format

MS Mass spectrometry

PCA Principal Component Analysis

Peptidomics The study of endogenous peptides, discovery and identification Proteomics The study of proteins and in particular their structure and function PTM(s) Post-translational modification(s)

Snap-frozen Rapidly frozen (sample) moments after extraction Stabilized Rapidly heated (sample) moments after extraction Taxotere A cancer treatment drug

Trypsine A protease that exist in the digestive system of many organisms

(10)
(11)

1 Preface

This is a master thesis submitted to fulfill the requirements for a Master of Science degree in Bioinformatics Engineering from Uppsala University, Sweden. The project behind the thesis was conducted at Denator AB, Uppsala, Sweden under the supervision of Dr. Karl Sköld. The main supervisor was Dr. Claes Andersson at the Department of Medical Scienes, Uppsala University and the scientific examiner was Prof. Mats Gustafsson at the same department.

1.1 Introduction

In peptidomics, the research is focused on discovering and identifying endogenous peptides and their function. Peptides exist as signal and regulatory substances in living organisms. Peptides can also undergo PTMs and their change and expression can be interpreted to gain knowledge about biological processes, for example diseases. Investigating and discovering endogenous peptides can therefore lead to new biomarkers for diagnose or treatment of disease. An analysis tool for deter- mining the molecular mass of sample analytes is mass spectrometry (MS) and is used to discover novel endogenous peptides [48]. A biological sample, for example a tissue sample, is analyzed by MS to obtain its peptide composition. It is crucial that the biochemical integrity of the sample is maintained in order to achieve the true peptide contents. It has been shown that enzymatic ac- tivity in the samples lead to unwanted degradation products within one minute of extraction [43].

Denator provides a technology for stabilizing biological samples, the Stabilizor T1® (ST1) instru- ment [47]. Samples are stabilized prior to analysis by heat inactivation of all enzymes, effectively stopping the degradation of proteins and peptides. Previous studies have concluded that the ST1 is in fact essential to find biomarkers related to disease [15]. Denator has recently launched a plat- form for peptidomics in collaboration with Gothenburg University. Customers can send samples to the facility for stabilization, preparation and liquid chromatography mass spectrometry (LCMS).

LCMS can be used to identify peptides that differ in concentration in different biological groups, for example patients or treatment groups.

Any peptide found in a sample needs to be identified, or annotated, in order to yield any kind of bi- ological information. This is usually a time-consuming process which also requires experience and previous knowledge. The mass information gained from MS is used to find the amino acid com- position of the peptide. The masses from the MS run is compared against the theoretical masses of amino acids; this is done using peptide search engines such as Mascot [24] or X!Tandem [49].

The identification process differs between peptides and proteins. Proteins are commonly identified via information from their cleavage products from enzymatic digestion by e.g. Trypsine in vivo.

These products are biologically inactive peptides that have specific terminal amino acids. In con- trast, endogenous peptides do not provide any information on terminal amino acids, which makes their identification more challenging [14]. In addition, post-translational modifications (PTMs) add a new dimension of difficulty to the identification of endogenous peptides.

It is always a possibility that the same peptide is detected in different MS runs, however this is not obvious without knowing the identity of the peptide. Once a peptide from a specific MS run has

(12)

been annotated, it is desirable to be able to find that specific peptide in other MS runs without having to repeat the identification process. There is also no guarantee that an identity can be determined for a certain peptide, but it can still be of interest to observe an unknown peptide occurring in different MS runs, from e.g. different tissues. Finding annotations is time-consuming not only because of the long execution times of the search engines. Ambigous identifications are frequently found, which require detailed inspection of the result in order to confirm that the identity is likely to be correct.

While MS is useful in the identification of new biomarkers, it also generates information on thou- sands of peptides from a single sample. The large amount of data creates problems in interpreting the results since it is not always obvious which peptides may be of interest.

In this project, a platform was developed to help the identification process of endogenous peptides and to find patterns in peptide abundance among different sample groups, in order to focus on potentially interesting peptides. The platform has two components; a database for storage of MS data and a machine learning component that investigates possible patterns in the data. Currently available tools for peptide analysis focus on known identified peptides, while the idea in this project is to give a certain peptide an identity even if it its amino acid sequence is unknown. The intention is to be able to discern between new peptides and peptides that have already been seen before, even if their exact identity is unknown. Available tools also use databases that contain annotated data, whether the database is public or local. An important aspect of this project is to be able to create a custom database that uses unannotated data.

1.2 Project goal

The project’s goal is to construct a software platform that help the identification of unknown pep- tides and is able to find a peptide level signature, consisting of a certain combination of peptides in order to discern between sample groups. A peptide level signature is for example differences in a certain peptide’s abundance between untreated and treated specimens. The platform has two com- ponents: the spectral library application called Spectra Matching and Annotation Software Helper (SMASH), and a machine learning component that investigates possible patterns in the data.

The platform offers of three key services:

1. Library construction: Build a library of peptides from mass spectrometry (MS) runs. These include both peptides annotated and unannotated by a protein search engine, e.g. Mascot.

Unannotated peptides use their MS/MS spectra (peptide fragment spectra) as identifiers, rather than an amino acid sequence.

2. Peptide matching: Find peptides that have been present in previous experiments but may not have been identified by a search engine such as Mascot.

3. Pattern recognition: Machine learning methods to find patterns in peptide levels that are specific for a sample group and in that way find a signature for e.g. disease treatment effect.

(13)

2 Background

This section covers the theoretical background to the various aspects of the project.

2.1 Mass spectrometry

Mass spectrometry is used to analyze particle masses, elemental composition and chemical struc- ture of a target molecule. There are a number of different technologies used in MS and not all of them will be covered here since the aim is to give a short introduction to MS. A mass spectrometer has three main components: the ionization source, the mass analyzer and the detector. In short, the technology works by first ionizing the molecules in a sample, separating them according to their mass-to-charge ratio (m/z) and detecting the separated ions. The detected ion signals are stored in a mass spectrum, where their m/z ratios are plotted versus their relative abundance [4].

2.1.1 Ionization source

The samples can be introduced into the mass spectrometer either directly or via some form of chro- matography, depending on the type of ionization source and nature of the sample. There are a number of ionization methods, where the most common are Electrospray Ionization (ESI), Matrix Assisted Laser Desorption Ionisation (MALDI) [4]. Liquid Chromatography Mass Spectrometry (LCMS) is most common in peptidomics and is the setup used by Denator. In LCMS, a Liquid Chromatography (LC) system is coupled to the mass spectrometer, thereby combining the separa- tion capabilities of LC on physical properties with the mass spectrometer.

ESI works by dissolving the sample in polar, volatile solvent and pumping it through a narrow stainless steel capillary, where a high voltage (3-4 kV) is applied over the capillary tip. The sample form an aerosol of highly charged droplets when passed through the electric field formed by the high voltage. This process is aided by introducing an inert gas, such as nitrogen, that help direct the flow from the capillary tip into the mass spectrometer. The droplet size is reduced by using volatile solvents since they evaporate easily. Reduction of droplet size is also aided by adding warm gas, called drying gas, that speeds up solvent evaporation. Charged sample ions are separated from the droplets and passed through a sampling cone into an intermediate vacuum region, and subsequently through an opening into the high vacuum environment of the mass analyzer. See figures 1 and 2 for an overview of ESI and an illustration of the reduction in droplet size.

The idea behind MALDI is to apply laser light on a sample in order to ionize it. Sample prepara- tion is done by mixing the sample in a volatile solvent, which is then mixed with a matrix. The compound of choice to be used as the matrix varies depending on the analyte. There are a few main criteria that need to be met: the matrix must be soluble with the analyte, it must not be chemically reactive with the analyte and it must strongly absorb the light of the lasers’ wavelength. The role of the matrix is to transfer energy to the analyte in order to excite it and thereby ionizing it. In addition, the matrix protects the analyte from exposure to excessive energy from the laser that may

(14)

Figure 1 – Schematic overview of the Electrospray Ionization (ESI) technique. The sample is dissolved in volatile solvent and passed through a small capillary with an electric field at the capillary tip. This form an aerosol of charged sample ions which is passed into the mass spectrometer.

Figure 2 – Droplet evaporation during the Electrospray Ionization (ESI) process. The sample ions are released from the droplet as solvent is evaporated and thereby reducing the droplet size.

(15)

where an anode or catode is present. As the energy transferred from the matrix eventually cause the analytes to evaporate, they can then be introduced into a mass analyzer. The most common mass analyzer used with MALDI is the time-of-flight (TOF), a combination referred to as MALDI-TOF.

The exact details behind the mechanism of MALDI is not entirely understood and is still subject to ongoing research [4, 27].

2.1.2 Mass analyzer

There exist a number of different mass analyzers, which has the main function of separating the ions formed in the ionization step by their m/z ratios. The most widely used mass analyzers include quadrapole ion traps and time-of-flight. Mass analyzers can be used in a series of two and the setup is then called tandem mass spectrometry (tandem MS) [4]. In peptidomics, the tandem MS setup is used to first detect the peptides in a sample and then split the peptides into peptide fragments. The peptide fragment are detected in the next MS run and give information on the amino acid sequence, which is needed to identify a peptide. Different peptides can have the same masses and by splitting them into fragments it becomes possible to discern between them. Peptide fragment ions can exist as different types of ions: a, b, c or x, y, x, where the type is determined by where the peptide bond is cleaved and which fragment retains the charge. The a, b or c ions are formed when the charge is retained by the amino terminal and the x, y or z ions are formed when the charge is retained by the carboxy terminal. [46]

The TOF mass analyzer is based on the principle that the velocity of ionized analytes being accel- erated by a homogenous, constant electric field is directly related to their m/z ratio. It is therefore possible to determine the masses of the analytes based on their arrival time at the detector. The analytes accelerated by the electric field are sent into a vacuum chamber that has no electric field or other ways of applying force on the analytes. This chamber is referred to as the field-free region.

The principle behind TOF is described by equation 1.

t= 2md eE

12 + L

 m

2eV0

12

(1) where m = mass of particle, e = electronic charge, E = electic field applied in source, d = length of acceleration region, L = length of field-free region and V0 = accelerating potential [28]. A simplified MALDI-TOF schematic can be seen in figure 3 [4].

The quadrapole mass analyzer consist of four parallel rods, where each opposing pair has a different electrical charge (fig. 4). An alternating voltage is applied between the rod pairs as well as a direct current voltage. Ions produced by the ionization source are passed into the middle of the four rods and their motion depend on the oscillating electric field caused by the alternating voltage.

Consequently, only analytes of specific m/z ratios will have a stable trajectory and be able to pass through to the detector. The quadtrapole technique therefore makes it possible to scan for ions within a specific m/z range. [9]

(16)

Figure 3 – Simplified schematic of MALDI-TOF mass spectrometry. The sample ions are released from the matrix and accelerated by an electric field into the field-free region and subsequently registered by the detector. V0= Accelerating potential, d = length of accelerating region, L = length of field-free region.

Figure 4 – A simplified schematic of the quadrapole mass analyzer. The oscillating electric field be- tween the rods causes a stable or an unstable trajectory of the ions depending on their m/z ratio. Only ions with stable trajectories can pass through to the detector.

(17)

Figure 5 – An illustrative example of a mass spectra. The peaks can represent signals from either peptides or peptide fragments if a tandem MS setup is used.

2.1.3 Detector

The detector records the signals from the ion current in the mass analyzer and record the data in the form of mass spectra. The results form a mass spectrometer can be viewed as a spectrum of peptide signals, where the intensity of the peptide’s signal is plotted versus it’s m/z ratio - called a mass spectrum. The mass spectrum is also referred to as an MS spectrum or MS/MS spectrum when representing peptides or peptide fragments respectively. An example of a mass spectrum can be seen in figure 5. Common detectors include the photomultiplier, the electron multiplier and the micro-channel plate.

In peptidomics research, it is of interest to know the amount of peptides from an MS run. This information is usually key in understanding the posed biological question, for example if a certain peptide occurs in any meaningful amounts after drug treatment. Such information is achieved by quantification of the MS data. Data from LC-MS is quantified by using the retention time, the time it takes for a sample to leave the LC column. Peptides are not momentarily release from the column, but eluate during a small period of time, a retention window. The intensity signal for given peptide over the length of its retention window is used to quantify that peptide. For this project, the quantification was done in the software Progenesis LC-MS [20] as it is the program used by Denator. Since it is a commercial program, details on the quantification process are not revealed. Other quantification softwares exist such as MaxQuant [22], which is a freely available quantification software.

(18)

2.1.4 Peptide identification

Peptide fragment spectra, or MS/MS spectra, are normally sent to a proteomics search engine such as Mascot [24] or the open-source X!Tandem [49]. It is often impossible to match all peptides in an MS sample using existing search engines due to several reasons. These are variable data quality and the number of possible combinations of peptide fragments when it comes to size, charge and terminal amino acids [11]. Post-translational modifications (PTMs) make the matching procedure even more complicated since it adds to the number of possible amino acid variants.

Perhaps the most common type of peptides used in mass spectrometry is tryptic peptides, which originate from proteins that are cleaved using Trypsine in vitro. This method yields peptides that in principle always end in specific amino acids; lysine (K) and arginine (R). The available peptide search engines work best when the peptides are more than seven amino acids long and have an ion charge of +2 or +3 [11]. Endogenous peptides does not have any specific amino acids in the ends because they are cleaved inside the body under unknown conditions. This leads to higher number of possible amino acid combinations and thus makes identification more difficult.

2.1.5 File formats

There exist several different file formats for storing mass spectra, both open and proporietary. The different formats have been developed for different purposes by different organisation which un- fortunately has lead to a lack of standardization.

A common format is the mzXML [31] which is an XML (eXtensible Markup Language) format developed by the Seattle Proteomics Center (SPC) [39]. Markup languages works by enveloping properties in the file in tags, which makes it human-readable and easy to organize. All the extra data in the file used for markup increase the storage space requirede for the file which is a downside of mzXML. Another XML format, also similar to mzXML is the mzData format, developed by the Human Proteome Organization (HUPO). In an attempt to create a standardized format and replace mzXML and mzData, the mzML format was created by SPC and HUPO. The mzML format was released in 2008, with revisions made in 2009, and has thus existed for a while [32]. Even though efforts have been made to create a standard format, older formats such as mzXML is still used and are supported in current software. An example illustrating the mzXML format is depicted in figure 6.

An example of a more compact open file format is the Mascot Generic Format (MGF) [25] devel- oped by Matrix Science to be used in the Mascot search engine. This format is not an XML format and does therefore take up less storage space. An example of the MGF format can be seen in figure 7.

2.2 Existing tools and software

There are currently a number of tools available for proteomics and peptidomics, where many are mentioned in recent literature reviews [30]. The following are a selection of software that are

(19)

Figure 6 – An illustrative example of the mzXML format. The data for each spectrum is encapsuled in the “scan” tag, which is an instance of the detector scanning for peptides during the MS run. The peak m/z ratios and corresponding intensities are encoded in the “peaks” tag under “contentType”.

Figure 7 – An illustrative example of the Mascot Generic Format (MGF). The data related to the spec- trum is encapsuled in the BEGIN IONS and END IONS statements. The TITLE and PEPMASS describe the title of the spectrum and the peptide mass in Daltons. The following pairwise numbers represent m/z ratios of the peaks and the corresponding intensity.

(20)

considered to be of most relevance for the background of this project.

Mascot [24], X!Tandem [49] and Sequest [7] are examples of search engines that identify peptides from peptide ion spectra - MS/MS spectra. These search engines use databases containing theoret- ical amino acid masses to match against the peak masses of the observed (query) MS/MS spectra.

The database can be species specific and X!Tandem also allows for user customized databases.

Peptides are cleaved artificially, in silico, which results in theoretical, calculated spectra that are matched against the query spectra. The quality of the match is determined by a significance mea- sure, such as an expectation score, see section 2.3.4. The theoretical masses also include possible PTMs that may or may not be present on the amino acids. In the end this often results in a large number of possible combinations that need to be explored. The search parameters are configured by the user prior to initiating the search. These parameters include the possible PTMs to explore, the compound used to cleave the protein, mass tolerance and if peptides with certain charges should be ignored. By adjusting these settings, it is possible to narrow or widen the search space. The available search parameters vary depending on the search engine used.

Example of tools that identifies peptides by matching MS/MS spectra against a database of other MS/MS spectra are X!Hunter [6] and SpectraST [17]. The idea here is to create a local library of annotated peptide spectra that come from one of the search engines described above. These anno- tated spectra are then matched against raw unannotated peptide spectra provided by the user. This enables an easy way to setup and use a custom reference database for annotations. However, the in- tention in this project is to store raw peptide spectra that might later become annotated, essentially using the spectrum itself as an identifier. X!Hunter and SpectraST does not provide a way to create a database that is able to hold raw spectra and add annotations over time. Although not the intented use of the software, it is possible to store raw spectra using SpectraST. It is however not possible to add annotations afterwards, but the software is able to match and score spectra using the methods described in sections 2.3.5 and 2.3.4.

2.3 Spectra processing and scoring

The processing of spectral data prior to matching is something that have a lot of aspects. Com- parison of two spectra can be done qualitatively by a human observer. This is not possible for computers, as they need a way to quantify results in order to determine if something is similar or dissimilar. Therefore, a system that can process spectra and provide a measurement of similarity is required to do comparisons of large amounts of MS data. There are several aspects of such a sys- tem: peak picking, create consensus spectra, deconvolve spectra, a way to score spectra similarity and a way to assess the significance of that score.

2.3.1 Peak picking

Mass spectra often contain more peaks than expected from the ionization process. This unwanted signal data is referred to as noise. The presence of noise can greatly reduce the chances of finding the correct peptide and also increase ambiguity in the search results. For these reasons, it is prefer-

(21)

able to select only a number of peaks from a spectra that give the strongest signals and provide an optimal signal-to-noise ratio. A common method for picking peaks is to pick the X peak that have the strongest intensity, in an effort to discard noise. Another method is to divide the spectrum into Y parts and then pick the X strongest peaks in each part. [34]

2.3.2 Spectral deconvolution

Different isotopes of an element occur naturally in every sample. They differ only a few Daltons (Da) in mass, resulting in the formation of a cluster of peaks in the spectra, where each peak represents a specific isotope and charge state. To deconvolve a spectra means to group these peaks together into one single peak and determine its effective mass and charge. [19]

2.3.3 Similarity score

Intensity signals in MS/MS spectra from different MS runs can be vastly different in magnitude and need to be normalized in some way in order to make data comparable from separate runs. If a spectrum is represented by a vector, where the vector elements are peak intensities, normalizing the spectrum intensities is often done by transforming the vector into a unit vector. The resulting vector will have a magnitude of one, regardless of the values of the original peak intensities. This is done by dividing every component uiof a vector U by the vector’s magnitude, Û (eq. 2).

Uˆ = s n

i=0

u2i (2)

Scoring a comparison between two spectra can be done in different ways. A popular method used by for example X!Tandem[8] is by calculating the dot product of the normalized intensity vectors (eq. 3). Since the vectors are normalized to unit vectors, the score can range from 0 to 1, where 1 is an identical match. [16]

D=

n

i=0

Ilibrary,iIquery,i (3)

where Ilibrary,iand Iquery,i are normalized intensities of the ithbin in the intensity vectors that repre- sent the query spectrum and the spectrum being queried.

There are other scoring functions based on empirically observed rules (Spectrum Mill [1]), or statistically derived frequencies for fragmentation (PHENYX [10]), but using the dot product as described above has proven to be useful. [30]

(22)

2.3.4 Significance measures

A way to measure the significance of a match is to use an expectation value (e-value). The purpose of the e-value is to provide a measure of how likely a certain score have arisen by random. The lower the e-value, the less likely a hit is to have arisen by chance. This kind of measurement is used in other types of search engines, such as BLAST. [29]

X!Tandem use a method based on a hypergeometric distribution of the dot product score to achieve an e-score. A hypergeometric distribution describes the probability of making a number of suc- cessful draws from a population of finite size without replacement. The resulting scoring scheme is called Hyperscore and essentially adds two factors to the dot product score:

Hyperscore=

n i=0

IiPi

!

∗ Nb! ∗ Ny! (4)

where Ii is the intensity of peak i, Pi is 0 or 1 depending on whether or not the peak exist in the theoretical spectra and Nband Nyis the number of b and y ions respectively. A spectrum is queried against all other spectra in the database and a distribution of the resulting Hyperscores is formed.

It is assumed that the true match, if present in the database, will recieve the highest Hyperscore.

Hence, the set of all scores but the highest is a sample from the null distribution of scores between the query and non-matching spectra in the database, i.e. reflect an incorrect match (fig. 8a). The sample is used to estimate the probability of obtaining a score at least as high as the highest scoring match and compute the e-score. The scores on the right side of the distribution are assumed to fall on a straight line when log-transformed from the argument that incorrect results are random. The e-value is calculated based on the intersection of the extrapolated straight line with the maximum Hyperscore, i (fig. 8b). The expectation score is then ei. [38]

The details on the actual implementation of the e-score in X!Tandem is not clearly stated [8], but the main concept of how it works as a significance measure is described above.

2.3.5 Scoring in SpectraST

The dot score (eq. 3) is also used to score spectra in SpectraST. The peak intensities are placed into bins, 1 m/z unit wide, and a fraction of the intensity is spread out to neighboring bins to be able to match equivalent but slighly m/z-shifted peaks. However, SpectraST use a different approach in order to find the significance of a hit. The dot score difference between the two top hits is compared, a measure called ∆D (eq. 5).

∆D =

D1− D2

D1 (5)

If ∆D is large then the top hit clearly stand out from the other hits, thus it is more likely to be correct. Another metric used by SpectraST is the dot bias (DB) and it says to which degree the dot product is dominated by just a few matching peaks (eq. 6). DB gets a value of 1 if the dot product

(23)

Figure 8 – (a) Score distribution for a given peptide matched against the entire database, where the top score is circled. (b) The expectation score is calculated based on the logarithm of the scores and a straight line is extrapolated and the intersection of the top score (circle) is used to calculate the e-score.

The figures are adapted from [38].

is due to a single peak, which is the case where any contributing score come from only one vector element in each vector. Let the only contributing vector elements be x and y in each spectrum respectively. The numerator will then consist of the square root of a the squared product of the elements representing that peak in both spectra,p

x2y2= xy. The denominator will simply be the product xy resulting in xy/xy = 1. Conversely, DB gets a value of 1/√

b≈ 0 if all peaks contribute equally to the dot score, where b is the number of bins. If all contributing vector elements are of the size a and there are b bins in the vector, then the numerator will have the value √

ba4and the denominator ba2, resulting in DB =

ba4 ba2 = 1

b ≈ 0 for large values of b.

Large or small DB values are typical for uncertain matches where the dot score is too high, caused by matching a few dominating peaks or matching many small peaks that are likely noise. [16]

DB= q

iIlibrary, j2 Iquery, j2

D (6)

The hits from SpectraST are ranked by using both the dot score and dot bias to calculate a so-called discriminant function, F (eq. 7) which is the SpectraST equivalent of an e-value.

F = 0.6D + 0.4∆D − b (7)

where the penalty term b is determined by the DB (eq. 8) [16].

(24)

b=













0.12 i f DB < 0.1 0.12 i f0.36 < DB ≤ 0.4 0.18 i f0.4 < DB ≤ 0.45 0.24 i f DB > 0.45

0 f or all other values o f DB

(8)

The authors of SpectraST have derived the discriminant function based on trial-and-error runs on a chosen test set. The parameters and form of the F function is assumed to be general enough for other datasets and it is also possible to adjust them for the needs of a specific application. It is however unclear how to decide whether or not a given value for F is relevant and seems to be left for the user to decide. [16]

2.4 Pattern recognition

Pattern recognition is a machine learning approach that is used to find distinctive patterns in data in order to create a model that describes the significance of the features, or attributes, in the data.

A specific algorithm is used to train the model and this algorithm can be either supervised or un- supervised. Supervised learning means that the data is labeled, which implies that it is possible to evaluate the model’s performance in terms of correctly or incorrectly classified examples. Un- supervised learning means that the data is unlabeled and there are no predetermined classification assumptions. [2]

Both methods are useful in different context. This project used mainly supervised learning, or classification since the data set it was based were categorized in certain tissue types or treatment methods.

2.4.1 Classification

Classification is a form of supervised learning where the data have discrete outputs or labels - class labels. In the context of this project, the labels are discrete since they describe a tissue type or treatment method. A classifier algorithm is trained by providing it with a training data set from which it form a set of rules based on the features in that data. An unknown data example can be given to the algorithm, which then tries to determine which class it belongs to based on the rule set from the training. This is examplified in figure 9.

There exist many algorithms for classification, some of them are: Decision Tree, Random Forest, Artificial Neural Networks (ANN) and Support Vector Machine (SVM). All methods each have their strengths and weaknesses. Most tree-based methods produce a model that can be interpreted visually. This is a big advantage when it comes to understanding the biological meaning of the model, as the rules are shown as decision points in a tree structure.

Although tree models have a visual advantage, they may not be suitable for data sets that have a very

(25)

Figure 9 – Schematic view of a classification procedure. A classifier, C is trained by providing it with a set of training data on which it form rules that are used to predict the class, Y of an unknown example, X.

large number of features, or attributes. Tree models easily become overly adapted, or over fitted, to the data if every attribute is part of the decision algorithm. The resulting model will become too complex and small changes in the data become amplified, leading to poor predictive performance.

This project deals with data that e.g. consists of around 30 mass spectrometry runs and detects about 43 000 peptides. The resulting data set then has 30 examples and 43 000 attributes. Even though tree methods may not be suitable for this project, they are useful in order to illustrate how a classifier works (fig. 10).

Random Forest was used as the classifier algorithm in this project. Although based on decision trees, it is an ensemble method - it grows many tree models instead of just one. This compensates for the downsides of decision trees mentioned above. Each tree is constructed using a subset of the attributes in the data set to create a split, i.e. a node in the decision tree (fig. 10) - creating a forest of decision trees. A new example is classified by sending it to each tree model in the forest. Every model gives a classification, which is a vote for that class, and the final classification becomes the one that got the most votes from the forest. Random Forests are also fast to train on large data sets and can handle a lot of attributes, making it a suitable choice for this project [18].

2.4.2 Feature extraction

A small number of examples compared to the number of attributes can be problematic; it means that there are few cases available to train the classifier on and a large number of possible combinations of the features of the data. This leads to challenges when it comes to producing a robust model.

Unfortunately, mass spectrometry analysis is both expensive and time-consuming which means that the data sets will often have few examples compared to the number of attributes. Feature extraction is a way to reduce the number of attributes. Clustering is an example of this method in which similar attributes are grouped and together work as one single attribute rather than several individual attributes. Clustering can also be used on the examples in the dataset in order to reveal groupings in the data, but is in this context used to find similar features instead. There are two main types of clustering: partitional and hiearchial. The principal of partitional clustering is to divide the

(26)

Figure 10 – Simplified example of a tree classifier model where peptide abundances are used to de- termine which tissue type the peptides belong to. The green nodes are decision points that represent attributes in the data set, peptides in this case. The numbers on the arrows equals the mass value of the peptide in the adjacent node that in turn leads to the next decision point or to the classification of the peptide to a tissue type.

objects in the data set into non-overlapping subsets, where each object exist in exactly one subset (fig 11). Hierarchial clustering works by nest several clusters into a hierachy and organize them in a dendrogram. More similar data points will form their own small clusters, while less similar data points exist together in a larger cluster (fig. 12).[51]

K-means clustering is a type of clustering method with the aim of placing n observations into k clus- ters. A given set of observations X = x1 x2 ... xn is divided into k sets, S = s1 s2 ... sk , where k ≤ n so that the within-cluster sum of squares is minimized:

min

k

i=1

xj∈Si

||xj− µj||2 (9)

where µiis the mean of the elements in Siand is also called the centroid. Each data point is assigned to the cluster with the closest centroid - the center point of the cluster. The centroids for each cluster are then re-computed and the process is repeated until the centroids do not change. [50, 51]

Larger values of k means that the data set becomes less compressed - the number of attributes gets larger and the objective of feature extraction is to reduce that number. On the other hand, small values of k can result in large numbers of observations in the same clusters, observations that should be separated. In essence, this is a matter of choosing compactness versus retaining information in the data. An optimal value for k can be determined by using the so-called Elbow method [26].

In short, the Elbow method uses the ratio of the sum of squares distance (SSD) between the k-

(27)

Figure 11 – Example of partitional cluster- ing: the data points are divided into non- overlapping subsets where each data point only exist in one subset.

Figure 12 – Example of hierarchial cluster- ing: the data points (D1-D4) exist in nested clusters that is represented in a dendrogram.

With no criteria for similarity, all data points exist in the same clusters. As more stringent criteria for likeness are introduced, the clus- ters are split into smaller clusters containing only the data points that fulfill the criteria.

means clusters and total SSD in order to find the a suitable value for k. This ratio is equivalent to the ratio of total variance explained for a given value of k. The goal is to retain as much information in the data as possible, thus explain as much of the variance as possible, while reducing the number of attributes as much as possible. This means a high percentage total variance explained and a low value for k. The total sum of square distance, T is described by:

T =

i

||xi− µc(i)||2+

i

||µc(i)− µ||2 (10)

where the first term is the SSD within a cluster and the second term the SSD between the clusters.

µ is the centroid and c(i) refers to the cluster for example i and µc(i)= N1

ci∈I(c)where Ncis the size of cluster c and I(c) indexes examples in cluster c. The sum of variances, Vtot can be expressed as:

Vtot= 1

NT (11)

where N is the total number of data points. Furthermore the variance between the cluster Vbetween is described as:

Vbetween= 1

||µc(i)− µ||2 (12)

(28)

Figure 13 – An example illustrating the Elbow method, where the total variance explained by the data is plotted versus the number of k-means clusters used. The blue circle marks the optimal choice for the value of k, since it finds the best trade off between explaining as much of the variance as possible while keeping the number of clusters as low as possible.

Thus, the ratio of Vbetweenand Vtot is Vbetween/Vtot=

1

Ni||µc(i)−µ||2

1

NT which is equal to the ratio of the SSD between clusters and the total SSD. The Elbow method is further illustrated in figure , where the blue circle marks the “elbow” and thus the optimal value for k.

Principal component analysis (PCA) is another method for reducing the number of attributes. The data set is transformed to reduce the number of dimensions and this method can also be used as an unsupervised exploratory way to find patterns in the data. The transformed data is described by the principal components, which is a set of linearly uncorrelated vectors. The goal of PCA is to find a basis, that can re-express the data in the best way. This can be simply described by the following equation:

PX = Y (13)

where X is an m ∗ n matrix representing the original data set and Y is a representation of X after a linear transformation, P. The rows of Y are called the principal components and the rows of P are a set of new basis vectors for representing the columns of X:

(29)

PX =

 p1

... pm

 x1 · · · xn  =

p1x1 · · · p1xn ... . .. ... pmx1 · · · pmxn

= Y

The principal components, ∑ pixjrepresent the transformed values of a particular data point and are also called scores.

A covariance matrix express how much the dimensions in the datasets vary with respect to each other, compared to the variance measure which express the variation in one dimension indepen- dently. For a given data set, the aim of PCA includes minimizing the redundancy, given by the covariance, and maximizing the signal, given by the variance. A covariance matrix Cxfor the m ∗ n matrix X, is computed as described in the following equation:

Cx= 1

n− 1XXT (14)

where the values in X are in mean deviation form, meaning that the mean value ¯xof all data points in X have been subtracted form each element xi, or is zero. The diagonal elements in Cxdescribe the variance in the data set and the off-diagonal elements describe the covariance. Cxis diagonalized in order to minimize the covariance, and forms the manipulated covariance matrix Cy. This means that the off-diagonal elements are zero and thus the redundancy in the data is minimized. PCA assumes that the direction in the m dimensional data set X that have the largest variance are most important, and that the basis vectors in P are orthonormal. The direction giving the largest variance in X is saved as the vector p1, the first principal direction. Subsequently, the direction giving the next largest variance is saved in p2and so on. The search for each new principal component is restricted to directions that are perpendicular to all previously selected principal components, because of the assumed orthonormality constraint. The rows of P, {p1 ... pm} are in fact the eigenvectors of the matrix Cx, and are also called loadings and represent weight factors that is multiplied with the original data to calculate a principal component yi= pixj.

In short, the aim of PCA can be summarized as follows: find an orthonormal matrix P where Y = PX such that the matrix Cyn−11 YYTis diagonal [42, 12, 44].

PCA provides a way of visualizing high-dimensional data. Since the peptide data sets in this project contain several thousands of peptides, this is useful in gaining an overview of the data and hopefully observe any clear patterns or trends. This is shown in the result section 4.3.3, figures 27, 28 and 29.

2.4.3 Attribute importance

To determine the importance of the attributes in the data set is key in order to gain an understanding of the most prominent features in the data set. In the context of this project, this means finding the peptides that are typical for a certain tissue type or treatment effect. Attribute importance can be measured in different ways and a straight forward way is to measure the loss in classifier accuracy

(30)

Figure 14 – An example of 5-fold cross-validation showing the partitioning into training set and test set. The gray boxes represent elements in the training set and the white boxes represent elements in the test set.

c. The classifier’s accuracy is 90% when using all three attirbutes, i.e. it can classify a given example correctly 9 out of 10 times. If attribute c is removed, the classifier’s performance drop to 80%, meaning that attribute c has an importance of 10% to the overall accuracy of the classifier.

The attribute that contribute to the largest drop in accuracy when removed are considered the most importance one. Attribute importance can also be measure by looking at the Gini Index decrease.

The Gini Index in describes inequality in the distribution in the data. It can be though of as high impurity in the data with respect to the classes, so the Gini Index should be as low as possible [3].

A high decrease in Gini Index when an attribute is used is therefore desirable. [5]. Another method for measuring the importance of attributes is to randomly permutate the attributes and evaluating the differences in the classifier’s accuracy.

2.4.4 Cross-validation

The model produced by a classifier needs to be tested in some fashion to ensure its accuracy and robustness. A useful and common way to to this is cross-validation (CV). In k-fold CV, the data is split into k equally-sized parts. All parts except one is used to train the model and the remaining part is used to test the model - how well it can classify a given example based on the rules set up during the training. The training and testing procedure is repeated using different parts as training and test data set until all k parts have been used. [36] An example of 5-fold CV is illustrated in figure 14.

A special case of CV is leave-one-out CV (LOOCV) which means that the data set is partitioned into a number of parts equal to the number of examples in the data set. Consequently, one example is used to test the model each time, which is suitable for data sets where few examples are available.

The motivation is that the model can be trained using as many examples as possible, while it can still be tested on “new” test data.

(31)

2.4.5 Permutation test

In order to draw any meaningful conclusions from the performance of a classifier, the statistical significance of that performance must be established. A method of doing this is by conducting a permutation test, where the idea is to establish whether or not the performance is reliable or random. In the context of classification, this can be done by randomly shuffling the class-labels in the data set, effectively destroying the relation between an example and the class it belongs to. The classifier is then trained on the data with scrambled class-labels and its performance is measured.

This is repeated many times to create a distribution of resulting performances. The null hypothesis is stated such that there is no relation between an example and its class label. A p-value is used to assess the significance, which describes the probability of the null hypothesis being false. The p- value is expressed as the number of times the performance with scrambled class-labels is at least as high as the performance obtained from non-scrambled class-labels. Hence, a low p-value signifies a reliable performance from the classifier. [21]

3 Experimental setup and algorithmic solutions

This section explains the software, methods and solutions used in the project to construct the plat- form. It also covers the description of the data set used in the pattern recognition and the overall layout of the platform.

3.1 Platform layout

An overview of the platform’s layout is shown in figure 15.

There are several software and scripts that together make up the platform. The spectral library construction and spectral matching functions were provided by a modified version of SpectraST [40]. Annotation data are stored in an SQLite [45] database. File conversions are partly done by ProteoWizard’s MsConvert [13] and quantification is done with Progenesis LC-MS [20]. SpectraST is a part of the software suite Trans Proteomic Pipeline (TPP) but has been used as a standalone version in this project. All except Progenesis LC-MS are open-source software.

The scripting language Perl was used to integrate the different softwares of the peptide library.

Most of the integration has to to with file format conversations to enable communication between the softwares. Perl is both suitable for this task and familiar to the author and was therefore chosen as the scripting language of choice.

SpectraST was originally developed to enable the construction of a local spectral library where annotated MS/MS spectra is stored. The intention was to reduce search times and avoid doing rediscoveries of the same peptides. The software stores MS/MS spectra in a database which are directly compared against query spectra. This approach reduce search times since there is no in silicopeptide cleavage, in contrast to e.g. X!Tandem. However, SpectraST does not support con- tinous addition of annotations to the stored MS/MS spectra. For this reason, the annotations are

(32)

stored separately in an SQLite database as mentioned above. SpectraST was however chosen as the method of creating the spectral library and perform the spectra matching because of its speed and the possibility to make customizations.

Some modifications and fixes were done to SpectraST to make it more suitable to the project’s needs. In its original state, there seemed to be a bug in the SpectraST software that made it import one less spectra than the number of spectra that were present in a file. This meant that zero spectra were imported when the file only contained one spectra. The output format of a search result was slightly modified to also include information under “Remarks” for a search entry. Tissue type or otherwise categorical data is stored in the comments section of the database entry and is key in order to identify the sample downstream. Therefore it needed to be included in the search output.

A full list of changes made in order to build SpectraST is available in appendix A.

SpectraST supports importation of spectra in several different formats. The mzXML format was chosen because of technical limitations in SpectraST, which resulted in import errors when trying to import raw spectra in any format other than mzXML. However, the spectra imported into the database are meant to be exported from Progenesis LC-MS which does not support mzXML. This means that the spectra are exported from Progenesis LC-MS in MGF format and then converted to mzXML by using ProteoWizard MsConvert.

3.2 Data set

The data set used to perform the pattern recognition in this project consisted of quantified mass spectrometry data from endogenous peptides. This data set contains the amount of a certain peptide - or peptide abundance - for each sample in the MS run. The quantification were done in the software Progenesis LC-MS.

This data set will be referred to as the “taxotere” data set and contains MS data from mouse brain tissue from 13 different mice. A sample from the left and right striatum is taken from each mouse.

The samples in the data set were stabilized either by snap-freezing the samples [35] or by heat inac- tivation using Denator’s ST1 instrument. Furthermore, the samples were either treated or untreated with Taxotere. The resulting data set thus have four classes describing the sample’s inactivation technique used and whether or not it was treated with taxotere:

• “SnapFroz Yes” - the sample was inactivated by snap freezing and treated with Taxotere.

• “SnapFroz No” - the sample was inactivated by snap freezing and not treated with Taxotere.

• “Stab Yes” - the sample was inactivated by snap freezing and treated with Taxotere.

• “Stab No” - the sample was inactivated by snap freezing and not treated with Taxotere.

The entire data set contains 26 examples (MS runs) and 43450 attributes (peptide signals). All classes contained six examples, except “SnapFroz Yes” which contained eight examples. A short excerpt of the data set is shown in figure 16.

(33)

Figure 15 – The platform’s layout can be though of as two branches. One has to do with the peptide library, “searches and annotations”. The other branch includes the machine learning part, “Find interest- ing peptides”, where the objective is to find which peptides are specific for a tissue or a treatment effect.

The results from the machine learning branch can then be used to query the peptide library, as shown in the figure by an asterisk.

Figure 16 – An excerpt of the data set used for the pattern recognition. Each attributes represent a peptide’s m/z ratio, retention time, mass and charge. For example, 709.65_27.16_4960.50_7 means an m/z ratio of 709.65, a retention time of 27.16 minutes, a mass of 4960.50 Da and a charge of +7. The class labels representing the different sample preparations and treatments is shown in the right-most column.

(34)

3.3 Peptide library

The peptide library constructed in this project is comprised of two different databases. One database holds the raw spectral data and one database holds the annotation information, in which MS runs a certain peptide has occurred and whether or not two spectra are similar enough to be assumed to representing the same peptide. SpectraST is designed to be used with annotated peptide data that is directly imported from e.g. Mascot or Sequest. In this project the idea is to first store the spectra without annotations and mark them up in later stages with annotation data as the spectra are identified. To meet this need, a separate database is used to handle annotations. For the end- user, this distinction will not be noticeable. The search results are presented in the form of an Hypertext Markup language (HTML) table. Using HTML for this purpose provides an easy way to create a result view that is compatible with any web browser and thus reduce any extra software dependencies.

Matched peptides get a ranking that depends on their likeliness to be a good hit by the F-value given by the discriminant function F described above (eq. 7). A high F-value means a highly ranked and therefore likely match. A matching test was performed to try out the performance of the matching procedure. A data set consisting of neuropeptides from rat (“rat” data set) were imported into the spectral library (see 3.3.1) and annotations for most of these spectra were imported into the annotation library (see 3.3.2). The annotations were obtained by searching all MS/MS spectra from the “rat” data set in X!Tandem against a database of confirmed or highly likely endogenous peptides from rat. All spectra from the “taxotere” data set described above was used as the query, were none were previously annotated. The query spectra for the top eight hits from the test matching was sent to X!Tandem for annotations in order to compare with the annotation obtained from the “rat” data set. The following parameter settings were used in X!Tandem:

Modifications: C-terminal amidation och oxidation@M

This setting describes the allowed modifications of the peptide. In this case amidation of the C- terminal amino acid and a possible oxidation of Methionine.

Refinement: Acetyl N-term, Deamidated@N, Deamidated@Q, Acetyl@K, Oxidation@M

The refinement parameter allows for further relaxation of the search constraint by allowing more possible modification in a second-stage search, which expands the search space and thereby in- creasing the chance of finding an annotation.

Spectrum: parent monoisotopic mass error: 10ppm, fragment monoisotopic mass error: 0.5 Da The spectrum setting describes the tolerance in mass difference between the query spectra and the database spectra. The parent mass error reflects the mass of the peptide that was fragmented into the MS/MS spectrum and the fragment mass error refers to the peak masses of the MS/MS spectrum.

3.3.1 Spectral library

The spectral library, or database, was constructed using a modified version of SpectraST. The li- brary is stored in a binary file format specific for SpectraST, called splib, that is fast in terms of

(35)

search speed but the downside is that there is no way to extract data directly from the database - it should be thought of as a reference catalog only. There is a text version of the library that is created when the library is created, but it is not required for the library to function as the search process only uses the binary library file. Once peptide spectra are matched, they need to be extracted from the raw spectra data files in order to make further analysis, such as sending them to Mascot or Sequest for annotation.

3.3.2 Annotation library

The annotation library was constructed using SQLite, a relational database engine that provides a light-weight database service that also take up little storage space. An Entity Relationship (ER) diagram describing the database design is shown in figure 17. This type of diagram shows the tables in the database and which entity types (attributes) each table holds and the data type of each entity. The entities are related to each other between the tables as described by a cardinality.

The cardinalities express the number of instances one entity can, or must be associated with the instances of another entity. There are three types of cardinalities: one-to-one, one-to-many and many-to-many relationships. [41]

Information from the annotation database is automatically fetched by a Perl script during a search in the peptide library. An SQL query is sent to the database asking for any available annotations for a given spectrum. For example, in order to find if there are any peptides in the database similar to a given peptide with the name “Peptide_123”, the following SQL query would be sent:

s e l e c t * from o c c u r r e n c e where i d =( s e l e c t s i m i l a r _ p e p t i d e from s i m i l a r where i d = ’ P e p t i d e _ 1 2 3 ’ ;

If there are any similar peptides to “Peptide_123”, the names of these peptides and the MS run they occur in will be returned.

Peptide information are imported into the annotation library during the same time as the peptide spectra are imported into the spectral library. At this stage there aren’t any annotation data available in the sense of amino acid sequence or protein information. This information consist of peptide identity, if a spectra is similar to another spectra and in which run the peptide occurs. This means that there is enough information to be able to track the occurrence of spectra to specific experiments, even if the peptide itself is not annotated.

3.4 Pattern recognition

This subsection will explain the methods behind the process of pattern recognition. The aim is to find patterns in the peptide data in an attempt to correlate peptide abundance to a specific tissue or treatment.

(36)

Figure 17 – Entity Relationship diagram that describes the design of the annotation database. Each box represent a table in the database, where the name of the table is listed at the top and the attribute name and corresponding data types are listed below. The labels in the arrows describes the cardinalities between entries in the the tables. For example, the cardinality 1 to N between “Peptide” and “Annotation” means that one peptide can have several annotations.

(37)

3.4.1 Data sub-setting and filtering

The peptide abundance levels in the data set vary a lot in magnitude, from 0 up to around 65 000 000. The abundance values are dimensionless and reflect the relative abundance between all peptides in the data set. An abundance value of 0 means that no peptide signal were detected in that particular sample. However, the vast majority of data is in the lower ranges, under 1000. The data set was filtered using a cutoff for the abundance value to create a subset of the original data set. The purpose was partly to investigate if there is a difference in low and high level abundance peptides when it comes to prediction performance. Filtering and sub-setting was also performed to determine if peptides in low abundance are significant and not just noise in the data.

Abundance cutoff values were chosen arbitrarily from a visual inspection of the data distribution (fig. 25). Several values were tested; 150, 300, 500 and no cutoff. The filter works as follows:

• Go through all peptides in the data set.

• Check if the peptide occurs at a level higher than the given threshold in any MS run (any observation).

• If it does occur at a level higher than the threshold, save the peptide to the subset.

A more stringent cutoff was also used to explore any differences between high and low abundance peptides when it comes to discerning between sample groups. The hard criteria were that all ob- servations must fulfill a threshold, where the cutoff value of 500 was used.

3.4.2 Feature extraction - clustering

Feature extraction was done using a k-means method implemented in R (“kmeans” in package stats version 2.14.1) [33]. Since it is impossible to know the optimal number of clusters for a given data set beforehand, several values of k were tested to find the optimal ones; 5, 10, 20, 30, 40, 50, 100, 150, 200, 250, 500, 1000, 3000, 5000, 10000.

The optimal number of clusters, k, was investigated by using the Elbow method.

3.4.3 Pattern recognition

Principal Component Analysis (PCA) was used as an exploratory way of finding distinctive patterns in the data. The scores of the two major principal components were plotted against each other in order to discover any clear group distinction between groups in the data set.

Classification was done using Random Forest (“randomForest” in package randomForest version 4.6-6) as the classifier in R. A model was trained and tested by means of LOOCV.

The Random Forest model was trained and tested using 2000 trees and √

p attributes for each split, where p is the number of attributes. The accuracy was calculated by taking the mean of the

References

Related documents

Proteomic and mass spectrometry approaches were used to characterize the composition of the human colonic mucus layer in health an disease, and to determine how alterations in protein

Taken together, the results from this thesis show that the human colonic mucus is composed of a relatively small number of proteins that are organized around the

Asymmetric peaks will also influence the filter performance. The perhaps most  devastating  effect  is  that  the  peak  apex  positions  can  change  after 

The aim of this thesis was to investigate the use of alternative MS-based techniques to assist specific analytical challenges including separation of stereoisomers using

In terms of biomarker discovery, false discovery of protein peaks should be avoided, which be achieved by analyzing sufficient samples, adopting overfitting-resistant algorithms,

The aim of this thesis was to exploit these features by developing methods for the analysis of endogenous steroids, cholesterol oxidation products (COPs) and thyroid hormones (THs) by

Detection of compounds separated by CE is often performed with UV or fluorescence detectors. Although high sensitivities can be obtained, a significant drawback of these

The obtained surface yielded rapid CE-ESI-MS separations of a mixture of neuropeptides and proteins within five minutes with high efficiencies (Figure 4). CE-ESI-TOF-MS separation