Keywords Abstract

(1)

Theoretical Fundamentals of Computational Proteomics

and Deep LearningBased Identification of Chimeric

Mass Spectrometry Data

by

Jens Settelmeier

A Thesis

Submitted to the School of Electrical Engineering and Computer Science

KTH Royal Institute of Technology

In Partial Fulfilment of the Requirements for the Degree Master of Science

(2)

KTH Royal Institute of Technology

School of Electrical Engineering and Computer Science

Affiliation

The Noble Research Lab at Department of Genome Sciences, University of Washington, Seattle (USA)

Examiner

Mr. Prof. Johan Håstad, Ph.D.

Stockholm, Sweden

School of Engineering Sciences: Mathematics

Supervisors

Mr. Prof. Kevin Smith, Ph.D.

Stockholm, Sweden

Division of computational Science and Technology

Mr. Prof. William Stafford Noble, Ph.D.

Seattle, USA

University of Washington

Department of Genome Sciences

Department of Computer Science and Engineering

Mr. Wout Bittremieux, Ph.D.

San Diego, USA

University of California San Diego

Skaggs School of Pharmacy and Pharmaceutical Sciences

(3)

A complicating factor for peptide identification by MS/MS experiments is the presence of “chimeric” spectra where at least two precursor ions with similar retention time and mass coelute in the mass spectrometer. This results in a spectrum that is a superposition of the spectra of the individual peptides.

These chimeric spectra make peptide identification more difficult, so chimeric detection tools are needed to improve peptide identification rates. GLEAMS is a learned embedding algorithm for efficient joint analysis of millions of mass spectra. In this work, we first simulate chimeric spectra. Then we present a deep neural networkbased classifier that learns to distinguish between chimeras and pure spectra. The result shows that GLEAMS captures a spectrum’s chimericness, which can lead to a higher protein identification rate in samples or support biomarker development processes and the like.

Keywords

Computational proteomics, mass spectrometry, chimera identification, machine learning, deep learning, big data, bioinformatics

(4)

En komplicerande faktor för peptididentifiering genom MS / MSexperiment är närvaron av “chimära” spektra eller “chimera”, där åtminstone två föregångare med liknande retentionstid och massa sameluerar in i masspektrometern och resulterar i ett spektrum som är en superposition av individuella spektra. Eftersom dessa chimära spektra gör identifieringen av peptider mer utmanande behövs ett detekteringsverktyg för att förbättra identifieringsgraden för peptider. I detta arbete fokuserade vi på GLEAMS, en lärd inbäddning för effektiv gemensam analys av miljontals masspektrum. Först simulerade vi chimära spektra. Sedan presenterar vi en ensembleklassificering baserad på olika maskininlärnings och djupinlärningsmetoder som lär sig att skilja på simulerad chimera och rena spektra. Resultatet visar att GLEAM fångar

“chimärheten” i ett spektrum, vilket kan leda till högre identifieringsgrad av protein samt ge stöd till utvecklingsprocesser för biomarkörer.

Nyckelord

Beräkningsproteomik, masspektrometri, kimäridentifiering, maskininlärning, djupinlärning, stora data, bioinformatik

(5)

First of all, I want to thank Johan Håstad and Kevin Smith for supporting my venture to do the degree project abroad. A special thank goes to William Noble and Wout Bittremieux for their remarkable supervision during the whole project. Further, I want to thank Jeff Bilmes and all Noble Research Lab members for their critical and helpful comments during discussions about methods and results.

(6)

aaa amino acid alphabet

ASGD Averaged Stochastic Gradient Descent method AUC Area Under the Curve

CID CollisionInduced Dissociation CPU Central Processing Unit

FDR False Detection Rate GPU Graphic Processing Unit Lcont. Lipschitzcontinuous

LDC Linear Discriminant Analysis Classifier NBC Naive Bayesian Classifier

PSM Peptide Spectra Match

QDC Quadratic Discriminant Analysis Classifier RBF RadialBasis Function

ROC Receiver Operating Characteristic SVC Support Vector Classifier

SVD Singular Value Decomposition

(7)

1.1.1 A Fouriertransform ion cyclotron resonance instrument. . . 4

1.1.2 The basic chemical structure of an amino acid. . . 5

1.1.3 The 20 amino acids in their structural formulas that we consider in the following work . . . 5

1.1.4 An example spectrum of a peptide represented as a peaks diagram. 6 1.1.5 Ion annotation of a peptide in the RoepstorffFohlmann Biermann nomenclature. . . 7

1.1.6 An example of a chimeric spectrum and its ion annotation corresponding to the generating peptides. . . 9

2.1.1 MS/MS experiment workflow from sample preparation to data analysis. . . 18

3.2.1 Dropout: Sampling thin networks (b) out of (a). . . 47

3.2.2A residual connection in a MLP. . . 48

3.2.3 Learning rate circling over iterations. . . 49

3.2.4Learning rate circling with decay over iterations. . . 49

3.2.5 Confusion matrix and commonly used metrices. . . 50

3.2.6Example of ROC curves for a binary classifier. . . 51

4.1.1 Euclidean distance distribution of GLEAMS embeddings of positive and negative pairs. . . 59

(8)

4.2.1 Euclidean distance distribution of real chimeric spectra’ embeddings to generating

partners. . . 60

4.2.2Euclidean distance distribution of simulated chimeric spectra’ embeddings to generating partners. . . 61

4.4.1 Learning performance of the MLP Classifier. . . 63

4.4.2Learning performance of the SVM Classifier. . . 63

4.4.3Learning performance of the Gradient Boosted Classifier. . . 64

4.4.4Learning performance of the Quadratic Discriminant Analysis Classifier. . . 65

4.4.5 Learning performance of the Linear Discriminant Analysis Classifier. . . 65

4.4.6Learning performance of the Naive Bayes Classifier. . . 66

4.6.1 Final model’s performance on a test set with simulated chimera and pure spectra. . . 69

4.6.2Final model’s performance on a test set with real chimera and pure spectra. . . 70

4.6.3Modified Multilayer Perceptron, and best performing final model. 70 A.0.1An Overfitting classifier that is trained via circle learning and without dropout and weight penalization. . . 90

A.0.2Learning curves of a classifier without Overfitting problem. . . 91

B.1.1 A simulated chimeric spectrum. . . 93

B.1.2 A simulated chimeric spectrum with more shared peaks than the first one. . . 94

B.1.3 A simulated chimeric spectrum, with a lot shared peaks, even between different ion fragments. . . 95

B.2.1 A real chimeric spectrum, with a perfect chimericness score of 0.5. 96 B.2.2A real chimera found by crux. . . 97

(9)

B.2.3A real chimera found by crux. . . 98

(10)

3.2.1 Multilayer Perceptron grid for the sklearn GridSearchCV function. 53 3.2.2Support Vector Classifier grid for the sklearn GridSearchCV

function. . . 53 3.2.3 Gradient Boosted Decision Tree Classifier grid for the sklearn

GridSearchCV function. . . 54 3.2.4Linear Discriminant Classifier grid for the sklearn GridSearchCV

function. . . 55 4.4.1 Train and Validation AUCs for each binary classifier over

different data portions. . . 62 4.5.1 Performance scores of the tested classifiers. . . 67

(11)

1 Introduction

1

1.1 Background . . . 3

1.1.1 Mass Spectrometer: Technical Basics . . . 3

1.1.2 Mass Spectrometry and Amino Acids . . . 4

1.1.3 Ion Annotation . . . 6

1.1.4 Chimeric data . . . 8

1.1.5 Crux . . . 8

1.1.6 GLEAMS . . . 10

1.2 Problem . . . 10

1.3 Approach . . . 11

1.4 Limitations . . . 12

1.5 Implementation details. . . 13

1.6 Outline . . . 13

2 Background

14 2.1 Mass Spectrometry, notation and Terminology . . . 14

2.2 Identifying peptide sequences with database search tools . . . . 23

2.3 The MassIVEKB dataset . . . 25

2.4 GLEAMS . . . 26

2.4.1 Encoding mass spectra for GLEAMS input . . . 26

2.4.2 Neural Network Structure of GLEAMS. . . 27

2.4.3 Training of GLEAMS. . . 27

(12)

3 Methods

29

3.1 Proteomics Methods . . . 29

3.1.1 Simulation of Chimeric spectra . . . 30

3.1.2 Finding real chimera . . . 30

3.1.3 GLEAMS’ embeddings . . . 32

3.2 Machine Learning Methods . . . 34

3.2.1 Support Vector Machines . . . 35

3.2.2 Gradient Boosted Decision Trees . . . 37

3.2.3 Naive Bayes, Linear and Quadratic Discriminant Analysis classifiers . . . 38

3.2.4 Multilayer Perceptron . . . 40

3.2.5 Evaluation Metric: Area Under the Curve (AUC) . . . 50

3.2.6 Model Search . . . 52

3.2.7 Model Scalibility . . . 55

4 Experiments and Results

57 4.1 Experimental Verification of GLEAMS Constrative Loss . . . 57

4.2 Experimental investigation of chimeric spectrum embeddings . . 58

4.3 Model Search Results . . . 61

4.4 Scalibility Investigation Results . . . 62

4.5 Model Performances. . . 65

4.6 Final binary classifier . . . 67

5 Discussion

71 5.1 Discussion of exploratory GLEAMS’ embeddings’ investigation . . 71

5.1.1 Embedding of GLEAMS pairs . . . 71

5.1.2 Embedding of chimeric spectra. . . 72

5.2 Model Search and Final Model Development . . . 72

5.2.1 Model Scaling results . . . 73

5.2.2 Result of the Final Model and its development . . . 74

(13)

6 Conclusion

76

6.1 Summary . . . 76

6.2 Discussion and Future Work . . . 76

6.3 Ethical considerations and social aspects . . . 77

6.4 Sustainability . . . 78

References

79

(14)

Introduction

Mass spectrometry is a technical method for measuring the masstocharge ratio of ions. In the field of proteomics the systematic study of the complete set of proteins expressed in a given organism this technique is typically used to measure and compare changes in biological samples such as peptides. A peptide is a small linear sequence of 2− 50 amino acid residues. Peptides are primarily distinguished by their amino acid sequence and build proteins that confer different properties and functions to tissues. Because of their function

giving properties, scientists are very interested in understanding and identifying peptides. This can be critical, for example, in the treatment of cancer patients.

A physician would like to know how the protein surface of a tumor changes so that medical treatment using cancer drugs can be specifically tailored to the patient’s tumor. Such personalized medicine approaches can significantly improve patient survival rates, as discussed in Koomen et al. [37].

To find out which amino acid sequence corresponds to a given peptide sample from a patient, a measurement is performed with a mass spectrometer, which provides a mass spectrum depending on the amino acid sequences’

physicochemical properties and can thus be used to identify the sequence. This measurement procedure is subject to many uncertainties, which are described, for example, in [66], [57]. One particular uncertainty is the existence of

(15)

chimeras. Chimeras can occur when two or more peptides coelute during the mass spectrometric measurement process. The resulting spectrum is a superposition of the individual spectra of the corresponding coeluted peptides.

Chimeras lead to misidentification of the peptides’ amino acid sequences, as described in [30]. In Dorfer et al. [12] is discussed how to reduce the false discovery rate for peptide identifications with chimeric spectra identification.

The dominant approach for peptide identification is to associate a tandem mass spectrum with a peptide sequence based on database searches [2], [16], [46]. Alternatives are clustering approaches like PRIDE [22] or embedding approaches like GLEAMS [43]. Only the latter has the ability to efficiently add emerging peptide sequences as discussed in [43], which is a significant advantage as the amount of data increases.

This work investigates how the tandem mass spectrometry embedding algorithm GLEAMS captures the multiplicity of chimeras of a spectrum using machine learning methods. The work is aimed at researchers in machine learning, computational biology, and biochemists. In chapter 1 we cover a basic introduction to mass spectrometric measurements. In chapter 2, we formalize proteomics terminology and cover the basics of databasedriven tools for identifying peptide spectra using Crux as an example. In chapter 3 we cover content about GLEAMS, how it works, our method of simulating and finding real chimeric spectra, and the basics of machine learning methods that we use and modify to create a strong binary classifier that can distinguish pure from chimeric spectra. In chapter 4, we perform the experiments giving evidence that GLEAMS captures the chimericness of simulated mass spectrometry data and real data from the MassIVE KB dataset [69]. We further investigate the presented machine learning approaches in terms of their suitability for solving the “chimeric or pure” classification problem, and develop a final model based on the previous investigations that performs better than all discussed stateof

theart approaches. We discuss the results in chapter 5. In chapter 6 we draw

(16)

some conclusions, explore future work, and consider sustainability and ethical aspects of the work.

1.1 Background

This section is divided into six parts, starting with background information on the basics of mass spectrometry in subsection 1.1.1. Next, we cover an introduction to the technical basics of mass spectrometry and amino acids in subsection 1.1.2. This is followed by a brief explanation of how the amino acids of peptides and spectra are related and describe ion annotations in subsection 1.1.3. We then describe chimeric spectra and show an example plot in Figure 1.1.6. We then give some background on Crux, the database search tool used in subsection 1.1.5, and its competitor, the neural networkbased identification tool GLEAMS in subsection 1.1.6. To the best of the author’s knowledge, this is the first time that an algorithm for embedding peptide mass spectra has been studied in terms of its ability to capture the chimericness information of embedded spectra.

1.1.1 Mass Spectrometer: Technical Basics

A typical mass spectrometer is a Fourier transform ion cyclotron resonance instrument, shown in Figure 1.1.1, or an orbitrap instrument, as described in [47]. Simply explained, a mass spectrometer is a particle accelerator and detector. First, the amino acid chain of a peptide is broken into fragments by evaporation, with some breaks in the chain more likely than others due to physicochemical properties. Then the fragments are protonated. The resulting ions move through controlled electric and magnetic fields. Their dynamics is governed by Newton’s second law ⃗F = m⃗a and the Lorentz force law ⃗F = Q( ⃗E + ⃗v× ⃗B), where ⃗F is the force acting on the ion and m is the mass, ⃗a is the acceleration, Q is the charge state of the ion, ⃗Eis the electric field, and ⃗v× ⃗B is

(17)

Figure 1.1.1: A Fouriertransform ion cyclotron resonance instrument is shown.

It is a device for measuring the masstocharge ratio of ionized molecules.

1

the cross product between the velocity of the accelerated ion and the magnetic field. The detector measures the circular current induced by the ion motion with a mass chargedependent frequency. These frequencies are converted into a spectrum via a Fourier transform. For a further introduction to spectra, see subsection 1.1.2.

1.1.2 Mass Spectrometry and Amino Acids

As mentioned above, peptides differ from each other by their amino acid sequence. An amino acid is an organic compound that contains a carboxyl (

COOH) and an amine (NH₂) group, as well as an aminospecific R group, also called a side chain. Its structure is shown in Figure 1.1.2. Twenty different amino acids occur in the genetic code. Their structural formulas are shown in Figure 1.1.3.

From top left to bottom right, we use the following abbreviations for amino acids: G, A, V, I, L, F, Y, W, K, R, H, D, E, N, Q, C, M, S, T, P. An example of

1https://www.flickr.com/photos/beigephotos/5651556067/

2https://www.purefoodcompany.com/aminoacidchart/

(18)

Figure 1.1.2: The basic chemical structure of an amino acid. It has four main groups: The carboxyl group COOH, the amino group N H₃, the central carbon C, and the variable region called the side chain. The carbon atoms are black, the oxygen is dark gray, the nitrogen is light gray, and the hydrogen is white: [31].

Figure 1.1.3: The 20 amino acids in their structural formulas that we consider in the following work

2

(19)

Figure 1.1.4: An example spectrum of a peptide represented as a peaks diagram.

a peptide is a sequence of amino acids such as GAMER. The spectrum might look like the peak diagram shown in Figure 1.1.4 and is given by an intensity array I ∈ Rⁿand a corresponding masstocharge ratio array of the same length m/z ∈ Rⁿ, with n ∈ N. The peaks in the spectrum correspond to the tuples (Ii, m/zi) for i ∈ {0, 1, . . . , n}. It turns out that the spectrum of a peptide is not unique. The peptide’s physicochemical properties determine the spectrum, but there are a lot of uncertainties that affect the exact values. More details on this in section 1.4. A formal definition of spectra and more details are given in chapter 2.

1.1.3 Ion Annotation

The fragmentation of the peptide in the collisioninduced dissociation (CID) process of the mass spectrometer is a stochastic process determined by the peptide’s physicochemical properties and the energy of the collision. For more technical details of CID, we refer to [70] and [60]. Typically, the CID

(20)

method involves cleavage of the peptide at the peptide bonds, and therefore the resulting spectrum contains information about the constituent amino acids of the peptide. The fragments that are charged after the CID process can be deduced based on the position of the broken bond and the side that retains the charge. A fully annotated peptide example is given in Figure 1.1.5 using the established RoepstorffFohlmannBiermann nomenclature. Depending on the broken peptide bonds, we distinguish a, b, x, y, and zions. In this work, we only consider b/y ions because the CID method generates b/y ions, almost exclusively. For more details on the peptide ion annotation, we refer to [53] and [3].³ For more details, see chapter 2. For now, it is important to understand that fragments cause the peaks in the spectra and that there is a systematic way to name them.

Figure 1.1.5: Ion annotation of a peptide in the RoepstorffFohlmannBiermann nomenclature. The dashed lines mark the possible breaking peptide bonds. In the following work, we only consider b and y ions. b_iis the fragment remaining from the left of the dashed line to b_i and has an N terminus. y_j is the fragment remaining from the right to the dashed line to y_j and has an OH terminus. For more details, see chapter 2

4

3https://www.ionsource.com/tutorial/DeNovo/nomenclature.htm

4https://www.ionsource.com/tutorial/DeNovo/full_anno.htm

(21)

1.1.4 Chimeric data

We are interested in identifying chimera. Chimera or chimeric spectra are spectra generated by a heterogeneous population of two or more co

eluting peptides (chimeric peptides). The more peptides present in the mass spectrometer, the more likely it is that there will be many peptides in the instrument at the same time that happen to have similar mass, making separation and identification difficult. I chimeric peptides are present we expect to see an overlap of the spectra of the individual peptides caused by the fragments of the peptides in the mass spectrometer. An example of a spectrum classified as chimeric by the Crux database search tool (see below) can be found in Figure 1.1.6. Annotations were performed using a modified version of pepfrag

5. Green peaks belong to fragments of one peptide, blue peaks to fragments of the other peptide. Red peaks show peaks belonging to the fragments of both peptides.

1.1.5 Crux

Crux is a peptide identification tool(kit) that takes as input a spectrum and assigns as output a most likely corresponding peptide sequence. It is an extension of the widely used database search program SEQUEST, developed at the University of Washington. Both Crux and SEQUEST rely primarily on a crosscorrelation score to estimate the similarity between the input spectrum and databaseknown peptide spectrum candidates with a similar precursor mass charge ratio estimated in the first stage of the mass spectrometer during the MS/MS experiment. Further details on Crux and in particular how the crosscorrelations are calculated can be found in the following paper [46] and in section 2.2. Additional details on how to perform a twostep mass spectrometry (MS/MS) experiment can be found in chapter 3 and in particular in [47, 56,

5https://pypi.org/project/pepfrag/

(22)

(23)

63], and in subsection 3.1.3. In addition to databasebased search programs such as Crux, there are alternative approaches such as the mass spectrometry embedding algorithm GLEAMS, which we examine in more detail as part of this work.

1.1.6 GLEAMS

GLEAMS is a deep neural network for learning how to embed spectra in a lowdimensional space. GLEAMS embeds spectra from the same peptides with the same precursor charge and similar precursor mass charge ratio close to each other, resulting in clusters. In contrast to Crux, which is based on matching peptide sequences to tandem mass spectra by deriving a list of candidate spectra from a database, generating a theoretical spectrum for each candidate, and then scoring each putative peptidespectrum match based on the similarity between the observed and theoretical spectral fragments, GLEAMS uses a neural network approach with a contrastive loss function that leads to the common embedding of identical peptides. Once the network is trained, inference leads to an embedding in a lowdimensional space from which the peptide sequence can be inferred by checking which peptide cluster it is closest to. A useful property of this approach is that, unknown data is embedded separate from known data, so +class expansion is possible without having to retrain the entire network, as is often required in conventional clustering approaches such as the PRIDE program mentioned earlier. For more details on GLEAMS, please refer to [43] and chapter 2.

1.2 Problem

The studies of [44, 68] show that up to 50% of MS/MS spectra contain more than one peptide during the first stage of the MS/MS experiment. In [30], it is shown that chimeras impair peptide identification by increasing the false

(24)

negative rate while also showing a small increase in falsepositive rate. By identifying chimeric spectra, we can improve the identification rate of pure spectra, which is essential for drug development and patient treatment. Thus we investigate: To what extend can GLEAMS capture the chimericness of peptide mass spectrometry data?

Our hypothesis is that GLEAMS embeddings is able to capture information about the chimericness of a spectrum.

1.3 Approach

Our approach will be to train different classifiers on GLEAMS embeddings and evaluate their ability to distinguish pure vs. chimeric spectra. To do this, it is necessary to first construct a dataset with simulated chimeric spectra to train and evaluate the classifiers. We also check how this classifier performs on a dataset containing real chimeric spectra derived using the established database search tool Crux and pure spectra, both from the MassIVEKB dataset.

To identify real chimeric spectra we use the Crux database search tool and apply it to the MassIVEKB dataset, chosen because it contains the largest collection of identified MS/MS spectra. Because this process yields a very small set of chimeric spectra, we develop a method to simulate chimera, described in subsection 3.1.2.

If the trained classifiers are able to discriminate between pure spectra vs.

real/simulated chimera, the GLEAMS embeddings must contain information about the chimericness of the spectrum. We measure the classifier performance by Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC).

If the AUC is substantially greater than 0.5, the classifier performs better than random guessing and is able to discriminate chimera from pure spectra. A negative result from our classifier experiment cannot reject the hypothesis that GLEAMS captures the chimericness of a spectrum.

(25)

1.4 Limitations

Here, we address some limitations to the scope of our work. We do not explain or discuss how exactly the biological samples are collected. The rule of thumb is that the samples are proteins from yeast, mouse, wheat, or human. If it is from a human, it can be from a patient’s blood or stool. Then trypsin⁶, an active and stable protease, is used to convert the proteins into peptides. Even though the ultimate goal is to identify proteins, we will only cover the identification of peptides. The main reason is that peptides are chemically easier to handle than proteins. Therefore, we will only address peptides and not proteins in the scope of this work.

We only consider chimeric spectra with two generating peptide sequences and do not consider socalled peptide modifications as discussed in [35, 41]. Our dataset likely contains some spectra where more than two peptides of similar mass are coeluted.

In addition, we do not investigate feature selection or engineering via PCA, decoderencoder networks, or other methods since the GLEAMS algorithm embeddings have already gone through a feature selection and encoding step.

This work uses tandem mass spectrometry data from the MassIVE

KB dataset, a collection of mass spectrometry experiments from different laboratories around the world. The data were performed with different mass spectrometer instruments, which introduces uncertainties into the data.

Besides, the exact experimental workflow in the wet lab differs from scientist to scientist and from lab to lab, which also introduces some uncertainty into the data collection. Last but not least, the physical process of CID is not fully understood and is an open research topic that will not be further explored in this work. We, therefore, understand the data from the MassIVEKB dataset as tuples of observed spectra and associated peptides.

Due to the Corona pandemic and unstable internet connections during the

6https://enzyme.expasy.org/EC/3.4.21.4

(26)

thesis, the experiments were performed locally on one machine. We had to limit the amount of data that could be processed during the work. This means that it was impossible to process the entire MassiveKB dataset during this work, and we limit the dataset to a smaller subset of about 10% (Aug 01, 2020). The experiments were performed on a computer with a Ryzen 9 3900X Central Processing Unit (CPU), Nvidia Geforce RTX 2080 Ti Graphic Processing Unit (GPU) and 64GB DDR4 RAM with a SWAP partition of 135GB on an M2SSD.

1.5 Implementation details

We use SciKit Learn version 0.23.1 and PyTorch version 1.4.0 with GPU support, CUDA version 11.0, and NVIDIA driver version 450.102.04 for the experiments with the binary classifiers. GLEAMS uses TensorFlow version 1.14.0 with GPU support.

1.6 Outline

In the following, we describe in chapter 2 more detailed background information necessary for understanding the data and the problem. Then, in chapter 3 we discuss the methods used to solve the problem, and in chapter 4 we explain the experiments. The results are discussed in chapter 5. In chapter 6 we summarize the work results and give a hint for future work.

(27)

Background

This chapter consists of four sections. We introduce the terminology in section 2.1. In particular, we introduce amino acids, peptides, fragments, spectra, GLEAMS pairs, and chimeras. Furthermore, we introduce the basic operation of a database search tool for peptide identification, using Crux as an example in section 2.2, followed by section 2.3, where details of the MassIVE

KB dataset used are given. In section 2.4 we describe details of GLEAMS, the embedding algorithm we are investigating in terms of its ability to capture the chimericness of a spectrum.

2.1 Mass Spectrometry, notation and Terminology

Let’s start with some definitions. We define the following terms:

Definition 2.1.1. (amino acids)

An amino acid alphabet (aaa) is represented by a set A = {G, A, V, I, L, F, Y, W, K, R, H, D, E, N, Q, C, M, S, T, P }, where we call x∈ A an amino acid.

Remark 2.1.2. We assume that the elements of the amino acid alphabet (aaa) are readily distinguishable by chemical and physical properties.

(28)

The definition of amino acids leads directly to the peptide definition, where amino acids are strung together.

Definition 2.1.3. (Peptides)

A peptide P is a finite concatenation of amino acids p_i ∈ aaa. That is, it is a finite family P = (p_i)_i_∈I ∈ Aⁿ (or equivalently a finite Cartesian product) of the amino acid alphabet with a finite index set I ={i : 1 ≤ i ≤ n, n ∈ N} ⊊ N.

The peptide has length|P | = n ∈ N if P ∈ Aⁿ. With U_P =∪n∈NAⁿthe universe of peptides is defined, where A is aaa and n∈ N. It contains all hypothetically possible peptides of arbitrary length.

Remark 2.1.4. Note that for |P | = n in the above definition, we should require 2 ≤ n ≤ 50. Otherwise, P is not a peptide according to the biological definition of peptides. In the following, we will use peptide, peptide sequence, or amino(acid) sequence in the same context as it is commonly used in the literature.

Since peptides are chained amino acids, their mass is a summation of the masses of the chained amino acids, which takes a crucial role during a peptide identification process, and we define it below:

Definition 2.1.5. (Peptide mass)

The mass m_P ∈ R of a peptide P ∈ UP is given by the sum of the amino masses of the peptide and thus mP =∑n

k=1mass(pk), where|P | = n, pk ∈ P . The unit is dalton the unified atomic mass [Da].

How do peptides get into the mass spectrometer?

Peptides generated by protein digestion (usually with the protease trypsin), for example, are eluted to the mass spectrometer via a microcapillary high

performance liquid chromatography (HPLC) column. At the end of the HPLC, they pass through a very thin needle of 50 − 150µm in which the

(29)

solution is protonated and evaporated. A voltage of 2 − 6kV between the electrospray needle and the mass spectrometer accelerates the ions into the mass spectrometer. For more details, please refer to [7, 27]. Tryptic peptides are usually doubly protonated, leading to the following definition.

Definition 2.1.6. (Peptide ion)

A peptide ion P⁺ is a positively charged peptide P ∈ UP. It is given by the tuple (P, z) = P⁺ ∈ UP × N, where z ∈ N is called the precursor charge (state).

In the following, let (P, z) be a peptide ion for P ∈ UP, z ∈ N. For the next section, we also introduce the following definitions:

Definition 2.1.7. (Precursor ion mass)

The mass of the precursor ion is given by: m_P⁺ = (mP + z1.0073) ∈ R. The factor 1.0073 corresponds to the proton mass.

Definition 2.1.8. (Precursor m/z)

The precursor m/z is given by: m_P⁺/z = ^m^{P +}_z ∈ R.

Definition 2.1.9. (Precursor spectrum)

Let {(P1, z₁), (P₂, z₂), . . . , (P_m, z_m)} ∈ (UP × N)^m be a set of peptide ions with peptides P_i, i ∈ {1, 2, . . . , m}, m ∈ N generated by protein digestion. A precursor spectrum or M S¹ spectrum S₁ is given by:

S₁ = {(I1, m_P⁺

1 /z₁), (I₂, m_P⁺

2 /z₂), . . . , (I_m, m_P⁺

m/z_m)}, where Ii ∈ N are the abundances of ions P_i⁺ = (Pi, zi) ∀i. The elements si ∈ S1 are called peaks and they are sorted in ascending order by their masstocharge ratio.

What happens inside the mass spectrometer?

The positive peptide ions P⁺ enter the mass spectrometer. Depending on the instrument, the masstocharge ratio of the peptide ion is estimated differently.

A Fourier transform ion cyclotron uses induced alternating current from ions

(30)

orbiting at frequencies inversely proportional to their m/z values. A timeof

flight (TOF) mass spectrometer measures the time it takes for ions to travel through an electric fieldfree tube. All ions are accelerated to the same kinetic energy. Since the kinetic energy E_kin = ¹₂m_P⁺v_P²+ is a function of the peptide ion mass m_P⁺, the heavier ions fly at a lower velocity v_P⁺ than the lighter ones and therefore reach the detector later. Depending on the electric field, the ions can be filtered with respect to their m/z value, and the instruments measure an intensity signal that produces a peak on the mass spectrum.

The existence of the stable ¹³C carbon isotope¹ ensures that each peptide signal consists of an isotopic cluster of peaks separated by 1 dalton (Da). Let x_P⁺[Da]∈ R be the m/z value of a peptide ion P⁺measured by the instrument.

Then ^(m^P^+z^P_z^×1.0073)

P [Da] = x[Da], where 1.0073[Da] is the atomic mass of a proton (remember, the peptide P was protonated), m_P[Da] ∈ R is the mass of the peptide, z_P⁺ ∈ N the charge of the peptide ion. Let us now consider a main peak at x^(m)[Da], generated by the peptide ion P⁺ with only usual ¹²C carbons. To the right of this (due to the higher mass of the ¹³C isotope), we observe with lower intensity another side peak at x^(S)[Da] 6= x^(m)[Da], showing the appearance of all peptide ions P⁺ containing an amino with a ¹³C carbon for a p_k ∈ P⁺. The m/z difference of the two peaks is given by: ∆x[Da] =

|x^(S) − x^(m)|[Da]. Then the ionic charge can be calculated by: ∆x[Da] = ^1[Da]_z

P + . Once we know the ionic charge, we can calculate the peptide mass m_p.

This process is called the M S¹ stage, and since there is a second MS stage, the peptide mass is also called the precursor mass, the ion mass is called the precursor ion mass, the ion charge is called the precursor charge, and the m/z value of the ions is called the precursor m/z. In Figure 2.1.1, the complete MS/MS workflow is shown up to the data analysis part presented in section 2.2.

Remark 2.1.10. The terms precursor mass and precursor m/z are often used interchangeably in the literature when referring to the m/z of the peptide ion.

1approximately 1% of all carbons are¹³C isotopes according to [63].

(31)

Figure 2.1.1: MS/MS experiment workflow from sample preparation to data analysis [63]. See text for details

This can be not very clear because the precursor mass is actually the mass of the peptide. However, one usually refers to the precursor mass m/z as it is the mass spectrometer’s output and correctly uses the term precursor mass m/z or incorrectly uses the term precursor mass.

After all m/z values and intensity values in the molecules’ precursor spectrum have been determined from the HPLC, a second stage M S²is run. A particular ion is isolated by its m/z value. Then energy is introduced by collisions with an inert gas and causes the peptide to break apart at amino bonds. The process is called CollisionInduced Dissociation (CID) and the technical basis can be found in Zhang [74]. The fragments produced by the CID process are called b

and yfragments, depending on where the charge is retained after the breakage.

Doubly charged tryptic peptides yield mainly singly charged y and b ions. In the following, we introduce a definition of ion fragments.

Definition 2.1.11. (Fragment) A fragment f_i(P⁺) =

(

×

^k^∈˜^I:^|˜^I^|=i^{p^k^{}, ˜z}⁾ ^{= ( ˜}^{P , ˜}^z) is a piece of a peptide ion P⁺ = (P, z)with an ionic charge and fragment charge z, ˜z ∈ N and ˜z ≤ z, P = (p_k)_k_∈I ∈ UP, where I ⊂ N is a finite index set as defined in Definition 2.1.3 and ˜I ⊂ I. We distinguish b and yfragments.

If the peptide ion fragment is the charge retained at the Nterminus of the peptide, then the piece is called a bfragment. If the charge is retained at the OH terminus, it is called a y fragment.

Depending on which carbonhydrogen bond (amino bond) has broken the

(32)

peptide ion, we also designate the fragment with an index i ∈ N. We call it bi fragment if the piece has included the Nterminus and if it consists of 2i carbons.

In this case, ˜I ={k : k ∈ I ∧ k ≤ i}.

We call it y_ifragment if the piece has the OHterminus included and if it consists of 2i carbons. In this case ˜I = I{k : k ∈ I ∧ k ≤ |P | − i}.

Remark 2.1.12. In the literature, it is common to speak of the fragments of a peptide instead of correctly speaking of fragments of a peptide ion. When we say b or yfragment of the peptide P , we actually mean the b or yfragment of the peptide ion P⁺ = (P, z). Further, when we speak of the charge of the precursor of a peptide P , we actually mean the charge of the ionized peptide P, which is referred to as P⁺.

An illustration of a fragment annotation can be found in Figure 1.1.5 on page 7. Fragment annotation is often referred to as ion annotation because the fragments are protonated. In the following, we introduce the term fragmentation space, which will be crucial for assigning the amino acid sequence to an unidentified peptide.

Definition 2.1.13. (fragmentation space)

The fragment set of a peptide P ∈ UP is given byF(P ) = {bi((P, z)), y_i((P, z)) : z ∈ N, i ≤ |P |} ⊂ UP × N and it contains all theoretically possible fragments of a peptide. Furthermore, we introduce the fragmentation space ϕ(P ) of a peptide P , which contains the set of all fragmentation patterns F of P . Thus, ϕ(P ) ={F : F ⊆ F(P )}.

Notation 2.1.14. Given the inherent nondeterministic nature of a fragmentation pattern generated by the CID process, we typically assume that a fragmentation pattern is generated randomly at any given time. This means in particular at for the rest of thesis whenever we use F ∈ ϕ(P ) it is understood as a simplified notation for F (ω)∈ ϕ(P ) whereas F is described as a random variable over a suitable probability space (Ω,B(Ω), P r).

(33)

The next step is crucial to understand what kind of data we are dealing with.

After the M S¹ stage, a second stage is processed, and the resulting spectra are used to assign an amino sequence to the biomolecules formed after protein digestion. However, before presenting the M S² spectrum, we conclude the following lemma.

Lemma 2.1.15. Let P⁺ = (P, z) be a precursor ion and F ∈ ϕ(P ) be a corresponding fragmentation pattern after the CID process. For each f_i ∈ F , there is a ˜P ∈ Aⁿ^{≤|P |} and ˜z ∈ N such that ( ˜P , ˜z) = f_i. In particular, the fragment mass m_f_i = mP˜+ ˜z1.0073 = m_P⁺ is usefully defined.

Lemma 2.1.15 shows that the fragments start sequentially from the Nterminus or Ohterminus and build shorter peptide ions generated from the precursor ion. Therefore, the M S² spectrum definition is very similar to Definition 2.1.9.

Definition 2.1.16. ((M S²)spectrum)

For any precursor ion P⁺ = (P, z) and a corresponding fragmentation pattern F ∈ ϕ(P ) after the CID process, a MS/MS (or MS²) spectrum S₂of F is given by S₂(F ) = {(Ii,^m_z^fi

i ) : f_i ∈ F } where Ii = I(f_i, ω^′) ∈ N are abundances, also called intensities, of the corresponding fragment f_i. Hereby ω^′ ∈ Ω^′ for some suitable probability space (Ω^′,B(Ω^′), P r). The elements s_i ∈ S2(F ) are called peaks and they are sorted in ascending order by their masstocharge ratio which corresponds to the second coordinate of a peak.

Remark 2.1.17.

1. A fragmentation pattern F ∈ ϕ(P ) of a peptide P is also called a theoretical spectrum. It can be understood as a spectrum ˜S(F )without given intensity values for the peaks ˜s_i ∈ ˜S.

2. The randomness within Definition 2.1.16 explains in particularly the inherent inaccuracy and uncertainty of the measurement process.

(34)

In the following, we will use the term spectrum only when a M S² spectrum is meant and use S as notation and refer to them as S_i with an index i when we want to distinguish different M S² spectra. Two examples of spectra are given in Figure 1.1.4 and Figure 1.1.6. We also introduce the terms positive pairs and negative pairs, which will be needed frequently throughout the paper.

Definition 2.1.18. (GLEAMS pairs)

A (GLEAMS) pair is a tuple ((S₁, P₁⁺), (S₂, P₂⁺))of two spectra S₂ = S(F₁), S₂ = S₂(F₂)of P₁⁺ = (P₁, z₁), P₂⁺ = (P₂, z₂)respectively for which holds:

1. The two generating peptide ions P₁⁺, P₂⁺of the spectra S₁, S₂, respectively have the same precursor charge z₁ = z₂.

2. (m/z)1 = ^m^(P1,z1)_z

1 = ^m^(P2,z2)_z

2 = (m/z)2. This means the two generating peptide ions of the spectra have the same precursor m/z.

We also write (S₁, S₂)for a GLEAMS pair.

Definition 2.1.19. (positive pairs)

A positive pair is a tuple (S₁, S₂) of two spectra S₁ = S₂(F₁), S₂ = S₂(F₂)of peptide ions P₁⁺, P₂⁺for which holds:

1. The tuple is a GLEAMS pair

2. For the P₁ ∈ P1⁺, P₂ ∈ P2⁺from F₁ ∈ ϕ(P1)and F₂ ∈ ϕ(P2)holds: P₁ = P₂ Definition 2.1.20. (negative pairs)

A negative pair is a tuple (S₁, S₂)of two spectra S₁ = S₂(F₁), S₂ = S₂(F₂)of peptide ions P₁⁺, P₂⁺for which holds:

1. The tuple is a GLEAMS pair

2. For the P₁ ∈ P1⁺, P₂ ∈ P2⁺from F₁ ∈ ϕ(P1)and F₂ ∈ ϕ(P2)holds: P₁ 6= P2

Lemma 2.1.21. A GLEAMS pair is either a positive pair or a negative pair.

(35)

Finally, we can introduce and formalize chimeric spectra in the next definition.

Definition 2.1.22. (chimera)

Let P₁⁺ = (P₁, z₁), P₂⁺ = (P₂, z₂) be two different peptide ions for with the following conditions:

1. P₁ 6= P2,

2. z₁ = z₂, 3. m_(P₁₎ = m_(P₂₎,

Furthermore, their fragmentation patterns F₁ ∈ ϕ(P1), F2 ∈ ϕ(P2)are given respectively and the corresponding spectra are S₁ = S(F₁), S₂ = S(F₂). Then the chimeric spectrum or chimera S_cis defined as the union of S₁and S₂ by:

Sc= S1∪ S2 (2.1)

We call S₁, S₂generating partners or generators of the chimera S_cfor short.

Remark 2.1.23.

1. Note that in practice, more than two peptide ions can coelute and fragment at the same time. This implies for future work that the definition can be generalized for a finite family of peptide ions by changing Equation 2.1 to Sc = ∪q

i=1S(Fi), where q is the number of peptide ions that coeluted and cofragmented during the CID process.

2. Another model improvement would be to require:

If there are peaks s_i = (Ii,^m_z^fi

i ) ∈ Sb and s_j = (Ij,^m_z^fj

j )∈ Safor j 6= i and S_a, S_b ⊂ Scas well as ^m_z^fj

j = ^m_z^fi

i , then we remove the peaks s_i, s_j from S_c and add the peak (I_i+ I_j,^m_z^fj

j )to S_c.

In particular, we only consider chimeras generated by two peptides, and we only join the spectra’ peaks.

(36)

Lemma 2.1.24. The enumerated conditions in Definition 2.1.22 imply that (S₁, S₂)from Equation 2.1 is a negative pair.

In the next section, we present the analysis part of a typical MS/MS workflow, which is the last step in the workflow diagram 2.1.1 on page 18. In section 2.2, we explain how database search tools use spectra to identify the amino acid sequence for a corresponding peptide.

2.2 Identifying peptide sequences with database search tools

Consider a database D = {(Si, Pi) : i ∈ M ⊂ N} containing peptide spectrum matches (PSMs). Let us further assume that S is a spectrum for which we do not yet know the corresponding peptide sequence P . We make the general assumption: For a peptide P ∈ UP, there is F ∈ ϕ(P ) such that S(F ) = S.

Then it is useful to compare the unidentified spectra S with the known spectra S_i from the database D to find the corresponding peptide P for S. The first algorithm that implemented this idea is called SEQUEST [16], which was later reimplemented by Crux [46].

After retrieving all possible matching peptide sequences P M P S = {P1, P2, . . . , Pm} ⊂ UP by considering sequences that may have the same precursor mass as the unidentified peptide P from the isolated peptide ion (P, z) = P⁺, which mathematically means ∀ Pi ∈ P MP S : mPi = m_P, a preliminary similarity score S_P is calculated. Then, the peptides with the 500 best scores are reordered using a crosscorrelation score X_corr.

How to estimate the matching peptide for a given unidentified spectrum The fragmentation pattern F_S of S is given by the second component of peaks s_i ∈ S. Furthermore, let F P = {F : F ∈ ϕ(Pi)for P_i ∈ P MP S} be the set of all

(37)

fragmentation patterns of peptides in P M P S.

Remark 2.2.1. The fragmentation patterns can be derived from a database or constructed using the De Novo Sequencing method. It takes as input a peptide and a precursor charge to output a theoretical fragmentation pattern.

The annotation package pepfrag that we use uses the de novo method to compute the annotations. Details of the de novo method can be found in Matthiesen et al. [42]. The terms theoretical spectra and theoretical (de novo) fragmentation (patterns) are used interchangeably. It is important to note that a theoretical spectrum has arbitrary or, in particular, no intensities for the peaks.

The scoring values are calculated for each F ∈ F P by:

S_P(F_S, F ) =

(∑_|F_S_|

i=1 m_fi

zi δ_f_i(F )) (∑_|F_S_|

i=1 δ_f_i(F ) )

(1 + 0.075max(A))

|F | , (2.2)

and then ordered descending from highest to lowest, where f_i ∈ FS,

A ={µ(Bj0,j)|Bj0,j ={fi| ∀ fi ∈ F for i ∈ {j0, . . . , j} : fi ∈ FS} for j₀, j ∈ {1, . . . , |F |} and j0 ≤ j}},

µis the count measure, δ_f_i is the Dirac function centred at f_i on the measure space (F, ϕ(P )). Here, max(A) represents the maximum number of consecutive b or yions from the theoretical spectrum F that appears in the unidentified spectrum’s S corresponding fragmentation pattern F_S. For more details, especially on the calculation of the crossvalidation score, please refer to Diament and Noble [11].

In the following work, it is important that the P with the highest crossvalidation score among the preliminary top 500 is associated as the corresponding sequence to the given spectra S. This is basically the idea of how any database searchbased peptide identification tool works: The unidentified

(38)

spectrum is compared to spectra from a database via at least one scoring function that measures similarity. The peptide sequence of the highestscoring identified spectrum from the database is assigned as the matching peptide for the unidentified spectrum.

We describe the data we work with in the next section.

2.3 The MassIVEKB dataset

The MassIVE KB dataset is a communitywide, continuously updating knowledge base that aggregates proteomics mass spectrometry discoveries in an open, reusable format with complete provenance information for community review. In 2018, more than 31 TB were aggregated, containing more than 2.1 million precursors of 19,610 proteins. 55% of all human proteome amino acids are covered, representing 6 million amino acids. The data covers a wide variety of tissues, cell types, and experimental conditions. In addition to the usual controls, additional statistical controls are performed to avoid an uncontrolled accumulation of false discoveries. For each identified spectrum, metadata is provided with information on the origin, precursor attributes such as charge and m/z, and details on the search procedure used to derive the results. The data set includes HCD and CID data. In total, it includes 658 million tandem mass spectrometry M S² spectra. False Detection Rate (FDR) in MassIVEKB are initially controlled at the level of individual dataset searches (1% Peptide Spectra Match (PSM)level FDR estimated by targetdecoy approach TDA [15]). For our random chosen subset charge states range from 2− 8 and amino sequence lengths from 6−40. The m/z values of the precursors range from 300.8−1493.21.

For all technical information, please refer to Wang et al. [69].

In the following work, we will consider the data in the MassIVE KB library as a tuple (S, P⁺), where S = S(F ) for an F ∈ ϕ(P ) is the observed spectrum, and P ∈ P⁺ is the corresponding peptide, and z ∈ P⁺ is the corresponding

(39)

precursor charge.

In the next section, the mass spectrometry embedding algorithm GLEAMS is presented.

2.4 GLEAMS

GLEAMS is a mass spectrometry embedding algorithm and is part of the data analysis part in a typical mass spectrometry workflow, as shown in Figure 2.1.1 on page 18, like Sequest and Crux. GLEAMS computes an embedding of identified spectra with respect to their peptides in an iterative supervised manner, resulting in a clustering of spectra. An unidentified spectrum can then be identified by calculating the corresponding embedding and then checking to which cluster it is closest. The corresponding peptide is then that of the cluster.

Below we provide some details on the algorithm input, neural network structure, and training as background for further work. More about GLEAMS as a method can be found in chapter 3. For the three following subsections, we cite technically relevant details from the original work [43].

2.4.1 Encoding mass spectra for GLEAMS input

Each spectrum is encoded as a vector of 3056 features of three types: Fragment intensities, dotproduct similarities with a set of reference spectra, and precursor attributes. Precursor attributes are precursor charge, mass, and m/z.

Precursor mass and precursor m/z are encoded in a 27bit binary encoding. The charge states are natural numbers between one and eight. This gives a total of 61 precursor features.

The fragment intensities are encoded as 2449 features. They are rescaled, and if they fall outside a certain range, the corresponding fragment is discarded.

Each spectrum’s similarities to an invariant (chosen once at random and then used for all spectra) set of reference spectra are encoded as 500 features.

(40)

For more technical details, please refer to May et al. [43].

2.4.2 Neural Network Structure of GLEAMS

The precursor features are processed by a twolayer fullyconnected network with layer dimensions 32 and 5. The fragment intensity and reference spectra features are each passed through a separate singlelayer convolutional neural network with 30 filters, a kernel size of 3, and a stride length of 1, followed by maxpooling. The output is concatenated and passed to a final fully concatenated layer with 32 dimensions. A scaled exponential linear unit activation preceded all layers. For more details, please refer to May et al. [43].

2.4.3 Training of GLEAMS

Two instances of the encoder with shared weights are constructed to form a

“Siamese network”. Tuples or pairs of identified spectra (S_a, S_b)with the same charge state and precursor m/z are fed into the encoder. The Euclidean distance between E_a, E_b is calculated, and a contrastive loss function is minimized. The loss function is discussed in detail in chapter 3. It pushes away pairs with different peptide sequences and pulls together pairs with the same peptide sequences.

GLEAMS was trained on 100 million positive and 100 million negative pairs found in the MassIVE KB library. Note that the MassIVEKB data has a high resolution with respect to m/z values, and therefore, a precursor m/z tolerance of 10ppm is used as default. If the absolute difference in precursor m/z values of two ions is within 10ppm, the ions are considered to have the same precursor m/z. Throughout this work, we have examined a pretrained version of GLEAMS and have fixed the model, so no changes to the model will be made hereinafter.

(41)

In this chapter, we introduced background information and terminology on amino acids, peptides, fragments, spectrum, chimera, and GLEAMS pairs. We have covered a final description of an MS/MS workflow and the basic principle of modern database search tools. Furthermore, we give details about the MassIVE KB dataset used. In the next chapter, we will present in more detail GLEAMS’ loss function, our approach to data simulation, finding real chimera, and the machine learning methods we use to study GLEAMS.

(42)

Methods

In this chapter, the necessary proteomicsrelated methods are presented to answer the question of whether GLEAMS captures the chimericity of an MS/MS spectrum in section 3.1. Then, machine learningrelated methods will be presented in section 3.2 to solve the binary classification task to distinguish between GLEAMS embeddings which are generated by pure and chimeric spectra.

3.1 Proteomics Methods

In subsection 3.1.1 we present the method used to simulate chimeric spectra, which is necessary due to the lack of real data, followed by the method how we derived a small sample of real chimera from the used subset of MassIVEKB in subsection 3.1.2. Details of the GLEAMS embedding method is presented in subsection 3.1.3, which we examine in terms of its ability to capture the chimericness of a spectrum.

(43)

3.1.1 Simulation of Chimeric spectra

In the following let D = {(S, P⁺) : (S, P⁺)from the MassIVEKB dataset} be the MassIVEKB dataset with the tuples (S, P⁺). The set of simulated chimeric spectra is then given by CS = {Sc = S˜₁ ∪ S2 : S˜₁ = {˜si = ( ˜I_i,^m_z^fi

i ) : ˜I_i = ξI_i for ξ ∼ U(0.2, 1) and (Ii,^m_z^fi

i ) ∈ S1, i = 1, 2, . . . ,|S1|}, S2 ∈ (S₁, P₁⁺), (S₂, P₂⁺)∈ D respectively and(S1, S₂)is a negative pair}.

Random rescaling of the peaks’ intensities of S₁ is introduced to achieve a more realistic simulation. In practice, one peptide is more dominant in its occurrence than the other. Moreover, the lower bound of 0.2 ensures that both peptides will occur.

Before the two spectra are merged, we additionally normalize their peaks.

This preprocessing is known in the mass spectrometry community as base peak normalization. The following calculation is performed: for each s^(q)_i we update I_i^new = _max(_{I ^Iⁱ

i:i∈{1,2,...,k}}) for q∈ {1, 2}.

We postulate a theoretical chimeric peptide P_C for each chimeric spectrum S_c ∈ CS. The corresponding theoretical chimeric peptide ion is then given by P_C⁺ = (P_C, z), where z is the generators’ precursor charge (which is the same for both generators). Furthermore, the precursor charge m/z of the generators (which is also the same) is associated with P_C⁺. Three examples with ion

annotations are given in Figure B.1.1, Figure B.1.2 and Figure B.1.3 on page 93 et seq.

3.1.2 Finding real chimera

We used the Crux¹ functions tidesearch followed by crux function percolator with a relatively high false detection rate (FDR) of 0.1 to find potential chimeric spectra in the MassIVEKB dataset. Using an increased FDR of 0.1 instead of the default 0.01, we increased the number of possible PSM. We found∼ 22 million

1http://crux.ms/