PePIP : a Pipeline for Peptide-Protein Interaction-site Prediction

(1)

Linköpings universitet

Linköping University | Department of Physics, Chemistry and Biology

Master thesis, 30 ECTS | Bioinformatics

2017 | LITH-IFM-A-EX--17/3310--SE

PePIP: a Pipeline for

Peptide-Protein

Interaction-site

Prediction

PePIP: en Pipeline for Förutsägelse av Peptid-Protein

Bindnings-site

\Isak Johansson-Åkhe

Supervisor : Claudio Mirabello Examiner : Björn Wallner

(2)

Datum

Date

09/06 2017

Avdelning, institution Division, Department

Department of Physics, Chemistry and Biology

Linköping University

URL för elektronisk version

ISBN

ISRN: LITHIFMAEX17/3310SE

_________________________________________________________________ Serietitel och serienummer ISSN Title of series, numbering ______________________________ Språk Language Svenska/Swedish Engelska/English ________________ Rapporttyp Report category Licentiatavhandling Examensarbete Cuppsats Duppsats Övrig rapport Titel Title

PePIP: a Pipeline for PeptideProtein Interactionsite Prediction

Författare Author

Isak JohanssonÅkhe

Nyckelord Keyword

PPI, Proteinprotein Interaction, Proteinpeptide interaction, Random Forest, Bioinformatics,

Sammanfattning Abstract

Protein-peptide interactions play a major role in several biological processes, such as cell

proliferation and cancer cell life-cycles. Accurate computational methods for predicting

protein-protein interactions exist, but few of these method can be extended to predicting

interactions between a protein and a particularly small or intrinsically disordered peptide.

In this thesis, PePIP is presented. PePIP is a pipeline for predicting where on a given protein

a given peptide will most probably bind. The pipeline utilizes structural aligning to peruse

the Protein Data Bank for possible templates for the interaction to be predicted, using the

larger chain as the query. The possible templates are then evaluated as to whether they can

represent the query protein and peptide using a Random Forest classifier machine learning

algorithm, and the best templates are found by using the evaluation from the Random Forest

in combination with hierarchical clustering. These final templates are then combined to give

a prediction of binding site.

PePIP is proven to be highly accurate when testing on a set of 502 experimentally determined

protein-peptide structures, suggesting a binding site on the correct part of the protein-surface

roughly 4 out of 5 times.

(3)

Abstract

Protein-peptide interactions play a major role in several biological processes, such as cell proliferation and cancer cell life-cycles. Accurate computational methods for predict-ing protein-protein interactions exist, but few of these method can be extended to pre-dicting interactions between a protein and a particularly small or intrinsically disordered peptide.

In this thesis, PePIP is presented. PePIP is a pipeline for predicting where on a given protein a given peptide will most probably bind. The pipeline utilizes structural aligning to peruse the Protein Data Bank for possible templates for the interaction to be predicted, using the larger chain as the query. The possible templates are then evaluated as to whether they can represent the query protein and peptide using a Random Forest classifier machine learning algorithm, and the best templates are found by using the evaluation from the Random Forest in combination with hierarchical clustering. These final templates are then combined to give a prediction of binding site.

PePIP is proven to be highly accurate when testing on a set of 502 experimentally de-termined protein-peptide structures, suggesting a binding site on the correct part of the protein-surface roughly 4 out of 5 times.

(4)

Acknowledgments

I would like to extend a special thanks to the following individuals, who held great influence over this work:

Björn Wallner, for giving me the opportunity to produce this work under his guidance, and for many creative and enlightening discussions.

Claudio Mirabello, for his patient and insightful advice. Johannes Salomonson, for his great opposition and feedback. Claudia Bratu, for her invaluable support.

(5)

List of Figures

2.1 Examples of possible active sites of Serine proteases. . . 6

3.1 Overview of the PePIP pipeline. . . 13

3.2 Alignments per protein, histogram. . . 17

4.1 Matthew’s Correlation Coefficient with different feature groups . . . 22

4.2 Effects of different feature groups on performance . . . 22

4.3 Bar graph of feature importance. . . 23

4.4 Frequencies of different amino acids in data-set interaction-sites . . . 24

4.5 Protein length and Negative rate . . . 24

4.6 Structure similarity amongst target chains. . . 25

4.7 Structural similarity effect on Matthew’s Correlation Coefficient . . . 25

4.8 Using structure representatives: effect on Matthew’s Correlation Coefficient . . . . 26

4.9 Pipeline performance for different Random Forest cutoff values . . . 26

4.10 Pipeline performance for different clustering versions. . . 26

4.11 Matthew’s Correlation Coefficient for different final versions. . . 27

4.12 Difference in Matthew’s Correlation Coefficient for different proteins . . . 27

4.13 Effect of alignment quantity and quality on performance . . . 27

4.14 Effect of target protein secondary structure disorder on performance . . . 28

4.15 Effect of having structurally similar proteins in the data-set on final performance . 28 4.16 Comparison between MCC-score and Pipeline Score. . . 29

4.17 Comparison to Pepsite2 . . . 29

(8)

List of Tables

(9)

1 Introduction

Protein-protein interactions are heavily involved in all major cellular processes. As the 3-dimensional structure of proteins are closely related to their function, understanding the structural details of protein-protein interactions, and thus the effects they have on protein activity, is crucial when attempting to understand biological processes.

Several methods exist to experimentally determine the structure or nature of protein-protein interaction complexes. High-throughput methods such as yeast two-hybrid [1] or BioID [2] can quickly ascertain which proteins belong to the same complexes, but give no de-tailed information about the structure of the interaction and suffer from many false negatives and positives [3]. Alternatives for obtaining more detailed structural information includes NMR, x-ray crystallography, or SNP activity assays. These, however, take up a lot of lab-time [4]. Thus, many computational methods have arisen to complement these experimental methods, predicting interaction-complex structure or probability.

When predicting new protein interactions, it has proven effective to use already solved models or interactions as templates for training and/or running new prediction methods [5, 6]. Such template-based methods will then search the existing data-set for complexes similar to the one currently being investigated and attempt to draw information from what is found. If, for instance, Cyclophilin A from Gorilla Gorilla interacts with HIV-1 capsid protein, then it is likely that Cyclophilin A from Homo Sapiens interacts with HIV-1 capsid protein in a similar fashion. Classically, the search for similarity has been performed with regards to protein sequence, but many newer methods have started utilizing structural searches for similarity instead, as structure is more conserved than sequence [7], and more structures are added to the PDB every day, increasing the amount of structural templates available.

One effective method for predicting protein-protein interactions using structural template searches is Interpred [8], a pipeline for predicting the structure of protein-protein interactions given only the proteins’ sequences, modeling the interaction complex after already solved complexes from the Protein Data Bank. This method however, as most methods that rely on searching for structurally similar templates, encounter difficulties when one or both of the query proteins are small or disordered poly-peptides, as it is difficult to find a significant structurally similar match to something lacking properly defined structure. Performing struc-tural match searches with such a peptide will either produce a vast number of insignificant matches, or close to no matches at all.

(10)

1.1. Aim

Small and/or disordered poly-peptide fragments participate in many protein interactions however, and are vital to some important biological processes, like cell life-cycle and cancer prevention [9]. Relatively common is that an otherwise well-structured protein may contain a disordered region, parts of which could take an ordered structure upon binding another protein (such segments are called Molecular Recognition Features, or MoRF), or which could simply be small disordered regions which bind certain sites with low affinity [10, 11]. In later years, there has been increased interest for peptides when it comes to designing drugs, both as targets for small molecules as well as being used themselves as agonists or antagonists against a larger protein interaction-partner [12, 13]. From this it is obvious that understand-ing protein-peptide interactions is just as important as understandunderstand-ing interactions between larger, well-folded proteins.

Protein-protein interfaces have been shown to be highly degenerate, and some studies have claimed that as much as 88 % of all native protein-protein interfaces can be structurally closely matched to at least one already existing similar interface in the PDB, even though far from all quaternary protein structures are structurally determined [14].

In this work, it is hypothesized that because of the degeneracy of protein-protein inter-faces, some template structures describing an interaction similar to a given peptide-protein interaction complex should in most cases be found within the PDB, even when using only the well-structured protein participant as the search query. It is also theorized that by finding these templates and analyzing them with regards to their other similarities to the disordered peptide participant with the help of a Random Forest machine learning algorithm, using such information as amino acid composition or predicted secondary structure composition, it will be possible to further refine the selection of template matches, and with the help of this re-fined selection predict where on a protein’s surface a particular peptide might bind.

1.1 Aim

In this project, a computational pipeline named PePIP (Peptide-Protein Interaction Predictor) is engineered and evaluated. The pipeline’s purpose is to predict where on a given protein’s surface a particular peptide fragment would most likely bind.

PePIP will exist as support for protein interaction or structural determination studies. It will hopefully be powerful enough to be used as an indication of how the peptide-protein interaction of specific couples might actually look, supplying a basis for further docking pro-tocols or interaction studies.

1.2 Research questions

This project will attempt to seek answers for the following research questions:

1. Can structural template searches and knowledge of the peptide fragment’s sequence be enough to accurately predict where on a given protein a certain peptide will bind? 2. What information about the template files is relevant to the evaluation whether they

can model an interaction or not?

3. How well will the prediction-pipeline PePIP fare in a benchmark, and when tested against existing protocols with the same purpose?

1.3 Delimitations

The method presented herein was created to predict interactions between proteins and pep-tides which contain no special residues or atoms. That is, the method does not take into account: glycosylation, phosphorylation, non-standard amino-acids (such as selenocysteine),

(11)

1.3. Delimitations

D-amino acids, cofactors, or other additions to the peptide chains which are not recognized as unmodified versions of the standard 20 amino acids.

(12)

2 Theory

2.1 Protein Structure and Function

It has long since been established that there is a strong connection between a protein’s 3-dimensional structure and its function [15]. Most often, this holds especially true when discussing protein-protein interactions. Many protein-protein interactions rely on the interaction-surfaces being guided into a tightly interlocking position by steric and electro-static forces: positively and negatively charged residues often interact, and hydrophobic residues favor interaction with other hydrophobic residues [15]. Generally speaking, it has also been shown that the shape-complementarity of two protein surfaces positively correlate to how well the two associate [16].

Because most protein functions and interactions rely on certain of the protein’s amino acids to be in certain positions in space in relation to each other, protein structure can be seen to be generally more conserved than sequence [7]. For example, there is an abundance of serine proteases which all contain the same catalytic triad of serine, histidine, and aspartic acid, in the same relative position in space to each other. The specific spatial positioning of these residues relative each other is vital to their catalytic function, hence this trio of residues can be seen in these relative spatial positions in many proteases, even if their sequential posi-tions are not conserved relative each other [17]. A further visualization of this example can be found in Figure 2.1. Because of how sites important to protein function or stability tend to be more conserved, just like the active site of serine proteases is highly structurally conserved, residues participating in protein-protein interaction are also generally more conserved than usual. If a participating residue from one protein in the pair mutates, it is likely that the in-teraction will suffer, and thus such mutations are generally disfavored, resulting in higher evolutionary conservation for such residues [18, 19].

When disordered peptide fragments participate in interactions, they often settle into a structure complimentary to what they interact with, interacting through induced fit [10, 11]. A good example of induced fit is Cyclophilin A, which strongly associates with any free AlaAlaProPhe fragment by perfectly fitting it into the protein’s binding pocket, changing the fragment’s backbone conformation during the fitting to ensure optimal binding interactions [20, 21]. In this case, the structure of the ordered protein is important to the interaction and activity, but the fragment need only possess the right residues in the right sequence order, and have a relative degree of freedom of movement [20].

(13)

2.2. Structural Alignment

Figure 2.1: Examples of possible active sites of ser-ine proteases. The catalytic properties of the serine depends on the hydrogen-bonding between the three residues via histidine, meaning these interactions must be present for the protease to function. Since this dependency is related to spatial position and residue type only, the three participating residues could exist at various positions in a protein’s sequence, and as long as the fold puts them in this approximate spatial position among each other (and the site is accessible to the ligand) they should be able to perform their proteolytic function [17].

2.2 Structural Alignment

Throughout many applications in the research field of structural biology, structural align-ment is extensively utilized [22, 23]. It is commonly used to compare different structures, judge structure similarity, or perform searches for similar structures. Many different meth-ods exist to finding and scoring the optimal pairing of residues between structures, and an example of a common score evaluation method is RMSD, which measures the root-mean-square distance between equivalent residues [22]. The task of optimally aligning equivalent residues however, is NP-hard and many different methods exist for producing an alignment. One such method, which will be used in this project, is TM-align. TM-align quickly and ef-ficiently aligns two 3-dimensional protein structures by utilizing the coordinates of the back-bone Cαof residues in the structures to be aligned in five steps [23]:

1. First, three rough but quick alignments are performed:

a) One alignment of secondary structure is performed using dynamic programming. The secondary structure of a residue is defined by the backbone-angles of the 5 residues in the residue of interest’s immediate sequential vicinity, and single-residue secondary-structure elements are absorbed into nearby structure to de-crease error.

b) A second gap-less alignment is generated. Other structural alignment methods also utilize these kinds of starting structures, but tend to score by RMSD. TM-align instead uses the TM-score.

(14)

2.3. Contact Order

c) The final initial alignment is done by dynamic programming much like the first, but the score-matrix is a combination of the score-matrices used for the first and second initial alignment.

2. The aligned residues from these initial alignments are then used with the TM-score rotation matrix to rotate the structures, and a score similarity matrix is obtained:

S(i, j) = 1 1+d2

ij/d0(Lmin)2

(2.1)

where dij is the distance between residue i of the first structure and residue j of the second structure, d0(Lmin) =1.243

?

Lmin 15 1.8, and Lminis the length of the smaller protein.

3. A new alignment is then made using S(i, j)with heuristic dynamic programming. 4. Step 2 and 3 are then repeated, with step 2 using the most recent alignment, until the

alignments are stable.

5. The alignment with the highest TM-score is returned.

The TM-score of an alignment is an evaluation score much like RMSD. Contrary to most such scores though, the TM-score uses all of the residues of the aligned structures, rather than only select fragments, which other advanced scoring methods do. To insure that including poor accuracy regions and local structural deviations do not ruin the precision of the score, protein pairs with smaller distances are weighed relatively higher than those with longer distances through a modified version of the Levitt-Gerstein weight factor [24]. Because of this however, the method can be more lenient when matching larger proteins. TM-score is also normalized by protein-size, which means the unmodified score can be used to quickly assess if a protein structure pair is random, which would award a TM-score of 0.17, whereas a TM-score of 0.5 is defined as a significant match between one protein and at least part of the other [24].

Using TM-align awards one major advantage over other structural alignment methods: Thanks to the method’s efficient initial alignments and the consistency of the generated ro-tational matrices, convergence happens quickly, and TM-align is therefore a quick alignment method, outperforming other popular structural alignment methods such as CE or DALI when it comes to speed, while still generating alignments of similar quality [23]. As stated in the article by Zhang and Skolnick (2005), "Because of its advantage in both speed and ac-curacy, TM-align is conveniently exploited in large-scale, sequence- independent structure comparisons." [23]. Hence why it is used in this project to peruse the PDB for structural templates.

Aligning structures to each other this way means that structure pairs which share struc-tural motifs, and smaller proteins which resembles a sub-domain of a larger proteins, will generate high TM-scores. This results in that the alignments will score matches of large and complicated structural motifs highly, but since no account is taken to contact order or fold complexity but only to relative position in space and residue identity, extremely simple and common motifs such as long α-helices will also yield high scores when matched. This means that the alignments might produce high-scoring matches based on the paired proteins shar-ing nothshar-ing more in common than each havshar-ing a long α-helical segment with roughly similar residues.

2.3 Contact Order

Contact order is a way of measuring the topological complexity of a protein. For every residue, it measures the average sequential distance from that residue to all other residues

(15)

2.4. Random Forest machine learning

in the protein within contact distance, commonly defined as 6 Ångström. When divided by protein length this then gives the relative contact order, a number describing the average se-quential distance of the average residue-residue contact in the protein. Relative contact order can be described as the formula below:

CO= 1 L N

N ¸

∆Si,j (2.2)

where L = protein length, N = total number of contacts within the protein, and∆Si,j= the sequential distance between residue i and j [25].

In simpler terms, contact order measures how complex the folds of the protein are. A mainly α-helical protein would generate a low contact order, since most contacts are between residues separated by only a couple sequential positions. A protein with many β-sheets how-ever would probably give a high contact order, since the strands of β-sheets need not be sequentially close at all to form a sheet [25, 15].

2.4 Random Forest machine learning

Machine learning encompasses methods for having a computer look at training sets of data where it has the correct answer, and then getting it to extract connections and dependencies among the training set data values that it can then use to predict information regarding test-sets of similar data, but with unknown answers [26]. The point of machine learning is that such methods should be able to make predictions in situations more complex than those that can be described by simple linear regression. Machine learning algorithms can be tricky to apply, however, because of a multitude of reasons.

Firstly, an algorithm can be "over-fit" which means that when the training data is noisy and the algorithm is too detailed in what data it considers, it might produce a predictor which explains the noise and errors perfectly, resulting in very good predictions on training data but poor performance when applied on new data with different noise. Secondly, machine learn-ing is often very sensitive to bias, and if some part of the trainlearn-ing-data is overrepresented, then the predictor will probably be biased. Thirdly, because of how sensitive machine learn-ing algorithms are to how the input-data is represented, it is necessary to have large trainlearn-ing sets to acquire accurate predictors, and it is also often necessary to carefully choose how the data is represented to the algorithm, what "features" it should be shown [26, 27].

Random Forest is a machine learning algorithm constructed to quickly yield high predic-tion accuracy while remaining more robust to over-fitting and data-noise than other machine learning algorithms [27]. Random Forests are constructed by making ensembles of randomly grown decision tree predictors which then consider input-data in parallel and "vote" for the correct answer. It is then possible to look at what each individual tree voted, to understand how "shure" the Random Forest is of its answer: if 6 out of 10 trees agree the answer is True, then this can be interpreted as the Random Forest predicting that the answer is probably True, but it is close to being uncertain [27].

A decision tree is simply a sets of paths with questions at the "branching" areas. An out-going "branch" will be chosen depending on the answer to a question at a branching, and this outgoing branch will lead to a new question further down that branch of the tree, and so on, until an end is reached, called a "leaf", containing the answer provided by the tree.

The "random" in Random Forest comes from how these trees are constructed. The trees are "grown" by randomly selecting what features of the input-data to consider at a branching, and then randomly selecting new sets of features to be considered for the next branchings, and so on. While doing this, the training-data is run through the trees and the numeric details of the queries at the branchings and answers at the leaves are selected through what the training-data indicates would be the optimal split or answer at that point [27].

(16)

2.5. Predictor Evaluation

Combining several randomly generated decision trees into a Random Forest results in smaller error rates and a more robust method, compared to many other machine learning algorithms (like using only one Random Decision Tree). Because of the large number of ran-domly generated branchings included in random forests, the method is especially resistant to over-fitting, as even if one tree where to suffer from it, it is unlikely all trees would thanks to the law of large numbers [27].

When adding features to a Random Forest, the method suffers from diminishing return, and adding too many mediocre features might drown out the few good ones, resulting in a decrease in performance even though the features on their own might seem relevant. It is also possible to use a completed forest to extract how important the different features of the training-data are to the predictions, which can be useful for choosing what features to discard when there are too many. One way of doing this is by measuring for each node (branching) which splits on a given feature, the error or "impurity" reduction of that node weighted by how many training samples reached it. This in turn depends on how error or impurity is calculated. In this project, gini impurity will be used, which represents how often randomly labeling elements in the training set in accordance to the distribution of labels in that set would be incorrect, and it is calculated by:

IG(f) =1 J ¸ i=1

f_i2 (2.3)

where fiis the fraction of elements labeled i and J is the total number of different labels (so iP t1, 2, ..., Ju) [28]. This way of measuring feature importance however, does not take into account that features might be dependent on each other. For instance, feature #2 may score high in "importance", but only because actually important features #1 and #3 depend on it, and without #2 they would contribute only slightly less. This leads to #2 being involved in many important branchings (nodes), even though its impact is minor, and it receives an arti-ficially high importance. Additionally, importance is measured relative to the other features, and high importance simply means that feature is good compared to the others used. As such, calculating feature importance this way can be used for getting a general idea of what features are used often and might be important, but the data should be analyzed carefully and not be the sole basis of conclusions regarding feature usefulness [27].

2.5 Predictor Evaluation

Two main methods will be used in this project to evaluate predictions, Matthew’s Correlation Coefficient and ROC-curves, both described below:

2.5.1 Matthew’s Correlation Coefficient

Matthew’s Correlation Coefficient, sometimes called the phi coefficient, is a way of measuring how well two sets of binary data correlate. It is a version of Pearson’s Correlation Coefficient adapted for use with binary data, and is more robust to unbalanced data-sets [29]. Matthew’s Correlation Coefficient can be calculated as follows:

N=TN+TP+FN+FP S= TP+FN N P= TP+FP N MCC= TP N S P a PS(1 S)(1 P) (2.4)

(17)

2.6. Hierarchical Clustering

where TN = number of True Negatives, TP = number of True Positives, FN = number of False Negatives, and FP = number of False Positives [29].

This will output a value ranging from -1.0 to 1.0, where 0.0 means absolutely no correla-tion, 1.0 means perfect correlacorrela-tion, and -1.0 means prefect negative correlation.

2.5.2 ROC-curves

A ROC-curve is a plot of True Positive Rate versus False Positive Rate for different cutoff val-ues of the predicted data into a binary prediction. This curve shows the relationship between sensitivity and specificity, and will preferably display a steep slope near origo, indicating many True Positives for every False Positive at optimal cutoff [30]. As True Positive Rate is preferred over False Positive Rate, the area under the curve (AUC) can be used as a measure of how well the predicted data describes the true data. True Positive Rate and False Positive Rate are:

TPR= TP P FPR= FP

N

, where TPR = True Positive Rate, FPR = False Positive Rate, TP = True Positives = amount of correct positive predictions, P = total amount of positives in data, FP = False Positives = amount of incorrect positive predictions, and N = total amount of negatives in data [30].

2.6 Hierarchical Clustering

There exist many ways of dividing sets of data into subgroups or clusters, where one cluster describes a collection of similar data-points, one of which is hierarchical clustering. Hierar-chical clustering is a greedy clustering method, in that it does not look ahead but only merges clusters which are currently deemed most similar, rather than analyzing which clusters could become most similar if merges happen another way [31].

The method bases the clustering from a dissimilarity matrix with measures of how much each data-point differs from each other data-point. The two that differ least are then merged into a cluster. This is then repeated, always merging the two samples most similar to each other, with dissimilarity now also not only having to be calculated between individual sam-ples, but between groups of merged one’s as well. There exist different methods for calculat-ing this, but a common method, also used within this work, is to take the mean of the highest similarity to any sample within the merge and the lowest similarity to any sample within the merge. This is called the "average" method [31].

As previously stated, the hierarchical clustering algorithm needs a dissimilarity matrix to work from. Many different methods exist for calculating such "distance" between data-sets, but when analyzing binary data, "hamming distance" is both simple an efficient to use. It is simply the fraction of data which is in disagreement between the two sets [32].

Hierarchical clustering continues until only one cluster remains, but record is taken of the distance between all sets merged. Thus, it is possible to go back and look at these merges in a dendrogram or similar plot to manually pick how many clusters are wanted. A popular method of automatically picking how many clusters to optimally have is the "elbow method". This method works by measuring the sum of squares of variation within the clusters as a method of how many clusters there are, and then picks the amount of clusters after which the rate of change for this number slows down the most [33].

(18)

3 Method

The basis of the pipeline developed is that it will use structural alignment to find possible templates for interaction modeling, use a Random Forest classifier to identify which of these templates could represent a protein-peptide interaction, and lastly summarize this data into a prediction of what residues in the protein will participate in interaction with the given pep-tide. This prediction should be easy to use as a starting point for further docking protocols, if the user wishes.

3.1 Dataset

As previously stated, a machine learning algorithm needs to be trained using data similar to what it should predict. Preferably the training will be performed on as many data-points as possible, to insure an accurate predictor. Thus, a large training set would be needed. Additionally, a test-set is required, independent from the training set, to be able to benchmark the pipeline and judge it in comparison to other methods.

To construct these sets, the PDB (fetched on 19/05-2016) was scoured for structure-files of at least 3 Ångström resolution containing interactions between a larger globular chain of at least 50 residues and a smaller chain of 5 to 25 residues. The length requirements of the smaller peptide was based on the work of Amrita Mohan et al [10], as it is stated that this length of peptide is often representative of peptides disordered when unbound. Fur-thermore, candidates where the interaction surface between the peptide and protein was less than 400 Å2where discarded, to makes sure that only biologically relevant interactions where included. This was done by comparing the exposed surface area, calculated with NACCESS [34], of the peptide with and without the interacting protein present. Once again, this num-ber was from the work of Amrita Mohan et al [10]. Additionally, since this pipeline focuses on surface-interaction between protein and peptide, and not deep binding-pockets, co-factor binding, or DNA-binding complexes, candidate files where more than 60 % of the peptide was buried within its interaction-partner or where nucleotide-chains or other special atoms where bound to the protein or peptide were discarded. Lastly, the larger chains of interest in the remaining files were clustered at 30 % sequence identity using BLASTCLUST [35] to insure no closely related proteins were represented more than once within the data-set.

These steps resulted in 502 files containing surface interactions between a larger protein chain and a smaller peptide-fragment.

(19)

3.2. The Pipeline

Since machine learning algorithms are dependent on large data-sets to produce accurate predictors, it was judged that this set of 502 complexes was too small to be divided into a test-set and a training set. As such, all 502 complexes were used for both training and testing by utilizing jack-knifing, meaning that when the Random Forest was to be used to predict data regarding one complex, it was trained on data from the remaining 501. This means that even though the training took up much more time, since the forest had to be re-trained before analyzing every complex, a larger set of data-points was available for every prediction.

3.1.1 Nomenclature

When discussing structures, sequences, and alignments in later steps, some nomenclature is needed: The target peptide stands for the peptide which the pipeline is currently attempting to make predictions for. The target protein or chain stands for the protein structure which the target peptide binds to, and it is where on this protein’s surface the target peptide binds that PePIP will predict. The partner protein or chain stands for the structure the target chain was aligned to in the alignment currently being discussed.

3.2 The Pipeline

The purpose of the PePIP pipeline is to predict where on a protein’s surface a smaller peptide-fragment might bind. The pipeline can be summarized in these steps:

1. Structural alignments: The protein is aligned against all multi-chain structures in the PDB to find possible templates for interaction. Only alignments with relevant structural matches are saved.

2. Feature Extraction: The features that will be judged by the Random Forest are extracted from all alignments.

3. Random Forest Classifier: The Random Forest judges what templates it believes to be representative of interactions which could include the peptide.

4. Result Clustering: The results are clustered by what interaction surface they represent and the highest scoring surface is chosen. This generates the final prediction of where on the protein the peptide should bind.

(20)

3.2. The Pipeline 0.43123 0.56245 23 65 0.2346 0.4247 0.457893 54 65 0.34892 0.9785 0.37095 13 54 0.2432 0.47562 0.23456 34 53 0.4326 0.65544 0.56745 24 64 0.45345 0.5 0.2 0.7 0.9 0.5 1. Structural Alignments 2. Feature Extraction

3. Random Forest Classifier

4. Result Clustering

Target chain structure Target peptide sequence

Prediction of interaction-surface

(21)

3.2. The Pipeline

3.2.1 Structural Alignments

The first step in the PePIP pipeline is to find possible templates for modeling the protein-peptide interaction. This was done by structurally aligning the target chain’s structure against every chain of at least 25 residues of every protein structure in the PDB, using TM-Align [23]. Since the interest lies in finding structures with protein-protein or protein-peptide interac-tions, files with only 1 protein chain could be ignored in the interest of time.

All alignments which displayed a score indicating a relevant match, meaning a TM-score of 0.5 or more [24], was deemed a possible template and sent to the next step in the pipeline. Since TM-score is normalized by chain length, every alignment yields 2 separate TM-scores, one normalized by query A (in this case the target chain) and one by query B (in this case the partner chain). If one of these scores are high while the other is low, then a smaller query protein might have been closely matched to a sub-section of the larger query protein. The alignment is not a relevant match for the larger protein, since only a smaller fragment of it was properly aligned, but it is relevant for the smaller protein. Since this pipeline simply looks for close structural matches, it looks to both TM-scores generated by the alignments, and if any of them indicate a relevant match the alignment will pass as a possible template.

The output from the structural alignment step is a number of alignments between the target chain and many different partner chains from the PDB. All of these partner chains have some form of inter-chain interaction, which together with the alignment make up a possible template for interaction.

3.2.2 Feature Extraction

In this step, a number of features representing the aligned complexes are calculated for each alignment, for the Random Forest to use in judging whether a template might be used to rep-resent protein-peptide interaction between the target protein and target peptide or not. These features where chosen on the basis that they could probably be relevant, from a structural bi-ology point of view, but it is unsure how and to what degree, hence the need for machine learning. A large number of features were included, on the basis that their individual useful-ness could be analyzed and poor features could be cut later. All features used are described below:

• Length: 5 features relate to different lengths of peptide sequences. The length of the target peptide is given as one feature, and each of the aligned protein chains’ lengths are given as their own features. The length of the alignment between the two chains is given as well. Finally, the length of the stretch of the alignment which represent residues that are involved in inter-chain interactions in the partner protein’s structure (ergo the size of the suggested template interaction surface) is given as the last "length"-feature. Some length features are included for comparison among one another: as the suggested interaction surface should represent interaction to the target peptide, the size of this surface and the length of the peptide should be related in some way in a good match. The other length features are included for their relation to other features: target and partner chain length could be used to de-normalize TM-score for instance.

• Alignment Quality: 3 features represent the quality of alignment: the two TM-scores acquired by normalizing on the different chains’ lengths, and the RMSD between the aligned parts of the aligned chains. Even though TM-score is supposedly a superior description of alignment quality compared to RMSD, TM-score uses protein length not only as a final normalization, but also in a more complex step in its calculation, which results in larger proteins being granted higher TM-scores for semi-poor alignments, as can be seen in equation 2.1. As such, RMSD of the alignment is also included in cases where the aligned chains are particularly large and TM-score might be too lenient.

(22)

3.2. The Pipeline

Hypothetically, a more closely aligned, and therefore more structurally similar, partner chain should be more probable to represent the target chain than a poorly aligned one. • Alignment Complexity: 2 features represent the complexity of the aligned region. The contact order of the aligned region of the target chain is given as one feature, and the length of the longest unbroken aligned α-helix is given as the other. These features are included to give the Random Forest the tools necessary to identify alignments that consist almost exclusively of long α-helices, which is a very common structural motif which also generate high TM-scores.

• Amino Acid Composition: 3 vectors of 20 values each represent relative amino acid compositions. The relative amino acid composition for a sequence of residues is simply a vector with a percentile value for each of the 20 standard amino acids, representing how much of the sequence consist of the given residue. For example, a vector of [0.2 0 0 0 0.3 0.2 0 0 0 0 0 0 0 0 0 0 0 0 0.1 0.2] would indicate that the sequence consists of 20 % Alanine, 30 % Phenylalanine, 20 % Glycine, 10 % Tryptophan, and 20 % Tyrosine. The three sequences represented this way are:

– The target peptide sequence

– The residues of the target chain which have aligned counterparts in the partner chain which are involved in inter-chain interactions (ergo the interaction-surface suggested by this alignment)

– The residues that aligned segments of the partner chain are involved in interac-tions with.

The reason for giving the information in this format is that the Random Forest might then group residues depending on type (hydrophobic, hydrophilic, charged, etc.) and realize that hydrophobic residues prefer interaction with other hydrophobic residues, while at the same time being given the option of directly comparing residue composi-tions. Another version of these features was also attempted where the scalar product between the different vectors where included.

• Secondary Structure Composition: 2 vectors of 3 values each represent relative sec-ondary structure composition. These vectors function just like the Amino Acid Compo-sition ones, except that they give what relative amount of the sequence are composed of

α-helix, β-sheet, or random coil, rather than amino acids. The first vector is generated

from the predicted secondary structure of the target peptide, and the second vector describes the secondary structure of the residues the aligned segment of the partner chain interacts with. The secondary structure composition of the target peptide was predicted using PSIPRED [36]. Many of the target peptides are probably disordered when unbound, but as some may take on a defined structure upon binding this fea-ture was kept, using the predicted most probable secondary strucfea-ture of the peptide. These features could be used to compare if the predicted structure the peptide might take matches the structure of the aligned partner’s binding ligand.

• Surface Conservation: 1 feature represents how conserved the proposed interaction-surface’s residues are. The relative conservation of all residues in the target chain where calculated with PSIBLAST [37], the mean of the conservation of all residues aligned to counterparts which participate in inter-chain interactions then becomes this feature. As mentioned previously in the theory section, interaction sites are generally more con-served than the rest of proteins.

• Surface Exposure: 1 feature represents the relative exposure of the suggested interac-tion surface, meaning the relative exposure of the residues of the target chain which

(23)

3.2. The Pipeline

have aligned counterparts in the partner chain which are involved in chain inter-actions. Relative exposed surface of residues was calculated with NACCESS [34]. In this case, relative exposure of a residue X is the absolute exposed surface of the residue divided by how much of that residue is normally exposed in a straight Ala-X-Ala chain. This feature is included as deeply buried sites probably has a lower chance of being in-volved in interactions than surface exposed sites. As protein structure can change upon binding however, it is possible that the inside of a protein would be the correct binding site.

To validate that all features were relevant, they were divided into 8 groups and each com-bination of these groups of features was used in the later steps, to investigate if all features positively affect the prediction accuracy, and if that is not the case, find the best combination of features. These smaller groups were: TM-scores and target peptide length with surface exposure, target chain length and partner chain length, aligned length, RMSD, matching in-teraction length and mean conservation, relative amino acid composition, relative secondary structure composition, and contact order and longest aligned helix. Of these groups, the first one was used as a "base" together with all other combinations of the other 7 groups.

3.2.3 Random Forest Classifier

The purpose of the Random Forest in the PePIP pipeline is to predict whether a template protein-protein interaction-surface acquired from structural alignment could represent a protein-peptide interaction-surface between the target chain and target peptide. A template interaction-surface is defined as being correct if at least a certain percentage of it overlaps with the actual protein-peptide interaction-surface observed in the experimental data. Dif-ferent cutoff-values for when a surface is "correct" where tested for their impact on the final predictions of the pipeline, but during testing of the Random Forest, 20 % was used.

The Random Forest was set up, trained, and used with the scikit-learn package for Python [38]. For all tests, 100 trees were used, with the standard settings for scikit-learn’s Random Forest Classifier: at leastaTotal number o f f eatures features are considered in every branch-ing, and only one data-point from the training set must make it to a leaf for it to be grown.

To prevent bias, a limit was set to how many data-points (alignments) every protein in the data-set could contribute when used for training. No more than 2000 data-points were used from every protein, if there were more available then 2000 were chosen at random. This limit was set after consulting the data shown in figure 3.2 below.

The protein chains had previously been clustered with regards to sequence to mitigate bias through using many data-points from related proteins. However, since structure is more conserved than sequence, it is very possible that some of the target chains structurally match each other, which could possibly lead to a Random Forest biased towards certain structural motifs. To investigate this, the target chains were clustered with regards to structure simi-larity based on TM-score, and in addition to training a Random Forest on all data, another separate Random Forest was then trained on an absolute clustered subset with no two struc-turally similar target chains in different clusters, using one representative structure from each cluster. The representative for each cluster was chosen as the structure within that cluster which could contribute the most alignments.

In this step the relevance of the features were analyzed as well. Two methods were used: As previously described, many different forests were trained using different combinations of 8 groups of features. The Matthew’s Correlation Coefficient of these results can then be compared to see if there is an increase in performance when adding different feature groups. Secondly, using gini impurity, the importance of each feature was extracted from a forest trained on all data.

(24)

3.2. The Pipeline

Figure 3.2: Histogram over how many alignments with relevant TM-score were produced by the alignment step for each protein in the data-set. Because of the significant drop at circa 2000 alignments, that number was chosen as the limit for how many alignments any one protein could contribute when used for training, so as to not make it so that only a subsection of proteins with common structural motifs (and therefore many relevant alignments) stand for a disproportionate amount of training data.

3.2.4 Result Clustering

At this point in the pipeline, there exists a number of structural alignments of the target chain to different partners which are involved in protein-protein interactions. By looking at which matched residues in the partner chain are involved in interaction, a template is received for which equivalent residues in the target-chain would be involved in a similar interaction. All these templates have then been ranked by the Random Forest Classifier, regarding how prob-able it is that such a template site overlaps with an actual site for binding of the target peptide. These results would then be summarized into one prediction regarding binding site. A simple weighted sum or average is insufficient however, as that would mean numerous low-scoring matches could drown out a few high-low-scoring matches.

As such, the alignment results were clustered depending on what part of the protein they described as the active surface, utilizing hierarchical clustering with hamming distance scor-ing (implemented with SciPy package [39]), and automatically choosscor-ing the number of clus-ters for each target protein using the elbow method. The scores of the top 10 % highest scoring alignments for each cluster are then averaged to represent the cluster as a whole. These are then compared to each other, and the highest-scoring cluster is used for producing the final prediction, which is made by for every residue looking through all alignments in the best cluster for if they predict that residue interacts or not (0 or 1) and then taking the weighted average over all these predictions, where the weight is the score awarded by the Random Forest.

(25)

3.3. Benchmarking and comparison to other methods

Different versions of the result clustering step were tested to find the best method. First of all, it was tested what conditions for determining what is a "True" representative template interaction-surface was gave the best end-results. When testing the Random Forest, the cri-teria was that at least 20 % of the template interaction-surface must be included in the true interaction-surface. Here, 40 %, 60 %, 80 %, and 100 % were tested as well.

Secondly, versions with different approaches to scoring the final surface were tried. The different versions where:

• No Buried: A version where buried residues were ignored in both clustering and final scoring. The Random Forest should hopefully weed buried surfaces out by itself, as it has the fraction of exposure of the predicted interaction surfaces as a feature, but if it did not, this might help.

• Last Check Buried: This version includes buried residues in the clustering step, but removes them from the final answer.

• Top Scorers Only: When calculating the final scores for the residues in the surface, only the top 10 % of the templates in the best cluster is used. This could help with the drowning out of few strong signals by a mass of weak ones, but it could also make the method more sensitive to false positives.

• Cluster Cutoff X: Four versions where different numbers of alignments are used to calculate each cluster’s score. Hypothetically, a lower number should let more statistical outliers through, but will also increase the method’s sensitivity. The numbers are 5 % of the highest scoring alignments in the cluster, all of the alignments in the cluster, or the static numbers of 1 or 10 highest scoring alignments in the cluster.

• No Clusters: Two last versions skipped the clustering altogether and simply made weighted averages over the predicted templates, just as if the method had used only one cluster containing all alignments, to test if clustering is necessary at all. One ver-sion of this used all alignments, and the other used only the 10 highest scoring.

As each cluster is awarded its own score, the score of the best cluster can be treated as an overall PePIP score which could potentially be used for indicating how certain the pipeline is that it has found the correct answer. To investigate whether the PePIP score can be used reliably such a way, it was tested if it correlates to the quality of the pipeline’s predictions. If the templates in the cluster used for the final prediction disagree however, the result can be a poor prediction even with a high cluster-score, so another scoring method was tested, which is the mean between the cluster score and the highest residue scores in the prediction.

During this step, not only is the probability of binding predicted, but the residues that were included in the alignment partner’s interacting ligand’s interaction surface were pro-cessed as well, to give a probability of what kind of residue binds to what part of the final predicted interaction-site. While not directly relevant to the prediction of where on the target chain the target peptide binds, this information pertains to how it binds, and will be valuable if protein-peptide docking is attempted later.

3.3 Benchmarking and comparison to other methods

Utilizing jack-knifing, each one of the 502 protein-peptide pairs of the data-set could be used for benchmarking. In the end, the goal of the PePIP pipeline is to predict what residues on a protein’s surface are most likely to interact with a given short peptide-sequence. Every residue on the target chains’ surfaces can be classified as either interacting with the peptide (within 6 Ångström), or not. Because of this binary classification, it is easy to evaluate the PePIP pipeline with ROC-curves, measuring true positive rate versus false positive rate.

(26)

3.4. Work Process

The data-set was also analyzed with the Pepsite2 web server [40] as a way of compari-son. Pepsite2 is a template-less method which uses 3-dimensional position-specific scoring matrices to analyze where on a protein certain amino acids would most likely be found in actual protein-protein interactions. Pepsite2 then attempts to find a string of such sites on the protein’s surface which fits the query peptide [40]. This produces a more detailed pre-diction than PePIP, since it also predicts the specific binding locations of the residues in the peptide sequence. Pepsite2 only supports peptides 10 residues or shorter however, and as such when the data-set contained longer peptides Pepsite2 was run several times for differ-ent fragmdiffer-ents, and the results where summarized into one prediction. Curiously, there are two different ways of submitting jobs to the Pepsite2 web server which gives slightly differ-ing results. Either, the query protein structure can be submitted as a PDB-file, or a chain-id can be given for the server to fetch the chain from the PDB itself. Since the results of these two methods differ, both are used in comparison as two separate sets of answers.

3.4 Work Process

To establish a solid base of understanding of protein-peptide interactions, the data-set was constructed before all other work. After this, it was clear that the pipeline would be easily di-vided into discrete steps. Thus, the flow of work would be that one step would be completed at a time, revisiting details and analyzing the previous step as deemed necessary during the project. This would enable constantly keeping focus only one major problem, but still being flexible enough to revisit and improve already solved parts.

If any major or time-consuming changes to past parts would be deemed necessary by the student, a discussion would be held with the supervisor or examiner about the importance of such a change and its ramifications on the final time-schedule.

Throughout the project, the supervisor or examiner would check the current state of the work bi-weekly, to insure its quality and discuss further improvements or how to proceed.

(27)

4 Results

4.1 Random Forest

4.1.1 Feature Relevance

Of the 8 sets of features, the one containing TM-scores was used as a base, and all combina-tions of the other 7 were attempted. For every combination of features, Matthew’s Correlation Coefficient of the predictions to the true answers of the data-set was calculated as a measure-ment of predictor performance. These results can be found in figure 4.1 and 4.2. These figures are without the alternate version of the amino acid composition vectors, where scalar prod-ucts were included, as these additions did not improve performance and showed a decrease in final MCC-value of the complete forest with all features by circa 0.0051.

Feature importances relative each other was also extracted from a Random Forest trained with all features, shown in figure 4.3. As target chain length seemed to stand out above the rest of the features, it was investigated if this feature perhaps correlated with frequency of Positives or Negatives in a protein’s pool of alignments, shown in figure 4.5.

4.1.2 Structural Clustering

A heat-map over the structural similarities between target protein chains in the data-set was constructed for easier visualization of how the different protein chains are structurally sim-ilar. Bi-sectional clustering produced 70 clusters where no protein chain was structurally similar to any chain not in the cluster, as can be seen in figure 4.6.

When the Random Forest was trained on only one representative target protein from each of these 70 clusters, no significant increase in performance regarding low-scoring points was observed in comparison to training on all data, as shown below in figure 4.7, but rather, there was an overall decrease in performance, as seen in figure 4.8.

4.2 Result Clustering

Different versions of the result clustering step where evaluated in comparison to each other. Firstly, different cutoff criteria for what constitutes a ’True’ template interaction surface for the Random Forest was tested, shown in figures 4.9 and 4.11.

(28)

4.3. Evaluation and Benchmarking

Secondly, different parameters for the extraction of the final prediction by clustering were investigated, shown in figure 4.10 and 4.11.

4.3 Evaluation and Benchmarking

The pipeline’s final accuracy with four different versions of the clustering step was investi-gated by plotting the differences in MCC-value between different proteins, figure 4.12, and for the best version it was also investigated if there is some relation between having struc-turally similar partners in the training set and quality of prediction, 4.15. The correlation between the MCC-value of predictions and their PePIP score was also investigated for these four versions, figure 4.16.

In 99 of the 502 cases studied, the method scored an MCC value of 0.0 or below even when using the best version. In these cases, it was investigated if the prediction correlated not to the binding area of the peptide, but rather to that of another ligand, by calculating MCC values for this criterion instead. The mean of the MCCs calculated this way was 0.0334, indicating almost no correlation in most cases. However, 4 targets achieved an MCC score of 0.4 or above.

When comparing the PePIP pipeline to Pepsite2, since Pepsite2 is designed to find the key interacting residues of the target chain rather than all interacting residues, the ROC-curve constructed for comparison does not count one residue as one data-point, but instead one binding-site is one data-point. If more than half of the predicted residues for a target chain are within the correct interaction-site, it counts as being a correct prediction (true positive). Making a prediction, but predicting residues that does not follow this criterion, results in an incorrect prediction (false positive).

The methods’ Precision at optimal cutoff value was also compared, as a measure of how many of the predictions regarding what residue interacts actually are correct, while simulta-neously not penalizing for making few predictions. Precision is the amount of true, correctly predicted positives divided by the total number of predicted positives.

4.4 Example

The interaction between 5izuA and 5izuB is presented as an example of how the pipeline works.

4.5 Figures

(29)

4.5. Figures

0 1 2 3 4 5 6 7 8

Number of Feature Groups 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Ma tth ew 's Co rre lat ion Co eff ici en t

Matthew's Correlation Coefficient with different features

Figure 4.1:

Matthew’s Cor-relation Coefficient for all different combinations of 7 groups of features. Note that the best combination of features for every number of feature groups is always higher than the highest scores for all combinations of fewer feature groups. 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR ROC Base , AUC = 0.533 Base + group 0, AUC = 0.720 Base + group 1, AUC = 0.712 Base + group 2, AUC = 0.681 Base + group 3, AUC = 0.653 Base + group 4, AUC = 0.678 Base + group 5, AUC = 0.610 Base + group 6, AUC = 0.689

0.00 0.02 0.04 0.06 0.08 FPR 0.00 0.05 0.10 0.15 0.20 TPR ROC Base , AUC = 0.533 Base + group 0, AUC = 0.720 Base + group 1, AUC = 0.712 Base + group 2, AUC = 0.681 Base + group 3, AUC = 0.653 Base + group 4, AUC = 0.678 Base + group 5, AUC = 0.610 Base + group 6, AUC = 0.689

Figure 4.2: ROC-curves for the different feature groups used only in combination with the base features. The right figure shows the same things as the left, but zoomed in on the region close to origo. Base features = Target peptide length, TM-scores, and fraction of exposure of template interaction surface. Group 0 = length of partner chain and aligned chain. Group 1 = aligned length, Group 2 = RMSD for alignment. Group 3 = Template interaction surface size and mean conservation. Group 4 = Relative amino acid compositions. Group 5 = Relative secondary structure compositions. Group 6 = Contact order of template interaction surface and length of longest aligned helix.

(30)

4.5. Figures

Table 4.1: Table of Matthew’s Correlation Coefficient for results from Random Forests trained when excluding only one of the feature groups at a time. Base features = Target peptide length, TM-scores, and fraction of exposure of template interaction surface. Group 0 = length of partner chain and aligned chain. Group 1 = aligned length, Group 2 = RMSD for alignment. Group 3 = Template interaction surface size and mean conservation. Group 4 = Relative amino acid compositions. Group 5 = Relative secondary structure compositions. Group 6 = Contact order of template interaction surface and length of longest aligned helix.

Excluded Feature Group Matthew’s Correlation Coefficient

No excluded groups 0.4623 Group 0 0.4213 Group 1 0.4492 Group 2 0.4544 Group 3 0.4503 Group 4 0.4425 Group 5 0.4489 Group 6 0.4496 0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Feature Importances

Figure 4.3: Bar graph showing relative feature importance. Features are displayed in the fol-lowing order: 0) Target peptide length, 1) TM-score normalized by target chain, 2) TM-score normalized by partner chain, 3) Length of target chain, 4) Length of partner chain, 5) Num-ber of aligned residues, 6) RMSD of alignment, 7) NumNum-ber of aligned residues of the partner chain participating in inter-chain interactions (ergo the size of template interaction surface),

8)Mean conservation of residues in the target chain which where aligned to residues par-ticipating in inter-chain interaction, 9) Contact order of aligned residues of the target chain,

10)Longest unbroken aligned α-helix, 11-30) Relative amino acid composition of residues of target chain aligned to residues participating in inter-chain interactions (sorted alphabeti-cally by 1-letter name), 31-50) Relative amino acid composition of the target peptide (sorted alphabetically by 1-letter name), 51-70) Relative amino acid composition of residues in the interacting part of the chain the partner chain interacts with (sorted alphabetically by 1-letter name), 71-73) Predicted secondary structure composition of the target peptide (random coil,

β-sheet, and α-helix), 74-76) Secondary structure composition of interacting residues in the

chain the partner chain interacts with (random coil, β-sheet, and α-helix), and 77) Fraction of exposed residues among the residues in the target chain aligned to residues participating in inter-chain interactions.

(31)

4.5. Figures A C D E F G H I K L M N P Q R S T V W Y 0.00 0.02 0.04 0.06 0.08 0.10 Fr eq ue nc y Figure 4.4: Frequencies of how often different amino acids occur in the interaction-interfaces of the data-set, relative the other amino acids. The bars are colored by what kind of residue it repre-sents. Blue = hydrophobic, Red = charged, Purple = uncharged polar, and Green = special. −200 0 200 400 600 800 1000 1200 1400 Protein Size −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 N eg at iv e Fr ac ti on

Protein size and correct answer distribution

6 8 10 12 14 16 18 20 22 24 Pe pt id e Si ze 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 log(Protein Size) −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 N eg at iv e Fr ac ti on

Protein size and correct answer distribution

1.8 2.0 2.2 2.4 2.6 2.8 3.0 lo g( Pe pt id e Si ze )

Figure 4.5: Protein size plotted against fraction of alignments which represent an "incorrect" interaction surface. Peptide size included as color gradient. For easier analysis, in the right plot the logarithm of target protein size has been used.

(32)

4.5. Figures 0 100 200 300 400 500 0 100 200 300 400 500 Original 0 100 200 300 400 500 0 100 200 300 400 500 Clustered

Figure 4.6: Heat-maps of structure similarity between target chains in the data-set. To the left are all protein chains alphabetically by their entry names in the PDB. To the right, they have been sorted according to what cluster they belong to. The color of pixels represents the highest TM-score from the alignment of protein X and protein Y to each other, blue signifies a TM-score of less than 0.5 and red signifies a TM-score of 1.0. The largest cluster might appear unsorted, but all protein structures within it are inter-related in a network of structural motifs so that it cannot be further divided into clean clusters.

Figure 4.7: Difference in random forest performance between targets which have structurally similar partners in the training set and those that do not when using all available structures as training set or when using only one representative from each structural cluster. Violin-plots presenting MCC values for every target protein in the data-set. Note that the violin-plot reaching above MCC value 1.0 for some proteins is an artifact of the curve-smoothing performed by the plotting program, it is not possible to score an MCC value of more than 1.0.

(33)

4.5. Figures

Figure 4.8: Effects from clustering on Matthew’s Correlation Coefficient for every protein. Violin-plot presenting MCC value for every target protein in the data-set, and box-plots of the same thing.

0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR

Final predictions, different Random Forest True criteria

20% , AUC = 0.750 40% , AUC = 0.773 60% , AUC = 0.787 80% , AUC = 0.800 100% , AUC = 0.760 0.00 0.02 0.04 0.06 0.08 FPR 0.0 0.1 0.2 0.3 0.4 TPR

Final predictions, different Random Forest True criteria

20% , AUC = 0.750 40% , AUC = 0.773 60% , AUC = 0.787 80% , AUC = 0.800 100% , AUC = 0.760

Figure 4.9: ROC-curve showing the final performance of the method when using different limits for what constitutes a ’True’ interaction surface when training the Random Forest. The right plot focuses on a particularly interesting region, where the False Positive Rate is very low. The "No Buried" version of the clustering step was used when creating these plots.

0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR

Final predictions, different clustering versions

Base, AUC = 0.824 No Buried, AUC = 0.800 Last Check Buried, AUC = 0.804 Top only Residue Scoring, AUC = 0.815 Top 5 % Cluster-Scoring, AUC = 0.820 All for Cluster-Scoring, AUC = 0.798 Top 1 Cluster-Scoring, AUC = 0.826 Top 10 Cluster-Scoring, AUC = 0.824 No Clustering, AUC = 0.707 Top 10 No Clustering, AUC = 0.832

0.00 0.02 0.04 0.06 0.08 FPR 0.0 0.1 0.2 0.3 0.4 TPR

Final predictions, different clustering versions

Base, AUC = 0.824 No Buried, AUC = 0.800 Last Check Buried, AUC = 0.804 Top only Residue Scoring, AUC = 0.815 Top 5 % Cluster-Scoring, AUC = 0.820 All for Cluster-Scoring, AUC = 0.798 Top 1 Cluster-Scoring, AUC = 0.826 Top 10 Cluster-Scoring, AUC = 0.824 No Clustering, AUC = 0.707 Top 10 No Clustering, AUC = 0.832

Figure 4.10: ROC-curves showing the final performance of the method with different version of clustering. The right plot shows the same as the left, but zoomed in on the interesting region of low False Positive Rate.

(34)

4.5. Figures 0.0 0.2 0.4 0.6 0.8 1.0 Cutoff Value 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Ma tth ew 's Co rre lat ion Co eff ici en t

Matthew's Correlation Coefficient for different Random Forest True criterias 20% 40% 60% 80% 100% 0.0 0.2 0.4 0.6 0.8 1.0 Cutoff Value 0.0 0.1 0.2 0.3 0.4 0.5 M at th ew 's C or re la ti on C oe ff ic ie nt

Matthew's Correlation Coefficient for different clustering versions

Base No Buried Last Check Buried Top only Residue Scoring Top 5 % Cluster-Scoring All for Cluster-Scoring Top 1 Cluster-Scoring Top 10 Cluster-Scoring No Clustering Top 10 No Clustering

Figure 4.11: Matthew’s Correlation Coefficient for different cutoffs using different ’True’ cri-teria for the Random Forest (left) and different versions of the clustering step (right).

Figure 4.12: Violin-plots over the different MCC-values for the target protein-peptide pairs, with different clustering methods. Note how the pipeline often either scores relatively high or low, but seldom around 0.1 to 0.4.

0 2000 4000 6000 8000 10000 12000 Number of alignments −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 Ma tth ew 's Co rre lat ion Co eff ici en t

Alignment statistics versus performance

0.40 0.48 0.56 0.64 0.72 0.80 0.88 0.96 Ra nd om Fo rest sc ore fo r b est pr ed ict ion Figure 4.13: Number of relevant alignments found for the different targets in the alignment step plotted against the MCC of their final predictions, and col-ored by the highest certainty score awarded by the Ran-dom Forest to an alignment of that target. Two targets with significantly larger amounts of alignments where omitted, 4iga with 24178 alignments and an MCC of 0.35, as well as 2fmk with 27074 alignments and an MCC of 0.62.

(35)

4.5. Figures

0.0 0.2 0.4 0.6 0.8 1.0

Fraction Random Coil −0.5 0.0 0.5 1.0 Ma tth ew 's Co rre lat ion Co eff ici en t

Structure and Performance

Figure 4.14: The fraction of random coil secondary structure of the target chains plot-ted against the MCCs of their final predictions.

Figure 4.15: Violin-plot over the MCC values of the final predictions for different proteins. In this plot, the predictions have been divided into two sets based on whether their target chains are structurally similar to other target chains in the data-set or not.

PePIP : a Pipeline for Peptide-Protein Interaction-site Prediction

Linköping University | Department of Physics, Chemistry and Biology

Master thesis, 30 ECTS | Bioinformatics

2017 | LITH-IFM-A-EX--17/3310--SE

PePIP: a Pipeline for

Peptide-Protein

Interaction-site

Prediction

PePIP: en Pipeline for Förutsägelse av Peptid-Protein

Bindnings-site

\Isak Johansson-Åkhe

Datum

Date

09/06 ­ 2017

Department of Physics, Chemistry and Biology

Linköping University

URL för elektronisk version

ISBN

ISRN: LITH­IFM­A­EX­­17/3310­­SE

PePIP: a Pipeline for Peptide­Protein Interaction­site Prediction

Isak Johansson­Åkhe

PPI, Protein­protein Interaction, Protein­peptide interaction, Random Forest, Bioinformatics,

Protein-peptide interactions play a major role in several biological processes, such as cell

proliferation and cancer cell life-cycles. Accurate computational methods for predicting

protein-protein interactions exist, but few of these method can be extended to predicting

interactions between a protein and a particularly small or intrinsically disordered peptide.

In this thesis, PePIP is presented. PePIP is a pipeline for predicting where on a given protein

a given peptide will most probably bind. The pipeline utilizes structural aligning to peruse

the Protein Data Bank for possible templates for the interaction to be predicted, using the

larger chain as the query. The possible templates are then evaluated as to whether they can

represent the query protein and peptide using a Random Forest classifier machine learning

algorithm, and the best templates are found by using the evaluation from the Random Forest

in combination with hierarchical clustering. These final templates are then combined to give

a prediction of binding site.

PePIP is proven to be highly accurate when testing on a set of 502 experimentally determined

protein-peptide structures, suggesting a binding site on the correct part of the protein-surface

roughly 4 out of 5 times.

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aim

1.2

Research questions

1.3

Delimitations

2

Theory

2.1

Protein Structure and Function

2.2

Structural Alignment

2.3

Contact Order

2.4

Random Forest machine learning

2.5

Predictor Evaluation

2.5.1

Matthew’s Correlation Coefficient

2.5.2

ROC-curves

2.6

Hierarchical Clustering

3

Method

3.1

Dataset

3.1.1

Nomenclature

3.2

The Pipeline

3.2.1

Structural Alignments

3.2.2

Feature Extraction

3.2.3

09/06 2017

ISRN: LITHIFMAEX17/3310SE

PePIP: a Pipeline for PeptideProtein Interactionsite Prediction

Isak JohanssonÅkhe

PPI, Proteinprotein Interaction, Proteinpeptide interaction, Random Forest, Bioinformatics,