Protein contact prediction based on the Tiramisu deep learning architecture

(1)

Protein contact prediction based

on the Tiramisu deep learning

architecture

NIKOS TSARDAKAS RENHULDT

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

based on the Tiramisu deep

learning architecture

NIKOS TSARDAKAS RENHULDT

Master in Machine Learning Date: June 27, 2018

Supervisor: Arne Elofsson, Johan Gustavsson Examiner: Danica Kragic

Swedish title: Prediktion av proteinkontakter med djupinlärningsarkitekturen Tiramisu

(4)

(5)

Abstract

Experimentally determining protein structure is a hard problem, with applications in both medicine and industry. Predicting protein struc-ture is also difficult. Predicted contacts between residues within a protein is helpful during protein structure prediction. Recent state-of-the-art models have used deep learning to improve protein contact prediction.

This thesis presents a new deep learning model for protein contact prediction, TiramiProt. It is based on the Tiramisu deep learning ar-chitecture, and trained and evaluated on the same data as the PconsC4 protein contact prediction model. 228 models using different combi-nations of hyperparameters were trained until convergence.

The final TiramiProt model performs on par with two current state-of-the-art protein contact prediction models, PconsC4 and RaptorX-Contact, across a range of different metrics. A Python package and a Singularity container for running TiramiProt are available at gitlab.

(6)

Sammanfattning

Att kunna bestämma proteiners struktur har tillämpningar inom både medicin och industri. Såväl experimentell bestämning av proteinstruk-tur som prediktion av densamma är svårt. Predicerad kontakt mellan olika delar av ett protein underlättar prediktion av proteinstruktur. Under senare tid har djupinlärning använts för att bygga bättre mo-deller för kontaktprediktion.

Den här uppsatsen beskriver en ny djupinlärningsmodell för pre-diktion av proteinkontakter, TiramiProt. Modellen bygger på djupin-lärningsarkitekturen Tiramisu. TiramiProt tränas och utvärderas på samma data som kontaktprediktionsmodellen PconsC4. Totalt träna-des modeller med 228 olika hyperparameterkombinationer till kon-vergens.

Mätt över ett flertal olika parametrar presterar den färdiga TiramiProt-modellen resultat i klass med state-of-the-art-modellerna PconsC4 och RaptorX-Contact. TiramiProt finns tillgängligt som ett Python-paket samt en Singularity-container via gitlab.com/nikos.t.renhuldt/

(7)

Acknowledgements

I would like to thank my supervisor, professor Arne Elofsson, as well as David Menendéz Hurtado, Mirco Michel, and the rest of the Elofs-son lab for their help during this project. Their expertise and patience has been invaluable. This project would not have happened without them. I would also like to extend thanks to Johan Gustavsson for his supervision and valuable feedback. His input has done a lot to make this thesis more accessible. Matilda Berkell has done a terrific job of proofreading this thesis, and any remaining errors are wholly my own. A special thanks to my family. Their support has made these past years of study possible.

(8)

1 Introduction 1

2 Background 3

2.1 Convolutional neural networks . . . 5

2.1.1 ResNet . . . 5

2.1.2 DenseNet . . . 5

2.1.3 U-net . . . 6

2.1.4 Tiramisu, the fully convolutional DenseNet . . . . 7

2.1.5 Hyperparameter optimization using hyperopt . 8 2.2 Protein contact prediction . . . 9

2.2.1 Direct-coupling analysis . . . 12

2.2.2 PconsC4 . . . 12

2.2.3 RaptorX-Contact . . . 13

2.2.4 Performance measures . . . 14

2.2.5 Ethics and sustainability . . . 15

2.3 Summary of previous work . . . 16

3 Method 17 3.1 Datasets . . . 17

3.2 The TiramiProt model . . . 18

3.3 Evaluation of results . . . 21

4 Results 23 4.1 Hyperparameter optimization . . . 23

4.2 Model performance . . . 24

5 Discussion 33 5.1 Hyperparameters and architecture . . . 33

5.2 Model performance . . . 34

5.3 Capturing model quality . . . 36

(9)

5.4 Summary . . . 36

6 Conclusion 38

Bibliography 40

A Sequences 44

A.1 CASP12 . . . 44 A.2 PconsC3 benchmark dataset . . . 44

(10)

(11)

Introduction

Proteins are molecules fundamental to life. The three-dimensional (3D) structure of the protein molecule determines its function, mak-ing determination of this structure an important field of research. This research has applications ranging from drug design to the engineering of enzymes for use in industry. Experimentally determining protein structures is a challenging and time-consuming problem. Being able to predict protein structure in silico – itself also a difficult problem – holds the promise of reducing the time necessary to elucidate protein structure from months or years to minutes. Protein contact prediction, i.e. predicting which parts of the protein are in contact, helps in pro-tein structure prediction by limiting the possible shapes of the propro-tein. Direct-coupling analysis (DCA) has recently contributed to large im-provements in contact predictions, but much work remains [see e.g. 1]–[3].

As in many other research fields, recent state-of-the-art results within protein contact prediction have been produced using deep learning [4]–[6]. Two such state-of-the-art models, PconsC4 [5] and RaptorX-Contact [6], have drawn on deep learning results from computer vision to produce improved contact predictions by treating contact prediction as a pixel-wise classification problem. As the field of deep learning is rapidly moving forward, new techniques have been released since these results were published. The Tiramisu architecture has previously shown impressive results in image segmentation [7]. The objective of this thesis is to explore the Tiramisu deep learning architecture as a tool for improving current models for contact prediction. To this end, we pose the following research question:

(12)

Does the Tiramisu architecture improve contact predic-tions when compared to the results produced by current state-of-the-art models PconsC4 and RaptorX-Contact?

(13)

Background

Protein structure is often described as having four levels, primary up to quaternary. An overview of the levels of protein structure is shown in figure 2.1. The building blocks of proteins are 20 amino acids. Each of these amino acids has a common structure that is capable of binding to other amino acids, forming the linear backbone of the protein, and a unique side chain, or lack of side chain, giving the amino acid dis-tinct chemical properties. The linear sequence of amino acids in a pro-tein is called the primary structure. The primary structure is generally known by the time one is interested in determining protein structure, e.g. based on available DNA or mRNA sequence data. Protein sec-ondary structure relates to how nearby amino acids interact to form α-helices or β-sheets. Tertiary structure deals with how the protein folds to get an overall shape, while quaternary structure deals with how different protein chains come together to form a functional protein. Protein structure is experimentally elucidated using methods such as X-ray crystallography, cryo-electron microscopy, or nuclear magnetic resonance spectroscopy, but remains a challenging problem, in partic-ular for large or membrane-bound proteins. The contact prediction dealt with in this thesis is used for predicting tertiary structure. The produced contacts maps serve as input to tools producing full tertiary structures in 3D. The contact maps help tertiary structure prediction by providing restraints on possible structures.

The remainder of this chapter first deals with convolutional neural networks (CNNs), as an understanding of these is necessary to grasp the recent state-of-the-art models described in the latter part of this chapter. The section on CNNs goes through some architectures with

(14)

(15)

roots in computer vision, two of which have been used in previous contact prediction models (ResNet [9], u-net [10]), and two architec-tures related to the model built in this thesis (DenseNet [11], Tiramisu [7]). The section on contact prediction gives further background on the problem, and goes through DCA which has been fundamental to re-cent advances in contact prediction [3], [12], [13], as well as two current state-of-the-art models (PconsC4 [5], RaptorX-Contact [6]).

2.1 Convolutional neural networks

CNNs have been used for producing state-of-the-art results in vari-ous fields, including tasks relating to images, audio, text translation, as well as for protein contact prediction [5], [6], [11], [14], [15]. Char-acteristic for these models is the use of convolutions. The following paragraphs briefly touch on some architectures relevant to this thesis: ResNet, used in RaptorX-Contact [6], [9]; DenseNet, used as part of the Tiramisu architecture [7], [11]; and u-net, used in PconsC4 [5], [10].

2.1.1 ResNet

The ResNet architecture enables the training of very deep models through the introduction of skip connections (shortcut connections) [9]. The network consists of convolutions followed by batch normalization (BN) [16] and ReLU activations [17]. The thing that sets ResNet apart from previous architectures is the idea of learning a residual mapping. This means that instead of learning a new meaningful mapping from scratch, only the difference from the ingoing mapping is being learnt at each layer, making learning easier. This is achieved by adding the input of a layer to the output of the following layer, as shown in figure 2.2a. The network architecture also helps to avoid vanishing gradient issues [18]. ResNets and derived architectures continue to produce impres-sive results on image classification tasks [19].

2.1.2 DenseNet

(16)

consists of BN, followed by ReLU and convolutions. Between dense blocks of BN-ReLU-conv, convolutions and average pooling are used for downsampling. DenseNets have been shown to promote feature reuse within dense blocks [11]. The architecture outperforms plain ResNets, and produces state-of-the-art results on some image classifi-cation tasks [20].

(a) (b)

Figure 2.2: a) Example of an residual block. Note the skip connection, adding the input of the first layer to the output of the second [9]. b) Example of a dense block. Output of one layer is concatenated to the input of all subsequent layers. Image taken from [11] and used with permission from the author.

2.1.3 U-net

(17)

seg-mentation tasks, providing good results with very small amounts of labeled data, and without pre-training [10].

Figure 2.3: The u-net architecture. Note how the input is first downsampled and then upsampled, and the concatenation of downsampled feature maps to upsampled feature maps of the same shape. Figure take from [10] with permission from the author.

2.1.4 Tiramisu, the fully convolutional DenseNet

(18)

seman-tic segmentation task they use as benchmark [7].

Figure 2.4: The Tiramisu architecture. Note the similarity to the u-net architecture, and the use of dense blocks rather than plain convolutions. ‘Concat’ corresponds to concatenation of feature maps [7].

2.1.5 Hyperparameter optimization using hyperopt

(19)

evaluates the objective function, and suggests new hyperparameters as the results of previous evaluations are made available. In the case of TPE, new hyperparameter values are proposed based on how previ-ous hyperparameters have performed. Hyperparameters are initially generated from probability distributions specified by the user. This setup allows the optimization with little user interaction, making it easy to interface with the SLURM job scheduling system used by the HPC2N Kebnekaise cluster. The Kebnekaise cluster was used during training of the model presented in this paper.

2.2 Protein contact prediction

Protein contact prediction aims at predicting which residues of a tein are in contact. This information is helpful in determining pro-tein structure. The Critical Assessment of propro-tein Structure Prediction (CASP) experiments assess different models for protein contact predic-tion, and produce rankings of state-of-the-art contact prediction mod-els [3]. By CASP rules contact is defined as a pair of residues with Cβ*

within 8 Å of each other. Figure 2.5 shows a common way of repre-senting protein structure – the ribbon diagram – as well as a contact map for the same protein. The contact map in figure 2.5b shows which residues are within 8 Å of each other, where one (black) represents contact and zero (white) represents no contact.

In general, the input to contact prediction models tends to consist of features related to the protein sequence (primary structure), typi-cally the sequence itself, as well as features based on related protein sequences, e.g. features capturing information about which residues tend to change together. Table 2.1 provides an overview of some dif-ferent types of input.

Related protein sequences are often represented as a multiple se-quence alignment (MSA), aligning these related sese-quences such that conserved (similar) parts are aligned with each other. A visualization of a truncated MSA is shown in figure 2.6. Features based on MSAs provide information not available in the sequence of the protein alone. Since the sequences making up the MSA often have a similar function in the cell, their overall structure often needs to be similar to preserve interactions with other molecules. A change in one part of the protein

(20)

(a) (b)

Figure 2.5: Two representations of the protein with PDB ID 1A1X [24]. a) Ribbon diagram representation. b) Contact map representation of one of the 1A1X chains. Black corresponds to contact, white to no contact. Note e.g. how some of the first residues in a) seem to be located close to residues much later in the sequence, and compare this to the contacts shown in the top-left corner of the contact map in b).

Table 2.1: An overview of different types of inputs to contact prediction models. For

amino acid i, pi corresponds to frequency of i at a position, and ¯pi corresponds to

average frequency of i in the MSA.

Input Description

Protein sequence Protein primary structure as a one-hot en-coding.

Sequence profile Probability distribution over amino acids for each position of the protein sequence. Self-information Ii = − log (pi/¯pi)[25]

Partial entropy Si = −pilog (pi/¯pi)[25]

DCA See section 2.2.1.

Mutual information Measure of how much knowledge of one

residue tells us about another residue.

Cross entropy Measure of entropy when modelling one

residue based on another.

(21)

may in such case require a change in adjacent parts of the protein to accommodate the new amino acid while retaining the overall function of the protein. High-quality MSAs make a big difference in producing quality contact predictions [3].

Figure 2.6: A visualization of a part of an MSA. Each row corresponds to a protein sequence. Amino acids are represented by letters. Gaps in the alignment – indicating insertions or deletions – are represented by a dash. In this particular visualization the amino acids are color coded by side chain properties, with e.g. hydrophobic side chains indicated with blue, negatively charged side chains in purple, and so on.

The output of the contact prediction models is a square matrix with element i, j corresponding to the probability of residues i, j being in contact. This means that the output is a matrix of identical shape to the matrix shown in figure 2.5b, with probability of contact instead of binary values in each position of the matrix.

(22)

model that did not compete in CASP12 [5]. As PconsC4 is developed by the Elofsson lab, i.e. the lab hosting this thesis project, the author has access to training, validation, and test sets suitable for closer parison of the performance of the model architectures. Finally, com-mon performance measures are introduced.

2.2.1 Direct-coupling analysis

DCA has provided a significant improvement in protein contact pre-dictions [3]. As a residue in a protein mutates the residues that are in contact with it also tend to change. Sometimes this is a matter of residue A, in contact with residue B, changing into A’, encouraging B to change into B’. At other times A may be in contact with both residue B and C, and a change to A’ encourages changes of B and C into B’ and C’ respectively. In the latter case, the relation between B and C may seem stronger, when the residues are in fact only related due to their contact with residue A. DCA attempts to disentangle these rela-tionships, and tries to elucidate which residues are coupled directly and which are only indirectly coupled [12]. Different approaches have been taken to solve this problem [1], [13], but the fundamental idea of solving a global optimization problem – i.e. taking all correlations into account instead of looking only at pairwise correlations – has helped improve contact predictions.

2.2.2 PconsC4

(23)

en-tropy, and output from GaussDCA [13] as inputs – see table 2.1 for an overview of what these inputs capture. The network is trained on a combination of binary distance cut-offs, and regression of the S-score [26]. The S-score is a distance measure defined as

Sij =

1 1 + (dij/d0)2

(2.1) where dij is the distance between residue i and residue j, and d0 is

a reference distance determining how fast the S-score decreases with increasing dij. The S-score is inversely related to dij, taking the value 1

when there is no distance between residues, the value 0.5 when dij =

d0, and tending towards zero as the distance between residues tends

towards infinity. Approximately 2 800 proteins were used in training PconsC4, with an additional 100 proteins used as validation set during training. PconsC4 is made available as a Python package, and makes predictions based on a provided MSA.

Figure 2.7: A visualization of the sequence network. PconsC4 takes the final ResNet module with 128 filters as one part of its input. Figure adapted from [25] with au-thor’s permission.

2.2.3 RaptorX-Contact

(24)

takes a sequence profile, predicted secondary structures, and predicted solvent accessibility as 1D input, and runs this through a ResNet. These sequential features are converted to pairwise features and concate-nated to other, inherently pairwise, features (DCA, mutual informa-tion, pairwise potential). Table 2.1 explains what these input features are meant to capture. The concatenated inputs are put through another ResNet, yielding a contact map prediction. Training is done end-to-end. The network is illustrated in figure 2.8. The full RaptorX-Contact model is an ensemble of several models of this architecture. Approx-imately 6 000 proteins were used to train these models, with an addi-tional 700 proteins used for validation. RaptorX-Contact is publicly available as a web-server, and makes predictions based on a provided sequence or MSA.

Figure 2.8: A visualization of the RaptorX-Contact model. Figure taken from [6], licensed as CC BY 4.0.

2.2.4 Performance measures

Table 2.2 provides an overview of the performance measures presented here. This section provides some background and intuition regarding these measures.

(25)

mea-sure of performance. It is reported in numerous papers, including CASP12 [see e.g. 3], [6], [27]. In CASP12, F1 score, area under the

precision-recall curve, and the entropy score measure of how close to each other predicted contacts are, are reported together with some additional measures. The number of top contacts these measures are based on is commonly related to the length of the protein (L), such that e.g. the top L/10, L/5, L/2 and L contacts are evaluated. Relating the number of evaluated contacts to the protein length makes sense as the number of actual contacts are expected to grow with the protein length. PPV may also be presented separately for short-, medium-, and long-range contacts.

Table 2.2: Some common performance measures for contact prediction models, what they measure, and how they are calculated. AUC PR: area under precision-recall curve; TP: true positives; FP: false positives; TN: true negatives; FN: false negatives.

Name Measures Formula

PPV, pre-cision

Proportion of predicted positive that are true positive

T P T P +F P

Recall Proportion of total positives predicted as positives

T P T P +F N

F1 score Harmonic average of recall and

preci-sion

2 · _{precision+recall}precision·recall

AUC PR Classifier performance accounting for

contact probabilities, possible to esti-mate with average precision score Entropy

score

Measure of separation between pre-dicted contacts

2.2.5 Ethics and sustainability

All protein structures used for training are taken from the publicly available Protein Data Bank (PDB) [28]. No data traceable to any indi-vidual is used. Other ethical aspects are shared with the rest of biotech-nology and science in general: any progress made may pose a problem if used with malicious intent. Malicious use of this technology would require significant resources and expertise.

(26)

up in different areas of biotechnology and medicine, and may help speed up biological research in general, giving us e.g. improved mod-eling of the multitude of biological systems in which proteins play a key role. Two of the likely applications are modeling of potential drug targets, and design of enzymes to replace hazardous chemicals cur-rently used in industrial processes. Both of these may contribute to sustainable development, with better drugs corresponding to the goal “Good Health and Wellbeing” (goal three) of the Sustainable Develop-ment Goals (SDGs). Replacing hazardous chemicals with enzymes is one aspect of what is sometimes referred to as “green chemistry”, and corresponds to the goal “Responsible Consumption and Production” (goal 12) of the SDGs.

These results may benefit society as a whole, but a lot of work re-mains for protein contact prediction to become reliable. This thesis in particular may be of interest to others within the field of protein con-tact prediction, as well as machine learning researchers interested in how deep learning architectures perform across different application domains.

2.3 Summary of previous work

(27)

Method

3.1 Datasets

The datasets used were the same as those used for training, validation, and testing of PconsC4 [5]. In total, 2 891 proteins were used during training, with 100 of these used for validation. The protein structures had a minimum resolution of 2.0 Å, and a maximum R-factor* _{of 0.3.}

The sequences had no more than 20% overlap. To avoid overlap be-tween the training and PconsC3 benchmark datasets, the training set was filtered to exclude the ECOD H-groups†_{[29] present in the}

bench-mark set. PISCES [30] was used for culling the protein sequences from the PDB [28] on 2017-09-14, but all sequences dating after 2016-05-01 were excluded to avoid overlap with the CASP12 test set.

The test sets consisted of the 210 protein sequences originally used as test set for PconsC3 [2], as well as 40 sequences with published structure from the CASP12 dataset. A full list of sequences is available in appendix A. The 40 structures from the CASP12 dataset are those that were released by CASP in December 2016, and includes no se-quences that have been published to the Protein Data Bank since. The choice to exclude CASP12 sequences that have been released by other parties than CASP is based mainly on reproducibility. The protein

* _{The R-factor is a measure of the discrepancy between experimental structure}

data and the proposed structure model. R-factors closer to zero imply smaller dis-crepancy.

† _{ECOD groups consist of proteins with similar sequences. Excluding}

H-groups present in the benchmark set from the training set ensures that training and test set are independent.

(28)

structures released since may contain minor differences in sequence, as well as multiple models and chains per structure. Using structures re-leased by CASP provides some measure of reproducibility, and avoids issues of producing contacts maps from this complex data. The test datasets are kept separate to enable comparison to other published re-sults using these datasets. Since RaptorX-Contact is trained on data released after the release of the PconsC3 benchmark dataset, the sepa-ration of datasets also ensures that at least one test set has data that is not present in the RaptorX-Contact training set.

The MSAs were constructed as for PconsC3 and PconsC4, i.e. us-ing HHblits [31]. The uniprot20 database dated 2016-02-26 was used. HHblits was run with an e-value cutoff of 1, the parameter -all, and parameters -realign_max and -maxfilt having the value 999 999.

3.2 The TiramiProt model

The model used was a PyTorch [32] implementation of Tiramisu. The network was modified to take the same inputs as PconsC4 [5] – figure 3.1 gives an overview of the model inputs and how they are generated, and table 2.1 provides some explanation of what these features are in-tended to capture. The inputs to the model consist in the features gen-erated by the pre-trained sequence network shown in figure 2.7 (orig-inally presented in [25]), as well as cross entropy, GaussDCA score [13], corrected mutual information [33], and normalized APC-corrected mutual information [34]. The pre-trained sequence network takes protein sequence, self-information, and partial entropy as input [25]. The TiramiProt model output was S-scores with d0 set to 8 Å, and

contact probability for distance cutoffs 6 Å, 8 Å and 10 Å, identical to the outputs of PconsC4 [5].

An L1 loss was used for the S-score output, and a binary cross-entropy loss was used for the binary outputs. The loss from the binary outputs was averaged across them, and added to the S-score loss. This is identical to what was done in PconsC4.

(29)

Protein sequence Multiple sequence alignment Sequence network Pairwise features: ● GaussDCA

● APC-corrected mutual information

● Normalized APC-corrected mutual information

● Cross entropy

Learned sequence features: 128 L x L matrices TiramiProt input: 132 L x L matrices Database of protein sequences Sequence features ● Self-information ● Partial entropy Sequence one-hot encoding

(30)

Hyperopt generates hyperparameters

Model trained for 24 hrs or until convergence

Model evaluated based on S-score loss on validation set TiramiProt input: 132 L x L matrices Tiramisu network TiramiProt output: ● S-score prediction

● Binary distance prediction, 6 Å

(31)

Table 3.1: Hyperparameter intervals used during initial random search. LR: learning rate; LRP: learning rate patience; WD: weight decay; DR: dropout rate; BS: batch size; FFC: filters in the first convolution layer; NP: number of pooling layers; LPB: layers per dense block; GR: growth rate; LU: log-uniform; DU: discrete uniform; U: uniform.

LR LRP WD DR BS FFC NP LPB GR

Distribution LU DU LU U DU DU DU DU

-Lower limit 10−4 3 10−12 0 1 10 1 2 1

Upper limit 10−2 17 10−4 0.3 5 132 4 22 _{N P ·F F C}799

rate of the dense blocks, and the dropout rate were all tuned using hyperopt. Distributions used for each hyperparameter is shown in table 3.1. Training of 600 models were attempted using hyperopt, with 120 of these being initial exploration using random search and the remaining 480 relying on TPE to generate new hyperparameters. S-score loss on the validation set was used to determine model quality. Each model was trained for 24 hours on a K80 GPU, or until S-score validation loss had not improved for 100 epochs. The weights cor-responding to the lowest S-score validation loss were saved for each model.

3.3 Evaluation of results

The TiramiProt output for 8 Å contact probability was used as the con-tact map during prediction. Results of the finished model were eval-uated using precision-recall curves, average precision score, and F1

score across all predicted contacts. The commonly reported measure of PPV for the top L (L: protein length) contacts was also calculated, with the reported measure being the PPV averaged across all proteins in each dataset. Long-range contacts, i.e. contacts with an index differ-ence of ≥24 were evaluated using precision-recall curves and average precision score. Residue pairs with an index difference of 5 or less were excluded from evaluation.

(32)

(33)

Results

A Python package and a Singularity container for running the final TiramiProt model are available at gitlab.com/nikos.t.renhuldt/ TiramiProt. This chapter first goes through the model resulting from the hyperparameter optimization outlined in the previous chap-ter, followed by the results generated by this model.

4.1 Hyperparameter optimization

In total, 228 out of 600 attempted hyperparameter combinations re-sulted in models fitting on the GPU. The approximate hyperparame-ters of the model with the smallest loss is shown in table 4.1, with the exact hyperparameters included in appendix B.

Table 4.1: Approximate hyperparameters of the model with the lowest validation loss. LR: learning rate; LRP: learning rate patience; WD: weight decay; DR: dropout rate; BS: batch size; FFC: filters in the first convolution layer; NP: number of pooling layers; LPB: layers per dense block; GR: growth rate. Precise hyperparameters are shown in appendix B.

LR LRP WD DR BS FFC NP LPB GR

9.9 × 10−4 11 7.1 × 10−9 0.047 4 29 4 8 10

(34)

4.2 Model performance

F1 scores and PPV for each model are shown in tables 4.2 and 4.3

re-spectively. TiramiProt refers to the model developed as a part of this thesis – see section 4.1 for the hyperparameters used. Note the large difference in performance between RaptorX-Contact and the other two models on the PconsC3 benchmark dataset.

Table 4.2: F1 score of all contacts. Residue pairs with an index difference of five or

less are excluded from evaluation.

Dataset TiramiProt PconsC4 RaptorX-Contact

CASP12 0.427 0.382 0.477

PconsC3 benchmark 0.444 0.406 0.602

Table 4.3: PPV of the top L contacts. Residue pairs with an index difference of five or less are excluded from evaluation.

Dataset TiramiProt PconsC4 RaptorX-Contact

CASP12 0.561 0.538 0.573

PconsC3 benchmark 0.624 0.639 0.764

Precision-recall curves for each model are shown in figures 4.1 and 4.2 for CASP12 and the PconsC3 benchmark dataset respectively. Av-erage precision (AP) scores are displayed in the figure legends. Fig-ures 4.1a and 4.2a show the precision-recall curves across all contacts for CASP12 and the PconsC3 benchmark dataset respectively, while figures 4.1b and 4.2b show the curves for long range contacts for the respective datasets.

F1 score histograms for each model, as well as scatter plots of F1

score across the models, are shown in figures 4.3 and 4.4 for the CASP12 dataset and PconsC3 benchmark dataset respectively. Note the high correlation between TiramiProt and PconsC4.

Figures 4.5 and 4.6 compare contact maps of the PconsC3 bench-mark set proteins with the largest difference in F1score between

(35)

(a) All contacts. Residue pairs with an index difference of five or less are excluded. Proportion of actual contacts: 0.0128.

(b) Long range contacts. Residue pairs with an index difference of 23 or less are excluded. Proportion of actual contacts: 0.0096.

(36)

(a) All contacts. Residue pairs with an index difference of five or less are excluded. Proportion of actual contacts: 0.0193.

(b) Long range contacts. Residue pairs with an index difference of 23 or less are excluded. Proportion of actual contacts: 0.0151.

(37)

Figure 4.3: F1 score histogram (frequency of F1 scores) for each model, and scatter

plots (F1scores against F1scores) of F1scores for each pair of models on the CASP12

(38)

Figure 4.4: F1 score histogram (frequency of F1 scores) for each model, and scatter

plots (F1scores against F1scores) of F1scores for each pair of models on the PconsC3

(39)

0 5 10 15 20 25 30 35 Residue index 0 5 10 15 20 25 30 35 Residue index

(a) Target 1MB6A. TiramiProt has higher F1score than RaptorX-Contact.

0 25 50 75 100 125 150 Residue index 0 20 40 60 80 100 120 140 160 Residue index

(b) Target 2OJ5C. TiramiProt has lower F1score than RaptorX-Contact.

Figure 4.5: Contact maps for PconsC3 benchmark dataset targets 1MB6A and 2OJ5C, showing both TiramiProt (upper left) and RaptorX-Contact (lower right). 1MB6A and 2OJ5C are the PconsC3 benchmark dataset targets with the best and

worst TiramiProt F1score respectively, relative to RaptorX-Contact. Blue star: true

(40)

0 20 40 60 80 100 Residue index 0 20 40 60 80 100 Residue index

(a) Target 3P45J. TiramiProt has higher F1 score than PconsC4. Missing

data likely caused by incomplete data in experimental structure.

0 20 40 60 80 100 120 Residue index 0 20 40 60 80 100 120 Residue index

(b) Target 3VHHB. TiramiProt has lower F1score than PconsC4.

Figure 4.6: Contact maps for PconsC3 benchmark dataset targets 3P45J and 3VHHB, showing both TiramiProt (upper left) and PconsC4 (lower right). 3P45J and 3VHHB are the PconsC3 benchmark dataset targets with the best and worst

TiramiProt F1 score respectively, relative to PconsC4. Blue star: true positives; red

(41)

0 50 100 150 200 250 300 Residue index 0 50 100 150 200 250 300 Residue index

(a) Target T0904. TiramiProt has higher F1score than RaptorX-Contact.

0 20 40 60 80 Residue index 0 20 40 60 80 Residue index

(b) Target T0922. TiramiProt has lower F1score than RaptorX-Contact.

Figure 4.7: Contact maps for CASP12 targets T0904 and T0922, showing both TiramiProt (upper left) and RaptorX-Contact (lower right). T0904 and T0922 are

the CASP12 targets with the best and worst TiramiProt F1 score respectively,

(42)

0 20 40 60 80 100 Residue index 0 20 40 60 80 100 Residue index

(a) Target T0900. TiramiProt has higher F1score than PconsC4.

0 50 100 150 200 Residue index 0 50 100 150 200 Residue index

(b) Target T0864. TiramiProt has lower F1 score than PconsC4.

Figure 4.8: Contact maps for CASP12 targets T0900 and T0864, showing both TiramiProt (upper left) and PconsC4 (lower right). T0900 and T0864 are the

CASP12 targets with the best and worst TiramiProt F1 score respectively, relative

(43)

Discussion

This thesis has used the Tiramisu deep learning architecture to pre-dict protein contacts. The resulting model, TiramiProt, has been com-pared to two current state-of-the-art models, PconsC4 and RaptorX-Contact. This comparison consisted in evaluating their performance on the CASP12 and the PconsC3 benchmark datasets using a number of different metrics. The PconsC3 benchmark dataset is five times big-ger than the CASP12 dataset, but it is likely that there is an overlap between the training set of RaptorX-Contact and the PconsC3 bench-mark dataset. None of the models have any training set overlap with the CASP12 dataset. The TiramiProt and PconsC4 models are similar, both as they have used the exact same data for training and validation, but also as their respective architectures are fairly similar. RaptorX-Contact differs from the two other models by using a different type of architecture, and twice the amount of training data.

This chapter discusses the model architecture and the hyperparam-eters of the final model, the performance of TiramiProt to PconsC4 and RaptorX-Contact, and the measures used to evaluate the models.

5.1 Hyperparameters and architecture

Hyperparameter optimization was a part of the development of the TiramiProt model. The model with the lowest S-score loss follow-ing the hyperopt hyperparameter optimization differs in some ways from the original Tiramisu model. Tiramisu uses a weight decay of 10−4 and a dropout rate of 0.2. The weight decay of TiramiProt is ap-proximately 10−8_{, with a dropout rate of 0.05. This low level of}

(44)

larization may be a side-effect of how hyperopt was allowed to eval-uate the models, only looking at the S-score validation loss of a single epoch. Keeping a moving average of the S-score validation loss may promote different behavior in terms of the preferred regularization. The lower number of pooling layers – a maximum of four, lower than the five used by Tiramisu – is motivated by the minimum sequence length of the training dataset not allowing for five pooling layers. Per-haps also worth noting is the difference in the number of filters in the first convolution layer between TiramiProt and PconsC4. PconsC4 ini-tially compresses these 132 filters into 64 filters, whereas TiramiProt compresses them to 29 filters. It is not immediately clear why this would differ across the models, but it may give an indication that the representation produced by the sequence network is quite sparse.

5.2 Model performance

Looking at model performance for the two datasets, we note that RaptorX-Contact performs much better than both TiramiProt and PconsC4 in-dependent of measure used on the PconsC3 benchmark dataset. Tables 4.2 and 4.3 show a large difference between RaptorX-Contact and the two other models on the PconsC3 benchmark dataset. This same pat-tern is also seen in the precision-recall curves in figure 4.2 and their corresponding average precision scores. Comparing the histograms in figure 4.4 also shows this very clear difference in performance, with the F1 score distribution of RaptorX-Contact showing fewer low F1

scores than either of the other models. On the CASP12 dataset, the results are less clear. The F1 score of RaptorX-Contact is slightly better

than TiramiProt, but PPV of the top L contacts, as well as the results of the precision-recall curves and their average precision scores (figure 4.1), are very similar to TiramiProt. In figures 4.1 and 4.2, the rapid drop in precision as recall approaches 1 for RaptorX-Contact is caused by RaptorX-Contact not reporting contact probabilities for all residue pairs. PconsC4 precision starting at 0 in figure 4.1 is likely caused by a few false positives reported as high-probability contacts.

(45)

remem-bering here is that RaptorX-Contact is an ensemble of several models based on the same architecture, averaging the output of these mod-els. Additionally, it is also probable that the RaptorX-Contact training dataset contains parts of the PconsC3 benchmark dataset, likely ex-plaining at least a part of the very large performance difference on this dataset. The choice of input features is quite similar between RaptorX-Contact and the two other models, and they are intended to capture more or less the same things, though methods used for feature extrac-tion differ somewhat. The effect of feature choice may be interesting future work, but it is outside of the scope of this thesis.

Looking closer at the performance of TiramiProt and PconsC4, ta-ble 4.2 shows similar performance, with TiramiProt having a slightly higher F1 score than PconsC4. Table 4.3 also shows very small

dif-ferences, with TiramiProt having a slightly higher PPV on CASP12, and PconsC4 having a slightly higher PPV on the PconsC3 benchmark dataset. Figures 4.1 shows some difference, with TiramiProt showing better recall for long-range contacts in particular, but this relationship is a lot less pronounced for the PconsC3 benchmark dataset, as shown in figure 4.2.

Figures 4.3 and 4.4 show a high correlation between F1 values of

PconsC4 and TiramiProt, with some correlation with the F1 values of

RaptorX-Contact. Given the identical training and validation sets of TiramiProt and PconsC4, this is expected. The correlation between RaptorX-Contact and the other models is likely in part caused by over-lapping training sets, as well as varying intrinsic difficulty of the tar-gets. The higher prevalence of difficult targets in the CASP12 dataset, evidenced by the higher proportion of sequences on which all mod-els receive low F1 scores, is notable. Different criteria during dataset

construction is one likely cause.

Figures 4.5, 4.6, 4.7, and 4.8 illustrate the targets with the largest differences in F1 score between the models. In figure 4.8 we note the

(46)

contacts in different parts of the protein, without too many false pos-itives. In the opposite case – when RaptorX-Contact performs better than TiramiProt – the number of false positives seems high. This is in line with what is expected based on the F1score correlations for these

four points. Note that this does not necessarily hold for the dataset as a whole.

5.3 Capturing model quality

Figure 4.7a gives an example of what the F1 score does not capture:

spread of contacts. Though the F1 score of TiramiProt is better than

that of Contact, it is worth noting that the contacts of RaptorX-Contact may in fact be more useful in this case. The contacts captured by RaptorX-Contact are spread across different parts of the protein, making them more useful for restricting the possible shapes the pro-tein may take. The entropy score used in CASP might have been useful in quantifying this, and it or some similar score should ideally be used in future model evaluation.

Using contacts maps generated by the model to build full protein structures and evaluating these may be an additional interesting mea-sure. This may be useful for model evaluation overall. As building these structures is quite resource intensive, it is not feasible to use dur-ing model traindur-ing, but for future work relydur-ing on hyperparameter optimization, this may a better heuristic than validation loss for model evaluation.

5.4 Summary

(47)

Re-training these models with more data should provide an answer to this.

(48)

Conclusion

The focus of this thesis has been the construction of a deep learn-ing model for protein contact prediction, TiramiProt. The TiramiProt model is based on the Tiramisu deep learning architecture. It takes a number input features related to protein sequence and protein ho-mology, and returns a protein contact map, i.e. a matrix of contact probabilities for each residue pair in the protein. The input features, as well as training and validation datasets, are identical to those of the state-of-the-art PconsC4 model. TiramiProt performs similarly to PconsC4 as well as RaptorX-Contact, another current state-of-the-art model for protein contact prediction. The performance is comparable to these models across a range of different measures used for evalu-ation of contact prediction models as well as in other machine learn-ing contexts. Given the identical inputs and similar performance of TiramiProt and PconsC4, it is not possible to conclusively say that the Tiramisu architecture gives better results than the u-net architec-ture of PconsC4 for protein contact prediction. It is however note-worthy that TiramiProt performs comparably to RaptorX-Contact, an ensemble of deep learning models based on the ResNet architecture, trained on twice the amount of data. Though not a clear improve-ment on the current state-of-the-art, TiramiProt still warrants contin-ued experimentation. A Python package and a Singularity container for the TiramiProt model are available at gitlab.com/nikos.t.

renhuldt/TiramiProt.

Immediate future research include retraining the model on a bigger dataset. Many new structures have been released since CASP12, with the new dataset consisting of more than 16 000 structures. There are

(49)

also plans to attempt data augmentation techniques during training, in the hope of giving more robust models. Current plans for this include using several alignments of different quality for each protein, and ran-domizing which one is used for each sequence during each epoch, similar to the data augmentation applied during training of models on image data. As of writing this, the new alignments have recently been produced, with only small changes to the current code necessary for retraining the network with data augmentation. To the author’s knowledge, this kind of data augmentation has never been done for deep learning models for protein contact prediction before. Traditional ensembling may also be worth attempting. During the writing of this thesis, new methods in deep learning have been released, including training allowing super-convergence [36] and stochastic weight aver-aging [37], both of which seem like possible avenues for future experi-mentation. Training allowing super-convergence in particular is inter-esting as it holds the promise of reducing training times significantly, making it possible to quickly evaluate different models.

(50)

[1] M. Ekeberg, C. Lövkvist, Y. Lan, M. Weigt, and E. Aurell, “Im-proved contact prediction in proteins: Using pseudolikelihoods to infer Potts models”, Physical Review E, vol. 87, no. 1, p. 012 707, Jan. 11, 2013.

[2] M. J. Skwark, M. Michel, D. M. Hurtado, M. Ekeberg, and A. Elofsson, “Accurate contact predictions for thousands of protein families using PconsC3”, BioRxiv, p. 079 673, Oct. 7, 2016.

[3] J. Schaarschmidt, B. Monastyrskyy, A. Kryshtafovych, and A. M. Bonvin, “Assessment of contact predictions in CASP12: Co-evolution and deep learning coming of age”, Proteins: Structure, Function, and Bioinformatics, vol. 86, pp. 51–66, 2018.

[4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning”, Nature, vol. 521, no. 7553, p. 436, May 27, 2015.

[5] M. Michel, D. Menéndez Hurtado, and A. Elofsson, “PconsC4: Fast, accurate, and hassle-free contact prediction”, In preparation, 2018.

[6] S. Wang, S. Sun, Z. Li, R. Zhang, and J. Xu, “Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model”, PLOS Computational Biology, vol. 13, no. 1, e1005324, Jan. 5, 2017. [7] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio,

“The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation”, Nov. 28, 2016. arXiv: 1611.09326 [cs].

[8] T. Shafee, Protein structure illustration, in Wikipedia.

[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition”, Dec. 10, 2015. arXiv: 1512.03385 [cs].

(51)

[10] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, May 18, 2015. arXiv: 1505.04597 [cs].

[11] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely Connected Convolutional Networks”, Aug. 24, 2016. arXiv: 1608.06993 [cs].

[12] M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, and T. Hwa, “Identification of direct residue contacts in protein–protein inter-action by message passing”, Proceedings of the National Academy of Sciences, vol. 106, no. 1, pp. 67–72, Jan. 6, 2009. pmid: 19116270. [13] C. Baldassi, M. Zamparo, C. Feinauer, A. Procaccini, R. Zecchina, M. Weigt, and A. Pagnani, “Fast and Accurate Multivariate Gaus-sian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners”, PLoS ONE, vol. 9, no. 3, K. Hamacher, Ed., e92721, Mar. 24, 2014.

[14] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,

A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio”, Sep. 12, 2016. arXiv: 1609.

03499 [cs]_.

[15] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional Sequence to Sequence Learning”, May 8, 2017. arXiv: 1705.03122 [cs].

[16] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, Feb. 10, 2015. arXiv: 1502.03167 [cs].

[17] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Re-stricted Boltzmann Machines”, in Proceedings of the 27th Interna-tional Conference on InternaInterna-tional Conference on Machine Learning, ser. ICML’10, USA: Omnipress, 2010, pp. 807–814.

[18] K. He, X. Zhang, S. Ren, and J. Sun, “Identity Mappings in Deep Residual Networks”, Mar. 16, 2016. arXiv: 1603.05027 [cs]. [19] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated

(52)

[20] G. Pleiss, D. Chen, G. Huang, T. Li, L. van der Maaten, and K. Q. Weinberger, “Memory-Efficient Implementation of DenseNets”, Jul. 21, 2017. arXiv: 1707.06990 [cs].

[21] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

[22] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D. Cox, “Hyperopt: A Python library for model selection and hyperpa-rameter optimization”, Computational Science & Discovery, vol. 8, no. 1, p. 014 008, 2015.

[23] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter optimization”, in Advances in Neural Infor-mation Processing Systems, 2011, pp. 2546–2554.

[24] Z. Q. Fu, G. C. Du Bois, S. P. Song, I. Kulikovskaya, L. Virgilio, J. L. Rothstein, C. M. Croce, I. T. Weber, and R. W. Harrison, “Crystal structure of MTCP-1: Implications for role of TCL-1 and MTCP-1 in T cell malignancies.”, Proc.Natl.Acad.Sci.USA, vol. 95, pp. 3413–3418, 1998.

[25] D. M. Hurtado, K. Uziela, and A. Elofsson, “Deep transfer learn-ing in the assessment of the quality of protein models”, Apr. 17, 2018. arXiv: 1804.06281 [q-bio].

[26] A. Ray, E. Lindahl, and B. Wallner, “Improved model quality assessment using ProQ2”, BMC Bioinformatics, vol. 13, p. 224, Sep. 10, 2012. pmid: 22963006.

[27] J. Ma, S. Wang, Z. Wang, and J. Xu, “Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning”, Bioinformatics, vol. 31, no. 21, pp. 3506–3513, Nov. 1, 2015.

[28] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, “The Protein Data Bank”, Nucleic Acids Research, vol. 28, no. 1, pp. 235–242, Jan. 1, 2000.

(53)

[30] G. Wang and R. L. Dunbrack Jr, “PISCES: A protein sequence culling server”, Bioinformatics, vol. 19, no. 12, pp. 1589–1591, 2003. [31] M. Remmert, A. Biegert, A. Hauser, and J. Söding, “HHblits:

Lightning-fast iterative protein sequence searching by HMM-HMM alignment”, Nature methods, vol. 9, pp. 173–5, Dec. 25, 2011. [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,

Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic dif-ferentiation in PyTorch”, Oct. 28, 2017.

[33] S. D. Dunn, L. M. Wahl, and G. B. Gloor, “Mutual information without the influence of phylogeny or entropy dramatically im-proves residue contact prediction”, Bioinformatics, vol. 24, no. 3, pp. 333–340, Feb. 1, 2008.

[34] T. Lopez, K. Dalton, A. Tomlinson, V. Pande, and J. Frydman, “An information theoretic framework reveals a tunable allosteric network in group II chaperonins”, Nature Structural & Molecular Biology, vol. 24, no. 9, pp. 726–733, Sep. 2017.

[35] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Opti-mization”, Dec. 22, 2014. arXiv: 1412.6980 [cs].

[36] L. N. Smith and N. Topin, “Super-Convergence: Very Fast Train-ing of Neural Networks UsTrain-ing Large LearnTrain-ing Rates”, Aug. 23, 2017. arXiv: 1708.07120 [cs, stat].

(54)

Sequences

A.1 CASP12

CASP12 target IDs T0859, T0860, T0861, T0862, T0863, T0864, T0865, T0866, T0868, T0869, T0870, T0871, T0872, T0873, T0879, T0886, T0889, T0891, T0892, T0893, T0896, T0897, T0898, T0900, T0902, T0903, T0904, T0911, T0912, T0918, T0920, T0921, T0922, T0928, T0941, T0942, T0943, T0944, T0945, T0947

A.2 PconsC3 benchmark dataset

PDB accessions 1AHSC, 1C2YD, 1C9YA, 1CCTA, 1COZA, 1DBRA, 1DCHF, 1EDIA, 1EFDN, 1F46B, 1F68A, 1FHIA, 1FJRB, 1FS0G, 1G61A, 1GJJA, 1GLGA, 1GPSA, 1H68A, 1I95E, 1I97T, 1IMBB, 1IMXA, 1IR1S, 1IS9A, 1JGPR, 1JH0L, 1K6LH, 1KNVB, 1KNYB, 1KQPA, 1LDIA, 1LQKB, 1M12A, 1MB6A, 1MFRP, 1MR7A, 1N2ZB, 1N5BA, 1N60C, 1NQLB, 1OAGA, 1OTFF, 1P3HE, 1PCFA, 1PDFE, 1PS1A, 1RD9D, 1RH7C, 1RL9A, 1S3FB, 1S68A, 1SUDA, 1SWXA, 1SYHA, 1TD4A, 1TFKB, 1TJLD, 1UWZB, 1VCRA, 1VJNA, 1VQZA, 1W8AA, 1W9GB, 1WD5A, 1WIGA, 1WPVB, 1X0PJ, 1X48A, 1X8HA, 1X91A, 1XBAA, 1XQFA, 1XS6A, 1Y4HD, 1Y60C, 1YG6F, 1YHQO, 1YQFF, 1YWSA, 1Z7ME, 1ZD7B, 1ZJ0A, 1ZWYC, 2A84A, 2A9KB, 2AMCA, 2AV5D, 2B9NX, 2BWEL, 2CB6A, 2CCCA, 2CDMC, 2CJRB, 2CSMA, 2D0PB, 2D2CN, 2DIOC, 2E2AB, 2EJNA, 2F0RA, 2FEEB, 2FJCO, 2GVIA, 2H44A, 2HGHA, 2HI7B, 2HJJA, 2HL0A, 2I9LI, 2IA9E, 2II9B, 2J1KQ, 2J3WA, 2J8WB, 2JOVA, 2JYNA, 2KYSA, 2KZSA, 2M0MB, 2NQ2A, 2NR9A, 2OF5H, 2OGFD, 2OHCA, 2OJ5C, 2ONKC, 2OPIA, 2PAVP, 2PLSF, 2Q7RA, 2QYFD, 2RDOL, 2RMRA, 2RTBB, 2VGRA, 2VT8A,

(55)

(56)

Model hyperparameters

Table B.1: Exact hyperparameters of the model with the lowest validation loss.

Hyperparameter Value

Learning rate 0.0009872639887752842

Learning rate patience 11

Weight decay 7.091143176348923e-09

Dropout rate 0.04656566466213424

Batch size 4

Number of filters, first convolution layer 29

Number of pooling layers 4

Layers per block 8

Growth rate 10

(57)

(58)