Study of Protein Interfaces with Clustering

(1)

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se

Linköping University | Department of Physics, Chemistry and Biology

Master thesis, 30 hp | Bioinformatics

Spring term 2018 | LITH-IFM-A-EX--18/3471--SE

Study of Protein Interfaces

with Clustering

Jonathan Bergqvist

Supervisor : Claudio Mirabello Examiner : Björn Wallner

(2)

Datum

Date

2018-06-19

Avdelning, institution

Division, Department

Department of Physics, Chemistry and Biology

Linköping University

URL för elektronisk version

ISBN

ISRN: LITH-IFM-A-EX--18/3471--SE

_________________________________________________________________ Serietitel och serienummer ISSN

Title of series, numbering ______________________________ Språk Language Svenska/Swedish Engelska/English ________________ Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport _____________ Titel Title

Study of Protein Interfaces with Clustering

Författare Author

Jonathan Bergqvist

Nyckelord Keyword Sammanfattning Abstract

Protein-protein interactions occur in nature and have different functions. The interacting surface

between two interacting proteins contains the respective protein's interface residues.

In this thesis, a series of Python scripts are presented which can perform interface-interface

comparisons with the method InterComp, to obtain a distance matrix of different protein

interfaces. The distance matrix can be studied with the use of clustering algorithms such as

DBSCAN.

The result from clustering using DBSCAN shows that for the 77,017 protein interfaces studied, a

majority of the protein interfaces are part of a single cluster while most of the remaining interfaces

are noise for the tested parameters Eps and MinPts.

The conclusion of this thesis is the effect on the number of clusters for the tested parameters Eps

and MinPts when performing DBSCAN.

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(4)

Abstract

Protein-protein interactions occur in nature and have different functions. The inter-acting surface between two interinter-acting proteins contains the respective protein’s interface residues.

In this thesis, a series of Python scripts are presented which can perform interface-interface comparisons with the method InterComp, to obtain a distance matrix of different protein interfaces. The distance matrix can be studied with the use of clustering algorithms such as DBSCAN.

The result from clustering using DBSCAN shows that for the 77,017 protein interfaces studied, a majority of the protein interfaces are part of a single cluster while most of the remaining interfaces are noise for the tested parameters Eps and MinPts.

The conclusion of this thesis is the effect on the number of clusters for the tested pa-rameters Eps and MinPts when performing DBSCAN.

(5)

Acknowledgments

I would like to thank the following individuals:

Björn Wallner, for giving me the opportunity to perform this thesis work, all the discus-sions and help he has provided.

Claudio Mirabello, for providing advice and help. Lovisa Sandell, for her great opposition and feedback.

(6)

List of Figures

2.1 The initial GANTT schedule. Activities A are shown in yellow and milestones M are shown in green. Dependencies D are shown between activities and milestones. 3 3.1 Illustration of a protein. The yellow part is the protein core, the red part is the

surface of the protein and the black part is the interface of the protein surface. Made with draw.io [7]. . . 6 3.2 Illustration of DBSCAN, where the dots are points and the circles are range Eps

with MinPts = 3. The points with solid circles are core points, the points with dashed lines are border points and the point with a dotted line is noise. Made with draw.io [7]. . . 7 4.1 Flow chart of general method steps (top row) and the corresponding scripts

(bot-tom row) used in this thesis. Made with draw.io [7]. . . 9 4.2 Flow chart of the script largeInterCompRun.py. Made with draw.io [7]. . . 10 4.3 Flow chart of the script ownMatrixMaker.py. Made with draw.io [7]. . . 11 5.1 The distribution of the ´log10(p ´ values)with 50 bins. Note that the bar to the

right (~ 90) is equal to the to the number of data point that are infinite in size (limp´valueÑ0´log10(p ´ value) =8, see negLog10Matrix() in section 4.7 for explanation). . . 17 5.2 The final GANTT schedule. Activities A are shown in yellow and milestones M

(9)

List of Tables

4.1 Summary of the scripts used in this thesis, including their respective input and output. . . 14 5.1 The number of clusters estimated by DBSCAN for different Eps when MinPts=5,

in decreasing size of Eps. . . 15 5.2 The number of clusters estimated by DBSCAN for different Eps when MinPts =

10, in decreasing size of Eps. . . 15 5.3 The top three biggest clusters and corresponding cluster-id (-1 equals noise)

esti-mated by DBSCAN for a selection of Eps when MinPts=5, in decreasing size of Eps. Note that the 3rd biggest cluster size (and smaller clusters) may be available multiple times. . . 16 5.4 The top three biggest clusters and corresponding cluster-id (-1 equals noise)

esti-mated by DBSCAN for a selection of Eps when MinPts=10, in decreasing size of Eps. Note that the 3rd biggest cluster size (and smaller clusters) may be available multiple times. . . 16

(10)

1 Introduction

1.1 Motivation

Proteins can interact with each other through binding, which occurs on the interacting sur-face containing the intersur-face residues of each respective protein. Many intersur-faces have been determined. [1] The same interface can be created regardless of the amino acid sequence of the protein, since the same interface geometry can be created from different amino acid sequences. [2]

By using known interfaces from Protein Data Bank (PDB), the software InterComp can be used to perform interface-interface comparisons between different interfaces. By calculating the p-values with InterComp for the available interface-interface interactions, a matrix with the p-values can be created, where each p-value is to test the null hypothesis that a random hit would have a higher score. [3]

The p-value matrix created can be used with the clustering algorithm DBSCAN to sort the matrix rows (PDB id’s) into different groups/clusters. [4] Then the clusters and differences between clusters can be studied.

1.2 Aim

The aim of this thesis is to compare interfaces between proteins followed by clustering to de-termine if there are any relationships within the clusters and differences between the clusters. The clustering will also be used to try to answer if the Protein Data Bank (PDB) is complete in regards to protein interfaces, as in year 2010 it was claimed that the library of interfaces was almost complete [2].

1.3 Research questions

1. How many clusters are available for different parameters in DBSCAN?

2. Do the protein interfaces within each cluster from DBSCAN have anything in common? 3. Is it possible to determine if all possible protein interfaces can be found in the Protein

(11)

1.4. Delimitations

1.4 Delimitations

The total number of protein interfaces available for usage in InterComp was 578,884, which would give a p-value matrix of size 578, 8842 requiring a lot of RAM and time to create. Thereby a smaller dataset of 77,017 interfaces was produced in advance of the thesis by the supervisor, which required less RAM and time to create. See section 4.2 for more details.

(12)

2 Process

2.1 Time plan

When the project started, a time plan was constructed as a GANTT schedule, which can be seen in figure 2.1. The figure contains the planned activities A and milestones M together with their dependencies D.

Figure 2.1: The initial GANTT schedule. Activities A are shown in yellow and milestones M are shown in green. Dependencies D are shown between activities and milestones.

The aim/goal of the project is stated in section 1.2 and the activities to reach that goal is seen in the GANTT schedule. The main programming starts during week 6 of year 2018 as the activity Designing program and is continued with the activities Implementing and running program and Analyzing results from the program. The milestones are different deadlines for the project.

2.2 Plan for systematic follow-up

The plan for follow-up can generally be seen in the dependencies of the GANTT schedule in figure 2.1. As can be seen in the figure, activities starting week 6 are presented in a workflow.

(13)

2.2. Plan for systematic follow-up

When one activity ends another one usually starts as a continuation, for example is Analyzing results from the program a continuation from Implementation and running program, since you need a program that generate results to be able to analyze the obtained results. If it turns out that some activities take more time than planned, this is not considered a major problem as there is some flexibility in the design of the time plan, where more time can be allocated for the part of the programming that needs more time if needed.

(14)

3 Theory

3.1 Protein-protein interactions

Protein-protein interactions (PPIs) have different functions in nature depending on for exam-ple affinity, where the PPIs can be very important for the cell. [5]

According to Nooren and Thornton [5], there are three different types of PPIs: 1. Homooligomeric and heterooligomeric complexes.

2. Non-obligate and obligate complexes. 3. Transient and permanent complexes.

However, PPIs do not need to be a certain type. Examples mentioned by Nooren and Thorn-ton [5] include Arc repressor as an obligate homodimer and Thrombin together with Rodniin inhibitor as a non-obligate permanent heterodimer. [5]

The surface residues that are binding in the respective proteins in the PPI are called the protein’s interface and many protein interfaces have been determined. [1] Interfaces are not dependent of the order in amino acid sequence of the protein, since the same interface ge-ometry can be obtained from different amino acid sequences. [2] Interfaces contain more hydrophobic residues and less hydrophilic residues according to Yan et al. [6]. For interfaces involved in an interaction, called protein-protein interfaces (called interacting interfaces be-low), the more common contacts are salt bridges, hydrophobic interactions and Cys-Cys con-tacts as disulfide bridges. It has also been found that aromatic residues are more common in interacting interfaces. [6] An illustration of a protein can be seen in figure 3.1.

3.2 InterComp

InterComp is a method for structural comparisons for proteins, including comparisons of interface-surface and interface-interface. The method is independent of topology and sequence-order since the method treat the compared molecules as independent points in 3D space. [3]

The objective function for InterComp is calculated by comparing the Cα distance maps

Da and Db for molecule a and b with length La = N and Lb = M respectively, where the molecules can be of different sequence length (N ď M). [3]

(15)

3.2. InterComp

Figure 3.1: Illustration of a protein. The yellow part is the protein core, the red part is the surface of the protein and the black part is the interface of the protein surface. Made with draw.io [7].

The similarity for a trial alignment between the molecules is calculated by the function strdist(Da, Db)(equation 3.1), which calculates the difference δxy = |Da´Db|, where(x, y) is the element being calculated, between the positions in the common size of the distance matrices (if molecule b ą a then the excess residues M ´ N are excluded from the alignment and are not calculated). A trial alignment is obtained by switching two random rows or columns in distance matrix Db to obtain a new distance matrix D_bp. The parameter d0is a parameter of the Levitt-Gerstein score [8] and is set to 0.5 Å through optimization. [3]

strdist(Da, Db) = 1 N2 N ÿ x=1 N ÿ y=1 1 1+ (δ_dxy 0 ) 2 (3.1)

The similarity of the aligned residues is calculated by the function seqdist(Sa, Sb)(equation 3.2), where Saand Sbare the residues in a and b respectively that are aligned, and BLOSU M is the BLOSUM62 substitution matrix by Henikoff and Henikoff [9]. [3]

seqdist(Sa, Sb) = 1 N N ÿ z=1 BLOSU M(s_za, sbz) (3.2)

The complete objective function is calculated by weighing strdist(Da, D_bp)and seqdist(Sa, S_bp) to get the optimal maximum opt(p)(equation 3.3).

opt(p) =arg max_pPP(M,N)Wstrstrdist(Da, D_bp) + (1 ´ Wstr)seqdist(Sa, S_bp) (3.3)

In equation 3.3, Wstr P[0, 1]is the weight, which default value is optimized to 0.5, D_bpis the new distance matrix used in the trial alignment and Sp_b are the amino acids from the distance matrix D_bp. [3]

The optimization is made by trying different D_bpand calculating the score opt(p)(equation 3.3). The tested D_bpis accepted if its score is better than the earlier top score. The score can also be accepted with the probability P=exp(´∆score/T), where∆score is the difference in

(16)

3.3. Cluster analysis

score between tested D_bpand last top score, T is annealing temperature which lowers as the number of test iterations increase. [3]

3.3 Cluster analysis

Cluster analysis, or clustering, is a set of methods used to split a set of data into smaller groups of data called clusters, where each cluster contains data points with more similarity within the cluster, and less similarity when comparing between different clusters. [10]

Methods for cluster analysis can be split into two broad classifications: hierarchical (will give results that resemble a phylogenetic tree) and non-hierarchical (will group the data into exclusive sections). [10]

3.4 The clustering algorithm DBSCAN

The clustering algorithm used in this report is called DBSCAN, Density Based Spatial Clustering of Applications with Noise, which discovers clusters based on density of points. There are two types of points in a cluster: core points within clusters and border points on the border of clusters. [4]

A core point q is defined as a point with at least MinPts number of points within range Eps of the core point q, this condition is not applicable for border points which normally contains less points near them. Core points can reach other points that are within range Eps of the core point in question. When a core point q reaches a border point p, the border point is called a density-reachable border point. A core point o reaching multiple border points p and q at once, makes the border points p and q density-connected border points. [4]

A cluster is defined as all points p and q, where q is density-reachable from p and p is density-connected to q. Noise is defined as all points that do not belong to any cluster. [4]

An illustration of DBSCAN can be seen in figure 3.2.

Figure 3.2: Illustration of DBSCAN, where the dots are points and the circles are range Eps with MinPts = 3. The points with solid circles are core points, the points with dashed lines are border points and the point with a dotted line is noise. Made with draw.io [7].

(17)

4 Method

4.1 Overview information

This thesis was made with the programming language Python, version 3.5.2 and 3.6.2, thereby some terminology used when programming will appear in the sections below. This section will give an explanation of the terminology used.

Terminology

This fontwill be used to present the scripts (name ending with .py), or programs. The same for functions (name ending with (), parentheses) within scripts and printouts of certain pieces of code.

• Float or floating point number - (Usually) an approximation of a decimal number. [11] • Pickle - pickle.dump() is used to store a representation of an object to a file wanted

for later use, which can be obtained by pickle.load(). [12]

• Package - Contains pre-written functions for use. The external packages numpy (includ-ing .npy files and obtain(includ-ing 32-bit floats) [13], scikit-learn (for DBSCAN) [14], Seaborn [15], matplotlib [16, 17], pandas [18] (for plotting) and scipy (for calculating euclidean distances) [19] are used in the thesis.

• Data type - Here written as dtype, which can be for example a 32-bit float written as dtype = float32.

4.2 Preparations

Before the start of this thesis, preparations were made by the supervisor, by creating smaller datasets from the initial large dataset of 578,884 interfaces1_{. This was made by picking a} sub-set of 2349 representative interfaces and performing the InterComp program of the 578,884

1_{A p-value matrix for these interfaces as 32-bit floats would require approximately 1.2 TB «} ₍_{578, 884}2_˚

(32/8))/(10244₎_{TB where 32 is 32-bit, 8 is the number of bits in a byte and 1024}4_{gives size in TB.}

(18)

4.3. Overview of the created scripts

interfaces versus the 2349 interfaces, yielding a 578,884 x 2349 matrix of p-values, where the p-values were to test the null hypothesis that a random hit would have a higher score [3].

This was followed by calculating the Pearson correlation coefficient [20] between the cou-ples of the 2349 vectors, to receive an estimation of similarity measure for the coucou-ples of 578,884 interfaces. When the correlation [20] was over 0.8, it was assumed that the two points were almost identical and were grouped together to a pseudo-cluster, where the largest repre-sentation for each pseudo-cluster was to be used (see next paragraph).

By calculating the Pearson correlation coefficient [20], 433,388 interfaces were able to be grouped into 77,017 pseudo-clusters, where the largest representation of each cluster was used in the thesis work2. The remaining 145,496 interfaces were not similar to anything, yielding a total of 222,513 interfaces available after calculations of the Pearson correlation coefficient [20].

4.3 Overview of the created scripts

In the different scripts below, the resulting data was saved as 64-bit floats from section 4.6 until after the use of the script ownTo32bit.py in section 4.8, if nothing else is stated. The interface dataset which contained 77,017 interfaces, called 77k-data below, was used in section 4.4 - 4.10 . The dataset of the pseudo-clusters was also used in section 4.11

A flow chart of the general steps and their corresponding scripts can be seen in figure 4.1.

Figure 4.1: Flow chart of general method steps (top row) and the corresponding scripts (bot-tom row) used in this thesis. Made with draw.io [7].

4.4 Running InterComp to obtain p-values

InterComp, or InterfaceComparison, was run for the 77k-data by creating the script largeInterCompRun.py, which flowchart can be seen in figure 4.2. The script looked at each new interface from a listfile containing the name of the interfaces. By then looking if there existed a .started file for the interface in question, the script would determine if it should continue with the interface or if it should continue with the next interface in the interface list, which made it possible to run the script multiple times in parallel.

If a .started file did not exist for the interface in question, a .started file was created and the script then checked if there existed a .out file for the interface. If there was no .out file for the interface, InterComp was called in the terminal by the script. InterComp was run with the arguments:

• The PDB data file for the interface in question.

• A listfile .list containing the PDB data for all the comparisons (remaining) to be made.

2_{A p-value matrix for these interfaces as 32-bit floats would require approximately 22 GB «} ₍_{77, 017}2_˚

(19)

4.5. Checking the results from InterComp

Figure 4.2: Flow chart of the script largeInterCompRun.py. Made with draw.io [7].

A .out file was created and the results from InterComp was added to the .out file

If there already existed a .out file for the interface, the made comparisons were added to a list containing made comparisons for that interface. The remaining not made compar-isons were then performed by calling InterComp in the terminal and appending the output of InterComp to the existing .out file.

All created files were stored in a folder created through InterComp. The default values of the available options for alignment and annealing in InterComp were used.

4.5 Checking the results from InterComp

The script check_outfile.py would take the listfile containing the interfaces (used in sec-tion 4.4) and the folder containing the created .out files from InterComp in secsec-tion 4.4 as input.

The interfaces from the listfile were stored in a dictionary with the interface as the key and the row index as the value. The script then looked at each .out file and checked if the .out contained any of the defined errors.

The defined errors were:

• The first column (column 0) had a name that was not found in the dictionary of inter-faces. The first column was the interface that was being compared to all other interinter-faces. • The second column (column 1) had a name that was not found in the dictionary of interfaces. The second column was an interface being compared to the interface in the first column.

• If the number of columns on a row in the .out file was not correct (15 columns were correct).

If an error was found in the first and/or second column, the names in the affected columns were incorrect. If an error was found in the number of columns, the line had been made incorrectly.

If errors were found in the .out file, a new file was created with the file ending with .tmp, where only the lines of the .out file containing no errors were saved.

The .tmp files then had to be manually copied to another directory for backup, then the .tmpfiles were renamed to their respective .out filename in the original folder containing all .out files. Finally, the .started files were removed for the new .out files, followed by running InterComp again according to section 4.4, to make sure the new .out files contained all data.

(20)

4.6. Creating a matrix of p-values

4.6 Creating a matrix of p-values

The matrix for the data created from InterComp in section 4.4 was created according to figure 4.3 by the script ownMatrixMaker.py. The matrix was made by first creating keys to a dictionary based on the listfile of interfaces used in section 4.4, where the index (line number starting at zero) in the listfile would correspond to the dictionary value and the interface name would correspond to the key.

Figure 4.3: Flow chart of the script ownMatrixMaker.py. Made with draw.io [7].

The matrix dimensions were then determined as the amount of keys in the dictionary, since the matrix would be size(amount o f keys)2_{. The matrix was then created as a numpy} array with size(amount o f keys)2_{and with the value -1.0 in each element and dtype =} floatwhich made all elements as float numbers. The p-values from the InterComp compar-isons for the dataset were then inserted into their respective element(i, j)and(j, i)based on the keys created before.

The matrix was finally saved as a .npy file.

4.7 Scanning the matrix

The script ownMatrixScan.py contained three functions that were used: findMinusOne(), matrixchange() and negLog10Matrix().

• findMinusOne() would look though all elements in the matrix to count any remain-ing -1.0 that could be left from ownMatrixMaker.py in section 4.6.

• matrixchange() would change the values in the matrix main diagonal to 0, which were -1.0 before, and saved the new matrix as a new file ending with _d.npy, where _dindicated that the diagonal had been changed.

• negLog10Matrix() would load a matrix (or list) and calculate ´log10(p ´ value)for each element. If the biggest element would become infinite, a list would be created containing all elements where the infinite elements were changed to the second largest element times two. The created data was then saved as 32-bit floats with the file ending with _log.npy which indicated the change to logarithm.

4.8 Changing the data type of the matrix

The _d.npy matrix created in section 4.7 had originally a 64-bit float data type (the used dtype = float in section 4.6 equals dtype = float64). The matrix data type was changed from 64-bit to 32-bit float in the script ownTo32bit.py which would load the ma-trix, change its data type to numpy 32-bit (dtype = float32) and save the new matrix with the file ending with _32bit.npy.

4.9 Running the clustering algorithm DBSCAN

The script ownDBSCAN.py was based on scripts provided by the supervisor and uses DB-SCAN from the scikit-learn package. ownDBDB-SCAN.py would load the 32-bit matrix from the

(21)

4.10. Analyzing the clusters

_32bit.npyfile finalized in section 4.8. The p-values from the matrix would be used as scores and are reshaped according to the number of elements nelements in the matrix (i.e. the number of rows in the matrix before reshape) if needed (if not given as a n ˆ n matrix).

The script checked if the pickle file already existed in the directory for the parameters Eps and MinPts stated in the script. If the pickle file existed it would be loaded, otherwise the algorithm DBSCAN would be performed by the scikit-learn package followed by dumping the pickle file.

The number of clusters was printed for the used parameters Eps and MinPts.

4.10 Analyzing the clusters

The clusters created from the DBSCAN algorithm in section 4.9 were analyzed by the script ownAnalysis.py. The script would import the 32-bit score matrix finalized in section 4.8 and the pickled results from performing DBSCAN in section 4.9.

The script would then look at each cluster to gather the correct scores and calculate the cluster size for each cluster. The cluster’s id (called cluster-id below) and cluster size were added as a tuple to a list (here called list 1). If the checked cluster was not noise, the script would then calculate the centroid mean position for the cluster from the gathered scores and add the cluster-id, cluster centroid and cluster size as a tuple to another list (here called list 2).

List 1 was then sorted in decreasing order of cluster size and saved as a .npy file as 64-bit integers. The script then calculated the euclidean distance between the cluster centroids in list 2 and saved the distances as a 32-bit numpy array as a .npy file.

4.11 Studying the pseudo-clusters

Running InterComp on pesudo-clusters

The script ownListChange.py contained the two functions changeList() and runInterComp().

• changeList() was used to obtain the PDB id’s within each pseudo-cluster from a listfile containing all pseudo-clusters, which was made through the use of regular ex-pression. The PDB id’s for each pseudo-cluster were stored in a list. The list of PDB id’s was then stored through pickle.dump().

• runInterComp() used the list of PDB id’s stored in changeList() and ran Inter-Comp for each list element/pseudo-cluster of PDB id’s in the list. For each list element, its PDB id’s were compared by InterComp after checking if a .started file and .out file existed (which allowed the program to be run in parallel, similar approach to sec-tion 4.4). When looking at the PDB id’s in each list element, the first PDB id was used to name a .done file containing all PDB id’s that had already been compared against. All created files were stored in a folder created through InterComp. The default values of the available options for alignment and annealing in InterComp were used.

Obtaining the p-values

The script ownScoreSearch.py was used to obtain the p-values from the created .out files from InterComp, by looking at each line in the .out files and store the column containing the p-value in a list. The list of all p-values were then changed to a numpy array storing the p-values as 32-bit floats, which was then saved as a .npy file.

(22)

4.12. Summary of scripts

Getting ´log

10

(p ´ values)

This was made with the function negLog10matrix() described in section 4.7.

Plotting the distribution of the data

A histogram of the ´log10(p ´ values)was created by the script plotData.py, which used the packages Seaborn, matplotlib and pandas for plotting. The histogram of the ´log10(p ´ values)had the number of bins determined in the script by the user and the figure was stored as a .png file.

4.12 Summary of scripts

(23)

4.12. Summary of scripts

Section Script Input Output

4.4 largeInterCompRun.py

File with PDB id’s and folder with files contain-ing PDB information.

Folder with files from In-terComp.

4.5 check_outfile.py

File with PDB id’s and folder with files from In-terComp (section 4.4).

.tmp files for the .out files that contained errors.

4.6 ownMatrixMaker.py

File with PDB id’s and folder created in section 4.4.

File with score matrix as numpy array.

4.7 ownMatrixScan.py

matrixchange(): File with score ma-trix from section 4.6. findMinusOne(): File with score ma-trix from section 4.6. negLog10Matrix(): File with matrix or list of p-values.

matrixchange(): File with new score matrix. findMinusOne():

Number of -1.0

in score matrix.

negLog10matrix(): File with 32-bit numpy ar-ray of ´log10(p ´ values).

4.8 ownTo32bit.py

File with 64-bit

score matrix from matrixchange() in section 4.7.

File with 32-bit score ma-trix.

4.9 ownDBSCAN.py File with 32-bit score

ma-trix. Pickled results-file.

4.10 ownAnalysis.py

File with 32-bit score ma-trix from section 4.8 and pickled results-file from 4.9.

File with distances be-tween clusters as 32-bit numpy array and file with cluster sizes as 64-bit integers.

4.11 ownListChange.py

changeList(): File with pseudo-clusters. runInterComp(): Pick-led list-file of pseudo-clusters.

changeList():

Pickled list-file of pseudo-clusters.

runInterComp(): Folder with files from InterComp.

4.11 ownScoreSearch.py

.out files from

InterComp in

runInterComp().

File with 32-bit numpy ar-ray of p-values.

4.11 plotData.py File with numpy array of

´log10(p ´ values).

.pngimage file of the his-togram.

Table 4.1: Summary of the scripts used in this thesis, including their respective input and output.

(24)

5 Results

5.1 Clustering with DBSCAN

Number of clusters

The number of clusters found by DBSCAN for the 77k-dataset in section 4.10 can be found in table 5.1 and 5.2.

Eps Number of clusters

0.001 2 0.0005 10 5e-5 15 1e-5 18 5e-6 23 1e-6 20 5e-7 19 1e-7 5 5e-8 5 1e-8 2 5e-9 2 1e-9 1 5e-10 1 1e-10 1

Table 5.1: The number of clusters es-timated by DBSCAN for different Eps when MinPts = 5, in decreasing size of Eps.

Eps Number of clusters

0.001 1 0.0005 3 5e-5 3 1e-5 4 5e-6 3 1e-6 3 5e-7 1 1e-7 1 1e-8 1

Table 5.2: The number of clusters es-timated by DBSCAN for different Eps when MinPts = 10, in decreasing size of Eps.

Size of selected clusters

A selection of the top three clusters sizes when MinPts=5 and MinPts=10 can be seen in table 5.3 and 5.4 respectively.

(25)

5.2. Distribution of the pseudo-clusters

Eps Number of clusters Biggest cluster (id) 2nd biggest cluster (id) 3rd biggest cluster (id)

5e-5 15 73964 (0) 2976 (-1) 9 (3)

5e-6 23 71666 (0) 5221 (-1) 14 (6)

1e-7 5 66847 (0) 10149 (-1) 6 (1)

5e-8 5 66208 (0) 10786 (-1) 8 (3)

Table 5.3: The top three biggest clusters and corresponding cluster-id (-1 equals noise) esti-mated by DBSCAN for a selection of Eps when MinPts=5, in decreasing size of Eps. Note that the 3rd biggest cluster size (and smaller clusters) may be available multiple times.

Eps Number of clusters Biggest cluster (id) 2nd biggest cluster (id) 3rd biggest cluster (id)

0.0005 3 75721 (0) 1289 (-1) 5 (1)

5e-5 3 73711 (0) 3293 (-1) 9 (2)

1e-5 4 72080 (0) 4912 (-1) 10 (1)

5e-6 3 71353 (0) 5649 (-1) 10 (2)

1e-6 3 69513 (0) 7489 (-1) 10 (2)

Table 5.4: The top three biggest clusters and corresponding cluster-id (-1 equals noise) esti-mated by DBSCAN for a selection of Eps when MinPts=10, in decreasing size of Eps. Note that the 3rd biggest cluster size (and smaller clusters) may be available multiple times.

5.2 Distribution of the pseudo-clusters

The distribution of the ´log10(p ´ values)from section 4.11 can be seen in figure 5.1, where the number of bins in the histogram is set to 50. The number of data points are 15,354,597, with 13,298,824 (86.6 %) data points of p-value = 0 and 67,765 (0.4 %) data points of p-value = 1 (method not shown in Method section).

5.3 The process

The final version of the GANTT-schedule can be seen in figure 5.2. A few changes were made compared to the initial GANTT-schedule in figure 2.1. The most noticeable change is that the three activities containing programming in the initial schedule was changed to a single activity. Another noticeable change is that the start of writing the report started earlier than initially planned. The other changes were that the activity for searching literature was extended, the activities of week 19 and forward of the schedule were moved forward in time, the final edits in the report was extended one week and the milestone half-time meeting was one week earlier.

(26)

5.3. The process

Figure 5.1: The distribution of the ´log10(p ´ values) with 50 bins. Note that the bar to the right (~ 90) is equal to the to the number of data point that are infinite in size (limp´valueÑ0´log10(p ´ value) = 8, see negLog10Matrix() in section 4.7 for explana-tion).

Figure 5.2: The final GANTT schedule. Activities A are shown in yellow and milestones M are shown in green. Dependencies D are shown between activities and milestones.

(27)

6 Discussion

6.1 Results

Clustering with DBSCAN

Based on the table 5.1 and 5.2 seen in section 5.1 it can be seen that for a selected value of MinPts, the number of clusters will increase as Eps decrease until a peak in amount of clusters is reached, followed by a decrease in clusters.

It can also be seen that for a bigger MinPts, the amount of clusters will decrease when compared to the same values of Eps. However, there may be a peak value for obtaining the maximum amount of clusters, which has to be studied further if that is the case.

The results from the table 5.3 and 5.4 in section 5.1 shows that for both when MinPts=5 and MinPts = 10, the first cluster (cluster-id 0) is the biggest cluster of the found clusters, while the noise (cluster-id -1) is the second biggest cluster.

It can also be seen that the third biggest cluster can vary in size and cluster-id when Eps decreases. As Eps decreases there may be points at the border of or within the biggest cluster that may be able to become is own cluster, as they are far enough from the biggest cluster while remaining enough points to become its own cluster. The change in third biggest cluster-id may be caused by the previous third biggest cluster being turned to noise as Eps may turn to small to obtain a core point.

It can be seen that as Eps decreases, the size of the biggest cluster decreases, the noise size increases while the third biggest cluster does not change a lot. Due to the big size difference between the regular clusters (by not including noise), the explanation may be that the high amount of points are located too closely to each other, thereby not making it possible to draw any conclusions of the data points (interfaces) in the different clusters.

Distribution of the pseudo-clusters

As can be seen in figure 5.1, the majority of the ´log10data points are of size ~ 90, which is equal to infinity (or p-value = 0). It can also be seen that the other data points that are visible are within the range zero to ten, which is equal to 100 ď p ´ value ă 10´10. The number of p-values = 1 are 0.4 %, which shows that there are PDB id’s that maybe shall not be in pseudo-clusters and thereby may decrease the number of pseudo-pseudo-clusters. However, these PDB id’s

(28)

6.2. Process

are not likely to affect the overall clustering with DBSCAN, due to the low percentage of p-value = 1. Other high p-p-values are also not considered a major problem since the data points in the left bin in the figure are still a small fraction of all data points.

To be able to see if a pseudo-cluster is bad based on p-values, it has to be studied which pseudo-clusters that contain high p-values.

6.2 Process

The initial GANTT-schedule in figure 2.1 was made when the project started. This made the initial GANTT-schedule a rough estimation. However, when looking at the changes made to the final GANTT-schedule in figure 5.2, it can be seen that the overall planning was a good estimation, as the only activities that were extended after its initial time frame was the literature search (which would continue later too, but not in the same extent) and final edits in the report. The earlier start of writing the report was made since time allowed it as it was possible to start writing for example pieces of theory. The activities and milestones of week 19 and forward were extended based on the known date of the presentation, while the half-time meeting was one week earlier as it was the date that worked out the best.

6.3 Method

The methods used in this work was dependent on the use of the available shared supercom-puter resources, which makes it important that the scripts and files used are as efficient as possible to maximize the use of the available resources.

The main change that could have been made is the use of 64-bit floats (dtype = float, which is equal to dtype = float64) when creating the original matrix in the script ownMatrixMaker.py in section 4.6, where the use of 32-bit floats (dtype = float32) would have led to the created file requiring half the amount of RAM when used compared to the original 64-bit floats while remaining good decimal accuracy. This was changed later during the thesis work with the use of the script ownTo32bit.py in section 4.8, which made it possible to perform the DBSCAN algorithm with less RAM resources. If the 32-bit floats would have been used initially when the matrix was created, it would likely have led to more accessible computer resources and thereby likely provided faster results and thereby the possibility to acquire more results from other tests. The dtype in the script ownMatrixMaker.py can be changed to dtype = float32 to get a 32-bit matrix from the start, thereby making the script ownTo32bit.py not needed for that case.

Another change that could have been made is to not perform InterComp on the remaining 145,496 interfaces, to obtain the total of 222,513 interfaces, as this was very costly in terms of computer resources, making it less likely to receive computer resources for other work within that resource period. It also took time trying to create a script that could handle the creation of a matrix with the size of(222, 513)2. The work of creating a matrix for this data was later discontinued and the priority was set for the 77,017 interfaces used initially.

6.4 The work in a wider context

The use of InterComp offers the possibility to compare interfaces with each other [3], which makes it possible to compare interfaces that have not been tested in the wet lab before. If an interface-interface interaction gives a good p-value from InterComp, it is then possible to test the interaction of the two interfaces in the wet lab, which can then confirm or deny if the interaction exists.

If new interface-interface interactions are found in vitro and/or in vivo with for example proteomics, more knowledge about the interacting proteins and the host carrying the

(29)

pro-6.4. The work in a wider context

teins can be earned, which in turn can lead to for example new medicines if the proteins are involved in a disease.

(30)

7 Conclusion

The aim of this thesis was to compare protein interfaces followed by clustering, to determine relationships and differences for the obtained clusters, as well as trying to answer if the Pro-tein Data Bank (PDB) is complete.

The final result is a list of Python scripts that allows the creation of distance matrices containing p-values, which then can be studied with the clustering algorithm DBSCAN. Un-fortunately, the obtained results from clustering did not provide any usable results as there was one major cluster followed by noise and small clusters for all the tested parameters Eps and MinPts. The results of the pseudo-clusters shows that a high amount of the p-values = 0 (or limp´valueÑ0´log10(p ´ value) =8), while a small fraction are p-values = 1.

Based on the obtained results, it is possible to see the effects on the amount of clusters for the tested parameters, however it is not possible to tell if there is anything in common within clusters, if there are differences between clusters or if PDB is containing all interfaces since these questions have not been able to study.

7.1 Future prospects

Based on the work in this thesis, the following are suggestions of future work: 1. Study the 77k-data with other clustering algorithms.

2. Further study the p-values of the pseudo-clusters used to create the 77k-data, to deter-mine if the used correlation [20] cutoff was a good choice (was it set too low?).

3. Include the 145,496 remaining interfaces with the initial 77,017 interfaces to perform clustering. This was partly made during the thesis, but is not mentioned in Method. 4. Study the full dataset of 578,884 interfaces with for example clustering.

5. Perform DBSCAN with other values on the parameters Eps and MinPts for the 77k-data.

(31)

Bibliography

[1] Fred P. Davis and Andrej Sali. “PIBASE: a comprehensive database of structurally de-fined protein interfaces”. In: Bioinformatics 21.9 (May 1, 2005), pp. 1901–1907.ISSN: 1367-4803.DOI: 10.1093/bioinformatics/bti277.URL: http://dx.doi.org/10. 1093/bioinformatics/bti277.

[2] M. Gao and J. Skolnick. “Structural space of protein-protein interfaces is degenerate, close to complete, and highly connected”. In: Proceedings of the National Academy of Sciences 107.52 (Dec. 28, 2010), pp. 22517–22522.ISSN: 0027-8424, 1091-6490. DOI: 10. 1073/pnas.1012820107.URL: http://www.pnas.org/cgi/doi/10.1073/ pnas.1012820107(visited on 01/25/2018).

[3] Claudio Mirabello and Björn Wallner. “Topology independent structural matching dis-covers novel templates for protein interfaces”. In: bioRxiv (Jan. 1, 2017).DOI: 10.1101/ 235812.URL: http://biorxiv.org/content/early/2017/12/19/235812. abstract(visited on 02/28/2018).

[4] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. “A density-based algo-rithm for discovering clusters in large spatial databases with noise”. In: AAAI Press, 1996, pp. 226–231.

[5] Irene M.A. Nooren and Janet M. Thornton. “Diversity of protein–protein interactions”. In: The EMBO Journal 22.14 (2003), pp. 3486–3492. ISSN: 0261-4189. DOI: 10 . 1093 / emboj/cdg359.URL: http://emboj.embopress.org/content/22/14/3486. [6] Changhui Yan, Feihong Wu, Robert L. Jernigan, Drena Dobbs, and Vasant Honavar.

“Characterization of Protein–Protein Interfaces”. In: The Protein Journal 27.1 (Jan. 2008), pp. 59–70.ISSN: 1572-3887, 1573-4943.DOI: 10.1007/s10930- 007- 9108- x.URL: http : / / link . springer . com / 10 . 1007 / s10930 - 007 - 9108 - x (visited on 01/23/2018).

[7] Flowchart Maker & Online Diagram Software. Draw.io. 2018.URL: https://www.draw. io/(visited on 06/18/2018).

[8] Michael Levitt and Mark Gerstein. “A unified statistical framework for sequence com-parison and structure comcom-parison”. In: Proceedings of the National Academy of Sciences 95.11 (1998), pp. 5913–5920.ISSN: 0027-8424. DOI: 10 . 1073 / pnas . 95 . 11 . 5913. eprint: http://www.pnas.org/content/95/11/5913.full.pdf.URL: http: //www.pnas.org/content/95/11/5913.

(32)

Bibliography

[9] S Henikoff and J G Henikoff. “Amino acid substitution matrices from protein blocks.” In: Proceedings of the National Academy of Sciences of the United States of America 89.22 (Nov. 15, 1992), pp. 10915–10919.ISSN: 0027-8424. JSTOR: {PMC}50453. URL: http: //www.ncbi.nlm.nih.gov/pmc/articles/PMC50453/.

[10] Britannica Academic. Cluster analysis. 2018. URL: http : / / academic . eb . com / levels / collegiate / article / cluster - analysis / 605385# (visited on 02/20/2018).

[11] 15. Floating Point Arithmetic: Issues and Limitations — Python 3.6.5 documentation. Docs.python.org. URL: https : / / docs . python . org / 3 / tutorial / floatingpoint.html(visited on 05/23/2018).

[12] 12.1. pickle — Python object serialization. Docs.python.org. URL: https : / / docs . python.org/3/library/pickle.html(visited on 05/26/2018).

[13] Stéfan van der Walt, S Chris Colbert, and Gaël Varoquaux. “The NumPy Array: A Struc-ture for Efficient Numerical Computation”. In: Computing in Science & Engineering 13.2 (2011), pp. 22–30.DOI: 10.1109/mcse.2011.37.

[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[15] Michael Waskom, Olga Botvinnik, drewokane, Paul Hobson, David, Yaroslav Halchenko, Saulius Lukauskas, John B. Cole, Jordi Warmenhoven, Julian de Ruiter, Stephan Hoyer, Jake Vanderplas, Santi Villalba, Gero Kunter, Eric Quintero, Marcel Martin, Alistair Miles, Kyle Meyer, Tom Augspurger, Tal Yarkoni, Pete Bachant, Mike Williams, Constantine Evans, Clark Fitzgerald, Brian, Daniel Wehner, Gregory Hitz, Erik Ziegler, Adel Qalieh, and Antony Lee. seaborn: v0.7.1 (June 2016). June 2016.DOI: 10.5281/zenodo.54844.URL: https://doi.org/10.5281/zenodo.54844. [16] J. D. Hunter. “Matplotlib: A 2D graphics environment”. In: Computing In Science &

En-gineering 9.3 (2007), pp. 90–95.DOI: 10.1109/MCSE.2007.55.

[17] Matplotlib Developers. matplotlib: v1.5.3. Sept. 2016.DOI: 10.5281/zenodo.61948. URL: https://doi.org/10.5281/zenodo.61948.

[18] Wes McKinney. “Data Structures for Statistical Computing in Python”. In: Proceedings of the 9th Python in Science Conference. Ed. by Stéfan van der Walt and Jarrod Millman. 2010, pp. 51–56.

[19] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python. [Online; accessed 2018-06-12]. 2001–.URL: http://www.scipy.org/. [20] Britannica Academic. Measure of association. 2018.URL: https://academic.eb.com/

levels/collegiate/article/measure-of-association/627729#335772. toc(visited on 06/19/2018).

Study of Protein Interfaces with Clustering

Linköping University | Department of Physics, Chemistry and Biology

Master thesis, 30 hp | Bioinformatics

Spring term 2018 | LITH-IFM-A-EX--18/3471--SE

Study of Protein Interfaces

with Clustering

Jonathan Bergqvist

Datum

Date

2018-06-19

Department of Physics, Chemistry and Biology

Linköping University

URL för elektronisk version

ISBN

ISRN: LITH-IFM-A-EX--18/3471--SE

Study of Protein Interfaces with Clustering

Jonathan Bergqvist

Protein-protein interactions occur in nature and have different functions. The interacting surface

between two interacting proteins contains the respective protein's interface residues.

In this thesis, a series of Python scripts are presented which can perform interface-interface

comparisons with the method InterComp, to obtain a distance matrix of different protein

interfaces. The distance matrix can be studied with the use of clustering algorithms such as

DBSCAN.

The result from clustering using DBSCAN shows that for the 77,017 protein interfaces studied, a

majority of the protein interfaces are part of a single cluster while most of the remaining interfaces

are noise for the tested parameters Eps and MinPts.

The conclusion of this thesis is the effect on the number of clusters for the tested parameters Eps

and MinPts when performing DBSCAN.

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Process

2.1

Time plan

2.2

Plan for systematic follow-up

3

Theory

3.1

Protein-protein interactions

3.2

InterComp

3.3

Cluster analysis

3.4

The clustering algorithm DBSCAN

4

Method

4.1

Overview information

Terminology

4.2

Preparations

4.3

Overview of the created scripts

4.4

Running InterComp to obtain p-values

4.5

Checking the results from InterComp

4.6

Creating a matrix of p-values

4.7

Scanning the matrix

4.8

Changing the data type of the matrix

4.9