Structural characterization of overrepresented

(1)

The Department of Physics, Chemistry and Biology

Master of Science in Engineering Biology – Bioinformatics

LiTH‐IFM‐A‐EX‐08/2010‐SE

Structural characterization of overrepresented

pentapeptide sequences

Fredrik Lysholm

Supervisor: Jonas Carlsson

Examiner: Bengt Persson

(2)

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/her own use and to use it unchanged for non‐ commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/. © Fredrik Lysholm

(3)

Abstract

Background: Through the last decades vast amount of sequence information have been produced by various protein sequencing projects, which enables studies of sequential patterns. One of the best‐ known efforts to chart short peptide sequences is the Prosite pattern data bank. While sequential patterns like those of Prosite have proved very useful for classifying protein families, functions etc. structural analysis may provide more information and possible crucial clues linked to protein folding. Today PDB, which is the main repository for protein structure, contains more than 50’000 entries which enables structural protein studies. Result: Strongly folded pentapeptides, defined as pentapeptides which retained a specific conformation in several significantly structurally different proteins, were studied out of PDB. Among these several groups were found. Possibly the most well defined is the “double Cys” pentapeptide group, with two amino acids in between (CXXCX|XCXXC) which were found to form backbone loops where the two Cysteine amino acids formed a possible Cys‐Cys bridge. Other structural motifs were found both in helixes and in sheets like “ECSAM” and “TIKIW”, respectively. Conclusion: There is much information to be extracted by structural analysis of pentapeptides and other oligopeptides. There is no doubt that some pentapeptides are more likely to obtain a specific fold than others and that there are many strongly folded pentapeptides. By combining the usage of such patterns in a protein folding model, such as the Hydrophobic‐polar‐model improvements in speed and accuracy can be obtained. Comparing structural conformations for important overrepresented pentapeptides can also help identify and refine both structural information data banks such as SCOP and sequential pattern data banks such as Prosite.

(4)

(5)

Acknowledgements

I would like to thank my supervisor Jonas Carlsson for his support throughout the course of this project. He has provided me with valuable information in protein folding mechanisms and helped me choosing what to further and more thoroughly investigate. I would also like to thank Bengt Persson for serving as my examiner.

(6)

(7)

1. INTRODUCTION ... 1 2. BACKGROUND ... 3 2.1 PROTEIN COMPOSITION AND STRUCTURE ... 3 2.1.1 Amino acids ... 3 2.1.2 Proteins ... 3 2.1.3 Secondary structure ... 3 2.1.4 Tertiary structure and quaternary structure ... 4 2.2 PROTEIN DATA BANKS... 4 2.2.1 Background ... 4 2.2.2 UniProtKB ... 4 2.2.3 The Protein Data Bank (PDB) ... 5 2.2.4 SCOP (Structural Classification of Proteins) ... 5 2.3 PROTEIN MODELLING ... 5 2.3.1 Molsoft ICM ... 5 2.3.2 Structural comparison in Molsoft ICM ... 5 2.4 CLUSTERING AND REDUNDANCY REDUCTION ... 6 2.4.1 Identity ... 6 2.4.2 CD‐HIT ... 6 2.4.3 BLASTClust ... 6 3. MATERIAL AND METHODS ... 7 3.1 DATA SET ... 7 3.2 REDUNDANCY REDUCTION ... 7 3.3 CALCULATION OF PENTAPEPTIDE FREQUENCY QUOTIENTS ... 7 3.3.1 Calculation of pentapeptide frequencies ... 7 3.3.2 Determine the statistically expected frequencies ... 7 3.3.3 Definition of overrepresented ... 8 3.4 STRUCTURAL INVESTIGATION OF THE OVERREPRESENTED PENTAPEPTIDES ... 8 3.4.1 Downloading PDB ... 8 3.4.2 Preparing PDB and extracting sequence tags ... 8 3.4.3 Calculating surface ratio using Molsoft ICM ... 8 3.4.4 Finding PDB occurrences for pentapeptides... 9 3.4.5 Redundancy reduction using BLASTClust clusters ... 9 3.5 STRUCTURAL SIMILARITY CHARACTERIZATION OF PENTAPEPTIDES ... 9 3.5.1 Clustering algorithm ... 9 3.5.2 Structural similarity characterization of pentapeptides ... 10 3.5.3 Strongly folded pentapeptides filtering ... 10 3.6 SCOP ... 11 3.6.1 Filtering ... 11 4. RESULTS ... 12 4.1 REDUNDANCY REDUCTION ... 12 4.2 CALCULATION OF POSITIVELY SELECTED PENTAPEPTIDES ... 12 4.2.1 Calculation of oligopeptide frequencies ... 12 4.2.2 Choice of oligopeptide ... 12 4.2.3 Determination of the statistically expected frequencies ... 13 4.3 STRUCTURAL INVESTIGATION OF THE OVERREPRESENTED PENTAPEPTIDES ... 14

(8)

4.4 STRUCTURAL CLUSTERING ... 15 4.4.1 Statistical overview ... 15 4.4.2 Linking SCOP to the PDB entries and further filtering ... 16 4.4.3 Grouping the PDB entries ... 16 4.5 EVALUATION OF STRONGLY FOLDED PENTAPEPTIDES ... 16 4.5.1 ECGKA ... 16 4.5.2 ECSAM ... 17 4.5.3 TIKIW ... 17

4.6 MISMATCHING SCOP CLUSTERS ... 18

4.6.1 Classification example ... 18 5. DISCUSSION ... 19 5.1 RELATIVE FREQUENCY CALCULATIONS OF PENTAPEPTIDES ... 19 5.2 STRUCTURAL CLUSTERING ... 19 5.3 THE EXISTENCE OF SHORT OLIGOPEPTIDE STRUCTURAL MOTIFS ... 19 5.4 SCOP BY COMMON OLIGOPEPTIDES ... 20 6. REFERENCES ... 21

(9)

1. Introduction

Through the last decades vast amounts of sequence information have been produced by various protein sequencing projects. For example the Swiss‐Prot data bank, a thoroughly verified subset of the larger UniProtKB data bank, is updated biweekly and experiences an ever increasing growth rate. The Swiss‐Prot data bank as of the 3rd of October 2007 contained 287’050 entries comprised of more than a 100 million amino acid residues combined. This huge amount of sequence information enables the possibilities for investigating short protein sequence frequencies. One of the best‐known efforts to chart short peptide sequences is the Prosite pattern data bank, which is closely linked to the Swiss‐Prot data bank (1). There have also been successful attempts at linking pentapeptides with secondary protein structure features (2). Accordingly there is no doubt that short peptide sequence patterns carry information of both functional and structural importance. In another study (3) the frequencies of pentapeptides in Swiss‐Prot and proteins out of 386 genomes sets where studied. The frequencies were calculated by sliding a window of length five over the protein set counting each occurrence presented in the window. In order to weight these pentapeptide occurrences the expected frequency where approximated by a permutation of the dataset for each protein. The relative over‐ and under‐ representation for three different kingdoms (archaea, bacteria and eukaryotes) were studied. One of the major conclusions of the study was that there are uncharted oligopeptide patterns that have functional/structural importance which was worth investigating. Because of the complexity of analysing structural information compared to plain sequential information good candidates for further analysis were needed. As the overrepresented pentapeptides are more biologically interesting these were the most relevant to study structurally, i.e. served as good candidates. Consequently, overrepresented pentapeptides were calculated for Swiss‐Prot but not divided by kingdom, as general structural motifs were sought which should not be dependent on kingdom. These candidates were then investigated structurally where it was proposed that if a specific conformation of a pentapeptide is found in different proteins, i.e. different close surroundings, it would be of extra interest. In order to do so, an algorithm for structural clustering was invented, see Figure 1. Figure 1. An algorithm to present structurally interesting overrepresented pentapeptides. To provide more information about the proteins/domains in which the occurrences were found a SCOP (Structural Classification of Proteins) (see 2.2.4) class was linked to each of the occurrences. Cluster each occurrence on structural similarity of the oligopeptide. For each of these clusters, re‐cluster on structural similarity of the oligopeptide and the close surrounding. Filter the entries to keep only those clusters which present an oligopeptide with similar fold in different surroundings.

(10)

This provided even more information making it possible to filter for occurrences from different proteins.

The resulting subset of pentapeptides was then to be more thoroughly studied. The pentapeptides were divided into different groups and a few pentapeptides were further examined.

(11)

2. Background

2.1 Protein composition and structure

2.1.1 Amino acids Proteins are built from a repertoire of 20 different amino acids, and the same set of amino acids is used to form proteins in all species. An amino acid, also referred to as a peptide, consist of an amino group, a carboxyl group, a hydrogen atom and a distinctive side chain which is different for each of the 20 amino acids. These side chains give the amino acids their different properties. The amino acids are often grouped by these properties in various ways. Examples of groupings are amino acids grouped by polarity, charge, aromatic/none‐aromatic and size (4). Related to all of these properties is the hydropathy index, which is a number representing the amino acids hydrophilic or hydrophobic properties. This is very important in protein structure; hydrophobic amino acids tend to be internal while hydrophilic amino acids are more commonly found towards the protein surface (5). 2.1.2 Proteins Proteins are responsible for many functions within an organism. Nearly all chemical reactions in biological systems are catalyzed by proteins. Proteins are also responsible for transporting many small molecules such as oxygen, transported by a protein called hemoglobin. Proteins are the major component of muscles and collagen, a fibrous protein which gives strength to skin and bone (4). In proteins, the carboxyl group of one amino acid is joined to the amino group of another amino acid by a “peptide bond”, a type of amide bond. By joining many amino acids by peptide bonds a polypeptide chain is formed. A polypeptide chain has direction because its building blocks have different ends, an amino group or a carboxyl group. Following the order of synthesis, the amino end is taken to be the beginning of a polypeptide chain. Consequently, a polypeptide chain consists of a regularly repeating part, called the main chain or backbone, and a variable part, comprising the distinctive side chains. Most natural polypeptide chains contain between 50 and 2000 amino acid residues (4). Even though shorter fragments of polypeptide chains does not form biologically functional proteins they carry information and are therefore often referred to a as oligopeptides. Oligopeptides of specific length are named by Greek numerical prefixes instead of the oligo‐ prefix, i.e. mono‐ (one), di‐ (two), tri‐ (three), tetra‐ (four), penta‐ (five) and hexa‐ (six) etc. 2.1.3 Secondary structure Polypeptide chains can fold into two major regular structures, the alpha helix and the beta‐pleated‐ sheet, these are referred to as secondary structural elements. The alpha helix is a rod‐like structure (see Figure 2, to the left) where the backbone form the inner part of the coil and the side chains extends outwards. The coil is stabilized by hydrogen bonds between the NH and CO groups of the backbone. The alpha‐helix content of proteins ranges widely, from nearly none to almost 100% (4). The beta‐sheet (see Figure 2, to the right) differs markedly from the rod‐like alpha helix. A polypeptide chain in a beta‐sheet is almost fully extended rather than being tightly coiled as in the alpha‐helix. Furthermore the stabilizing hydrogen bonds between NH and CO groups are formed between different polypeptide strands rather than within the same strand. Adjacent chains in a beta‐

(12)

sheet can run in the same direction (parallel beta‐sheet) or in opposite directions (anti‐parallel beta‐ sheet) (4). Figure 2. An alpha‐helix (α‐helix) to the left and an anti‐parallel three strand beta‐sheet (β‐sheet) to the right. 2.1.4 Tertiary structure and quaternary structure A striking characteristic of proteins is that they have well‐defined three‐dimensional structures. Consequently, each sequence of amino acid residues has a specific conformation, and the function of that sequence (protein) arises from that conformation. This conformation or fold is referred to as the tertiary structure. Although this in many cases is enough to give a protein its specific three‐ dimensional conformation and function, in other cases several different chains are needed to form a protein with biological function. This association of several equal or different chains is referred to as the quaternary structure of a protein (4).

2.2 Protein data banks

2.2.1 Background Through the last decades vast amounts of sequence information have been produced by various protein sequencing projects. For example the Swiss‐Prot data bank, a thoroughly verified subset of the larger UniProtKB data bank, is updated biweekly and experiences an ever increasing growth rate. The Swiss‐Prot data bank as of the 3rd of October 2007 contained 287’050 entries of more than a 100 million amino acids combined. 2.2.2 UniProtKB UniProtKB (Universal Protein Resource Knowledgebase) is the world's most comprehensive catalogue of information on proteins. The UniProt Knowledgebase consists of two sections: a section containing manually‐annotated records with information extracted from literature and curator‐evaluated computational analysis, and a section with computationally analyzed records that await full manual annotation (6) (7). The two sections are referred to as “UniProtKB/Swiss‐Prot” and “UniProtKB/TrEMBL”, respectively. 2.2.2.1 UniProtKB/SwissProt UniProtKB/Swiss‐Prot is a manually annotated protein knowledgebase established in 1986 and maintained since 2003 by the UniProt Consortium, a collaboration between the Swiss Institute of Bioinformatics (SIB) and the Department of Bioinformatics and Structural Biology of the Geneva University, the European Bioinformatics Institute (EBI) and the Georgetown University Medical Centre’s Protein Information Resource (PIR) (8). Furthermore for this master’s thesis the Swiss‐Prot databank has been the dataset from which the short peptide sequence occurrences have been calculated.

(13)

2.2.2.2 UniProtKB/TrEMBL UniProtKB/TrEMBL consists of computer‐annotated entries derived from the translation of all coding sequences (CDS) in the nucleotide sequence databases, except for CDS already included in Swiss‐ Prot. It also contains protein sequences extracted from the literature and protein sequences submitted directly by the user community (8). 2.2.3 The Protein Data Bank (PDB) The Protein Data Bank was founded in 1971 to provide a repository for three‐dimensional (3D) structure data of experimentally determined biological macromolecules. When the PDB was in its infancy, the archive contained seven structures composed of loosely structured free text. Today, the PDB archive just passed 50’000 structures and relies upon strict ontologies that define the content of these entries (9) (10).

The PDB archive _{contains more than just 3D coordinate data; it also contains information about the} chemical content such as polymer sequence and ligand chemistry, information about the experiment used to derive the structure and some qualitative descriptions of the structure (9). This structural information of PDB has been the most crucial resource in this master’s thesis, as it enabled studying the different structural conformation of different oligopeptides, and specifically different pentapeptides. 2.2.4 SCOP (Structural Classification of Proteins) The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. A fundamental unit of classification in the SCOP database is the protein domain. A domain is defined as an evolutionary unit observed in nature either in isolation or in more than one context in multi‐domain proteins. The protein domains are classified hierarchically into families, superfamilies, folds and classes (11). The SCOP database is defined by human experts rather than by empirical rules implemented in a variety of bioinformatics algorithms and tools. Computational support in SCOP is then used to extend the human ability to analyse and interpret the data (11) (12).

2.3 Protein modelling

2.3.1 Molsoft ICM Molsoft ICM (Internal Coordinate Mechanics) is a protein modelling software. ICM can read several different 3D formats including the PDB format and interpret those into an internal 3D object, correcting bond types and charges for thousands of ligands. Furthermore, ICM can perform a large set of operations on a loaded structure such as alignment, superpositioning of structures, protein surface calculations, pocket prediction and much more (13). 2.3.2 Structural comparison in Molsoft ICM Structural comparison of two structures is a rather complex problem. The problem is usually twofold where the first step involves an alignment and then a superimposed of the structures, given the aligned parts. Then the similarity or distance of these structures is measured with some mathematical definition of structural similarity. The most popular measure of structural similarity is the root mean squared deviation (RMSD) (14).

(14)

Molsoft ICM natively defines a function called Rmsd(). By using the “align” option of the function it first performs an alignment of the two objects and then optimally superimposes the two objects using the McLachlan's algorithm (15). The root mean squared deviation (RMSD) is then calculated for equal atom pairs of the two superimposed objects (16).

2.4 Clustering and redundancy reduction

As each new protein added to the protein sequence data banks is not selected randomly but based on interest of researchers it could be assumed that the data banks do not contain a representative subset of all proteins on Earth. Consequently, in order to reduce redundancy and produce a subset of mostly unique proteins, it is necessary to cluster similar or identical sequences. 2.4.1 Identity A commonly used measurement of similarity between two sequences is their identity. The identity is calculated as the percentage of equal residues between two optimally aligned sequences, where the percentage is calculated from the length of the shorter of the two. 2.4.2 CDHIT The CD‐HIT algorithm was mainly developed to increase the speed of the current efforts to reduce redundancy by clustering. The algorithm starts with sorting the database in decreasing length order. The longest sequence becomes the representative of the first cluster, and then each sequence is tested against the representative of each cluster. If the similarity of any representative is above a given threshold then the sequence is added to that cluster, if not a new cluster is created with that sequence as representative. Generally the most time consuming part of sequence clustering is sequence alignment, by using short word filters CD‐HIT avoids a big part of this which significantly increases the speed at which sequences are compared. For example, two sequences having 85% identical residues over a 100‐residue window will have at least 70 identical dipeptides, 55 tripeptides and 25 pentapeptides. Therefore, by counting oligopeptides one can directly reject sequences that do not satisfy these conditions (17) (18). 2.4.2.1 UniRef In an effort to produce a redundancy reduced UniProt, which Swiss‐Prot is a subset of, the CD‐HIT algorithm is applied to UniProt. The redundancy reduced database is called UniRef and comes in three versions, UniRef100, UniRef90 and UniRef50 which are clustered at 100, 90 and 50% identity, respectively. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis (19). 2.4.3 BLASTClust BLASTClust is a program within the standalone BLAST package used to cluster either protein or nucleotide sequences. The program begins with pair‐wise matches and places a sequence in a cluster if the sequence matches at least one sequence already in the cluster. In the case of proteins, the blastp algorithm is used to compute the pair‐wise matches using default options. This includes the usage of the BLOSUM62 matrix, a gap opening cost of 11, a gap extension cost of 1 and no low‐ complexity filtering (20).

(15)

3. Material and methods

3.1 Data set

For this master’s thesis mainly two data sets have been utilized. For initial calculations of overrepresented pentapeptide patterns the Swiss‐Prot data bank has been used. Secondly, a database which holds structural information for proteins was needed in order to structurally characterize these pentapeptides. For this purpose the PDB data bank has been used.

3.2 Redundancy reduction

In initial trials it was soon discovered that many overrepresented pentapeptides came from large groups of well studied proteins which were found to be abundant in Swiss‐Prot. As the major purpose was to find and characterize structurally important pentapeptides, existing in extensively different proteins, a method of reducing falsely overrepresented proteins was needed. The UniRef (UniProt Reference Clusters) database is an identity clustered UniProt database, using the CD‐HIT algorithm. Since UniProt (and therefore UniRef) contains both Swiss‐Prot and TrEMBL sequences, the Swiss‐Prot sequences had to be extracted. This was achieved using a hash table as lookup for Swiss‐ Prot identifiers, which in turn were extracted from the Swiss‐Prot fasta‐file.

3.3 Calculation of pentapeptide frequency quotients

3.3.1 Calculation of pentapeptide frequencies There are several approaches to calculating the 3.2 million (205) combinations of pentapeptides. The least memory intensive and possibly the simplest method is to search the data for each pentapeptide occurrence and write the result to a file. Although this method hardly consumes any memory at all it is horribly slow. As of today a modern computer processes approximately 15 Mb/s which in this case would translate into 32 million seconds (more than a year) to process a 150 Mb dataset. A better method would be to scan the dataset fewer times finding more than one sequence per scan, the amount of scans needed is determined by the available memory and the ability to fit as many pentapeptide counters as possible in that memory. The usual word counting algorithms uses a list of counters where each entry is found fast by some data structure algorithm, usually b‐tree or hash. Although b‐trees and hash algorithms are very well suited to keep track of words given an unknown set, there are better methods for known set of words. As each pentapeptide consists of a chain of five amino acids and there are 20 different amino acids one might interpret each pentapeptide as number of the base 20. This way each pentapeptide is given an index in an integer array. To allocate a 32‐bit integer array for all 3.2 million (205) pentapeptides 12.2 Mb memory is needed, which is a fairly small amount given a modern computer. For this method and a data set of 150Mb, the time needed to calculate frequencies for all pentapeptide combinations is approximately 15 seconds. 3.3.2 Determine the statistically expected frequencies In a limited and not complete dataset such as the Swiss‐Prot data bank, as it does not contain all proteins in the world, it is clear that the observed frequency of pentapeptides needs to be put into perspective. Additionally, each amino acid residue has different background frequency. For example, the sequence “QQQQQ” should statistically occur approximately 131 times and the sequence “WWWWW” 0.063 times in Swiss‐Prot. In order to determine the expected frequency of each pentapeptide, given a limited dataset of a fixed amount of sequences with fixed amount of amino acids, the dataset was randomized 1000 times and means were calculated. This served as a good

(16)

approximation of the expected occurrence. It was then possible to calculate the frequency in respect to the expected frequency for each pentapeptide. By this quotient it is possible to establish the most over‐ and under‐represented pentapeptides in the Swiss‐Prot data bank. 3.3.3 Definition of overrepresented Instead of just choosing the most overrepresented by quotient between frequency and expected frequency, a requirement of an expected frequency of at least 2 was added to the definition (which excluded 3.9% of the pentapeptides). This was because without this extra requirement a few pentapeptides acquired an unreasonably high quotient due to very low calculated expected frequencies. This also allows for possible errors in the sequences which could otherwise produce biologically none‐existing pentapeptides as overrepresented. Furthermore as these pentapeptides did not exist abundantly in Swiss‐Prot it is unlikely to find any structural information for these in PDB.

3.4 Structural investigation of the overrepresented pentapeptides

To further investigate the overrepresented pentapeptides found in Swiss‐Prot, structural information was needed. The PDB data bank is the major structural protein database and was therefore chosen in order to extract additional information about the overrepresented pentapeptides. 3.4.1 Downloading PDB RCSB (Research Collaboratory for Structural Bioinformatics) provides PDB in several formats (for example as FASTA format). However, if the full structural information for each PDB entry is needed you have to download these as separate PDB data files. These data files are available via a web interface where a single data file can be downloaded at the time. In order to efficiently proceed with the structural investigation the whole PDB database was needed. Generally for these kinds of repetitive tasks a script is needed, so for this particular task a Perl script was created using the PDB FASTA file as a PDB identifier dictionary. The script was executed and the data files for each of the 47'771 entries were downloaded in approximately 12 hours. 3.4.2 Preparing PDB and extracting sequence tags As the PDB database consists of about 30 GB data, mainly occupied by atom coordinates, it stood clear that it would be inefficient to re‐extract the needed structural information for each overrepresented pentapeptide occurrence. Consequently, the PDB database was scanned, and for each entry, the sequence was extracted along with secondary structural information such as α‐helix, β‐sheet, SS‐bonds and associations to other molecules. This proved to be harder than expected as the PDB database contains both inconsistent sequence indexing and in some cases even inaccurate data. The problem was solved by using sequence index normalization and by disregarding the few cases of inaccurate structural information. 3.4.3 Calculating surface ratio using Molsoft ICM Additional structural information of interest was the location of the pentapeptide, i.e. whether it is located in the core or towards the surface of the protein. By using a protein modelling software such as Molsoft ICM it is possible to calculate the protein surface and therefore the surface exposed ratio of each amino acid. Furthermore the sequence and secondary structure were verified with the interpretations of the PDB entry by ICM.

(17)

3.4.4 Finding PDB occurrences for pentapeptides Although the Swiss‐Prot databank does link to PDB entries using these links for mapping Swiss‐Prot occurrences to PDB entries proved to be inefficient. This was mainly due to the fact that only a few Swiss‐Prot entries had links to PDB entries and many of these linked to PDB entries which did not contain the pentapeptide of interest, as often only parts of the proteins are crystallized. Since the pentapeptides of interest were already found (see 3.3) it was just a matter of extracting structural information about each of these. The most efficient way to find this information was to scan the PDB sequences for these pentapeptides. The algorithm for extracting and linking this PDB information for each possible pentapeptide is fairly straight forward. The first step was to load the previously manufactured PDB information into memory. Then the sequences of the loaded PDB entries were scanned for occurrences of all pentapeptides and for each pentapeptide a list of PDB entries were created, referred to by an array index. Then finally each sequence was processed and the resulting information was written into a new PDB linked pentapeptide file. 3.4.5 Redundancy reduction using BLASTClust clusters While the pentapeptides of the PDB linked pentapeptide database were calculated on redundancy reduced data, the PDB entries linked to each pentapeptide was not. This was a problem when statistics were calculated on these links as for example the part helix or part sheet would be incorrectly influenced by large groups of equal or very similar structures. Accordingly, redundancy reduction was needed among the PDB entries to provide grounds for statistical analysis of the structural information. The RCSB website provides two different clusterings of PDB, one set of files, 50‐95% identity, where the CD‐HIT algorithm have been used and one set, 30‐95% identity, where BLASTClust has been used. Due to the fact that the provided cluster files from BLASTClust had a broader spectrum than the CD‐ HIT clusters, the BLASTClust clusters were used.

3.5 Structural similarity characterization of pentapeptides

The same three‐dimensional structure does not necessarily imply a common ancestor; likewise, a common ancestor does not always imply a common function, but probably a shared fold. The relation of sequence similarity and structural or functional properties is rather loosely defined, but a widely accepted rule‐of‐thumb is that 30% identity and over is sufficient for assuming a common ancestor (21). Therefore it was soon discovered that in order to efficiently analyze the pentapeptides some type of structural similarity clustering was needed. Furthermore, structurally similar pentapeptides in different structural context were of special interest in order to structurally classify pentapeptides. 3.5.1 Clustering algorithm The Molsoft ICM provides functionality for comparing different structures by calculating the RMSD (Root Mean Square Distance) between the two structures. Because Molsoft ICM was used to perform the RMSD and it is a somewhat time‐consuming task, an algorithm that involved as few comparisons as possible was needed. Additionally, as the structural composition of a sequence is both of high dimension and rather complex no cluster centroids could easily be formed and thus no centroid

(18)

forming clustering algorithm could be used. Both complete linkage and a modification of the CD‐HIT algorithm were explored. Ο 2 , Ο , Equation 1. Describing the maximum amount of comparisons needed for complete linkage (left) and the CD‐HIT algorithm (right) where n is the number of entries clustered and c is the number of resulting clusters. Because the number of clusters (c) is generally much less than the amount of entries (n) the CD‐HIT algorithm was extensively faster and was used to perform the structural clustering. The following modified CD‐HIT algorithm was used. 1. Sort entries in descending size order (using sequence length as size). 2. Place the first (largest) entry as the representative of the first cluster. 3. For each remaining entry compare to each cluster representative, if RMSD between the entry and the cluster representative is less than a given tolerance add that entry to that cluster. If no cluster was found add a new cluster and place the entry as cluster representative. 3.5.2 Structural similarity characterization of pentapeptides With the modified CD‐HIT algorithm a way of structurally comparing two entries and by that structurally clustering a set of entries, making it possible to present the different unique structural conformations of each pentapeptide. To present these unique structural conformations and in addition present for each of these conformations unique structural conformations of the surroundings, the following algorithm was proposed. 1. Cluster PDB entries on structural similarity of the pentapeptide region alone to form clusters of unique pentapeptide shapes. 2. For each of these clusters, cluster the entries on the pentapeptide surroundings to form clusters of unique surroundings. The similarity of the surroundings was defined as the maximum similarity of the pentapeptide and 10 amino acids upchain and downchain, respectively. 3. Write the cluster partitions for each pentapeptide to a file. 3.5.3 Strongly folded pentapeptides filtering Those pentapeptides that proved structurally similar across proteins with different folds would be of special interest. However, first a definition for these was required. A strongly folded pentapeptide was defined as having no more than 3 structural conformations of the pentapeptide with at least 50% more subclusters than clusters. This would produce a subset of pentapeptides with a single or a few structural conformations independent of structural context. The resulting set could then be further studied and investigated manually.

(19)

3.6 SCOP

To aid in properly analysing and further explaining the selected subset of pentapeptides (see 4.5) more information about the PDB entries was needed. SCOP (Structural Classification of Proteins) arranges a big part of PDB into a hierarchal structure, among which fold, superfamily and family are some examples. For each cluster representative (clusters of different folds and surroundings) SCOP classifications were added, among which family and superfamily and class. For some clusters only one fold was found according to SCOP proving a known relation even though the surrounding (upchain, downchain) amino acids from the pentapeptide of interest were significantly different producing several secondary clusters. 3.6.1 Filtering SCOP introduced more information about the pentapeptides which allowed more extensive filtering. All PDB entries from the “designed” base class were removed because those produced pentapeptides deemed to have a strong fold because of similar copies of designed proteins. Furthermore as the dataset still was large enough, all pentapeptides with less than two superfamily references were removed (among them all pentapeptides without SCOP reference).

(20)

4. Results

4.1 Redundancy reduction

UniRef comes in three different levels of identity clustering. All these three levels of identity clustering were used. The UniRef50 database parsed for Swiss‐Prot entries were given the name SwissRef50, UniRef90 became SwissRef90 and UniRef100 became SwissRef100.

Database Identity clustering Mean length Proteins Size

SwissRef50 50% 399.42 AA 95034 44.3 MB SwissRef90 90% 380.23 AA 193845 87.2 MB SwissRef100 100% 380.20 AA 259159 116.6 MB SwissProt ‐ 367.19 AA 285335 125.9 MB Table 1. Databases formed from UniRef and Swiss‐Prot and their respective size. The table above describes how the databases are affected by the identity clustering algorithm, CD‐ HIT. Noticeable is that the mean protein length increase with reduced identity threshold as an effect of the CD‐HIT algorithm keeping the longest sequence as cluster representative.

4.2 Calculation of positively selected pentapeptides

4.2.1 Calculation of oligopeptide frequencies The first task at hand was to calculate the number of occurrences for each oligopeptide sequence within each dataset.

Peptides Combinations Size of result file Time to calculate Time to write

3 (tri) 8'000 (203) 81.8 KB 10s 0s 4 (tetra) 160'000 (204) 1.53 MB 11s 1s 5 (penta) 3'200'000 (205₎ _29.6 MB _15s _3s 6 (hexa) 64'000'000 (206) 612 MB 18s 73s Table 2. Times to calculate and write occurrences of oligopeptides in the Swiss‐Prot data bank. The time needed to calculate the occurrence‐table is scaling very well, as expected. This is related to the size of these occurrence‐tables, which is calculated by the amount of combinations times the size of a 32‐bit integer, 4 bytes. Because only 256 Mb RAM is needed for the largest of these tables, they all fit in RAM and a single pass is sufficient for calculating all pentapeptides. Consequently, the time needed to calculate each table is more or less limited to the amount of data which is processed, in this case 126 Mb. 4.2.2 Choice of oligopeptide

Peptides Combinations Not found Coverage

3 (tri) 8'000 (203) 0 100%

4 (tetra) 160'000 (204) 1 100%

(21)

When analysing the resulting set of oligopeptides for 3, 4, 5 and 6‐peptide sequences the resulting coverage were all, one sequence not found, 5% not found and 59% not found, respectively. The pentapeptides (5‐peptides) provide almost complete coverage (5% not found) while being more specific than 3 and 4‐peptides, i.e. being longer. Furthermore the pentapeptides were the main oligopeptide investigated in the previous study (3) of oligopeptide patterns that laid the foundation for this thesis. Consequently, the pentapeptides were chosen to be further studied. 4.2.3 Determination of the statistically expected frequencies An advanced part in the calculation of the positively selected pentapeptides was to calculate the mean occurrence for each pentapeptide with a given amount of permutations of the data. Expected frequencies for all pentapeptides were calculated using different amount of permutations and then the result was compared.

Permutations Standard deviation Mean deviation

1 5.691 22.73% 5 2.548 10.14% 10 1.802 7.18% 50 0.809 3.23% 100 0.575 2.29% 500 0.267 1.06% 1000 0.197 0.79% Table 4. Statistics for different amount of permutations where 5000 permutations has been used as a true mean. The accuracy aimed for was a combination of what was practical in terms of time to calculate and what would be sufficiently accurate to calculate the respective pentapeptide over/underrepresentation ratios. It is clear that with more than a few hundred permutations one obtains a mean deviation of approximately one per cent which is sufficient. As 1000 permutations are computed in about 2.5 hours and this is more than sufficient this became the value used for the different databases. When the statistically expected frequencies were calculated the remaining task was to form quotients for each pentapeptide and ranking them in descending order.

Pentapeptide SwissProt SwissRef90 SwissRef50

WWNFG (SwissProt) 694 (Rank 1) 181 (Rank 5) 9.23 (Rank 2364)

WFQNR (SwissRef90) 184 (Rank 28) 195 (Rank 1) 170 (Rank 2)

HQRIH (SwissRef50) 147 (Rank 49) 171 (Rank 7) 190 (Rank 1)

Table 5. Rank leaders with quotient for that database and corresponding values for the other databases.

The above table shows the differences in quotients produced by the identity clustering which are rather extensive for some sequences. For example the rank leader for Swiss‐Prot, WWNFG is 694 times overrepresented in Swiss‐Prot, found 1590 times, while only 9.23 times overrepresented in SwissRef50, found 7 times.

(22)

4.3 Structural investigation of the overrepresented pentapeptides

While the resulting quotients and overrepresented pentapeptides were of interest it was also considered important to get structural information on these pentapeptides. In order to cope with this, the PDB (Structural protein database) was used to extract as much structural information as possible for each pentapeptide. The first task was to confirm the previous preference of using SwissRef50 representatives for further structural characterization. The top 1000 representatives of each database were extracted and, firstly, compared with each other and, secondly, with the overall mean for all pentapeptides represented in PDB.

Pentapeptides Count Core dominated Hydropathy

index Mean size Percentage unique PDB references Top SwissRef50 1000 44.0% ‐0.85 271.12 49.03% Top SwissRef90 1000 61.8% ‐0.72 333.58 40.30% Top SwissRef100 1000 68.4% ‐0.61 376.07 26.30% Top SwissProt 1000 67.7% ‐0.61 369.86 26.12% All 1557537 48.4% ‐0.50 332.04 0.47% Table 6. PDB information extracted from different sets of pentapeptides. What is perhaps most striking is that the overrepresented pentapeptides of SwissRef50 are a lot more surface orientated that those of the other databases. This might be because without redundancy reduction by identity clustering large groups of similar molecules produce incorrectly many representatives of these molecules. This is somewhat confirmed by the percentage unique PDB references which shows that with higher redundancy reduction the representatives are found in a wider range of PDB entries. The corresponding hydropathy index is also lower, more hydrophilic, for SwissRef50 compared to the other sets of pentapeptides. Consequently, the SwissRef50 set seem to be more surface orientated and more widely distributed while those of Swiss‐Prot seem to be orientated to fewer and larger PDB molecules. Although the pentapeptides of SwissRef50 are more surface‐orientated, the fact that they are more distributed among different PDB entries make them more appealing to study further. To further characterize the pentapeptides, secondary structural information was assessed for each database, i.e. whether or not any part of the peptide‐chain is ligand, SS‐bonded or part of a helix or sheet.

Pentapeptides Count α‐helix β‐sheet SS‐bond Ligand Without structure

Top SwissRef50 1000 52.4% 37.4% 9.5% 11.8% 15.1% Top SwissRef90 1000 60.6% 37.4% 7.1% 7.2% 12.5% Top SwissRef100 1000 71.6% 32.8% 5.6% 6.6% 9.3% Top SwissProt 1000 71.1% 32.0% 5.3% 6.6% 10.2% All 1557537 65.13% 47.57% 3.6% 0.76% 9.5% Table 7. Secondary structural information of overrepresented pentapeptides based on PDB database tags.

(23)

reduce the amount of overrepresented pentapeptides in helixes. This might be explained by the previous conclusion that less identity clustering produce a set of pentapeptides more generally from many similar large molecules which in turn, because of their size, statistically contain more helix formations. Worth noticing is also that each row sums to more than a hundred per cent which indicates that some pentapeptides are parts of more than one secondary structural motif. The proportion of sheet structure in overrepresented pentapeptides is fairly indifferent to identity clustering, but is slightly lower to all pentapeptides. This suggests that sheets are more seldom part of the active site and no specific pentapeptides generally form sheets as they would then be overrepresented. Maybe the most important part of this table is that the part disulfide bonds (SS‐bond) and ligand bonded rise among overrepresented pentapeptides compared to all pentapeptides and also with increasing identity clustering. Most likely because both SS‐bonds and parts bonded to a ligand generally are functionally or structurally important. Consequently, the identity clustering favours structurally or functionally important pentapeptides.

4.4 Structural clustering

4.4.1 Statistical overview The time required to perform structural clustering using the described method (see 3.5) was approximately 1 day per 1000 pentapeptides, which made clustering of all pentapeptide impossible within the given time frame. However previously overrepresented pentapeptides had been calculated which served as a perfect sub‐selection to cluster.

Data Count Strongly folded (see 3.5.3)

Top SwissRef50 100 17 (17.0%) Top SwissRef50 500 57 (11.4%) Top SwissRef50 1000 108 (10.8%) Top SwissRef50 2000 185 (9.25%) Top SwissRef50 3000 251 (8.37%) Top SwissRef50 4000 331 (8.28%) Top SwissRef50 5000 406 (8.12%) Table 8. Shows the amount of strongly folded pentapeptides in different subsets of SwisRef50. The amount of strongly folded pentapeptides is slightly greater among the more overrepresented pentapeptides. In order to validate that the SwissRef50 pentapeptides were a good choice for the search of strongly folded pentapeptides the first 500 out of SwissRef90 were calculated as well. The SwissRef90 pentapeptides produced 6.2% strongly folded pentapeptides compared to the 11.4% from SwissRef50. Even though it is hard to make a fair estimate of how many pentapeptides are strongly folded there are 3.2 million possible pentapeptides and only 5000 were investigated, which suggests that there are a whole lot more to find. Many of these may however not be widely presented in biologically functional proteins which makes them hard to find using this method.

(24)

4.4.2 Linking SCOP to the PDB entries and further filtering By linking SCOP to the PDB entries more information about the pentapeptide PDB occurrences was gained. For each PDB occurrence a lookup in SCOP was made and information was linked. SCOP classes originating from the “designed” base class were disregarded while all pentapeptides where the PDB occurrences were linked to at least two SCOP super families were kept, this in order to ensure that the actual occurrences were present in several different protein superfamilies. Out of the 406 (see Table 8) pentapeptides, from the top 5000, 238 pentapeptides remained when SCOP classes had been linked and the set had been filtered. 4.4.3 Grouping the PDB entries The resulting set of pentapeptides was ordered into four simple groups. 58 of the pentapeptides where double Cys pentapeptides with two amino acids in between (CXXCX|XCXXC), 25 were repeats of 4 or 5 amino acids (AAAAX|XAAAA|AAAAA), 5 were tandem repeats (ADADA) and 150 were none of the above.

Group Count Helix Sheet None Ambiguous

Double Cysteine 58 1 0 38 19 Repeats 25 7 0 13 5 Tandem repeats 5 2 0 3 0 Others 150 43 15 49 43 All 238 53 15 103 67 Table 9. Pentapeptide groups and related secondary structure of each group. Among the double Cysteine pentapeptides it is noticeable that most of these are neither located in helixes nor sheets, when further investigated these were found to form backbone loops where the two Cysteine amino acids formed a possible Cys‐Cys bridge. As for the repeats many of them were the results of manufactured repeats which made them less interesting to study. Furthermore as no overlapping was checked for when calculating the initial pentapeptide frequencies these may be falsely overrepresented.

4.5 Evaluation of strongly folded pentapeptides

In order to evaluate the quality of these strongly folded pentapeptide candidates a few of these were more thoroughly examined. 4.5.1 ECGKA Among the PDB occurrences for the pentapeptide “ECGKA” one fold were found, and among these two clearly different surroundings were found. Representatives for these two clusters are PDB entry “1ubd.C” and “1vqm.3”. The first protein is a Zinc finger protein where the pentapeptide is part of binding a Zinc ligand which in turn is part of a pocket. The second protein is a RNA protein where the occurrence is not part of any pocket, and therefore has no apparent Zinc affiliation. The two proteins have 17% identity and are part of two different SCOP super families. The two occurrences have similar upchain sequences where “1ubd.C” has “CAECGKA” and “1vqm.3” have “CGECGKA” where the upchain C might form a Cysteine‐Cysteine bridge with the Cysteine in the pentapeptide which would explain the stability of the loop of the backbone. There might also be

(25)

Figure 3. Showing the occurrence in 1ubd.C to the left and in 1vqm.3 to the right. 4.5.2 ECSAM The occurrences for the “ECSAM” pentapeptide also form two clusters of different surroundings for one fold. Representatives for these two clusters were PDB entry “1u0b.B” and “1cpy.A”. The occurrence is located to two amino acids downstream from a helix start, thus both are part of the post beginning of a helix. The first protein is a Cysteinyl‐tRNA synthetase and the peptide bonds to a Zinc atom which in turn is part of a possible main pocket. The second protein is a Serine carboxypeptidase II protein and the helix is surface orientated and not part of any pocket. The two proteins have 14% identity and are part of two different SCOP super families. Figure 4. Showing the occurrence in 1u0b.B to the left and in 1cpy.A to the right where the start of the helix is marked in yellow. 4.5.3 TIKIW The occurrences for the “TIKIW” pentapeptide form as the previous examples two clusters of different surroundings for one fold. Representatives for these two clusters were PDB entry “1nr0.A” and “2al7.A”. The occurrence is located to the middle strand of a three strand sheet. The first protein is an Actin interacting protein and the occurrence is located to the outer strands of a beta‐barrel which holds a manganese ion, the occurrence is therefore not directly involved in any pocket. The second protein is an ADP‐ribosylation factor‐like 10C protein where the occurrence is not part of any pocket although the formed sheet is close enough to give supporting structure of a possible main pocket. The two proteins have 14% identity and are part of two different SCOP super families. Figure 5. Showing the occurrence in 1nr0.A to the left and in 2al7.A to the.

(26)

4.6 Mismatching SCOP clusters

During the more thorough investigation of the strong fold sub‐clusters (clusters with one fold of the occurrence and similar/equal surrounding) many clusters with two or more SCOP super family references were found (for different PDB entries). This suggests that the sub‐clusters are in these cases aggregations of several “real” clusters which in turn imply that possible improvements of the sub‐clustering algorithm could be made. However, this is somewhat expected and sought for as the threshold values for the clusters are chosen so that they would generate few false positives. The other possibility is that SCOP has classified very similar proteins into different super families, i.e. inconsistencies within SCOP which was further investigated. 4.6.1 Classification example During the investigation of the “ECGKA“ pentapeptide the two PDB entries 1vqm.3 and 1w2b.2 presented the same subcluster, i.e. equal surroundings, yet were part of two different SCOP classes. The two respective peptide chains were actually identical and shared close to identical fold. They were both recognized to belong to “Archaeon Haloarcula marismortui” by SCOP but were part of two different base classes. The 1vqm.3 protein belonged to “Small proteins” (g) where 1w2b.2 belonged to “Low resolution protein structures” (i) which led to different sub classification as well. This may or may not be a known issue with the SCOP classifications but even after excluding the “Low resolution protein structures” there are many subclusters of several SCOP classes, although it is possible that some of those are the product of too narrow clustering thresholds and are not biologically related structures.

(27)

5. Discussion

5.1 Relative frequency calculations of pentapeptides

The relative frequency calculated in this master’s thesis is an approximation of the “real” relative frequency. The approximation is achieved by several permutations of the data and the frequency is calculated for each permutation and a mean is used as the expected frequency of which the relative frequency is calculated. By this approach we make sure that the expected frequency is calculated for a dataset of equal size with equally sized proteins with equal composition of amino acids but where the order of those have been randomized (permutated). The reason for only allowing the amino acids to move within the same protein is twofold. Mainly, this is because moving amino acids between proteins may create new protein amino acid compositions which may not be biologically possible, much in the same way we do not expect Tryptophan (W) as often as Glycine (G) or Alanine (A). Secondly, it is faster to move data around given a small memory area (one protein) compared to moving data around a large memory area (the entire database). This may or may not be an optimal choice of approximating the expected frequency and therefore the relative frequency. Despite that it is safe to say that the limited dataset is most likely a much greater source of error than the actual approximation for that set, given a large enough amount of permutations.

5.2 Structural clustering

In order to produce the candidates for strongly folded pentapeptides some structural clustering or similarity measurement were needed. In this master’s thesis the protein modelling software Molsoft ICM were used, which proved to be with respect to the other calculations a cumbersome endeavour. However, for calculating the few thousand needed to get a fair amount of candidates the time needed for this method was manageable. However, if all potentially strongly folded pentapeptides were to be extracted a better method for that would be preferred. For general structural similarity problems, alignment is a big part of the problem, but in this study similar sequence parts of equal length were most commonly compared. Furthermore as the similarity was not required to be absolute some approximation of the structural representation is allowed for. Consequently, a much simplified method for approximating similarity for this exact purpose could be developed. Another reason for the structural clustering being cumbersome is that all PDB entries found for a specific pentapeptide were explored in order to make sure that less false positives were presented. By using an identity reduced version of PDB or a PDB reduced by SCOP the structural similarity calculations could be decreased.

5.3 The existence of short oligopeptide structural motifs

Although some of the found pentapeptides may be the product of pure chance at least a few of the found pentapeptides are most likely conformed to a specific fold. For example the Cys‐Cys pentapeptides with a Pro in between which bends the backbone and more or less forces the loop to form supported by the covalent S‐S bond. Furthermore, the pentapeptide structural motifs found in this study are derived from overrepresented pentapeptides and are in many cases also well represented in both Swiss‐Prot and PDB. Therefore one can say that it is likely that the pentapeptides proved to have a preserved fold will retain their fold in a new protein. While these pentapeptides may very well be folded differently in other surroundings, the fact remains that they are not found in

(28)

those surroundings in any yet charted PDB protein. Consequently, the observed fold has been evolutionary preserved whether it is beneficial from a molecular energy point of view or not. This may be interesting for protein folding algorithms, such as the HP‐model, where randomized search algorithms are often used to tackle the folding problem (22)(23). By using the most common conformations in biologically functional proteins first the search algorithms may be improved. Furthermore by assuming some specific conformations for certain patterns the folding algorithms may also overcome barriers crossed by chaperones, protein folding proteins.

5.4 SCOP by common oligopeptides

The data generated in this master’s thesis are overrepresented oligopeptide patterns which have proved to have structural similarities across protein families. In the resulting dataset the data is filtered against similarities of whole proteins, where similar occurrences in different proteins are sought. By reversed filtering this method could be used as an alternative method for validating and possibly deriving new relations between proteins.

(29)

6. References

1. Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher. PROSITE: a documented database using patterns and profiles as motif descriptors. Oxford Journals. Briefings in Bioinformatics, 2002, Vol. 3, 3, pp. 265‐274. http://bib.oxfordjournals.org/cgi/reprint/3/3/265. 2. Figureau A, Soto MA, Toha J. A pentapeptide‐based method for protein secondary structure prediction. Oxford Journals. Protein Engineering, February 2003, Vol. 16, 2, pp. 103‐107. http://peds.oxfordjournals.org/cgi/content/full/16/2/103#APWEILER‐2001. 3. Bresell A, Persson B. Characterization of oligopeptide patterns in large protein sets. BMC Genomics. 1 Oct 2007, Vol. 8, 346. http://www.biomedcentral.com/1471‐2164/8/346. 4. Stryer, Lubert. Biochemistry. 4th Edition. New York : W. H. Freeman and Company, 1995. pp. 17‐ 41. ISBN 0‐7167‐2009‐4. 5. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology. 05 05 1982, Vol. 157, 1, pp. 105‐132. 6. EBI. UniProt Database. [Online] 11 12 2007. [Cited: 17 01 2008.] http://www.ebi.ac.uk/uniprot/database/DBDescription.html. 7. Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N., Yeh L.S. The Universal Protein Resource (UniProt). Oxford Journals. Nucleic Acids Research, 2005, Vol. 33, pp. 154‐159. http://nar.oxfordjournals.org/cgi/reprint/33/suppl_1/D154. 8. Boeckmann B., Bairoch A., Apweiler R., Blatter M.‐C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I., Pilbout S., Schneider M. The SWISS‐PROT protein knowledgebase and its supplement TrEMBL in 2003. Oxford Journals. Nucleic Acids Research, 2003, Vol. 31, pp. 365‐370. http://nar.oxfordjournals.org/cgi/content/full/31/1/365. 9. Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Oxford Journals. Nucleic Acids Research, 2007, Vol. 35, pp. 301‐303. http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D301. 10. RCSB Protein Data Bank. [Online] Research Collaboratory for Structural Bioinformatics, 08 04 2008. [Cited: 12 04 2008.] http://www.rcsb.org/pdb/static.do?p=general_information/news_publications/news/news_2008.ht ml#20080408. 11. Antonina Andreeva, Dave Howorth, Steven E. Brenner, Tim J. P. Hubbard, Cyrus Chothia and Alexey G. Murzin. SCOP database in 2004: refinements integrate structure and sequence family data. Oxford Journals. Nucleic Acids Research, 2004, Vol. 32, pp. D226‐D229. 12. Murzin A. G., Brenner S. E., Hubbard T., Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995, 247, pp. 536‐540. 13. Molsoft L.L.C. ICM‐Pro. [Online] 2008. [Cited: 12 04 2008.] http://www.molsoft.com/icm_pro.html.

(30)

14. Zhi D, Krishna SS, Cao H, Pevzner P, Godzik A. Representing and comparing protein structures as paths in three‐dimensional space. BMC Bioinformatics. 11 20, 2006, Vol. 7, 460. http://www.biomedcentral.com/1471‐2105/7/460. 15. McLachlan, A. D. Gene duplications in the structural evolution of chymotrypsin. Journal of Molecular Biology. 1979, Vol. 128, 1, pp. 49‐79. 16. ICM Language Reference : Functions. Molsoft L.L.C. [Online] 21 04 2008. [Cited: 26 04 2008.] http://www.molsoft.com/icm/icm‐functions.html#Rmsd. 17. Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Oxford Journals. Bioinformatics, Mars 2001, Vol. 17, 3, pp. 282‐283. http://bioinformatics.oupjournals.org/cgi/reprint/17/3/282.pdf. 18. Li W, Godzik A. Cd‐hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Oxford Journals. Bioinformatics, July 2006, Vol. 13, 22, pp. 1658‐1659. http://bioinformatics.oxfordjournals.org/cgi/reprint/22/13/1658. 19. Baris E. Suzek, Hongzhan Huang , Peter McGarvey , Raja Mazumder and Cathy H. Wu. UniRef: comprehensive and non‐redundant UniProt reference clusters. Oxford Journals. Bioinformatics, 22 March 2007, Vol. 10, 23, pp. 1282‐1288. http://bioinformatics.oxfordjournals.org/cgi/content/full/23/10/1282. 20. BLASTCLUST ‐ BLAST score‐based single‐linkage clustering. BLASTCLUST ‐ BLAST score‐based single‐linkage clustering. [Online] NCBI, October 2002. [Cited: 10 December 2007.] ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html. 21. Eva Bolten, Alexander Schliep, Sebastian Schneckener, Dietmar Schomburg, Rainer Schrader. Clustering protein sequences — structure prediction by transitive homology. Oxford Journals. Bioinformatics, 2001, Vol. 17, 10, pp. 935‐941. http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/10/935. 22. An efficient genetic algorithm for predicting protein tertiary structures in the 2D HP model. Bui T.N., Sundarraj G. s.l. : GECCO'05, 2005. 23. An improved ant colony optimisation algorithm for the 2D HP protein folding problem. Shmygelska A., Hoos H.H. s.l. : Proc. of the 16th Canadian Conference on Artificial Intelligence, 2003. pp. 400‐417. 2671.

Structural characterization of overrepresented

The Department of Physics, Chemistry and Biology

Master of Science in Engineering Biology – Bioinformatics

LiTH‐IFM‐A‐EX‐08/2010‐SE

Structural characterization of overrepresented

pentapeptide sequences

Fredrik Lysholm

Supervisor: Jonas Carlsson

Examiner: Bengt Persson

Copyright

Abstract

Acknowledgements

Table of contents

1. Introduction

2. Background

2.1 Protein composition and structure

2.2 Protein data banks

2.3 Protein modelling

2.4 Clustering and redundancy reduction

3. Material and methods

3.1 Data set

3.2 Redundancy reduction

3.3 Calculation of pentapeptide frequency quotients

3.4 Structural investigation of the overrepresented pentapeptides

3.5 Structural similarity characterization of pentapeptides

3.6 SCOP

4. Results

4.1 Redundancy reduction

4.2 Calculation of positively selected pentapeptides

4.3 Structural investigation of the overrepresented pentapeptides

4.4 Structural clustering

4.5 Evaluation of strongly folded pentapeptides

4.6 Mismatching SCOP clusters

5. Discussion

5.1 Relative frequency calculations of pentapeptides

5.2 Structural clustering

5.3 The existence of short oligopeptide structural motifs

5.4 SCOP by common oligopeptides

6. References