Protein Interactions from the Molecular to the Domain Level

(1)

(2)

(3)

Protein Interactions from the

Molecular to the Domain Level

(4)

©Patrik Björkholm, Stockholm University 2014 ISBN 978-91-7447-854-9

Printed in Sweden by US-AB, Stockholm 2014

(5)

(6)

(7)

List of Publications

This thesis is based on the following papers, they will be referred to by their roman numerals in the text.

i. Björkholm P, Daniluk P, Krystofovich A, Fidelis K, Andersson R and Hvidsten TR (2009) Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict resi-due-residue contacts. Bioinformatics. 25(10): 1264-1270

ii. Contreras FX*, Ernst AM*, Haberkant P, Björkholm P, Lindahl E, Gönen B, Tischer C, Elofsson A, von Heijne G, Thiele C, Pepperkok R, Wieland F, Brügger B (2012) Molecular recognition of a single sphingolipid species by a protein's transmembrane domain. Nature. 481 (7382): 525-529

iii. Björkholm P*, Ernst AM*, von Heijne G, Wieland F, Brügger B (2014) Identification of novel sphingolipid–binding motifs in mam-malian membrane proteins. Manuscript.

iv. Björkholm P and Sonnhammer E (2009) Comparative analysis and unification of domain–domain interaction networks. Bioinformatics. 25(22): 3020-3025

(8)

Additional Publications

v. Maddalo G, Stenberg-Bruzell F, Götzke H, Toddo S, Björkholm P, Eriksson H, Chovanec P, Genevaux P, Lehtiö J, Ilag LL, Daley DO (2011) Systematic analysis of native membrane protein complexes in Escherichia coli. Journal of proteome research. 10 (4): 1848-1859 vi. Botelho SC, Österberg M, Reichert AS, Yamano K, Björkholm P,

Endo T, von Heijne G (2011) TIM23-mediated insertion of trans-membrane α-helices into the mitochondrial inner trans-membrane The EMBO journal. 30 (6): 1003-1011

vii. Galian C, Björkholm P, Bulleid N, von Heijne G (2012) Efficient glycosylphosphatidylinositol (GPI) modification of membrane pro-teins requires a C-terminal anchoring signal of marginal hydropho-bicity. Journal of Biological Chemistry. 287 (20): 16399-16409 viii. Reithinger JH, Yim C, Park K, Björkholm P, von Heijne G, Kim H

(2013) A short C-terminal tail prevents mis-targeting of hydrophobic mitochondrial membrane proteins to the ER.

(9)

Chapter 1. Introduction

The most fundamental and basic unit in life is the cell, from single-celled bacteria on the ocean floor to the largest creatures on the planet. Organisms can broadly be divided into to three separate kingdoms, the Eukaryotes, the Bacteria and the Archaea. These kingdoms have many features in common but also quite a few things unique to each one.

Let us start with their common denominators, all cells have DNA, which is the genetic blueprint; this information is transported in the form of RNA from the genome to the ribosome where proteins are manufactured. Proteins can be seen as the functional parts of the cell, they usually have one or sev-eral functions and are the main actors in almost all essential biological pro-cesses. Proteins are what make the cell to be alive. Proteins are found both as solitary units and as parts of large complexes. Proteins can be found in all parts of the cell, freely floating in the cytoplasm, the central cavity in all cells, but they are also common both as integrated into or attached to mem-branes. Membranes are what separate the cell from the rest of the universe and is what defines the shape of the cell. Proteins integrated into the mem-brane have a wide number of responsibilities, they are not only the gatekeep-ers of the cell and they are also partly in charge of taking care of the waste, but perhaps the most important part is that a number of them are receptors and enzymes. Receptors are the means by which cells communicate with the rest of the world. If these start misbehaving this might not only be detri-mental on a cellular level but might start affecting the organism as a whole. This set of features is what makes membrane proteins so interesting for the pharmaceutical industry; more than half of the targets in pharmaceutical industry are membrane proteins. Enzymes are important contributors to the different metabolic processes in the cell. Most enzymes are proteins and their main task is to speed up processes in the cell most commonly by lower-ing the activation energy necessary for the process.

Deoxyribonucleic acid (DNA), is the molecule in which organisms store information and pass it on to its progeny. DNA resembles a twisted rope ladder, where each step is a base pair between two specific bases. The bases always base pair to a specific base, cytosine (C) always pairs with guanine (G) and adenine (A) always pairs with thymine (T). These segments can be several decimeters long yet so efficiently packed that they fit into cells [43].

(14)

In the genetic code parts of the genome correspond to a certain protein, these segments are usually referred to as genes. In order to express these proteins the genes have to be transcribed into messenger RNA (mRNA). This mRNA template is then transported to a ribosome where it is translated into a poly-peptide chain. A triplet of bases (codons) codes for a single amino acid. Each codon always codes for the same amino acid, but each amino acid can be coded for by more than one codon [76]. This redundancy makes the cells less affected by point mutations and more stable on the protein level. This flow of information is referred to as the Central Dogma [15]. The sum of all genes in an organism is referred to as the genome and the sum of proteins in an organism is referred to as the proteome.

Figure 2.1 The blue arrows represent the normal flow of information in cells. The red arrow happens in nature but only by a specific type of virus referred to as retroviruses that uses its RNA to infect host cells genome. The standard flow information is that (a) DNA information gets copied both to new (a) DNA and to (b) RNA. Information then continues to flow from (b) RNA into (c) proteins.

The organization of cells is somewhat different between the different king-doms. Bacteria, Archea and some Eukarya can all survive as single cells, the majority of Eukarya though are multicellular organisms, where most of the cells are specialized and have specific functions for the benefit of the organ-ism as a whole.

Eukaryotic cells have an elaborate internal organization where they have separated parts of themselves off using membranes, these form compart-ments with certain functions. These compartcompart-ments are called organelles. Most organelles have the same function in different types of cells: peroxi-somes break down hydrogen peroxide and synthesize lipids, mitochondria specialize in respiration, supplying the cell with high-energy compounds

(15)

such as ATP, the nucleus stores the genome and regulates replication and transcription. The rough endoplasmic reticulum is involved in the synthesis of secretory proteins and the smooth endoplasmic reticulum is involved in a variety of metabolic reactions such as synthesizing lipids and steroids. The Golgi specializes in protein sorting, protein transport and protein post-translational modifications.

Bacterial cells do not have the same level of cellular organization as found in eukaryotes but have one or two membranes that separate them from their surroundings. Their genome is usually circular and located in the cytoplasm, as there is no nucleus.

This thesis is focused on protein-interactions. It will start by describing intra-cellular interactions in proteins and protein folding and will continue to look at how proteins interact with membranes and finished off looking at how proteins form intracellular networks. These networks of proteins are what allow cells to metabolize, function and reproduce.

(16)

(17)

Chapter 2. Proteins

Proteins are the machines of the cell and therefore have a wide range of functions, tasks and responsibilities. Proteins are macromolecules made up of linear chains of amino acids. The start of a protein chain is referred to as the N-terminus and the end as the C-terminus. There are 20 different amino acids that are normally found in proteins. Amino acids are often categorized depending on their properties, as hydrophobic and polar.

2.1 Overview of protein structure

When looking at protein structure it is common to study it on four separate levels. The first level is the primary structure, which is the actual sequence of linked amino acids. The next level is the secondary structure. Local forces generate short-range interactions between amino acids inducing the for-mation of α-helices and β-strands. β-strands can pair up to form β-sheets; by necessity, this involves long-range interactions between different parts of the chain. These local structures are usually the first to form after the protein is translated and has left the ribosome tunnel. The tertiary structure represents the native 3-dimensonal structure of the chain. Finally, the quaternary struc-ture results when a polypeptide forms a complex with on or more identitical or different polypeptides [50]. Figure 2.2 shows the different structure levels.

(18)

Figure 2.2 This figure shows the different structural levels of a protein. a) is the primary structure, a simple chain of amino acids. b) shows the secondary structure level with a set of α-helices on top and a β-sheet at the bottom. In c) is the tertiary structure of the folded protein, where the protein has as-sumed its native state. The last part d) shows the quaternary structure of two folded polypeptide chains that have formed a functional protein.

2.2 Protein folding

Proteins are made up of one or more chains that form a 3-dimensional struc-ture. How this structure forms depends on interactions between amino acids of the protein. This process is affected by interactions with the environment in which the protein is folded. The first to show experimentally that a protein can fold spontaneously when in the right environment was Christian Anfin-sen; in 1972 his contributions were recognized with a Nobel prize in chemis-try.

2.2.1 Anfinsen’s Dogma

Anfinsen postulated a set of rules that describe the folding process. These rules have been called both the “Thermodynamic theory” or “Anfinsen’s Dogma” [1]. What Anfinsen proposed was that the lowest possible energy a small protein can have corresponds to its native state, and that the structure of this state is dependent on the amino-acid sequence. Three sets of criteria have to be fulfilled for this to be true. The first is that a protein can only have one lowest free energy state. The second criterion is that the native state is stable; once this state is reached it should not be conformationally sensitive to changes in its surroundings. The last criterion is that the path to the native state cannot be overly complex or the protein will have problems reaching their native state in a reasonable time. The last criterion have large biological implications since it can be interpreted as that the biologically functional state is not actually the native state of the protein, since the path to this state may be unattainable for some amino-acid sequences. This means that some

(19)

proteins will be trapped in a local energy minimum that has a higher energy than the sequences’ native state.

2.2.2 The hydrophobic effect

The dominant driving force for protein folding is the hydrophobic effect. Most proteins fold in aqueous surroundings such as the cytoplasm. The hy-drophobic effect is a consequence of hydrogen bonding between water mol-ecules. Water forms hydrogen bonds because oxygen atoms are electronega-tive and each hydrogen atom has a small posielectronega-tive charge. Hydrogen bonds form when one electron is shared, via a hydrogen atom, between two oxygen atoms. Hydrogen bonding happens naturally between water molecules, form-ing a network of hydrogen bonded molecules.

If a non-polar moiety, such as a non-polar amino-acid side chain, is added to water, parts of this network will break up to accommodate the non-polar residue. Since the residue is non-polar, the nearby water molecules will tend to interact more strongly amongst themselves, forming an ordered “cage” of hydrogen bonded waters around the residue. This lowers the entropy of the system and thus makes hydrophobic molecules poorly soluble in water [47]. For proteins, this drives hydrophobic residues to cluster in the core of the polypeptide, thereby minimizing the exposure of non-polar surface area to water.

There are other factors that affect folding as well, such as long-range electro-static interactions and hydrogen bonds that are formed between polar side-chains. Long-range electrostatic interactions can be divided into two types, specific and non-specific [3, 16]. Non-specific interactions are interactions between charged residues that are in reasonably close to each other in space but do not directly interact. Specific electrostatic interactions are for example salt bridges, these interactions form a stronger bonds but have been shown to be to few to be the main pillars of stability for proteins but that they none the less can contribute to protein stability [66].

2.3 Protein Structures

Information about protein structures is necessary to study and understand protein function. High-quality structures are in general expensive and diffi-cult to produce, even if this has improved in recent years. All resolved pro-tein structures are deposited in the Propro-tein Data Bank (PDB) [5].

There are four techniques available to solve a protein structure down to at-om-level, these are X-ray crystallography, nuclear magnetic resonance

(20)

(NMR), electron crystallography and single-particle cryo electron microsco-py [42, 74, 83, 84, 85]. With X-ray crystallography you locate the position of all atoms in the proteins in a crystal. The downside of this technique is that in order to solve the protein structure, the protein must be crystallized and must form well-defined crystals. If this is the case, the crystal can reflect X-rays to a high resolution and from this pattern the position of all atoms in the protein can be calculated [60].

NMR has the advantage that it does require a crystal but works with proteins in solution. Unfortunately, NMR structures have in general a lower resolu-tion than X-ray structures and this technique cannot always be used on large proteins or protein complexes [48]. In NMR the protein sample is exposed to a high intensity magnetic field that makes all the nuclei in the protein absorb and then emit electromagnetic energy. This emitted energy can then be used to calculate the proteins structure.

Electron crystallography is built upon similar principles as X-ray crystallog-raphy but bombarding the crystals with electrons instead of X-rays, it can be seen as a useful complement to the more common X-ray crystallography. Electron crystallography can be used on proteins that cannot form large 3-dimensional crystals. Because the electrons interact more intimately with the crystals it can be used to capture 2-dimensional structures or occasionally dispersed individual proteins, the downside is that this technique cannot be used on large or thick crystals and has a low resolution [83].

Single-particle cryo electron microscopy is a transmission electron micros-copy technique where samples are studied at cryogenic temperatures. Single-particle cryo electron microscopy can study samples at their native states and not as crystals, the downside is that the resolution of cryo-electron micros-copy maps is not sufficient to resolve protein structures. Some of these mi-croscopies have CCD detectors paired with a phosphorescent layer. This extra information combined with a lot of computational power can create three-dimensional pictures of the studied sample [84, 85].

2.4 Protein Structure Prediction

Protein structure prediction attempts to in silico predict the native state of a protein. These predictions can be made on several levels. The first is second-ary structure prediction, these can now be made with quite a high accuracy. The most common program to do this is PSI-PRED [77]. One can also try to predict which residues are close to each other in space (contact prediction) to

(21)

try to predict the backbone structure or the whole protein structure with full residue packing.

In general there are two broad approaches how to predict proteins structure. The most common is called template modeling. Template modeling uses a known structure with a sequence that is closely related to the sequence of the protein being modeled. These methods take the predicted sequence and then superimpose the template structure on the target sequence. The downside of this is that not all proteins have a relative of known structure. The quality of the structure also decreases when more distant relatives are used as template. Often used servers and methods that use template modeling are Pcons, I-Tasser and Modeler [20, 61, 71, 75].

The other type of structure prediction is called ab inito or de novo structure prediction. These are methods that use only sequence information to predict protein structures. The most common approaches to predict structure de no-vo is to try to use physics to mimic the folding process, often with the aid of algorithms that speeds up the process of finding local minima on the pro-tein’s energy landscape. The most common algorithm to use for this is the Monte Carlo algorithm. These methods are in general less reliable, especial-ly for large proteins, but remain the onespecial-ly option to predict structures of pro-teins that have no relatives with a known structure. The most reliable and well-known program for predicting protein structure de novo is the Rosetta platform [6, 58].

2.5 Protein motifs

Protein motifs are a series of specific amino acids that contribute to or gen-erate a biological function or interaction. Motifs that are common in differ-ent proteins may indicate a common functionality, at least on a local level. Two examples of motifs are the N-linked glycosylation acceptor motif (NxS and NxT) and the helix-packing motif (GxxxG) [78][79]. Motifs can be found by protein alignments or experimental work. There are two major sources of data for motifs, PROSITE and PRINT, these sites contain a lot of known motifs and allow the user to look for established motifs in protein or DNA sequences [2, 62].

2.6 Protein domains

Proteins are modular entities, something that became very apparent after the first proteins structures had been solved. What could be seen was that the

(22)

same structure would recur as identifiable domains in otherwise structurally disparate proteins [17, 64]. Domains have been shown to fold more or less independently [27, 35]. It was quickly realized that domain sequences could be aligned to identify key residues involved in stabilizing the domain and defining its function. This simple discovery of alignments or profiles as they have also being called can be seen as one of the founding pillars in bioin-formatics. Sequence profiles can be used to find other proteins of unknown structure and function and assign putative structure and function to them. In protein structure bioinformatics the most common and accurate prediction methods, template based protein modeling, use the same basic principle to transfer structural information from a protein of known structure to a protein of unknown structure [72, 75]. Profiles of protein domain families can be seen as repositories of information on a sequence level of related proteins. These profiles have proven incredibly powerful tools to search and identify new members of the same family. In the present age of genomics, this allows us to assign protein domains to newly assembled genomes and map prior knowledge on unknown sequences. Looking at the sequence similarities and dissimilarities, general rules could be created that define what can be seen as homologues proteins. This has allowed scientist to cluster the unmapped protein sequence space into novel protein domain families [9, 28, 65]. The understanding of protein domains has allowed us to classify relationships between proteins that otherwise would have been missed, especially since protein-domain interactions are conserved across species, something that is not always true when looking at protein level [31].

2.7 Protein domain interactions

Recent technologies both on the genomics and proteomic level have shown that proteins form large interaction networks where they bind to each over both short (transiently) and long periods (stable) time. These interactions are essential for proteins to be able to perform their functions. Specific recogni-tion occurs when, for example, one protein domain binds to a domain-binding motif in a target protein [40, 55]. An example of a wide-spread bind-ing motif is the PxxP motif that attracts Src-homology domain proteins [40]. What makes motifs and domains so special and biologically important is that they are not specific to single organisms, but are present across species [56, 67]. This is why studying protein-interactions on a protein domain level makes sense. It is also why protein interactions are so difficult to predict. Identifying a single domain-motif interaction is usually a lot of work, and the task is not made easier by the fact that most eukaryotic proteins consist of several protein domains [11, 14, 39]. As will be seen, a number of approach-es have been developed to use available large-scale protein interaction data

(23)

to both determine and predict protein domain interactions. Another good source of data to study protein interactions are protein structures. With the ever-increasing number of known structures, quite a number of reliable pro-tein domain interactions can be extracted from structures [19]. Although these interactions contains a lot of information, too much focus on such in-teractions is dangerous as this will heavily skew the focus towards stable protein domain interactions relative to transient protein interactions [19].

2.8 Prediction and Extraction of Protein Domain

Interactions

Several techniques for predicting domain-domain interactions have been developed over the years. The first approach was the association method, where domains were mapped to interacting proteins. In this method, the do-main interactions were selected if their frequency was higher than expected given the domains’ abundance in the proteome [37, 50]. Extensions to this approach include domain pair exclusion analysis and random forest optimi-zation [13, 69]. The domain pair exclusion analysis introduces a new score that compares the ratio of two domains interacting over the two domains not interacting. Random forest optimization methods calculates all possible pro-tein domain interactions over all available propro-tein domains using sets of predicted protein-protein interactions. Random forest optimization methods are good at estimating the effect of multi-domain combinations on interac-tions. Evolutionary data (phylogenetic profiling) has also been used to study interactions by inferring interactions from the co-occurrence of protein do-mains in multiple species [52, 59].

Protein domains have also been extracted by integrating different data types, for example combining gene ontology functional annotation with protein interaction data. This is done using Maximum Expectation or Bayesian methods [38]. The last approach for predicting protein domain interactions is to use co-evolutionary analysis. This approach looks both at structure and sequence to make an estimation if possible interacting domains may have co-evolved [34]. As can be seen a diverse set of methods have been developed to predict protein domain interactions, with various reliability [51, 69].

(24)

(25)

Chapter 3 Membranes

3.1 Biological membranes

Membranes define the outer boundaries and internal organelles of all cells. Membranes are composed of two layers of lipids where each layer is referred to as a leaflet. Although the lipids are what defines the membrane they are not the only constituents. Biological membranes have proteins both integrat-ed into them and attachintegrat-ed to their surface. The surface of some membranes is also covered with carbohydrates that are attached either directly to ipids or indirectly via proteins. The membrane should not be seen as homogeneous layer of lipids dotted with proteins or carbohydrates [63], but rather as an active ever-changing entity [32]. Both proteins and lipid composition is asymmetrical. If one were able to see the surface one would see local patch-es of distinct lipid phaspatch-es flicker in and out of existence, with a lot of pro-teins in oligomeric states integrated into the membrane and with constant protein activity both on the inside and outside of the membrane affecting both the composition and thickness of the membrane [49].

3.2 Lipids

Lipids are what make up the bulk of the membrane. The reason lipids form a membrane is due to their amphipathic nature and the hydrophobic effect. Lipids commonly have a polar headgroups and one or two non-polar hydro-phobic tails. What makes these molecules form membranes in water is that the headgroups can interact favorably with water while the tails cannot. This makes the apolar parts to self-associate, allowing them to interact favorably. By forming a bilayer the lipids non-polar parts can interact favorably with each other while the polar headgroups can interact with the water.

Lipids come in many sizes and shapes. In the bacterium Escherichia coli alone one can find over a 100 different lipids and in Eukaryotes there are over 1000 different types of lipids. This is one of the reasons why the differ-ent compartmdiffer-ents in the Eukaryotic cells can be so specialized [44]. Viewing

(26)

a lipid as geometric shape, the lipids can be categorized into to three stand-ard shapes. The most common are often seen as cylindrical which is consid-ered to be a requirement to allow lipids to form bilayers (see figure 5.1 be-low). The other two types are capable of organizing themselves into micellar or higher-order structures in a aqueous environments but are not as common-ly occurring in biology as the cylindricalcommon-ly shaped lipids. There are quite a number of lipids that are not cylindrical that reside in the membrane by mix-ing with lipids that are capable of creatmix-ing a bilayer, this does change the behavior of a membrane such as membrane thickness and curvature elastici-ty [44, 26].

Figure 3.1 In this figure the standard architecture, shape and its effects on lipid organization can be seen. The most common biological organization and shape is the cylindrical shape that leads to the lamellar bilayer organiza-tion. The Figure is adapted from [82].

3.3 Membrane proteins

Membrane proteins can broadly be classified into either peripheral or inte-gral membrane proteins. Peripheral membrane proteins are either attached to the surface of the membrane by electrostatic and hydrophobic forces or through enzyme-catalyzed attachment to a lipid such as GPI-anchors [22, 33]. One very important difference between the two types of peripheral membrane protein is that the non-covalent interactions are more easily re-versible than the covalent one. The integral membrane proteins have parts

(27)

that are embedded into the bilayer. These proteins retained in the membrane via interactions between the hydrophobic transmembrane domain of the tein and the hydrophobic body of the lipid bilayer. A typical membrane pro-tein residing in a membrane can be seen in Figure 3.2

Figure 3.2 The membrane protein complex Sensory rhodopsin II. The space between the blue and red lines represent the lipid bilayer.

3.3.1 Membrane Protein Topology

The topology of a membrane protein shows what parts of the protein travers-es a membrane and what parts are on one or the other side of the membrane. This is on a sequence level often denoted as in (i), out (o) and membrane (M) [4]. Membrane protein topology generally follows the “positive inside rule”. This rule simply states that the non-membrane regions pointing to-wards the cytoplasm should contain a higher number of positively charged residues [29]. Since the number of membrane proteins with known structures are quite limited, topology predictors are generally needed when doing large-scale analysis of membrane proteins. These predictors have become both faster and more accurate with time, and some recent methods have reached a prediction accuracy of up to 90% on certain datasets [68].

(28)

Figure 3.3 Membrane protein topology. In a) all transmembrane domains have been colored with a unique color to make them easier to distinguish. b) shows a typical topology model, each colored rectangle represents a trans-membrane segment and the lines connecting them represent soluble do-mains. In c) the amino acid sequence is seen in fasta format with the corre-sponding topology below (i = inside the cell, o = outside the cell, M = in the membrane).

3.4 Lipid rafts

In some cells, subdomains (also known as lipid rafts) tend to form in the membrane. These are laterally segregated domains that are enriched with certain lipids that give these domains features different from the surrounding membrane. The size, lifetime and even existence of these subdomains are a source of great debate in the field of cell biology. The ability to form mem-brane sub-domains is based on the observation that sphingolipids will not mix favorably with the other plasma-membrane lipids and may form a more ordered lipid domain on their own. Rafts are generally considered to be composed mainly of sphingolipids and these subdomains surfaces are en-riched with cholesterol.

3.5 Protein-Lipid interactions

It is known that protein-lipid interactions affect membrane protein folding [46]. As noted above, some lipids introduce stress into the lipid bilayer be-cause of non-compatible shapes such as the inverted cone shape [7, 23, 24]. This stress affects membrane protein insertion and capacity. Other factors that might affect protein insertion are the size and charge of the lipid head-groups. The effects of protein-lipid interactions go both ways, as a mismatch in the size between a hydrophobic protein segment and the surrounding

(29)

membrane will also put stress on the lipid bilayer. A too long hydrophobic protein segment may force the protein to tilt but will also attract lipids with longer tails to its proximity making the membrane thicker. The interplay will work oppositely with a too short hydrophobic protein segment, attracting lipids with shorter tails, making the bilayer thinner around the transmem-brane domain. These situations stress both the memtransmem-brane and protein. This can affect the folding of the protein and might compromise its function or its ability to interact with other proteins.

(30)

(31)

Chapter 4. Protein Sorting

Proteins are functional units that need to be at the right place at the right time. Many proteins function in the cytoplasm. These proteins are the easiest to localize as they can be translated and released on site. In contrast, integral membrane proteins and secreted proteins need to be identified and transport-ed to their proper destinations. This is done with the help of a signal se-quence, a hydrophobic N-terminal recognition sequence for the Signal Recognition Particle (SRP). If SRP recognizes the signal sequence the ribo-some and its nascent chain is transported to a so-called translocon, i.e. a pro-tein-conducting channel in the ER or inner bacterial membrane [53].

What tells the translocon if the protein is to be secreted or inserted into the membrane is the presence of transmembrane domains (TMDs). TMDs are usually α-helical and are hydrophobic [54]. If a TMD is recognized by the translocon, membrane translocation of the nascent chain is aborted and the TMD is released laterally into the membrane.

In eukaryotic cells, once an integral membrane protein has been properly inserted and folded in the ER membrane it is ready to progress along the secretory pathway to reach its proper subcellular location, unless this hap-pens to be the ER in which case it is retained there. The protein leaves the ER via a so-called ER exit site, moves on into the ER-to-Golgi intermediate compartment and onwards to the Golgi complex [74]. From here most pro-teins travel on to the plasma membrane while “escaped” ER propro-teins may be re-routed back to the ER.

This intracellular protein trafficking in the secretory pathway is based on a vesicular transport mechanism. A vesicle can be described as a small lipid bubble. The formation of transport vesicles is a tightly controlled process. For proteins travelling between the ER and the Golgi, so-called COP vesi-cles are used for transport. For proteins travelling from the ER to the Golgi COP-II vesicles are used. For the return transport to ER from the Golgi COP-I vesicles are used instead [36]. So how does the ER know which pro-teins should be retained in the ER and which should be sent onwards? It is thought transmembrane proteins are sent to their target membranes depend-ing on short sequence motifs located in domains exposed to the cytoplasmic

(32)

side of the ER that interact with cytoplasmic factors, there by enriching the proteins in the transport vesicles [25].

Subcellular trafficking is also required for proteins to be imported into mito-chondria and peroxisomes (plus the chloroplast in plant cells). Mitomito-chondrial proteins have a pre-sequence that is recognized by the so-called translocon of the outer membrane (TOM) complex in mitochondria. Proteins destined for the matrix are then translocated through the TIM (translocon of the inner membrane) complex, while inner- and outer- membrane proteins have inter-nal sequence features that target them to their correct destination.

Peroxisomal proteins are translated in the cytoplasm before the peroxisomal targeting signal is recognized by a soluble receptor. The targeting sequences can present both at the N- and the C-terminus of the protein. The mechanism of import is not well understood, but apparently involves “piggy-backing” of the imported protein on the receptor. The receptor is then recycled back into the cytoplasm.

(33)

Chapter 5. Machine Learning

Machine learning is the collective name for a set of methods used to train software to recognize patterns and features in new data by creating repre-sentative models for these patterns and features. This is done by giving the model input training data and iteratively optimizing its internal parameters to maximize its ability to correctly categorize the data. Machine learning meth-ods can be either be supervised or non-supervised. In supervised training, a model is trained with the correct answer and the parameters can be changed so that the model better identifies the correct answers. In non-supervised training the model is trained without knowing which are the correct answers and have to detect that using the model [8]. There are several types of monly used machine learning techniques in bioinformatics, the most com-mon ones being Artificial Neural Networks, Support Vector Machines and Hidden Markov Models.

Figure 5.1 A model of a double layer neural network. The network has three input nodes, two output nodes and two layers of five hidden nodes each.

5.1 Artificial Neural Networks (ANN)

ANN methods have been inspired by how neurons operate in our brain. Each neuron is represented as a node [46]. Each node receives multiple input sig-nals and uses a set of weights to convert these into one outgoing signal [18]. By using a number of “hidden” layers this can be used to create automatic processing and decision making in very complex situations. ANNs are

(34)

creat-ed using superviscreat-ed training as they are fcreat-ed correct data and the weights are iteratively updated to improve the ability of the ANN to produce the correct answers.

Figure 5.2 An example of a simple type of SVM, that uses a linear kernel to classify a two-class dataset. The kernel is the purple line and the two classes are represented by dots of two different colors, blue and red.

5.2 Support Vector Machines (SVMs)

SVMs are models that use kernels to separate data into two classes. The basic idea can easily be visualized using a linear kernel, looking at a two dimensional plot separated by a line (the linear kernel), the classes are the surfaces that are divided by the line. SVMs are trained by fitting the kernel so that the training data is separated as much as possible. Most SVMs use more complex kernels than linear kernels and use rigorous mathematics to optimize the separation of classes [10].

(35)

Figure 5.3 A typical sequence profile HMM, the most common type of HMM in bioinformatics. The purple squares represent match states (M), the orange diamonds represent insert states and the red circles represent delete states. The arrows in the model represent transition probabilities and all numbered states represent emission probabilities. The delete states are not numbered as these in standard HMM do not emit a probability.

5.3 Hidden Markov Models (HMMs)

HMMs were initially used for speech recognition but soon found use in bio-informatics. HMMs are statistical models that have a linear dependency that makes them excellent to model biological sequences. An HMM can be de-scribed as a set of connected states, that have preset probabilities to observe a certain set of possible observations. Each state has a unique set of observa-tion probabilities, these are called emission probabilities. Each node is then connected to a set of other nodes and each node has a unique transition prob-ability to move from that node to a set of possible nodes located after it in the model. This means that the probability for each state is dependent only on the states that connect to it and is independent of all other states. Using a HMM to evaluate a certain set of observations is a rather straightforward multiplication of probabilities. If you have series of connected states were each state is connected to several others you will find a situation where for a series of events the model will generate several possible outcomes for this event. In most cases the only interesting series of events is the most probable one. To find this probability and the series of states that generated it one can use the Viterbi algorithm. This is dynamic programming algorithm that finds most probable series of states for a certain set of input data [57, 70].

(36)

(37)

Chapter 6 Summary of Papers

6.1 Using multi-data hidden Markov models trained on

local neighborhoods of protein structure to predict

residue-residue contacts (Paper I)

The ability to predict protein structure from physical principles is one of the main goals in protein structure prediction since it would allow scientists to predict and study unknown structures. Accurate protein models can be creat-ed when proteins with known structure can be found that are closely relatcreat-ed to the target protein you want to fold, or if you have a very short protein sequence the protein can be folded by ab initio methods. The inability to fold these “unfoldable” proteins properly is considered to be the lack of correct long-range interactions [21]. It was shown that contact predictions with ac-curacy above 22% could improve ab initio predictions of protein structures [80]. In this paper we show that this level of contact predictions are achieva-ble by looking at recurring structural neighborhoods that are in theory fold-independent. These neighborhoods could be aligned to their proper location using hidden Markov models that were dual-layered using both sequence and secondary structure data. The performance of the predictor FragHM-Ment that we released in the paper was shown to be 22.8% when tested on proteins with unknown fold, at this time outperforming all previously pub-lished results. FragHMMent was then tested in CASP8 where it was ranked one of the top five methods in contact prediction (group number RR158) [81]. We also studied the actual distribution of different types of interactions in proteins of various lengths. It was generally considered that the number of actual interactions in a protein is linearly related to the length of the protein. Looking at interactions it can be seen that this is true for short- and medium-range interactions but not for long-medium-range interactions. This helps to explain why the structure of longer proteins is so much more difficult to predict ac-curately using physical principles.

(38)

6.2 Molecular recognition of a single sphingolipid

species by a protein’s transmembrane domain (Paper II)

In this paper we studied the specific interaction between transmembrane domains and sphingolipids. The starting point was the finding that the protein p24 interacts with a specific species of sphingolipid, sphingomyelin18 (SM18). A specific binding motif, VxxTLxxIY, could be characterized that facilitated this interaction between the transmembrane domain and SM18. The motif forms a crevice on the transmembrane helix that the lipid fits into. The tyrosine is what allows the transmembrane domain to interact with headgroup of SM18, while the remaining residues in the motif interact with the tails of the lipid. These interactions was also found to help regulate COP-I transport by promoting the transition of p24 from an inactive monomeric state to an active dimeric state. A question arose whether this motif in p24 was a unique biological feature or if more examples could be found. Therefore a loose motif was generated where the residues in the motif were replaced by amino acids with the same side-chain properties:

(V/T/L/I)xx(V/T/L/I) (V/T/L/I)xx(V/T/L/I)(F/W/Y).

The above loose motif actually represents 768 possible motifs and searching a large protein database with this many motifs would generate an unmanage-able amount of candidate proteins. Therefore a transmembrane motif evalua-tion method was developed to find the most significant motifs. This method was used on a mammalian protein dataset composed of only of single span-ning transmembrane domains. After analyzing these motifs we found that 13 motifs where overrepresented, and these motifs were used to mine for candi-date single-spanning proteins. Four of the 48 candicandi-dates found tested exper-imentally and found to bind sphingolipids.

(39)

6.3 Bioinformatics–based identification of novel

sphingolipid–binding motifs in mammalian membrane

proteins (Paper III)

This is a follow-up study of Paper II. So the method developed in paper II was expanded into a downloadable application. The new application MOPRO was used to find possible interactions between sphingolipids and multi-spanning transmembrane proteins by recognizing sphingolipid-binding motifs. We used the same loose starting motif as in Paper II, but now analyz-ing mammalian multi-spannanalyz-ing membrane proteins. 22 motifs were found to be overrepresented. Of these 21 were novel and one of was overrepresented already in the earlier analysis. We then used these motifs to find both single- and multi-spanning candidates for sphingolipid binding. Four candidates were tested for binding of sphingolipids and all were four tested positive. We noted a larger than expected number of GPCRs in our candidate list. Most of the sphingolipid binding motifs where predominantly located in TM6, while TM3 was underrepresented. The fact that TM3 is underrepresented makes sense since that is located in the functional core of GPCRs and is surrounded by other transmembrane domains so that it is not much in contact with the lipid bilayer. That TM6 is overrepresented is interesting and hints at a regu-latory function of lipid binding of some GPCRs. It is known that substrates interacting with TM6 can instigate large structural rearrangements on the cytoplasmic side.

(40)

6.4 Comparative analysis and unification of

domain-domain interaction networks (Paper IV)

In this paper we analyze a large number of published and publically availa-ble protein-protein and protein domain interaction networks. We compare the contents on a protein domain level and translate all networks to the same protein domain identifier, Pfam. After this a new measure to compare net-works is developed. This measure is called weighted overlap, which is an extension of the old overlap measure. The difference between the old meas-ure and the new is that with weighted overlap you only compare the common domain space. This has the advantage that not only that prediction algo-rithms can compete on a more fair level, it also does not disfavor large net-works as much when comparing against a gold standard of known protein domain interactions. The downloaded networks are all evaluated against a gold standard that is composed of protein domain interactions extracted from structures. The comparison is made using both the overlap and weighted overlap measures. In the comparison we can see that our new measure is successful in preventing large networks from being discriminated in compar-ison against smaller ones. It also shows that using structures as a gold stand-ard heavily favors networks that predict homotypic domain interactions. One could also suspect that it pushes networks to predict stable rather than transi-ent interactions, but in order to predict these a gold standard of transitransi-ent interactions must be created (I suspect it is no small feat). Having evaluated a large number of networks and translated them into a common domain iden-tifier we merge them to create a large composite domain interaction network called Unidomint, that combines the weights of all networks and produces a unique score for each interaction. We then compare this network against the only other known composite network Domine and find that our network outperforms this network when compared against the gold standard.

(41)

Chapter 7. Final thoughts

During my brief but fun time as a Ph.D. student I have seen an amazing transformation in the field bioinformatics. This transformation has, alas, probably made two of my publications slightly redundant, Papers I and IV. This transformation is centered on availability of data. When I started my work, sequence data was in short supply and it was a lot of work to find suf-ficient data to be able to develop methods and to find other data to compare and validate your findings against. Then with the advent of next-generation sequencing all that changed and transformed the field in a short period of time. Suddenly finding data was not an issue, rather finding the correct data among all the data became the central core of applied bioinformatics.

When Paper I was published another paper was most likely under review that used correlated mutations to predict intra-molecular contacts. Until this point the use of correlated mutations had been a theoretically appealing idea to predict contacts but the method failed because it could not separate indi-rect mutational pressure from diindi-rect mutational pressure. With the availabil-ity of more sequences and a clever statistics these shortcomings were over-come using sparse inverse covariance estimation that can separate direct and indirect coupling [86]. These methods now perform on a prediction level that purely fragment-based prediction methods will have a hard time competing with, on even on a theoretical level.

Paper IV was also made a bit redundant as the possibility to generate such large quantity of data with such ease. That being said, questions about pro-tein interactions are probably more important now than ever. In order to fully understand the cell, protein interactions need to be properly understood, but it does not end there. We need to understand and find what non-protein in-teraction partners proteins have if we are even to fully understand the roles that proteins have in cells. This can be seen as Papers III and IV, where we discuss how lipids are used to regulate protein activity and protein transport. These papers also show that lipid-protein interactions are not only specific and functionally important, but might be very common. I still believe there are many discoveries to be made looking only at the proteins themselves, but in the long run the folding process and their outcomes for example will be impossible to fully understand without considering the interactions between

(42)

proteins and their molecular chaperones. I do believe this way of thinking is far off into the future but I really hope to see it in my lifetime.

(43)

Chapter 8. Swedish summary

Cellen kan beskrivas som den minsta levande enheten, organismer kommer i olika former både som enkel- eller fler- celliga organismer. Celler har många gemensamma egenskaper men visar också upp en förvånande diversitet. Dessa egenskaper har lett till att man ofta delar upp organismer i olika kate-gorier. De största av dessa kategorier kallas kungadömen eller domäner, det finns tre kungadömen i naturen.

Celler består till största del av membraner, proteiner, DNA och RNA. DNA är det sätt som celler förvarar sin information på, RNA används för att omvandla den information till proteiner. Proteiner kan beskrivas som biolo-giska maskiner, som utför de flesta av de processer som gör liv möjligt. Membranet är de väggar som separerar och definerar cellen.

Denna avhandling handlar om proteiner och deras interaktioner både interna interaktioner och externa interatkioner. Den innehåller fyra studier där vi studerat dessa interaktioner. I första studien kunde vi se att naturen använder återkommande strukturella områden som kan beskrivas och användas på okända proteiner för att se vilka delar som interagerar med varandra. I andra och tredje studien såg vi hur delar av proteiner interagera med lipider, de komponenter som membran består av. Vi kunde även visa att det användes till att reglera proteintransport. I sista studien så studera vi på hur olika metoder kan förutsäga domän interaktioner mellan proteiner och använde dessa nätvärk för att skapa ett stort sammansatt nätverk.

(44)

(45)

Chapter 9. Acknowledgements

I have a lot of people to thank and will start with the people I have spent the most time with, starting with Gunnar whose passion for science that seems to know no boundaries and infecting this with the people around him! Would also like to thank Arne Elofsson and Erik Lindahl for the time in C4-corridor, always had a fun time discussing both science and life in general with you. I have had a lot of fun at my time in DBB but the best part was definitely the period when I shared office with Christoph and Sikander, always enjoyed meeting you guys in the morning having both a ton of serious and not so serious discussions. Would like to thank all people who I shared the C4 corridor with a place where something always happened. Also want to thank all people in Gunnars group that always have taken time to talk to and make me feel welcome there, especially I would like to thank Johannes, Nurzian, Salome, Florian and Nina! Should also mention the people that I taught basic chemistry with, three people pop into mind, Loppan, Gabriell and Johannes Björnerås who I spent the most time teaching with! To be honest teaching was never really my thing but all people I spent my time tutoring with made it a lot of fun so thank you all! Some other people I should mention are Viktor Granholm, Kristoffer Forslund, Bengt Persson and Dan Daley.

Other people I should thank are the Heidelberg crew, Andreas, Britta and Felix who showed me that scientific collaborations across countries are not only possible but extremely fun! Should also thank Torgeir who sparked my interest in science!

I should also mention all the wonderful people who graces my life with their presence that are not people I work with, Daniel Larsson always enjoy discussing world problems with, I apologize that you occasionally have to listen to us Tova! Fredrik Almroth, geographical issues and time wise issues makes us talk to far in between but it feels as we always pick up the ball where we left it the last time, I plan to get better at calling though. Gustaf and Hanna, long time friends from student times, always a source of comfort that you live close by! Martin and Tess always enjoy spending time with you and your families, don't ever feel like strangers. Fredrik Lysholm, a surprising recent friendship in life if still very appreciated most, you are a living testament to all the advantages of being an open and warm

(46)

personality... also on my list of people I should call more often! Would also like to extend my gratitude to the old TBI-gang from my student times (Gustav, Erik, Christian and Per), always makes me feel younger again, at least until the day after...

Would also like to thank my family for being there, pretty sure that my parents would not expect to see their son write a thesis in my adolescent years, my older sister Jenny would have been a safer bet... Thank you for teaching me the strengths of individualism and making your own path in life and for being there in good and bad times!

Last but most definitely least I would like to thank my better half Hanna for standing by me and occasionally teaching my that when you hit a dead-end in life or proffesionly don't try to push through walls to long just turn around go back and start over =). Would also like to thank here parents and sister with family for always making me feel welcome. Also want to thank my children Elias and up to this point unnamed little brother, for making my life so much richer if a bit more unpredictable!

I am sorry to all people I should have thanked, but forgot... I am a bit sleep-deprived writing this so in case you feel left out or slighted send me an email and I will try to make it up to you!

(47)

References

[1] Anfinsen CB (1973) Principles that govern the folding of protein

chains. Science. 181: 223–230

[2] Attwood TK, Coletta A, Muirhead G, Pavlopoulou A, Philippou PB, Popov I, Roma-Mateo C, Theodosiou A, Mitchell A (2012) The PRINTS

database: a fine-grained protein sequence annotation and analysis resource - its status in 2012. Database. 10.1093/database/base019

[3] Barlow DJ, Thornton JM (1983) Ion-pairs in proteins. Mol Biol. 168: 867-885

[4] Bernsel A, Viklund H, Falk J, Lindahl E, von Heijne G, Elofsson A (2008) Prediction of membrane-protein topology from first principles. Proc.

Natl Acad Sci USA. 105: 7177-7181

[5] Bernstein FC, Koetzle TF, Williams GJB, Meyer Jr. EF, Brice MD , Rodgers JR, Kennard O, Shimanouchi T, Tasumi M (1977) The Protein

Data Bank: a computer-based archival file for macromolecular structures. J

Mol Biol. 112: 535-542

[6] Bonneau R, and Baker D (2001) Ab initio protein structure prediction:

progress and prospects. Annu Rev Biophys Biomol Struct. 30: 173- 189

[7] Booth PJ, Rachael AC (1999) Membrane Protein Folding. Current

Opinion in Structural Biology. 9: 115–121

[8] Bradley AP (1997) The use of the area under the ROC curve in the

evaluation of machine learning algorithms. Pattern Recognition. 30: 1145–

1159

[9] Bru C, Courcelle E, Carrère S, Beausse Y, Dalmar S, Kahn D. (2005)

The ProDom database of protein domain families: more emphasis on 3D.

Nucleic Acids Research. 33: 212-215

[10] Burges CJ (1998) A Tutorial on Support Vector Machines for Pattern

Recognition. Data Mining and Knowledge Discovery. 2: 121-167

[11] Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, et al. (2005) Interaction

network containing conserved and essential protein complexes in

Escherich-ia coli. Nature. 433: 531-537

[12] Chandler D (2005) Interfaces and the driving force of hydrophobic

assembly. Nature. 437: 640-647

[13] Chen XW and Liu M (2005) Prediction of protein–protein interactions

using random decision forest framework. Bioinformatics. 21(24):

4394-4400

[14] Chothia C, Gough J, Vogel C, Teichmann SA (2003) Evolution of the

protein repertoire. Science. 300: 1701-1703

[15] Crick, F (1970). Central dogma of molecular biology. Nature. 227: 561–563

[16] Dill KA (1990) Dominant forces in protein folding. Biochemistry. 31: 7133-55

(48)

[17] Doolittle RF (1995) The multiplicity of domains in proteins. Annual

Reviews in Biochemistry. 64:287-314

[18] Farley B, Clark WA (1954). Simulation of Self-Organizing Systems by

Digital Computer. IRE Transactions on Information Theory. 4: 76–84

[19] Finn RD, Miller BL, Clements J, Bateman A (2013) iPfam: a database

of protein family and domain interactions found in the Protein Data Bank.

Nucleic Acids Research. doi: 10.1093/nar/gkt1210

[20] Fiser A, Do RK, Sali A (2000) Modeling of loops in protein structures.

Protein Science. 9: 1753-1773

[21] Floudas CA, Funga HK, McAllistera SR, Mönnigmanna M, Rajgariaa R (2006) Advances in protein structure prediction and de novo protein

de-sign: A review. Chemical Engineering Science. 61: 966-988

[22] Galian C, Björkholm P, Bulleid N, von Heijne G (2012)

glyco-sylphosphatidylinositol (GPI) modification of membrane proteins requires a C-terminal anchoring signal of marginal hydrophobicity. Journal of

Biolog-ical Chemistry. 287: 16399-16409

[23] Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al. (2002) Functional

or-ganization of the yeast proteome by systematic analysis of protein complex-es. Nature. 415: 141-147

[24] Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, et al. (2003) A protein interaction map of Dro-sophila melanogaster. Science. 302: 1727-1736

[25] Glick, B.S., Nakano, A. (2009) Membrane traffic within the Golgi

apparatus. Annu Rev Cell Dev Biol. 25: 113-32

[26] Gruner SM (1985) Intrinsic curvature hypothesis for biomembrane

lipid composition: a role for nonbilayer lipids. Proc Natl Acad Sci USA.

82: 3665-3669

[27] Han J-H, Batey S, Nickson A, Teichmann S, Clarke J. (2007) The

fold-ing and evolution of multidomain proteins. Nature Reviews Molecular Cell

Biology. 8: 319-330

[28] Heger A, Wilton CA, Sivakumar A, Holm L. (2005) ADDA: a domain

database with global coverage of the protein universe. Nucleic Acids

Re-search. 33: 188- 191

[29] von Heijne G (1992) Membrane protein structure prediction.

Hydro-phobicity analysis and the positive-inside rule. J Mol Biol. 225: 487-494

[30] Hessa T, Kim H, Bihlmaier K, Lundin C, Boekel J, Andersson H, Nils-son I, White SH, von Heijne G (2005) Recognition of transmembrane helices

by the endoplasmic reticulum translocon. Nature. 433: 377-381

[31] Itzhaki Z, Akiva E, Altuvia Y Margalit H (2006) Evolutionary

conser-vation of domain-domain interactions. Genome Biol. 7: R125

[32] Jacobson K, Mouritsen OG, Anderson RG (2007) Lipid rafts: at a

(49)

[33] Johnson JE, Cornell RB (1999) Amphitropic proteins: regulation by

reversible membrane interactions. Molecular Membrane Biology. 16:

217-235

[34] Jothi R, Cherukuri PF., Tasneem A and Przytycka TM. (2006)

Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights into Domain– Domain Interactions Mediating Protein–Protein Interactions.

J Mol Biol. 362: 861–875

[35] Kaczanowski S, Zielenkiewicz P (2010). Why similar protein

sequenc-es encode similar three-dimensional structursequenc-es?. Theoretical Chemistry

Accounts. 125: 543–550

[36] Kaiser CA, Schekman R (1990) Distinct sets of SEC genes govern

transport vesicle formation and fusion early in the secretory pathway. Cell.

61: 723-733

[37] Kim WK, Park J, Suh JK. (2002) Large scale statistical prediction of

protein-protein interaction by potentially interacting domain (PID) pair.

Genome Inform. 13:42-50

[38] Lee H, Deng M, Sun F, Chen T (2006) An integrated approach to the

prediction of domain-domain interactions. BMC Bioinformatics. 7:269

[39] Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, et al. (2004) A map of the interactome

network of the metazoan C. elegans. Science. 303: 540-543

[40] Lim WA, Richards FM, Fox RO (1994) Structural determinants of

peptide-binding orientation and of sequence specificity in SH3 domains.

Nature. 372: 375-379

[41] Lomize MA, Lomize AL, Pogozheva ID, Mosberg HI (2006) OPM:

Orientations of Proteins in Membranes database. Bioinformatics. 22:

623-625

[42] Luca S, Heise H, Baldus M. (2003) High-resolution solid-state NMR

applied to polypeptides and membrane proteins. Acc Chem Res. 36:

858-865

[43] Mandelkern M, Elias JG, Eden D, Crothers DM (1981) The

Dimen-sions of DNA in Solution. J Mol Biol. 152: 153-161

[44] van Meer G, Voelker DR, Feigenson GW (2008) Membrane lipids:

where they are and how they behave. Nat Rev Mol Cell Biol. 9: 112–124

[45] Marsh D. (2006) Elastic curvature constants of lipid monolayers and

bilayers. Chem Phys Lipids. 144: 146-159

[46] McCulloch, Warren; Walter Pitts (1943) A Logical Calculus of Ideas

Immanent in Nervous Activity. Bulletin of Mathematical Biophysics. 5:

115–133

[47] Mirsky AE, Pauling L (1936) On the Structure of Native, Denatured,

and Coagulated Proteins. Proc Natl Acad Sci USA. 22: 439–447

[48] Andrew J. Miles, Lee Whitmore, and B.A. Wallace. (2005) Spectral

magnitude effects on the analyses of secondary structure from circular di-chroism spectroscopic data. Protein Sci. 14: 368–374

(50)

[49] Nagle JF, Tristram-Nagle S (2000) Structure of lipid bilayers.

Bio-chimica et Biophysica Acta. 1469: 159-195

[50] Ng SK, Wong M. (1999) Toward Routine Automatic Pathway

Discov-ery from On-line Scientific Text Abstracts. Genome Inform Ser Workshop

Genome Inform. 10: 104-112

[51] Nooren IMA, Thornton JM (2003) Diversity of protein–protein

inter-actions. The EMBO Journal. 22: 3486–3492

[52] Pagel P, Wong P, Frishman D. (2004) A domain interaction map based

on phylogenetic profiling. J Mol Biol. 344: 1331-1346

[53] Park, E, Rapoport T (2011) Mechanisms of Sec61/SecY-mediated

pro-tein translocation across membranes. Ann Rev Biophys. 41 :21-40.

[54] Pasquier C, Palaios GA, Hamodrakas JS, Hamodrakas SJ (1999) A

novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm.

Protein Eng. 12: 381–385

[55] Pawson T, Nash P (2003) Assembly of cell regulatory systems through

protein interaction domains. Science 300: 445-452

[56] Pereira-Leal JB, Teichmann SA (2005) Novel specificities emerge by

stepwise duplication of functional modules. Genome Res. 15: 552-559

[57] Rabiner D. (1989) A tutorial on hidden Markov models and selected

applications in speech recognition. Proc IEEE. 77: 257 - 286

[58] Raman S, Vernon R, Thompson J, Tyka M, Sadreyev R, Pei J, Kim D, Kellogg E, DiMaio F, Lange O, Kinch L, Sheffler W, Kim BH, Das R, Grishin NV, Baker, D. (2009) Structure prediction for CASP8 with all-atom

refinement using Rosetta. Proteins. 9: 89-99

[59] Riley R, Lee C, Sabatti C, Eisenberg D. (2005) Inferring protein

do-main interactions from databases of interacting proteins. Genome Biol. 6:

R89

[60] Rupp B, Wang J. (2004) Predictive models for protein crystallization.

Methods. 34: 390-407

[61] Sali A, Blundell TL (1993) Comparative protein modelling by

satisfac-tion of spatial restraints. J Mol Biol. 234: 779-815

[62] Sigrist CJA, de Castro E, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L, Xenarios I (2012) New and continuing developments at

PROSITE. Nucleic Acids Res. doi: 10.1093/nar/gks1067

[63] Singer S J & Nicolson G L. (1972) The fluid mosaic model of the

structure of cell membranes. Science. 175: 720-731

[64] Sonnhammer EL, Kahn D. (1994) Modular arrangement of proteins as

inferred from analysis of homology. Protein Science. 3: 482-492

[65] Sonnhammer EL. (1998) Protein family databases for automated

pro-tein domain identification. Database. 9: 68-78

[66] Strop P, Mayo SL (2000) Contribution of Surface Salt Bridges to

Protein Interactions from the Molecular to the Domain Level