Automated annotation of protein families

(1)

Department of Physics, Chemistry and Biology

Master’s Thesis

Automated annotation of protein families

Eric Elfving

LiTH-IFM-EX--11/2551--SE

Department of Physics, Chemistry and Biology Linköpings universitet

(2)

(3)

Master’s Thesis LiTH-IFM-EX--11/2551--SE

Automated annotation of protein families

Eric Elfving

Supervisor: Joel Hedlund

ifm, Linköpings universitet

Examiner: Bengt Persson

ifm, Linköpings universitet

(4)

(5)

Avdelning, Institution

Division, Department Bioinformatics

Department of Physics and Measurement Technology Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2011-06-17 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.ifm.liu.se/bioinfo/ http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-69393 ISBN — ISRN LiTH-IFM-EX--11/2551--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title

Automatiserad annotering av proteinfamiljer Automated annotation of protein families

Författare

Author

Eric Elfving

Sammanfattning

Abstract

Introduction: The great challenge in bioinformatics is data integration. The amount of available data is always increasing and there are no common unified standards of where, or how, the data should be stored. The aim of this work is to build an automated tool to annotate the different member families within the protein superfamily of medium-chain dehydrogenases/reductases (MDR), by finding common properties among the member proteins. The goal is to increase the understanding of the MDR superfamily as well as the different member families. This will add to the amount of knowledge gained for free when a new, unannotated, protein is matched as a member to a specific MDR member family.

Method: The different types of data available all needed different handling.

Tex-tual data was mainly compared as strings while numeric data needed some special handling such as statistical calculations. Ontological data was handled as tree nodes where ancestry between terms had to be considered. This was implemented as a plugin-based system to make the tool easy to extend with additional data sources of different types.

Results: The biggest challenge was data incompleteness yielding little (or no)

results for some families and thus decreasing the statistical significance of the results. Results show that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N) and takes part in an oxidation-reduction process, often with NAD or NADP as cofactor. Many of the proteins contain zinc and are expressed in liver tissue.

Conclusions: A python based tool for automatic annotation has been created to

annotate the different MDR member families. The tool is easily extendable to be used with new databases and much of the results agrees with information found in literature. The utility and necessity of this system, as well as the quality of its produced results, are expected to only increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.

Nyckelord

(6)

(7)

Abstract

Introduction: The great challenge in bioinformatics is data integration. The

amount of available data is always increasing and there are no common unified standards of where, or how, the data should be stored. The aim of this work is to build an automated tool to annotate the different member families within the protein superfamily of medium-chain dehydrogenases/reductases (MDR), by finding common properties among the member proteins. The goal is to increase the understanding of the MDR superfamily as well as the different member families. This will add to the amount of knowledge gained for free when a new, unannotated, protein is matched as a member to a specific MDR member family.

Method: The different types of data available all needed different handling.

Tex-tual data was mainly compared as strings while numeric data needed some special handling such as statistical calculations. Ontological data was handled as tree nodes where ancestry between terms had to be considered. This was implemented as a plugin-based system to make the tool easy to extend with additional data sources of different types.

Results: The biggest challenge was data incompleteness yielding little (or no)

results for some families and thus decreasing the statistical significance of the results. Results show that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N) and takes part in an oxidation-reduction process, often with NAD or NADP as cofactor. Many of the proteins contain zinc and are expressed in liver tissue.

Conclusions: A python based tool for automatic annotation has been created to

annotate the different MDR member families. The tool is easily extendable to be used with new databases and much of the results agrees with information found in literature. The utility and necessity of this system, as well as the quality of its produced results, are expected to only increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.

(8)

Sammanfattning

Introduktion: Den stora utmaningen inom bioinformatik är dataintegration.

Mängden tillgänglig data ökar ständigt och det finns inga gemensamma standarder för hur, eller var, data ska lagras. Målet med detta arbete är att skapa ett automa-tiserat verktyg för att annotera proteinfamiljer inom MDR-superfamiljen genom att hitta gemensamma egenskaper hos medlemsproteinerna. Målet med detta är att öka förståelsen för både superfamiljen som helhet och de enskilda proteinfamil-jerna och därgenom öka mängden kunskap man får på köpet när nyfunna protein matchar någon av medlemsfamiljerna.

Metod: De olika databaserna innehåller flera olika sorters typer av data som

krävde olika hantering. Textuell data hanterades genom vanlig stränghantering medan numerisk data krävde mer avancerad behandling såsom statistiska beräk-ningar. Ontologidata hanterades som trädnoder för att ta hänsyn till termernas inbyggda släktskap. Verktyget implementerades som ett pluginbaserat system för att förenkla utbyggnad och inkludering av fler datakällor.

Resultat: Den största utmaningen var att ta hand om databortfall, framför allt

inom mindre studerade familjer vilket gav sämre statistisk säkerhet hos resulta-tet. Det visade sig att MDR-proteiner hos människa och mus har minst en av Pfam-domänerna ADH_N och ADH_zinc_N. De är även delaktiga i oxidations-reduktionsprocesser och använder ofta NAD eller NADP som kofaktor. Många av proteinerna binder zink och uttrycks i lever.

Slutsatser: Ett pythonbaserat verktyg för automatisk annotering av

proteinfamil-jer tillhörande MDR-superfamiljen har tillverkats. Verktyget är enkelt att bygga ut för använding av nya datakällor och det visar sig att dess resultat stämmer väl överens med litteraturen. Användbarheten och behovet av systemet samt kva-liteten av dess resultat kommer kontinuerligt att öka över tid, även om inga nya tillägg skapas, eftersom systemet kan göra fler och mer detaljerade inferenser i takt med att ny data blir tillgänglig.

(9)

Acknowledgments

Firstly, I would like to thank my supervisor Joel Hedlund for all the good comments and constant positive encouragements and my examiner Bengt Persson. I would also like to thank Torbjörn Jonsson, my boss and mentor at IDA who gave me a lot of administrative work instead of the usual scheduled education to get time to focus on this thesis. Christopher and Erik at home for good comments, discussions and support. My family for all their support and encouragements and finally, all my friends at [hg] (and especially Petter) for all the good times and things to do in my spare time.

(10)

(11)

1 Introduction

The great challenge in bioinformatics is data integration. Each group focuses on “their” topic and publishes their own data. There are no common unified standards of where, or how, the data should be stored but rather multiple, independently evolving, de facto standards. The amount of available data, as well as the number of databases, is growing steadily.

Hedlund et al.[1] _{have assembled a database of proteins in the medium-chain}

dehydrogenases/reductases (MDR) protein superfamily by creating hidden Markov models (HMMs) describing the member families. The aim of this project is to build an automated tool to annotate the different member families in this database by finding common properties of member proteins. The data will be gathered from several publicly available databases. The MDRs are widespread and exist in all kingdoms of life, however the main focus of this project will be on human and mouse proteins, since these two species are the richest in reliable annotation sources. The goal is to find common properties of each member family to increase the understanding of both the member families and the entire MDR superfamily. This knowledge will then be helpful when a new, unannotated, protein is matched as a member to a specific MDR member family. The tool will be created in a general way, making it easy to extend to include other databases and to annotate other protein families.

The utility and necessity of this system, as well as the projected quality of its produced results, are expected to only ever increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.

1.1 Bioinformatics

Around the year 2000 the human genome project (HUGO) was started. The goal of HUGO was to sequence the human genome and this generated a lot of data. Handling this wast amount of data required advanced computational algorithms and thus the field of bioinformatics was born. The main goal of bioinformatics is to increase the knowledge of all biological processes by collecting, organising, interpreting and visualising the available data. Easy access to good, annotated data makes it easier for researchers to find new fields of study.

At present, there are a lot of data on proteins but there is not as much data 1

(14)

2 Introduction

on families and superfamilies and we have found no other software that annotates protein families.

1.2 The MDR families

Hedlund et al.[1] _{estimate the number of MDR families to be around 500. In their}

work, 86 of those have been classified. In this thesis the focus will be on the eight families with at least two protein members expressed in human or mouse tissue

described below (adapted from[1,2]_).

1.2.1 ADH - Alcohol Dehydrogenase

ADH exists in five different forms (class I-V) in vertebrates. Class I is

typi-cally found in liver tissue and has high activity against several alcohols, includ-ing ethanol. Class II has been found in human liver. It is much less studied

but metabolises peroxidic aldehyde and norepinephrine[3]. Class III metabolises

formaldehyde and fatty acids. Class IV acts in stomach and other mucosa produc-ing tissues as a first step in metabolism of gastric alcohols. As an interestproduc-ing side

note, ADH class IV has an increased expression in esophageal cancer[4]_{, possibly}

due to the carcinogenicity of acetaldehyde, the first metabolite of ethanol. Class V ADH has been found in human foetal liver.

1.2.2 VAT1 - Vesicle Amine Transport Protein 1

VAT1 shows expression during the development of the nervous system but also in hypothalamus and cerebellum. Members show ATPase activity and binds to ATP and calcium. The family is thought to be involved in vesicle transport of neurotransmitters.

1.2.3 FAS - Fatty Acid Synthase

The main function of the proteins is to synthesise fatty acids by elongation of acetyl-CoA with malonyl-CoA. The reaction is NADPH dependent. In mammals, FAS is a homo-dimer, each consisting of five domains. In bacteria and plants, on the other hand, the reaction is carried out by several, monofunctional, enzymes in

a system[5].

1.2.4 MECR - Mitochondrial trans-2-enoyl-CoA Reductase

MECR seems to be a regulating transcription factor for mitochondrial respiratory proteins. They are mainly expressed in muscle tissue in human but members are present in plants, insects, worms, fish and mammals. Members has been shown to act in fatty acid synthesis in yeast mitochondria. Members catalyses the reduction of trans-2-enoyl-CoA to acyl-CoA and depends on NADPH.

(15)

1.2 The MDR families 3

1.2.5 PDH - Pyruvate Dehydrogenase

Transforms pyruvate into acetyl-CoA in a NAD-dependent way. Exists in mi-tochondria providing a link between glycolysis and the tricarboxylic acid (TCA) cycle. The general structure is a heterotetramer of two alpha and two beta sub-units.

1.2.6 PTGR - Prostaglandin Reductase

This family contains both the human prostaglandin reductase (PGR) 1 and 2 but most of the family consists of procaryotic members. Commonly expressed in liver, kidney, intestine, spleen and stomach but also in bronchial epithelial cells and heart.

1.2.7 vertQOR - Vertebrate Quinone Oxidoreductase

This family consists of both zinc containing members and members lacking zinc. Generally, the zinc containing members have been studied in a greater extent, making the available data for the zinc-lacking members less detailed. The enzy-matic function is described as quinone reduction with NADPH as cofactor but a

catalytic mechanism has yet to be proposed[6].

1.2.8 ZADH2 - Zinc-binding Alcohol Dehydrogenase domain

containing protein 2

ZADH2 is a very small family with only four members in SwissProt, of which only two exists in human or mouse. This family has not yet been studied to any greater extent, and only very little information can currently be found.

(16)

(17)

2 Methods

The tool was implemented in Python using the algorithm described below. The MDR families were read from a local version of the mdr-enzymes.org database. Each protein was then annotated with data from Uniprot to get syn-onyms for the protein names in different databases and some general information

such as Gene Ontology[7] _{terms and InterPro}[8]_{, Pfam}[9] _{and PROSITE}[10] _sites

found in the protein sequence. The UniProt database also gives primary and

secondary accession numbers for each protein which were used to remove any duplicates in the data. A duplicate entry was defined as having a primary acces-sion number that is a secondary accesacces-sion number for another protein. Manual sequence comparison of the matched proteins confirms the validity of this method. Duplicate removal was followed by synonym gathering. Many databases use the same name for a specific protein, but some use other names. Therefore mapping between different databases had to be done. Several plugins were developed to fetch synonyms for proteins.

When synonyms had been found, information from chosen databases was gath-ered for each protein. Common values from all proteins in each protein family were then collected. A value found in at least half of all proteins in a family was de-clared common for the entire family. A flowchart of the method can be seen in figure 2.1.

2.1 Implementation

The final tool is plugin based with four types of plugins; source plugins in which protein families are read from a source database, synonym plugins which finds syn-onyms for proteins, database plugins which queries a data source for information on proteins and finds common values and output plugins that writes the results to file. All plugins get automatic access to configuration files to ease the alteration of limits and paths to data files and results. There are both global settings and local settings for each plugin that can override the global configuration. For further information see the source code in appendix B and a short manual in appendix C.

(18)

6 Methods

Gather proteins from source databases

Extract information from UniProt

Remove duplicates

Find synonyms

Gather information from selected databases

Find common values

Save results to ﬁle

Figure 2.1: Summarised work flow of the tool.

2.2 Databases and services

The many different databases and services used during this project are described briefly in the following sections.

2.2.1 Uniprot

The widely used Universal Protein Resource (UniProt) is a protein sequence and annotation data resource that consists of several databases. The UniProt Know-legebase provides both manually curated information within the SwissProt sub-section, and automatically added annotation within the much larger TrEMBL subsection. No distinction has been made between the two sources in this project.

2.2.2 PICR

The Protein Identifier Cross-Referencing service[11] _{(PICR) is a web application}

that maps protein names between different databases. They base the results on the UniProt Archive (UniParc) which is a data warehousing service updated daily

(19)

2.2 Databases and services 7

by UniProt to account for new releases of source databases. Many of the results found should also be found with the UniProt polling but the UniParc database store data from many other sources then TrEMBL and SwissProt.

2.2.3 BioGPS

BioGPS[12] is a gene annotation portal that lets the user customise the layout

of the page. They also have a gene expression atlas based on human and mouse

protein-encoding transcriptomes[13]. They have created a data set based on both

a combined Human U133/GNF1H array and the mouse GNF1M and MOE430 arrays.

The fact that the tissues studied in different experiments are different increases the number of tissues studied but makes it harder to find common expression pro-files. All the calculations are species based to include as many tissues as possible. Three steps had to be taken to gather data from this data set. Since they had several expression levels for each tissue and the difference between a positive and negative value was quite large, the data was normalised by dividing each data point with the harmonic mean value

m = _nn P i=1 1 xi (2.1)

for that gene and then calculating mean values for each tissue. The BioGPS dataset is provided in two parts; annotations and data. The data part is organised based on probe id, so in order to be applicable, all proteins needed to be annotated with probe id synonyms using the annotations part. Several different synonyms for the protein could be used to extract the probe id but SwissProt accession numbers and Ensembl gene id were used primarily. In order to empasise large overexpressions, the tissues were sorted according to their t-statistic

t = x − X

s (2.2)

where x is a expression value for a specific tissue,X is the mean value for all tissues

and s is the sample standard deviation for the data set. The highest scoring tissues are shown in tables A.4 and A.5.

2.2.4 Human Protein Atlas

The Human Protein Atlas[14,15] (HPA) provides immunohistochemically stained

images of both diseased and normal human tissues, effectively providing specific

maps of protein presence in the human body in various conditions. It has a

coverage of about 25% of the human genome and since all the conclusions are drawn by trained pathologists, the database is quite reliable.

Sadly, very few of the MDR proteins are available in the HPA database but it is an active project and hopefully data can be gathered from HPA in future releases of the database.

(20)

8 Methods

2.2.5 ArrayExpress

Array Express[16]_{is a database of functional genomics experiments including the}

gene expression atlas with a subset of curated data. The ArrayExpress server was queried for every protein in each family and a positive tissue was defined as a tissue having more up-regulated than down-regulated experiments. Tissues that were positive in at least half of the proteins were deemed common for the family.

2.2.6 KEGG

KEGG, or Kyoto Encyclopedia of Genes and Genomes[17]_{, is a database resource}

containing data from 16 different sources. In this thesis the focus was on the KEGG Pathway database, containing maps for metabolism and other cellular processes. Data was extracted from UniProt and then processed as in the method description.

2.2.7 Reactome

Reactome[18]_{is a manually curated database over human cell reactions and}

path-ways with at least 5000 distinct proteins. Reactome data was collected from

UniProt and processed according to the method description.

2.2.8 Gene Ontology

The aim of the Gene Ontology[7] (GO) is to standardise the representation of

attributes of genes and gene products. It is used as a sort of thesaurus for biological terms where all terms have relationships with other terms (see example 2.1). The relationships can be shown in a tree format for easy overview.

Example 2.1: Gene Ontology terms

GO:0032496 (response to LPS) is a GO:0009617 (response to bacterium)

The GO terms were extracted from UniProt for each protein and then the number of positive terms were increased by using the relations between the terms. The number of occurrences of each term was counted whereafter the number of occur-rences of a more specific term was added to a more general term as shown in figure

2.2. The tree structure was generated using AmiGO[19]_.

2.2.9 InterPro

InterPro[8]_{combines several member data sources to derive protein signatures. All}

member databases annotate protein sequences with different focus giving a result with good overview of the sequence. Integration of the source databases is done manually and the resulting entry is annotated with crosslinks to other databases.

(21)

2.2 Databases and services 9 GO:0003674 molecular_function GO:0005488 binding GO:0043169 cation binding GO:0043167 ion binding GO:0046914 transition metal ion binding GO:0046872 metal ion binding

GO:0032440 2-alkenal reductase activity GO:0003824 catalytic activity GO:0016627 oxidoreductase activity, acting on the CH-CH group of donors GO:0016628 oxidoreductase activity, acting on the CH-CH group of donors. NAD or NADP

as acceptor

GO:0016614 oxidoreductase activity, acting on CH-OH group of donors

GO:0047522 15-oxoprostaglandin 13-oxidase activity GO:0004022 alcohol dehydrogenase (NAD) activity GO:0016491 oxidoreductase activity GO:0008270 zinc ion binding

GO:0016616 oxidoreductase activity, acting on the CH-OH group of donors. NAD or NADP

as acceptor

Figure 2.2: Part of the PTGR Gene Ontology Tree. Terms in thin boxes were found

among the proteins, but not with sufficient frequency for inclusion in the results. Terms in boldface boxes had enough occurrences to be included in the results, either in their own right or after cumulative addition of descendant terms. Terms in dashed boxes were not found in any of the proteins, but are included in this figure to show ancestry.

2.2.10 PFAM

PFAM[9] is a collection of protein domain models represented by HMMs. The

databases contain two kind of entries, Pfam-A and Pfam-B entries. The Pfam-A entries are based on manually constructed multiple sequence alignments (MSAs). Pfam-B are generated automatically and have lower quality.

Since the MDR family HMMs were bootstrapped from the Pfam HMMs ADH_N and ADH_zinc_N, one could expect to find at least one of these signatures in the majority of the results.

2.2.11 PROSITE

PROSITE annotates protein domains using profiles, which are similar to profile HMMs as used by Pfam-A, but it also annotates smaller sequence features like for example functional sites or posttranslationally modified sites, using a simple pat-tern matching technique similar to regular expressions used in many programming languages such as Perl.

(22)

10 Methods

2.3 Dendrogram generation

To generate the cladogram seen in Fig. 3.2, three softwares were used; ClustalW, MrBayes, and Dendroscope. All protein sequences were downloaded in FASTA for-mat from UniProt and then aligned using ClustalW and then trees were generated with MrBayes and displayed using Dendroscope.

(23)

3 Results and discussion

The results sorted by database can be seen in appendix A. In the following chapter the results for each family will be presented and analysed.

3.1 ADH

Both ArrayExpress and BioGPS shows that ADH is expressed in liver as ex-pected according to 1.2.1 but both sources also show some expression in heart and adipocytes. Only BioGPS show expression in mucosa such as eye and epidermis and then only for mouse proteins. A reason for this is probably because of the in-clusion limit of 50%. As stated earlier, only ADH class IV is commonly expressed outside of the liver so it is not that unexpected to only find a majority of the proteins in liver tissue.

The GO terms shows that most members has oxidoreductase activity (GO:0016491) according to Fig. 3.1 and also that they bind zinc. About 60% is part of an ethanol metabolic process.

InterPro shows that the members have a NAD(P) binding domain and confirms the zinc binding shown with GO terms.

Only a fraction of the total number of members were found in reactome but all of the six found took part in biological oxidations and confirms the GO results.

KEGG only annotated 16 of the members, however since these are distributed more or less evenly throughout the evolutionary tree of the family (Fig. 3.2), there is a high probability that the results presented here are indeed representative for the family as a whole, especially since the KEGG annotations are in complete consensus. Most of the pathways reported are included because of the alcohol → aldehyde conversion function of most members but they also take part in cy-tochrome P450 pathways such as chloral hydrate → trichloroethanol conversion.

3.2 VAT1

VAT1 seems to be expressed mainly in tissues within the central nervous system such as adrenal gland, hypothalamus and the amygdala. Both BioGPS and Ar-rayExpress also show raised levels in adipose tissue. No satisfying reason for this has been found in literature, but is a very interesting venue to explore in further

(24)

12 Results and discussion GO:0003674 molecular_function GO:0003824 catalytic activity GO:0016614 oxidoreductase activity, acting on CH-OH group of donors

GO:0016616 oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP

as acceptor GO:0004022 alcohol dehydrogenase (NAD) activity GO:0004024 alcohol dehydrogenase activity, zinc-dependent GO:0004032 alditol:NADP+ 1-oxidoreductase activity GO:0004745 retinol dehydrogenase activity GO:0003960 NADPH:quinone reductase activity GO:0004031 aldehyde oxidase activity GO:0018467 formaldehyde dehydrogenase activity GO:0019115 benzaldehyde dehydrogenase activity GO:0051903 S-(hydroxymethyl)glu athione dehydrogenase activity GO:0008106 alcohol dehydrogenase (NADP+) activity GO:0004033 aldo-keto reductase (NADP) activity GO:0016620 oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor

GO:0016623 oxidoreductase activity, acting on the aldehyde or oxo group of donors. oxygen as acceptor GO:0016903 oxidoreuctase activity, acting on the aledhyde or oxo group of donors GO:0016651 oxidoreductase activity, acting on NADH or NADPH GO:0016655 oxidoreductase activity, acting on NADH or NADPH, quinone or similar compound as acceptor GO:0016491 oxidoreductase activity

Figure 3.1: Part of the ADH GO tree, using the same graphical representation as figure

2.2.

laboratory experiments. The proteins have the ADH domain and binds zinc ac-cording to InterPro and PFAM. They also have a NAD binding domain and match the quinone oxidoreductase / ζ-crystallin PFAM pattern.

3.3 vertQOR

vertQOR is highly expressed in liver and kidney. This is expected since the main function is to deactivate the toxic quinones and it seems right that the proteins will be found in the renal system. Members have the NAD(P) binding domain and matches the ζ-crystallin pattern, a result that is quite expected since the

ζ-crystallin pattern defines the NADP-depending quinone oxidoreductases. The

four members found were all zinc-binding.

3.4 PTGR

Both expression databases show that PTGRs are expressed in small intestine and stomach. Members bind zinc and NAD(P). Many of the proteins exist in cyto-plasm according to gene ontology terms and have oxidoreductase activity, mainly 15-oxoprostaglandin 13-oxidase (PGR) activity but also 2-alkenal reductase and alcohol dehydrogenase activity as seen in Fig. 2.2.

(25)

3.5 FAS 13 B4E1R1_HUMAN P00325_HUMAN B4E2R9_HUMAN B4DVC3_HUMAN P07327_HUMAN P00329_MOUSE Q3UKA4_MOUSE A8MVN9_HUMAN B4DWS1_HUMAN P40394_HUMAN Q3UMM7_MOUSE Q9D748_MOUSE Q548K2_MOUSE Q64437_MOUSE Q496S1_MOUSE Q9QYY9_MOUSE Q3V0P5_MOUSE P08319_HUMAN Q6FI45_HUMAN Q6IRT1_HUMAN P11766_HUMAN Q5U043_HUMAN Q2VIM7_HUMAN P28474_MOUSE Q6P5I3_MOUSE P28332_HUMAN B4DPD8_HUMAN Q8IUN7_HUMAN A1L3C0_MOUSE Q9D932_MOUSE Q3UQ40_MOUSE P00326_HUMAN A8MYN5_HUMAN

Figure 3.2: Cladogram of the ADH family with members annotated in KEGG in bold

face. This is the consensus tree from 10000 Monte Carlo reconstructions using the MrBayes program, wherein no other credible topologies could be detected.

3.5 FAS

FAS was found to be expressed in high levels in adipocyte tissue as expected but also in muscle (skeletal and cardiac). ArrayExpress indicated high expression in many tissues but the results was not confirmed with BioGPS.

In Fig. 3.3, one can see that many of the found GO-terms are linked with fatty acid synthase activity which would have been a very expected result. How-ever, since no member was annotated with that GO-term, it wasn’t included in the result table. Potentially, this sort of non-detection could be remedied by step-wise generalisation up through the ontology tree until an ancestral term has been found that can describe sufficiently many of the member proteins. However, as this traversal could prove computationally costly, and could potentially also yield uselessly over-general terms, it has not been implemented in this version of the tool.

All members bind zinc and NAD(P) and most have a acyl carrier and trans-ferase domain.

(26)

14 Results and discussion

GO:0004312 fatty acid synthase

activity GO:0016297 acyl-[acyl-carrier-protein] hydrolase activity GO:0004315 3-oxocyl-[acyl-carrier-protein]synthease activity GO:0004317 3-hydroxypalmitoyl-[acyl-carrier-protein] dehydrogenase activity GO:0016295 myristoyl-[acyl-carrier-protein]hydrolase activity GO:0004314 [acyl-carrier-protein] S-malonyltransferase activity GO:0004315 palmitoyl-[acyl-carrier-protein]hydrolase activity GO:0004320 oleoyl-[acyl-carrier-protein]hydrolase activity GO:0004313 [acyl-carrier-protein] S-acetyltransferase activity

Figure 3.3: Reduced GO tree for FAS to highlight the terms associated with fatty acid

synthase with the same graphical representation as figure 2.2.

3.6 MECR

No common expression profiles for the two data sources could be found but BioGPS reported a high value for heart tissue. It was quite unexpected that none of the sources confirmed high expression in muscle tissue. According to GO, the proteins exists in mitochondria, has oxidoreductase activity and binds zinc. The latter confirmed by Pfam and InterPro.

3.7 PDH

The PDH family has members in many different species and only two proteins was found, one copy in each species studied. Gene ontology terms show that the mem-bers are located in the mitochondrial membrane, cilium and flagellum. Memmem-bers are expressed in liver, kidney and testis. Proteins binds zinc and NAD(P).

3.8 ZADH2

Both Pfam and GO show that the members bind zinc. The data gathered from BioGPS and ArrayExpress gives no clear indication of where the members are expressed. InterPro and Prosite both show that all members match the ζ-crystallin pattern.

(27)

4 Conclusions

The main goal in this project was to create a tool to annotate protein families by finding common properties in the member proteins. The focus was on the member families of the medium chain dehydrogenase/reductase superfamily with human and/or mouse protein members. The tool was supposed to be extendable and easy to use for other protein families.

During this project, a python based tool that annotates protein families with data from several data sources has been created. The tool is plugin based which makes it easy to expand and change according to each user’s preference. Regard-less of this, the utility and necessity of this system, as well as the quality of its produced results, are expected to only ever increase over time, even if no addi-tional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.

ADH members shows expression in liver, heart and adipocytes. BioGPS also show expression in mucosa such as eye and epidermis and then only for mouse proteins. Members bind to zinc and most take part in ethanol metabolic processes. KEGG annotated 16 proteins distributed evenly throughout the evolutionary tree of the family. Most pathways were linked to the alcohol → aldehyde conversion seen in members but some also takes part in cytochrome P450 pathways such as chloral hydrate → trichloroethanol conversion.

VAT1 seems to be expressed mainly in tissues within the central nervous system such as adrenal gland, hypothalamus and the amygdala but also in adipose tissue. No satisfying reason for this has been found in literature, but is a very interesting venue to explore in further laboratory experiments. Members match the quinone oxidoreductase / ζ-crystallin PFAM pattern.

vertQOR is highly expressed in liver and kidney. Members also match the quinone oxidoreductase / ζ-crystallin PFAM pattern.

Both expression databases show that PTGRs are expressed in small intestine and stomach. Many of the proteins exist in cytoplasm and have oxidoreductase activity, mainly 15-oxoprostaglandin 13-oxidase (PGR) activity but also 2-alkenal reductase and alcohol dehydrogenase activity.

FAS was found to be expressed in high levels in adipocyte tissue and in muscle tissue (skeletal and cardiac). Many of the found GO-terms are linked with fatty acid synthase activity. All members bind zinc and NAD(P) and most have an acyl carrier and transferase domain.

No common expression profiles for MECR have been found between the two 15

(28)

16 Conclusions

data sources but BioGPS reported a high value for heart tissue. The proteins exist in mitochondria, have oxidoreductase activity and bind zinc.

PDH members has been shown to be located in the mitochondrial membrane, cilium and flagellum. Members are expressed in liver, kidney and testis. Proteins bind zinc and NAD(P).

Data sources have not given any clear indications of where ZADH2 members are expressed but they seem to bind zinc and match the ζ-crystallin pattern.

The result agrees quite well with information gathered from literature and shows that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N). This is encouraging since the hidden Markov models used to generate the MDR families were bootstrapped from these signa-tures. The proteins take part in oxidation-reduction processes, often with NAD or NADP as cofactor and many of the proteins contain zinc and are expressed in liver tissue. Some also have the quinone oxidoreductase / ζ-crystallin domain.

(29)

5 Future Work

During this project I have identified several interesting venues for further explo-ration that may lead to increased sensitivity and reliability for the system. Firstly, running the tool for all proteins in each family instead of only families with pro-teins expressed in human and mouse tissue would increase the statistical value of the results. A problem would of course be that not all proteins have been anno-tated and some of the databases do not support several of the available organisms. Secondly, the result would be more reliable if distinction between automatically annotated and expert reviewed data had been made. During the development, I chose not to do this distinction because of the small sample size. If the full MDR database had been used, I believe that this would be crucial to get reliable data. A possible improvement of the gene ontology plugin would be to include new terms by doing stepwise generalisation up through the ontology tree until an ancestral term has been found that can describe sufficiently many of the member proteins, as discussed in section 3.5, and possibly with some heuristic to avoid generating uselessly general results.

(30)

(31)

Bibliography

[1] Joel Hedlund, Hans Jörnvall, and Bengt Persson. Subdivision of the mdr su-perfamily of medium-chain dehydrogenases/reductases through iterative hid-den markov model refinement. BMC Bioinformatics, 11:534, 2010.

[2] Bengt Persson, Joel Hedlund, and Hans Jörnvall. Medium- and short-chain dehydrogenase/reductase gene and protein families. Cellular and Molecular Life Sciences, 65:3879–3894, 2008.

[3] Roger S. Holmes. Alcohol dehydrogenases: a family of isozymes with differ-ential functions. Alcohol and alcoholism. Supplement, 2:127, 1994.

[4] Wojciech Jelski, Miroslaw Kozlowski, Jerzy Laudanski, Jacek Niklinski, and Maciej Szmitkowski. The activity of class i, ii, iii, and iv alcohol dehydrogenase (adh) isoenzymes and aldehyde dehydrogenase (aldh) in esophageal cancer. Digestive Diseases and Sciences, 54:725–730, 2009.

[5] Timm Maier, Marc Leibundgut, Daniel Boehringer, and Nenad Ban. Struc-ture and function of eukaryotic fatty acid synthases. Quarterly Reviews of Biophysics, 43(03):373–422, 2010.

[6] Sergio Porté, Agrin Moeini, Irene Reche, Naeem Shafqat, Udo Oppermann, Jaume Farrés, and Xavier Parés. Kinetic and structural evidence of the alke-nal/one reductase specificity of human ζ-crystallin. Cellular and Molecular Life Sciences, 68:1065–1077.

[7] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Andrew Tarver, Laurie Issel-and Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richard-son, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene ontology: tool for the unification of biology. Nature Genetics, 25:25–29, May 2000. [8] Sarah Hunter, Rolf Apweiler, Teresa K. Attwood, Amos Bairoch, Alex

Bate-man, David Binns, Peer Bork, Ujjwal Das, Louise Daugherty, Lauranne Duquenne, Robert D. Finn, Julian Gough, Daniel Haft, Nicolas Hulo, Daniel Kahn, Elizabeth Kelly, Aurélie Laugraud, Ivica Letunic, David Lonsdale, Ro-drigo Lopez, Martin Madera, John Maslen, Craig McAnulla, Jennifer Mc-Dowall, Jaina Mistry, Alex Mitchell, Nicola Mulder, Darren Natale, Christine

(32)

20 BIBLIOGRAPHY

Orengo, Antony F. Quinn, Jeremy D. Selengut, Christian J. A. Sigrist, Man-jula Thimma, Paul D. Thomas, Franck Valentin, Derek Wilson, Cathy H. Wu, and Corin Yeats. Interpro: the integrative protein signature database. Nucleic Acids Research, 37:D211–D215, 2009.

[9] Robert D. Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne E. Pollington, O. Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, and Alex Bateman. The pfam protein families database. Nucleic Acids Research, 38:D211–D222, 2010.

[10] Christian J. A. Sigrist, Lorenzo Cerutti, Edouard de Castro, Virginie Genevaux, Petra S. Langendijk-and Bulliard, Amos Bairoch, and Nicolas Hulo. Prosite, a protein domain database for functional characterization and annotation. Nucleic Acids Research, 38:D161–D166, 2010.

[11] Richard Cote, Philip Jones, Lennart Martens, Samuel Kerrien, Florian Reisinger, Quan Lin, Rasko Leinonen, Rolf Apweiler, and Henning Herm-jakob. The protein identifier cross-referencing (picr) service: reconciling pro-tein identifiers across multiple source databases. BMC Bioinformatics, 8(1): 401, 2007.

[12] Chunlei Wu, Camilo Orozco, Jason Boyer, Marc Leglise, James Goodale, Serge Batalov, Christopher L Hodge, James Haase, Jeff Janes, Jon W Huss III, and Andrew I Su. Biogps: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biology, 10(11):R130, 2009.

[13] Andrew I. Su, Tim Wiltshire, Serge Batalov, Hilmar Lapp, Keith A. Ching, David Block, Jie Zhang, Richard Soden, Mimi Hayakawa, Michael P. Kreiman, Gabriel Cooke, John R. Walker, and John B. Hogenesch. A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 101(16):6062–6067, April 2004.

[14] Fredrik Pontén, Karin Jirström, and Mathias Uhlén. The human protein atlas – a tool for pathology. Journal of Pathology, 216:387–393, 2008.

[15] Lisa Berglund, Erik Björling, Per Oksvold, Linn Fagerberg, Anna As-plund, Cristina Al-Khalili Szigyarto, Anja Persson, Jenny Ottosson, Henrik Wernérus, Peter Nilsson, Emma Lundberg, Åsa Sivertsson, Sanjay Navani, Kenneth Wester, Caroline Kampf, Sophia Hober, Fredrik Pontén, and Math-ias Uhlén. A genecentric human protein atlas for expression profiles based on antibodies. Molecular & Cellular Proteomics, 7:2019–2027, 2008.

[16] Helen Parkinson, Ugis Sarkans, Nikolay Kolesnikov, Niran Abeygunawar-dena, Tony Burdett, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Emma Hastings, Ele Holloway, Natalja Kurbatova, Margus Lukk, James Malone, Roby Mani, Ekaterina Pilicheva, Gabriella Rustici, Anjan Sharma, Eleanor

(33)

BIBLIOGRAPHY 21

Williams, Tomasz Adamusiak, Marco Brandizi, Nataliya Sklyar, and Alvis Brazma. Arrayexpress update-an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Research, 39(suppl 1):D1002–D1004, 2011.

[17] Minoru Kanehisa and Susumu Goto. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30, 2000.

[18] Lisa Matthews, Gopal Gopinath, Marc Gillespie, Michael Caudy, David Croft, Bernard de Bono, Phani Garapati, Jill Hemish, Henning Hermjakob, Bijay Jassal, Alex Kanapin, Suzanna Lewis, Shahana Mahajan, Bruce May, Esther Schmidt, Imre Vastrik, Guanming Wu, Ewan Birney, Lincoln Stein, and Peter D’Eustachio. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Research, 37:D619–D622, 2009.

[19] Seth Carbon, Amelia Ireland, Christopher J. Mungall, ShengQiang Shu, Brad Marshall, Suzanna Lewis, the AmiGO Hub, and the Web Presence Work-ing Group. Amigo: online access to ontology and annotation data. Bioinfor-matics, 25(2):288–289, 2009.

(34)

A

Tables

Family Name ADH FAS PDH MECR VAT1 PTGR vertQOR ZADH2

Number of proteins 33 5 2 4 14 8 4 5

Table A.1: Number of proteins found in each family

Family Term Number Percentage

of Proteins ADH PF08240 ADH_N 29 88 PF00107 ADH_zinc_N 33 100 FAS PF00975 Thioesterase 3 60 PF00698 Acyl_transf_1 4 80 PF08659 KR 4 80 PF00550 PP-binding 4 80 PF02801 Ketoacyl-synt_C 4 80 PF00109 ketoacyl-synt 4 80 PF08242 Methyltransf_12 5 100 PF00107 ADH_zinc_N 5 100 PDH PF08240 ADH_N 2 100 PF00107 ADH_zinc_N 2 100 MECR PF08240 ADH_N 4 100 PF00107 ADH_zinc_N 4 100 VAT1 PF08240 ADH_N 12 86 PF00107 ADH_zinc_N 14 100 PTGR PF00107 ADH_zinc_N 8 100 vertQOR PF08240 ADH_N 4 100 PF00107 ADH_zinc_N 4 100 ZADH2 PF08240 ADH_N 4 80 PF00107 ADH_zinc_N 5 100

Table A.2: PFAM annotations

(35)

23

of Proteins ADH

GO:0016491 oxidoreductase activity 31 94

GO:0008270 zinc ion binding 25 76

GO:0006067 ethanol metabolic process 20 61

FAS

GO:0031177 phosphopantetheine binding 4 80

GO:0055114 oxidation-reduction process 4 80

GO:0000036 acyl carrier activity 4 80

GO:0042587 glycogen granule 3 60

GO:0048037 cofactor binding 4 80

GO:0009058 biosynthetic process 3 60

GO:0016740 transferase activity 4 80

GO:0005739 mitochondrion 3 60

PDH

GO:0005929 cilium 2 100

GO:0031966 mitochondrial membrane 2 100

GO:0019861 flagellum 2 100

GO:0005625 soluble fraction 2 100

GO:0003939 L-iditol 2-dehydrogenase activity 2 100

MECR

GO:0005739 mitochondrion 3 75

VAT1

PTGR

GO:0032440 2-alkenal reductase activity 5 63

GO:0047522 15-oxoprostaglandin 13-oxidase activity 7 88

GO:0005737 cytoplasm 5 63

vertQOR

ZADH2

(36)

24 Tables

Family Organism Tissue t-score

ADH human Liver 6.51 Fetalliver 3.52 Adipocyte 3.46 colon 2.55 small intestine 2.07 mouse liver 7.03 cornea 4.15 adrenal gland 2.65 bladder 2.10 kidney 1.65 intestine large 1.32 FAS human SkeletalMuscle 2.88 CardiacMyocytes 2.88 Lymphoma burkitts(Raji) 2.27 AdrenalCortex 2.27 TrigeminalGanglion 2.27 Fetallung 1.97 SuperiorCervicalGanglion 1.67 Bonemarrow 1.67 Tongue 1.67 Liver 1.67 Heart 1.67 mouse adipose brown 7.37 adrenal gland 4.01

mammary gland lact 2.67

mammary gland non-lactating 2.21

ovary 1.74 adipose white 1.58 MECR human Heart 6.30 Lymphoma burkitts(Raji) 4.22 Lymphoma burkitts(Daudi) 1.76 721 B lymphoblasts 1.50 Wholebrain 1.37 OccipitalLobe 1.11 mouse

B-cells marginal zone 4.12

pancreas 3.73

lens 3.27

thymocyte SP CD4+ 2.48

B-cells GL7negative Alum 1.52

T-cells CD8+ 1.41

adipose brown 1.39

macrophage bone marrow 2hr LPS 1.34

retina 1.30

(37)

25

Family Organism Tissue t-score

PDH human Thyroid 6.10 Prostate 4.71 FetalThyroid 2.96 Liver 2.58 Kidney 1.64 mouse kidney 6.03 liver 4.88 testis 3.84

mast cells IgE 3.11

mast cells IgE+antigen 1hr 1.25

PTGR human BronchialEpithelialCells 7.61 small intestine 3.07 SmoothMuscle 2.13 CardiacMyocytes 2.03 mouse stomach 7.25 cornea 4.86 bladder 1.95 VAT1 human Adrenalgland 4.60 SmoothMuscle 3.56 Adipocyte 2.47 Lung 2.10 BronchialEpithelialCells 1.59 Leukemialymphoblastic(MOLT-4) 1.58 CardiacMyocytes 1.38 AdrenalCortex 1.31 retina 1.28 Hypothalamus 1.23 mouse

macrophage peri LPS thio 1hrs 5.14

neuro2a 1.36 mast cells 1.26 osteoclasts 1.21 vertQOR human 721 B lymphoblasts 6.99 Fetalliver 2.64 Leukemia promyelocytic-HL-60 2.00 Kidney 1.75 pineal night 1.44 Thyroid 1.34 PancreaticIslet 1.33 mouse kidney 9.52 liver 1.33 ZADH2 mouse adipose brown 4.32 Baf3 3.39 adrenal gland 2.73 salivary gland 2.53 ovary 2.33 liver 1.83 heart 1.62 stomach 1.45 intestine large 1.18

(38)

26 Tables

of Proteins ADH IPR011032 GroES-like 33 100 IPR013149 ADH_C 33 100 IPR002328 ADH_Zn_CS 29 88 IPR013154 ADH_GroES-like 29 88 IPR002085 ADH_SF_Zn 33 100 IPR016040 NAD(P)-bd_dom 32 97 FAS IPR016038 Thiolase-like_subgr 4 80 IPR023102 Fatty_acid_synthase_dom_2 3 60 IPR009081 Acyl_carrier_prot-like 4 80 IPR016035 Acyl_Trfase/lysoPlipase 4 80 IPR020843 PKS_ER 5 100 IPR006163 Phsphopanteth-bd 4 80 IPR001227 Ac_transferase_dom 4 80 IPR020842 PKS/FAS_KR 4 80 IPR001031 Thioesterase 3 60 IPR006162 PPantetheine_attach_site 4 80 IPR016036 Malonyl_transacylase_ACP-bd 4 80 IPR013149 ADH_C 5 100 IPR013968 PKS_KR 4 80 IPR014030 Ketoacyl_synth_N 4 80 IPR013217 Methyltransf_12 5 100 IPR011032 GroES-like 5 100 IPR014043 Acyl_transferase 4 80 IPR018201 Ketoacyl_synth_AS 4 80 IPR014031 Ketoacyl_synth_C 4 80 IPR016039 Thiolase-like 4 80 IPR016040 NAD(P)-bd_dom 5 100 IPR000794 Beta-ketoacyl_synthase 5 100 PDH IPR011032 GroES-like 2 100 IPR013149 ADH_C 2 100 IPR002328 ADH_Zn_CS 2 100 IPR013154 ADH_GroES-like 2 100 IPR002085 ADH_SF_Zn 2 100 IPR016040 NAD(P)-bd_dom 2 100 MECR IPR002085 ADH_SF_Zn 4 100 IPR013154 ADH_GroES-like 4 100 IPR011032 GroES-like 4 100 IPR013149 ADH_C 4 100 IPR016040 NAD(P)-bd_dom 4 100

(39)

27

of Proteins VAT1 IPR011032 GroES-like 13 93 IPR013149 ADH_C 14 100 IPR013154 ADH_GroES-like 12 86 IPR002085 ADH_SF_Zn 14 100 IPR016040 NAD(P)-bd_dom 14 100 IPR002364 Quin_OxRdtase/zeta-crystal_CS 14 100 PTGR IPR002085 ADH_SF_Zn 8 100 IPR011032 GroES-like 8 100 IPR013149 ADH_C 8 100 IPR014190 B4_12hDH 5 63 IPR016040 NAD(P)-bd_dom 8 100 vertQOR IPR011032 GroES-like 4 100 IPR013149 ADH_C 4 100 IPR013154 ADH_GroES-like 4 100 IPR002085 ADH_SF_Zn 4 100 IPR016040 NAD(P)-bd_dom 4 100 IPR002364 Quin_OxRdtase/zeta-crystal_CS 4 100 ZADH2 IPR011032 GroES-like 4 80 IPR013149 ADH_C 5 100 IPR013154 ADH_GroES-like 4 80 IPR002085 ADH_SF_Zn 5 100 IPR016040 NAD(P)-bd_dom 5 100 IPR002364 Quin_OxRdtase/zeta-crystal_CS 5 100

Table A.7: InterPro annotations, part 2

Family Term Matching Proteins

ADH PS00059: ADH_ZINC 29 FAS PS00012: PHOSPHOPANTETHEINE 4 PS00606: B_KETOACYL_SYNTHASE 4 PS50075: ACP_DOMAIN 4 PDH PS00059: ADH_ZINC 2 VAT1 PS01162: QOR_ZETA_CRYSTAL 14 vertQOR PS01162: QOR_ZETA_CRYSTAL 4 ZADH2 PS01162: QOR_ZETA_CRYSTAL 5

(40)

28 Tables

of Proteins

ADH

Subcutaneous adipose tissue 18 55

Liver 30 91

Superior cervical ganglion 19 58

Kidney 19 58 Heart 17 53 FAS Pancreas 3 60 Brain 5 100 Trachea 5 100

The region between the LAD artery and the apex 3 60

Brown fat from interscapular depression 3 60

Preoptic area 3 60 Embryo 3 60 Colon 5 100 Ovary 3 60 Preputial gland 3 60 Brainstem 3 60 Hypothalamus 5 100

Intestine 3 60

Epithelium 3 60

Adipose 3 60

Gonadal white adipose tissue 3 60

Lateral geniculate nucleus (thalamus) 3 60

Periaqueductal gray 3 60

Epidermis 5 100

Dorsal skin without fur 3 60

Embryonic stem cell 3 60

Extraocular muscle 3 60

Liver 5 100

Brown fat 3 60

Adipose tissue 5 100

Mesenteric lymph node 3 60

Lung 3 60

Epidydimal white adipose tissue 3 60

Dorsal root ganglion 5 100

Snout epidermis 3 60

Mammary gland 5 100

Nucleus accumbens 3 60

Corpus 3 60

Brown adipose tissue 3 60

Bed nucleus of the stria terminalis 3 60

Skeletal muscle (M. vastus lateralis) 3 60

Collecting duct 3 60

Adrenal gland 3 60

Whole brain 3 60

Forebrain 3 60

White adipose tissue 3 60

(41)

29

of Proteins

PDH

Kidney 2 100

Muscle 2 100 Liver 2 100 Small intestine 2 100 Testis 2 100 MECR Colon 4 100 Trachea 4 100 Uterus 4 100

VAT1

Pituitary gland 8 57

Lung 8 57

Hypothalamus 13 93

Amygdala 10 71

Heart 8 57

PTGR

Adipose 5 63

White adipose tissue 5 63

Small intestine 5 63

Nodose ganglion visceral sensory neurons 5 63

Stomach 5 63

Epithelium 5 63

Kidney 7 88

Brainstem 5 63

Oviduct 5 63

Cingulate cortex homogenate 5 63

Anterior tibialis 5 63

Whole organism 5 63

Intestine 5 63

Preputial gland 5 63

(42)

30 Tables

of Proteins vertQOR Kidney 3 75 Uterus 3 75 Pancreas 3 75 Liver 3 75 ZADH2 Urethra 3 60 Paravertebral muscle 3 60 Skin 3 60 Kidney cortex 3 60

Primary clear-cell renal cell carcinoma 3 60

Lung 3 60

Brain 3 60

Endometrium 3 60

Seminiferous tubule 3 60

Ventricular myocardium 3 60

Hematopoietic and lymphatic system 3 60

Cord blood 3 60

Frontal cortex, superior motor cortex 3 60

Umbilical cord 5 100 Occipital lobe 3 60 Spleen 3 60 Cerebrospinal fluid 3 60 Stomach pyloric 3 60 Cerebellum 5 100 Prostate 3 60 Colon 5 100 Penis 3 60 Blood 3 60 Jejunum 3 60 Endometrium/ovary 3 60

Primary visual cortex 3 60

Omental adipose 3 60 Mammary gland 3 60 Stomach fundus 3 60 Cerebral cortex 5 100 Deltoid muscle 3 60 Kidney medulla 3 60 Thyroid 3 60 Cancer, LCM 3 60

(43)

31

of Proteins

ADH

Drug metabolism - cytochrome P450 16 49

Fatty acid metabolism 16 49

Retinol metabolism 16 49

Glycolysis / Gluconeogenesis 16 49

Metabolism of xenobiotics by cytochrome P450 16 49

Metabolic pathways 16 49

Tyrosine metabolism 16 49

FAS

Insulin signaling pathway 2 40

Metabolic pathways 2 40

Fatty acid biosynthesis 2 40

MECR Metabolic pathways 3 75

Fatty acid elongation in mitochondria 3 75

PDH Metabolic pathways 2 100

Fructose and mannose metabolism 2 100

Table A.12: KEGG annotations. Note that the number of proteins in KEGG for the

(44)

B

List of Supplementary data

On the web address http://www.ifm.liu.se/bioinfo/supplements/elfving-2011-thesis, the following supplementary data can be accessed for free:

• Full source code of the Protein Family Annotation Software (PFAS) and various scripts written during the project

• Gene Ontology trees for all the families • Cladograms for all studied families • Raw data for the MDR and SDR families

(45)

C

Manual for using the Protein

Family Annotation Software (PFAS)

The full source code is included in appendix B free of charge and without restric-tions. The following text describes how to use the software.

The tool is written and tested under python 2.7. The software requires no extra packages but some of the plugins do. Please refer to table C.1 for specification.

Package Required in Version

PyMySQL mdrDB 0.4

suds PicrSynonyms 0.4

BeatifulSoup ArrayExpress 3.2.0

Table C.1: Packages needed for different plugins

Some of the plugins require separate data files available at each data provider’s site. The paths to these files must then be entered in each plugin’s configuration file.

Installation is quite straight-forward, untar the file in a folder of your choice and add that path and the path to the lib and plugins folders to your PYTHONPATH variable and the system should be up and running.

(46)

(47)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av doku-mentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerhe-ten och tillgänglighesäkerhe-ten finns det lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan be-skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förla-gets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet — or its possi-ble replacement — for a period of 25 years from the date of publication barring exceptional circumstances.

The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/

c

Automated annotation of protein families

Department of Physics, Chemistry and Biology

Master’s Thesis

Automated annotation of protein families

Eric Elfving

Automated annotation of protein families

Eric Elfving

Abstract

Sammanfattning

Acknowledgments

Contents

1

Introduction

1.1

Bioinformatics

1.2

The MDR families

1.2.1

ADH - Alcohol Dehydrogenase

1.2.2

VAT1 - Vesicle Amine Transport Protein 1

1.2.3

FAS - Fatty Acid Synthase

1.2.4

MECR - Mitochondrial trans-2-enoyl-CoA Reductase

1.2.5

PDH - Pyruvate Dehydrogenase

1.2.6

PTGR - Prostaglandin Reductase

1.2.7

vertQOR - Vertebrate Quinone Oxidoreductase

1.2.8

ZADH2 - Zinc-binding Alcohol Dehydrogenase domain

containing protein 2

2

Methods

2.1

Implementation

2.2

Databases and services

2.2.1

Uniprot

2.2.2

PICR

2.2.3

BioGPS

2.2.4

Human Protein Atlas

2.2.5

ArrayExpress

2.2.6

KEGG

2.2.7

Reactome

2.2.8

Gene Ontology

2.2.9

InterPro

2.2.10

PFAM

2.2.11

PROSITE

2.3

Dendrogram generation

3

Results and discussion

3.1

ADH

3.2

VAT1

3.3

vertQOR

3.4

PTGR

3.5

FAS

3.6

MECR

3.7

PDH