Department of Physics, Chemistry and Biology
Master’s Thesis
Automated annotation of protein families
Eric Elfving
LiTH-IFM-EX--11/2551--SE
Department of Physics, Chemistry and Biology Linköpings universitet
Master’s Thesis LiTH-IFM-EX--11/2551--SE
Automated annotation of protein families
Eric Elfving
Supervisor: Joel Hedlund
ifm, Linköpings universitet
Examiner: Bengt Persson
ifm, Linköpings universitet
Avdelning, Institution
Division, Department Bioinformatics
Department of Physics and Measurement Technology Linköpings universitet
SE-581 83 Linköping, Sweden
Datum Date 2011-06-17 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport
URL för elektronisk version
http://www.ifm.liu.se/bioinfo/ http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-69393 ISBN — ISRN LiTH-IFM-EX--11/2551--SE
Serietitel och serienummer
Title of series, numbering
ISSN
—
Titel
Title
Automatiserad annotering av proteinfamiljer Automated annotation of protein families
Författare
Author
Eric Elfving
Sammanfattning
Abstract
Introduction: The great challenge in bioinformatics is data integration. The amount of available data is always increasing and there are no common unified standards of where, or how, the data should be stored. The aim of this work is to build an automated tool to annotate the different member families within the protein superfamily of medium-chain dehydrogenases/reductases (MDR), by finding common properties among the member proteins. The goal is to increase the understanding of the MDR superfamily as well as the different member families. This will add to the amount of knowledge gained for free when a new, unannotated, protein is matched as a member to a specific MDR member family.
Method: The different types of data available all needed different handling.
Tex-tual data was mainly compared as strings while numeric data needed some special handling such as statistical calculations. Ontological data was handled as tree nodes where ancestry between terms had to be considered. This was implemented as a plugin-based system to make the tool easy to extend with additional data sources of different types.
Results: The biggest challenge was data incompleteness yielding little (or no)
results for some families and thus decreasing the statistical significance of the results. Results show that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N) and takes part in an oxidation-reduction process, often with NAD or NADP as cofactor. Many of the proteins contain zinc and are expressed in liver tissue.
Conclusions: A python based tool for automatic annotation has been created to
annotate the different MDR member families. The tool is easily extendable to be used with new databases and much of the results agrees with information found in literature. The utility and necessity of this system, as well as the quality of its produced results, are expected to only increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.
Nyckelord
Abstract
Introduction: The great challenge in bioinformatics is data integration. The
amount of available data is always increasing and there are no common unified standards of where, or how, the data should be stored. The aim of this work is to build an automated tool to annotate the different member families within the protein superfamily of medium-chain dehydrogenases/reductases (MDR), by finding common properties among the member proteins. The goal is to increase the understanding of the MDR superfamily as well as the different member families. This will add to the amount of knowledge gained for free when a new, unannotated, protein is matched as a member to a specific MDR member family.
Method: The different types of data available all needed different handling.
Tex-tual data was mainly compared as strings while numeric data needed some special handling such as statistical calculations. Ontological data was handled as tree nodes where ancestry between terms had to be considered. This was implemented as a plugin-based system to make the tool easy to extend with additional data sources of different types.
Results: The biggest challenge was data incompleteness yielding little (or no)
results for some families and thus decreasing the statistical significance of the results. Results show that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N) and takes part in an oxidation-reduction process, often with NAD or NADP as cofactor. Many of the proteins contain zinc and are expressed in liver tissue.
Conclusions: A python based tool for automatic annotation has been created to
annotate the different MDR member families. The tool is easily extendable to be used with new databases and much of the results agrees with information found in literature. The utility and necessity of this system, as well as the quality of its produced results, are expected to only increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.
Sammanfattning
Introduktion: Den stora utmaningen inom bioinformatik är dataintegration.
Mängden tillgänglig data ökar ständigt och det finns inga gemensamma standarder för hur, eller var, data ska lagras. Målet med detta arbete är att skapa ett automa-tiserat verktyg för att annotera proteinfamiljer inom MDR-superfamiljen genom att hitta gemensamma egenskaper hos medlemsproteinerna. Målet med detta är att öka förståelsen för både superfamiljen som helhet och de enskilda proteinfamil-jerna och därgenom öka mängden kunskap man får på köpet när nyfunna protein matchar någon av medlemsfamiljerna.
Metod: De olika databaserna innehåller flera olika sorters typer av data som
krävde olika hantering. Textuell data hanterades genom vanlig stränghantering medan numerisk data krävde mer avancerad behandling såsom statistiska beräk-ningar. Ontologidata hanterades som trädnoder för att ta hänsyn till termernas inbyggda släktskap. Verktyget implementerades som ett pluginbaserat system för att förenkla utbyggnad och inkludering av fler datakällor.
Resultat: Den största utmaningen var att ta hand om databortfall, framför allt
inom mindre studerade familjer vilket gav sämre statistisk säkerhet hos resulta-tet. Det visade sig att MDR-proteiner hos människa och mus har minst en av Pfam-domänerna ADH_N och ADH_zinc_N. De är även delaktiga i oxidations-reduktionsprocesser och använder ofta NAD eller NADP som kofaktor. Många av proteinerna binder zink och uttrycks i lever.
Slutsatser: Ett pythonbaserat verktyg för automatisk annotering av
proteinfamil-jer tillhörande MDR-superfamiljen har tillverkats. Verktyget är enkelt att bygga ut för använding av nya datakällor och det visar sig att dess resultat stämmer väl överens med litteraturen. Användbarheten och behovet av systemet samt kva-liteten av dess resultat kommer kontinuerligt att öka över tid, även om inga nya tillägg skapas, eftersom systemet kan göra fler och mer detaljerade inferenser i takt med att ny data blir tillgänglig.
Acknowledgments
Firstly, I would like to thank my supervisor Joel Hedlund for all the good comments and constant positive encouragements and my examiner Bengt Persson. I would also like to thank Torbjörn Jonsson, my boss and mentor at IDA who gave me a lot of administrative work instead of the usual scheduled education to get time to focus on this thesis. Christopher and Erik at home for good comments, discussions and support. My family for all their support and encouragements and finally, all my friends at [hg] (and especially Petter) for all the good times and things to do in my spare time.
Contents
1 Introduction 1
1.1 Bioinformatics . . . 1
1.2 The MDR families . . . 2
1.2.1 ADH - Alcohol Dehydrogenase . . . 2
1.2.2 VAT1 - Vesicle Amine Transport Protein 1 . . . 2
1.2.3 FAS - Fatty Acid Synthase . . . 2
1.2.4 MECR - Mitochondrial trans-2-enoyl-CoA Reductase . . . . 2
1.2.5 PDH - Pyruvate Dehydrogenase . . . 3
1.2.6 PTGR - Prostaglandin Reductase . . . 3
1.2.7 vertQOR - Vertebrate Quinone Oxidoreductase . . . 3
1.2.8 ZADH2 - Zinc-binding Alcohol Dehydrogenase domain con-taining protein 2 . . . 3
2 Methods 5 2.1 Implementation . . . 5
2.2 Databases and services . . . 6
2.2.1 Uniprot . . . 6
2.2.2 PICR . . . 6
2.2.3 BioGPS . . . 7
2.2.4 Human Protein Atlas . . . 7
2.2.5 ArrayExpress . . . 8 2.2.6 KEGG . . . 8 2.2.7 Reactome . . . 8 2.2.8 Gene Ontology . . . 8 2.2.9 InterPro . . . 8 2.2.10 PFAM . . . 9 2.2.11 PROSITE . . . 9 2.3 Dendrogram generation . . . 10
3 Results and discussion 11 3.1 ADH . . . 11 3.2 VAT1 . . . 11 3.3 vertQOR . . . 12 3.4 PTGR . . . 12 3.5 FAS . . . 13 ix
x Contents 3.6 MECR . . . 14 3.7 PDH . . . 14 3.8 ZADH2 . . . 14 4 Conclusions 15 5 Future Work 17 A Tables 22
B List of Supplementary data 32
1
Introduction
The great challenge in bioinformatics is data integration. Each group focuses on “their” topic and publishes their own data. There are no common unified standards of where, or how, the data should be stored but rather multiple, independently evolving, de facto standards. The amount of available data, as well as the number of databases, is growing steadily.
Hedlund et al.[1] have assembled a database of proteins in the medium-chain
dehydrogenases/reductases (MDR) protein superfamily by creating hidden Markov models (HMMs) describing the member families. The aim of this project is to build an automated tool to annotate the different member families in this database by finding common properties of member proteins. The data will be gathered from several publicly available databases. The MDRs are widespread and exist in all kingdoms of life, however the main focus of this project will be on human and mouse proteins, since these two species are the richest in reliable annotation sources. The goal is to find common properties of each member family to increase the understanding of both the member families and the entire MDR superfamily. This knowledge will then be helpful when a new, unannotated, protein is matched as a member to a specific MDR member family. The tool will be created in a general way, making it easy to extend to include other databases and to annotate other protein families.
The utility and necessity of this system, as well as the projected quality of its produced results, are expected to only ever increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.
1.1
Bioinformatics
Around the year 2000 the human genome project (HUGO) was started. The goal of HUGO was to sequence the human genome and this generated a lot of data. Handling this wast amount of data required advanced computational algorithms and thus the field of bioinformatics was born. The main goal of bioinformatics is to increase the knowledge of all biological processes by collecting, organising, interpreting and visualising the available data. Easy access to good, annotated data makes it easier for researchers to find new fields of study.
At present, there are a lot of data on proteins but there is not as much data 1
2 Introduction
on families and superfamilies and we have found no other software that annotates protein families.
1.2
The MDR families
Hedlund et al.[1] estimate the number of MDR families to be around 500. In their
work, 86 of those have been classified. In this thesis the focus will be on the eight families with at least two protein members expressed in human or mouse tissue
described below (adapted from[1,2]).
1.2.1
ADH - Alcohol Dehydrogenase
ADH exists in five different forms (class I-V) in vertebrates. Class I is
typi-cally found in liver tissue and has high activity against several alcohols, includ-ing ethanol. Class II has been found in human liver. It is much less studied
but metabolises peroxidic aldehyde and norepinephrine[3]. Class III metabolises
formaldehyde and fatty acids. Class IV acts in stomach and other mucosa produc-ing tissues as a first step in metabolism of gastric alcohols. As an interestproduc-ing side
note, ADH class IV has an increased expression in esophageal cancer[4], possibly
due to the carcinogenicity of acetaldehyde, the first metabolite of ethanol. Class V ADH has been found in human foetal liver.
1.2.2
VAT1 - Vesicle Amine Transport Protein 1
VAT1 shows expression during the development of the nervous system but also in hypothalamus and cerebellum. Members show ATPase activity and binds to ATP and calcium. The family is thought to be involved in vesicle transport of neurotransmitters.
1.2.3
FAS - Fatty Acid Synthase
The main function of the proteins is to synthesise fatty acids by elongation of acetyl-CoA with malonyl-CoA. The reaction is NADPH dependent. In mammals, FAS is a homo-dimer, each consisting of five domains. In bacteria and plants, on the other hand, the reaction is carried out by several, monofunctional, enzymes in
a system[5].
1.2.4
MECR - Mitochondrial trans-2-enoyl-CoA Reductase
MECR seems to be a regulating transcription factor for mitochondrial respiratory proteins. They are mainly expressed in muscle tissue in human but members are present in plants, insects, worms, fish and mammals. Members has been shown to act in fatty acid synthesis in yeast mitochondria. Members catalyses the reduction of trans-2-enoyl-CoA to acyl-CoA and depends on NADPH.
1.2 The MDR families 3
1.2.5
PDH - Pyruvate Dehydrogenase
Transforms pyruvate into acetyl-CoA in a NAD-dependent way. Exists in mi-tochondria providing a link between glycolysis and the tricarboxylic acid (TCA) cycle. The general structure is a heterotetramer of two alpha and two beta sub-units.
1.2.6
PTGR - Prostaglandin Reductase
This family contains both the human prostaglandin reductase (PGR) 1 and 2 but most of the family consists of procaryotic members. Commonly expressed in liver, kidney, intestine, spleen and stomach but also in bronchial epithelial cells and heart.
1.2.7
vertQOR - Vertebrate Quinone Oxidoreductase
This family consists of both zinc containing members and members lacking zinc. Generally, the zinc containing members have been studied in a greater extent, making the available data for the zinc-lacking members less detailed. The enzy-matic function is described as quinone reduction with NADPH as cofactor but a
catalytic mechanism has yet to be proposed[6].
1.2.8
ZADH2 - Zinc-binding Alcohol Dehydrogenase domain
containing protein 2
ZADH2 is a very small family with only four members in SwissProt, of which only two exists in human or mouse. This family has not yet been studied to any greater extent, and only very little information can currently be found.
2
Methods
The tool was implemented in Python using the algorithm described below. The MDR families were read from a local version of the mdr-enzymes.org database. Each protein was then annotated with data from Uniprot to get syn-onyms for the protein names in different databases and some general information
such as Gene Ontology[7] terms and InterPro[8], Pfam[9] and PROSITE[10] sites
found in the protein sequence. The UniProt database also gives primary and
secondary accession numbers for each protein which were used to remove any duplicates in the data. A duplicate entry was defined as having a primary acces-sion number that is a secondary accesacces-sion number for another protein. Manual sequence comparison of the matched proteins confirms the validity of this method. Duplicate removal was followed by synonym gathering. Many databases use the same name for a specific protein, but some use other names. Therefore mapping between different databases had to be done. Several plugins were developed to fetch synonyms for proteins.
When synonyms had been found, information from chosen databases was gath-ered for each protein. Common values from all proteins in each protein family were then collected. A value found in at least half of all proteins in a family was de-clared common for the entire family. A flowchart of the method can be seen in figure 2.1.
2.1
Implementation
The final tool is plugin based with four types of plugins; source plugins in which protein families are read from a source database, synonym plugins which finds syn-onyms for proteins, database plugins which queries a data source for information on proteins and finds common values and output plugins that writes the results to file. All plugins get automatic access to configuration files to ease the alteration of limits and paths to data files and results. There are both global settings and local settings for each plugin that can override the global configuration. For further information see the source code in appendix B and a short manual in appendix C.
6 Methods
Gather proteins from source databases
Extract information from UniProt
Remove duplicates
Find synonyms
Gather information from selected databases
Find common values
Save results to file
Figure 2.1: Summarised work flow of the tool.
2.2
Databases and services
The many different databases and services used during this project are described briefly in the following sections.
2.2.1
Uniprot
The widely used Universal Protein Resource (UniProt) is a protein sequence and annotation data resource that consists of several databases. The UniProt Know-legebase provides both manually curated information within the SwissProt sub-section, and automatically added annotation within the much larger TrEMBL subsection. No distinction has been made between the two sources in this project.
2.2.2
PICR
The Protein Identifier Cross-Referencing service[11] (PICR) is a web application
that maps protein names between different databases. They base the results on the UniProt Archive (UniParc) which is a data warehousing service updated daily
2.2 Databases and services 7
by UniProt to account for new releases of source databases. Many of the results found should also be found with the UniProt polling but the UniParc database store data from many other sources then TrEMBL and SwissProt.
2.2.3
BioGPS
BioGPS[12] is a gene annotation portal that lets the user customise the layout
of the page. They also have a gene expression atlas based on human and mouse
protein-encoding transcriptomes[13]. They have created a data set based on both
a combined Human U133/GNF1H array and the mouse GNF1M and MOE430 arrays.
The fact that the tissues studied in different experiments are different increases the number of tissues studied but makes it harder to find common expression pro-files. All the calculations are species based to include as many tissues as possible. Three steps had to be taken to gather data from this data set. Since they had several expression levels for each tissue and the difference between a positive and negative value was quite large, the data was normalised by dividing each data point with the harmonic mean value
m = nn P i=1 1 xi (2.1)
for that gene and then calculating mean values for each tissue. The BioGPS dataset is provided in two parts; annotations and data. The data part is organised based on probe id, so in order to be applicable, all proteins needed to be annotated with probe id synonyms using the annotations part. Several different synonyms for the protein could be used to extract the probe id but SwissProt accession numbers and Ensembl gene id were used primarily. In order to empasise large overexpressions, the tissues were sorted according to their t-statistic
t = x − X
s (2.2)
where x is a expression value for a specific tissue,X is the mean value for all tissues
and s is the sample standard deviation for the data set. The highest scoring tissues are shown in tables A.4 and A.5.
2.2.4
Human Protein Atlas
The Human Protein Atlas[14,15] (HPA) provides immunohistochemically stained
images of both diseased and normal human tissues, effectively providing specific
maps of protein presence in the human body in various conditions. It has a
coverage of about 25% of the human genome and since all the conclusions are drawn by trained pathologists, the database is quite reliable.
Sadly, very few of the MDR proteins are available in the HPA database but it is an active project and hopefully data can be gathered from HPA in future releases of the database.
8 Methods
2.2.5
ArrayExpress
Array Express[16]is a database of functional genomics experiments including the
gene expression atlas with a subset of curated data. The ArrayExpress server was queried for every protein in each family and a positive tissue was defined as a tissue having more up-regulated than down-regulated experiments. Tissues that were positive in at least half of the proteins were deemed common for the family.
2.2.6
KEGG
KEGG, or Kyoto Encyclopedia of Genes and Genomes[17], is a database resource
containing data from 16 different sources. In this thesis the focus was on the KEGG Pathway database, containing maps for metabolism and other cellular processes. Data was extracted from UniProt and then processed as in the method description.
2.2.7
Reactome
Reactome[18]is a manually curated database over human cell reactions and
path-ways with at least 5000 distinct proteins. Reactome data was collected from
UniProt and processed according to the method description.
2.2.8
Gene Ontology
The aim of the Gene Ontology[7] (GO) is to standardise the representation of
attributes of genes and gene products. It is used as a sort of thesaurus for biological terms where all terms have relationships with other terms (see example 2.1). The relationships can be shown in a tree format for easy overview.
Example 2.1: Gene Ontology terms
GO:0032496 (response to LPS) is a GO:0009617 (response to bacterium)
The GO terms were extracted from UniProt for each protein and then the number of positive terms were increased by using the relations between the terms. The number of occurrences of each term was counted whereafter the number of occur-rences of a more specific term was added to a more general term as shown in figure
2.2. The tree structure was generated using AmiGO[19].
2.2.9
InterPro
InterPro[8]combines several member data sources to derive protein signatures. All
member databases annotate protein sequences with different focus giving a result with good overview of the sequence. Integration of the source databases is done manually and the resulting entry is annotated with crosslinks to other databases.
2.2 Databases and services 9 GO:0003674 molecular_function GO:0005488 binding GO:0043169 cation binding GO:0043167 ion binding GO:0046914 transition metal ion binding GO:0046872 metal ion binding
GO:0032440 2-alkenal reductase activity GO:0003824 catalytic activity GO:0016627 oxidoreductase activity, acting on the CH-CH group of donors GO:0016628 oxidoreductase activity, acting on the CH-CH group of donors. NAD or NADP
as acceptor
GO:0016614 oxidoreductase activity, acting on CH-OH group of donors
GO:0047522 15-oxoprostaglandin 13-oxidase activity GO:0004022 alcohol dehydrogenase (NAD) activity GO:0016491 oxidoreductase activity GO:0008270 zinc ion binding
GO:0016616 oxidoreductase activity, acting on the CH-OH group of donors. NAD or NADP
as acceptor
Figure 2.2: Part of the PTGR Gene Ontology Tree. Terms in thin boxes were found
among the proteins, but not with sufficient frequency for inclusion in the results. Terms in boldface boxes had enough occurrences to be included in the results, either in their own right or after cumulative addition of descendant terms. Terms in dashed boxes were not found in any of the proteins, but are included in this figure to show ancestry.
2.2.10
PFAM
PFAM[9] is a collection of protein domain models represented by HMMs. The
databases contain two kind of entries, Pfam-A and Pfam-B entries. The Pfam-A entries are based on manually constructed multiple sequence alignments (MSAs). Pfam-B are generated automatically and have lower quality.
Since the MDR family HMMs were bootstrapped from the Pfam HMMs ADH_N and ADH_zinc_N, one could expect to find at least one of these signatures in the majority of the results.
2.2.11
PROSITE
PROSITE annotates protein domains using profiles, which are similar to profile HMMs as used by Pfam-A, but it also annotates smaller sequence features like for example functional sites or posttranslationally modified sites, using a simple pat-tern matching technique similar to regular expressions used in many programming languages such as Perl.
10 Methods
2.3
Dendrogram generation
To generate the cladogram seen in Fig. 3.2, three softwares were used; ClustalW, MrBayes, and Dendroscope. All protein sequences were downloaded in FASTA for-mat from UniProt and then aligned using ClustalW and then trees were generated with MrBayes and displayed using Dendroscope.
3
Results and discussion
The results sorted by database can be seen in appendix A. In the following chapter the results for each family will be presented and analysed.
3.1
ADH
Both ArrayExpress and BioGPS shows that ADH is expressed in liver as ex-pected according to 1.2.1 but both sources also show some expression in heart and adipocytes. Only BioGPS show expression in mucosa such as eye and epidermis and then only for mouse proteins. A reason for this is probably because of the in-clusion limit of 50%. As stated earlier, only ADH class IV is commonly expressed outside of the liver so it is not that unexpected to only find a majority of the proteins in liver tissue.
The GO terms shows that most members has oxidoreductase activity (GO:0016491) according to Fig. 3.1 and also that they bind zinc. About 60% is part of an ethanol metabolic process.
InterPro shows that the members have a NAD(P) binding domain and confirms the zinc binding shown with GO terms.
Only a fraction of the total number of members were found in reactome but all of the six found took part in biological oxidations and confirms the GO results.
KEGG only annotated 16 of the members, however since these are distributed more or less evenly throughout the evolutionary tree of the family (Fig. 3.2), there is a high probability that the results presented here are indeed representative for the family as a whole, especially since the KEGG annotations are in complete consensus. Most of the pathways reported are included because of the alcohol → aldehyde conversion function of most members but they also take part in cy-tochrome P450 pathways such as chloral hydrate → trichloroethanol conversion.
3.2
VAT1
VAT1 seems to be expressed mainly in tissues within the central nervous system such as adrenal gland, hypothalamus and the amygdala. Both BioGPS and Ar-rayExpress also show raised levels in adipose tissue. No satisfying reason for this has been found in literature, but is a very interesting venue to explore in further
12 Results and discussion GO:0003674 molecular_function GO:0003824 catalytic activity GO:0016614 oxidoreductase activity, acting on CH-OH group of donors
GO:0016616 oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP
as acceptor GO:0004022 alcohol dehydrogenase (NAD) activity GO:0004024 alcohol dehydrogenase activity, zinc-dependent GO:0004032 alditol:NADP+ 1-oxidoreductase activity GO:0004745 retinol dehydrogenase activity GO:0003960 NADPH:quinone reductase activity GO:0004031 aldehyde oxidase activity GO:0018467 formaldehyde dehydrogenase activity GO:0019115 benzaldehyde dehydrogenase activity GO:0051903 S-(hydroxymethyl)glu athione dehydrogenase activity GO:0008106 alcohol dehydrogenase (NADP+) activity GO:0004033 aldo-keto reductase (NADP) activity GO:0016620 oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor
GO:0016623 oxidoreductase activity, acting on the aldehyde or oxo group of donors. oxygen as acceptor GO:0016903 oxidoreuctase activity, acting on the aledhyde or oxo group of donors GO:0016651 oxidoreductase activity, acting on NADH or NADPH GO:0016655 oxidoreductase activity, acting on NADH or NADPH, quinone or similar compound as acceptor GO:0016491 oxidoreductase activity
Figure 3.1: Part of the ADH GO tree, using the same graphical representation as figure
2.2.
laboratory experiments. The proteins have the ADH domain and binds zinc ac-cording to InterPro and PFAM. They also have a NAD binding domain and match the quinone oxidoreductase / ζ-crystallin PFAM pattern.
3.3
vertQOR
vertQOR is highly expressed in liver and kidney. This is expected since the main function is to deactivate the toxic quinones and it seems right that the proteins will be found in the renal system. Members have the NAD(P) binding domain and matches the ζ-crystallin pattern, a result that is quite expected since the
ζ-crystallin pattern defines the NADP-depending quinone oxidoreductases. The
four members found were all zinc-binding.
3.4
PTGR
Both expression databases show that PTGRs are expressed in small intestine and stomach. Members bind zinc and NAD(P). Many of the proteins exist in cyto-plasm according to gene ontology terms and have oxidoreductase activity, mainly 15-oxoprostaglandin 13-oxidase (PGR) activity but also 2-alkenal reductase and alcohol dehydrogenase activity as seen in Fig. 2.2.
3.5 FAS 13 B4E1R1_HUMAN P00325_HUMAN B4E2R9_HUMAN B4DVC3_HUMAN P07327_HUMAN P00329_MOUSE Q3UKA4_MOUSE A8MVN9_HUMAN B4DWS1_HUMAN P40394_HUMAN Q3UMM7_MOUSE Q9D748_MOUSE Q548K2_MOUSE Q64437_MOUSE Q496S1_MOUSE Q9QYY9_MOUSE Q3V0P5_MOUSE P08319_HUMAN Q6FI45_HUMAN Q6IRT1_HUMAN P11766_HUMAN Q5U043_HUMAN Q2VIM7_HUMAN P28474_MOUSE Q6P5I3_MOUSE P28332_HUMAN B4DPD8_HUMAN Q8IUN7_HUMAN A1L3C0_MOUSE Q9D932_MOUSE Q3UQ40_MOUSE P00326_HUMAN A8MYN5_HUMAN
Figure 3.2: Cladogram of the ADH family with members annotated in KEGG in bold
face. This is the consensus tree from 10000 Monte Carlo reconstructions using the MrBayes program, wherein no other credible topologies could be detected.
3.5
FAS
FAS was found to be expressed in high levels in adipocyte tissue as expected but also in muscle (skeletal and cardiac). ArrayExpress indicated high expression in many tissues but the results was not confirmed with BioGPS.
In Fig. 3.3, one can see that many of the found GO-terms are linked with fatty acid synthase activity which would have been a very expected result. How-ever, since no member was annotated with that GO-term, it wasn’t included in the result table. Potentially, this sort of non-detection could be remedied by step-wise generalisation up through the ontology tree until an ancestral term has been found that can describe sufficiently many of the member proteins. However, as this traversal could prove computationally costly, and could potentially also yield uselessly over-general terms, it has not been implemented in this version of the tool.
All members bind zinc and NAD(P) and most have a acyl carrier and trans-ferase domain.
14 Results and discussion
GO:0004312 fatty acid synthase
activity GO:0016297 acyl-[acyl-carrier-protein] hydrolase activity GO:0004315 3-oxocyl-[acyl-carrier-protein]synthease activity GO:0004317 3-hydroxypalmitoyl-[acyl-carrier-protein] dehydrogenase activity GO:0016295 myristoyl-[acyl-carrier-protein]hydrolase activity GO:0004314 [acyl-carrier-protein] S-malonyltransferase activity GO:0004315 palmitoyl-[acyl-carrier-protein]hydrolase activity GO:0004320 oleoyl-[acyl-carrier-protein]hydrolase activity GO:0004313 [acyl-carrier-protein] S-acetyltransferase activity
Figure 3.3: Reduced GO tree for FAS to highlight the terms associated with fatty acid
synthase with the same graphical representation as figure 2.2.
3.6
MECR
No common expression profiles for the two data sources could be found but BioGPS reported a high value for heart tissue. It was quite unexpected that none of the sources confirmed high expression in muscle tissue. According to GO, the proteins exists in mitochondria, has oxidoreductase activity and binds zinc. The latter confirmed by Pfam and InterPro.
3.7
PDH
The PDH family has members in many different species and only two proteins was found, one copy in each species studied. Gene ontology terms show that the mem-bers are located in the mitochondrial membrane, cilium and flagellum. Memmem-bers are expressed in liver, kidney and testis. Proteins binds zinc and NAD(P).
3.8
ZADH2
Both Pfam and GO show that the members bind zinc. The data gathered from BioGPS and ArrayExpress gives no clear indication of where the members are expressed. InterPro and Prosite both show that all members match the ζ-crystallin pattern.
4
Conclusions
The main goal in this project was to create a tool to annotate protein families by finding common properties in the member proteins. The focus was on the member families of the medium chain dehydrogenase/reductase superfamily with human and/or mouse protein members. The tool was supposed to be extendable and easy to use for other protein families.
During this project, a python based tool that annotates protein families with data from several data sources has been created. The tool is plugin based which makes it easy to expand and change according to each user’s preference. Regard-less of this, the utility and necessity of this system, as well as the quality of its produced results, are expected to only ever increase over time, even if no addi-tional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.
ADH members shows expression in liver, heart and adipocytes. BioGPS also show expression in mucosa such as eye and epidermis and then only for mouse proteins. Members bind to zinc and most take part in ethanol metabolic processes. KEGG annotated 16 proteins distributed evenly throughout the evolutionary tree of the family. Most pathways were linked to the alcohol → aldehyde conversion seen in members but some also takes part in cytochrome P450 pathways such as chloral hydrate → trichloroethanol conversion.
VAT1 seems to be expressed mainly in tissues within the central nervous system such as adrenal gland, hypothalamus and the amygdala but also in adipose tissue. No satisfying reason for this has been found in literature, but is a very interesting venue to explore in further laboratory experiments. Members match the quinone oxidoreductase / ζ-crystallin PFAM pattern.
vertQOR is highly expressed in liver and kidney. Members also match the quinone oxidoreductase / ζ-crystallin PFAM pattern.
Both expression databases show that PTGRs are expressed in small intestine and stomach. Many of the proteins exist in cytoplasm and have oxidoreductase activity, mainly 15-oxoprostaglandin 13-oxidase (PGR) activity but also 2-alkenal reductase and alcohol dehydrogenase activity.
FAS was found to be expressed in high levels in adipocyte tissue and in muscle tissue (skeletal and cardiac). Many of the found GO-terms are linked with fatty acid synthase activity. All members bind zinc and NAD(P) and most have an acyl carrier and transferase domain.
No common expression profiles for MECR have been found between the two 15
16 Conclusions
data sources but BioGPS reported a high value for heart tissue. The proteins exist in mitochondria, have oxidoreductase activity and bind zinc.
PDH members has been shown to be located in the mitochondrial membrane, cilium and flagellum. Members are expressed in liver, kidney and testis. Proteins bind zinc and NAD(P).
Data sources have not given any clear indications of where ZADH2 members are expressed but they seem to bind zinc and match the ζ-crystallin pattern.
The result agrees quite well with information gathered from literature and shows that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N). This is encouraging since the hidden Markov models used to generate the MDR families were bootstrapped from these signa-tures. The proteins take part in oxidation-reduction processes, often with NAD or NADP as cofactor and many of the proteins contain zinc and are expressed in liver tissue. Some also have the quinone oxidoreductase / ζ-crystallin domain.
5
Future Work
During this project I have identified several interesting venues for further explo-ration that may lead to increased sensitivity and reliability for the system. Firstly, running the tool for all proteins in each family instead of only families with pro-teins expressed in human and mouse tissue would increase the statistical value of the results. A problem would of course be that not all proteins have been anno-tated and some of the databases do not support several of the available organisms. Secondly, the result would be more reliable if distinction between automatically annotated and expert reviewed data had been made. During the development, I chose not to do this distinction because of the small sample size. If the full MDR database had been used, I believe that this would be crucial to get reliable data. A possible improvement of the gene ontology plugin would be to include new terms by doing stepwise generalisation up through the ontology tree until an ancestral term has been found that can describe sufficiently many of the member proteins, as discussed in section 3.5, and possibly with some heuristic to avoid generating uselessly general results.
Bibliography
[1] Joel Hedlund, Hans Jörnvall, and Bengt Persson. Subdivision of the mdr su-perfamily of medium-chain dehydrogenases/reductases through iterative hid-den markov model refinement. BMC Bioinformatics, 11:534, 2010.
[2] Bengt Persson, Joel Hedlund, and Hans Jörnvall. Medium- and short-chain dehydrogenase/reductase gene and protein families. Cellular and Molecular Life Sciences, 65:3879–3894, 2008.
[3] Roger S. Holmes. Alcohol dehydrogenases: a family of isozymes with differ-ential functions. Alcohol and alcoholism. Supplement, 2:127, 1994.
[4] Wojciech Jelski, Miroslaw Kozlowski, Jerzy Laudanski, Jacek Niklinski, and Maciej Szmitkowski. The activity of class i, ii, iii, and iv alcohol dehydrogenase (adh) isoenzymes and aldehyde dehydrogenase (aldh) in esophageal cancer. Digestive Diseases and Sciences, 54:725–730, 2009.
[5] Timm Maier, Marc Leibundgut, Daniel Boehringer, and Nenad Ban. Struc-ture and function of eukaryotic fatty acid synthases. Quarterly Reviews of Biophysics, 43(03):373–422, 2010.
[6] Sergio Porté, Agrin Moeini, Irene Reche, Naeem Shafqat, Udo Oppermann, Jaume Farrés, and Xavier Parés. Kinetic and structural evidence of the alke-nal/one reductase specificity of human ζ-crystallin. Cellular and Molecular Life Sciences, 68:1065–1077.
[7] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Andrew Tarver, Laurie Issel-and Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richard-son, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene ontology: tool for the unification of biology. Nature Genetics, 25:25–29, May 2000. [8] Sarah Hunter, Rolf Apweiler, Teresa K. Attwood, Amos Bairoch, Alex
Bate-man, David Binns, Peer Bork, Ujjwal Das, Louise Daugherty, Lauranne Duquenne, Robert D. Finn, Julian Gough, Daniel Haft, Nicolas Hulo, Daniel Kahn, Elizabeth Kelly, Aurélie Laugraud, Ivica Letunic, David Lonsdale, Ro-drigo Lopez, Martin Madera, John Maslen, Craig McAnulla, Jennifer Mc-Dowall, Jaina Mistry, Alex Mitchell, Nicola Mulder, Darren Natale, Christine
20 BIBLIOGRAPHY
Orengo, Antony F. Quinn, Jeremy D. Selengut, Christian J. A. Sigrist, Man-jula Thimma, Paul D. Thomas, Franck Valentin, Derek Wilson, Cathy H. Wu, and Corin Yeats. Interpro: the integrative protein signature database. Nucleic Acids Research, 37:D211–D215, 2009.
[9] Robert D. Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne E. Pollington, O. Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, and Alex Bateman. The pfam protein families database. Nucleic Acids Research, 38:D211–D222, 2010.
[10] Christian J. A. Sigrist, Lorenzo Cerutti, Edouard de Castro, Virginie Genevaux, Petra S. Langendijk-and Bulliard, Amos Bairoch, and Nicolas Hulo. Prosite, a protein domain database for functional characterization and annotation. Nucleic Acids Research, 38:D161–D166, 2010.
[11] Richard Cote, Philip Jones, Lennart Martens, Samuel Kerrien, Florian Reisinger, Quan Lin, Rasko Leinonen, Rolf Apweiler, and Henning Herm-jakob. The protein identifier cross-referencing (picr) service: reconciling pro-tein identifiers across multiple source databases. BMC Bioinformatics, 8(1): 401, 2007.
[12] Chunlei Wu, Camilo Orozco, Jason Boyer, Marc Leglise, James Goodale, Serge Batalov, Christopher L Hodge, James Haase, Jeff Janes, Jon W Huss III, and Andrew I Su. Biogps: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biology, 10(11):R130, 2009.
[13] Andrew I. Su, Tim Wiltshire, Serge Batalov, Hilmar Lapp, Keith A. Ching, David Block, Jie Zhang, Richard Soden, Mimi Hayakawa, Michael P. Kreiman, Gabriel Cooke, John R. Walker, and John B. Hogenesch. A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 101(16):6062–6067, April 2004.
[14] Fredrik Pontén, Karin Jirström, and Mathias Uhlén. The human protein atlas – a tool for pathology. Journal of Pathology, 216:387–393, 2008.
[15] Lisa Berglund, Erik Björling, Per Oksvold, Linn Fagerberg, Anna As-plund, Cristina Al-Khalili Szigyarto, Anja Persson, Jenny Ottosson, Henrik Wernérus, Peter Nilsson, Emma Lundberg, Åsa Sivertsson, Sanjay Navani, Kenneth Wester, Caroline Kampf, Sophia Hober, Fredrik Pontén, and Math-ias Uhlén. A genecentric human protein atlas for expression profiles based on antibodies. Molecular & Cellular Proteomics, 7:2019–2027, 2008.
[16] Helen Parkinson, Ugis Sarkans, Nikolay Kolesnikov, Niran Abeygunawar-dena, Tony Burdett, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Emma Hastings, Ele Holloway, Natalja Kurbatova, Margus Lukk, James Malone, Roby Mani, Ekaterina Pilicheva, Gabriella Rustici, Anjan Sharma, Eleanor
BIBLIOGRAPHY 21
Williams, Tomasz Adamusiak, Marco Brandizi, Nataliya Sklyar, and Alvis Brazma. Arrayexpress update-an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Research, 39(suppl 1):D1002–D1004, 2011.
[17] Minoru Kanehisa and Susumu Goto. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30, 2000.
[18] Lisa Matthews, Gopal Gopinath, Marc Gillespie, Michael Caudy, David Croft, Bernard de Bono, Phani Garapati, Jill Hemish, Henning Hermjakob, Bijay Jassal, Alex Kanapin, Suzanna Lewis, Shahana Mahajan, Bruce May, Esther Schmidt, Imre Vastrik, Guanming Wu, Ewan Birney, Lincoln Stein, and Peter D’Eustachio. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Research, 37:D619–D622, 2009.
[19] Seth Carbon, Amelia Ireland, Christopher J. Mungall, ShengQiang Shu, Brad Marshall, Suzanna Lewis, the AmiGO Hub, and the Web Presence Work-ing Group. Amigo: online access to ontology and annotation data. Bioinfor-matics, 25(2):288–289, 2009.
A
Tables
Family Name ADH FAS PDH MECR VAT1 PTGR vertQOR ZADH2
Number of proteins 33 5 2 4 14 8 4 5
Table A.1: Number of proteins found in each family
Family Term Number Percentage
of Proteins ADH PF08240 ADH_N 29 88 PF00107 ADH_zinc_N 33 100 FAS PF00975 Thioesterase 3 60 PF00698 Acyl_transf_1 4 80 PF08659 KR 4 80 PF00550 PP-binding 4 80 PF02801 Ketoacyl-synt_C 4 80 PF00109 ketoacyl-synt 4 80 PF08242 Methyltransf_12 5 100 PF00107 ADH_zinc_N 5 100 PDH PF08240 ADH_N 2 100 PF00107 ADH_zinc_N 2 100 MECR PF08240 ADH_N 4 100 PF00107 ADH_zinc_N 4 100 VAT1 PF08240 ADH_N 12 86 PF00107 ADH_zinc_N 14 100 PTGR PF00107 ADH_zinc_N 8 100 vertQOR PF08240 ADH_N 4 100 PF00107 ADH_zinc_N 4 100 ZADH2 PF08240 ADH_N 4 80 PF00107 ADH_zinc_N 5 100
Table A.2: PFAM annotations
23
Family Term Number Percentage
of Proteins ADH
GO:0016491 oxidoreductase activity 31 94
GO:0008270 zinc ion binding 25 76
GO:0006067 ethanol metabolic process 20 61
FAS
GO:0031177 phosphopantetheine binding 4 80
GO:0055114 oxidation-reduction process 4 80
GO:0000036 acyl carrier activity 4 80
GO:0042587 glycogen granule 3 60
GO:0048037 cofactor binding 4 80
GO:0009058 biosynthetic process 3 60
GO:0016740 transferase activity 4 80
GO:0005739 mitochondrion 3 60
GO:0016491 oxidoreductase activity 3 60
GO:0008270 zinc ion binding 5 100
PDH
GO:0055114 oxidation-reduction process 2 100
GO:0005929 cilium 2 100
GO:0031966 mitochondrial membrane 2 100
GO:0019861 flagellum 2 100
GO:0005625 soluble fraction 2 100
GO:0003939 L-iditol 2-dehydrogenase activity 2 100
MECR
GO:0016491 oxidoreductase activity 4 100
GO:0055114 oxidation-reduction process 4 100
GO:0008270 zinc ion binding 4 100
GO:0005739 mitochondrion 3 75
VAT1
GO:0016491 oxidoreductase activity 14 100
GO:0055114 oxidation-reduction process 14 100
GO:0008270 zinc ion binding 14 100
PTGR
GO:0016491 oxidoreductase activity 8 100
GO:0055114 oxidation-reduction process 8 100
GO:0032440 2-alkenal reductase activity 5 63
GO:0047522 15-oxoprostaglandin 13-oxidase activity 7 88
GO:0008270 zinc ion binding 8 100
GO:0005737 cytoplasm 5 63
vertQOR
GO:0055114 oxidation-reduction process 4 100
GO:0016491 oxidoreductase activity 4 100
GO:0008270 zinc ion binding 4 100
ZADH2
GO:0016491 oxidoreductase activity 5 100
GO:0055114 oxidation-reduction process 5 100
GO:0008270 zinc ion binding 5 100
24 Tables
Family Organism Tissue t-score
ADH human Liver 6.51 Fetalliver 3.52 Adipocyte 3.46 colon 2.55 small intestine 2.07 mouse liver 7.03 cornea 4.15 adrenal gland 2.65 bladder 2.10 kidney 1.65 intestine large 1.32 FAS human SkeletalMuscle 2.88 CardiacMyocytes 2.88 Lymphoma burkitts(Raji) 2.27 AdrenalCortex 2.27 TrigeminalGanglion 2.27 Fetallung 1.97 SuperiorCervicalGanglion 1.67 Bonemarrow 1.67 Tongue 1.67 Liver 1.67 Heart 1.67 mouse adipose brown 7.37 adrenal gland 4.01
mammary gland lact 2.67
mammary gland non-lactating 2.21
ovary 1.74 adipose white 1.58 MECR human Heart 6.30 Lymphoma burkitts(Raji) 4.22 Lymphoma burkitts(Daudi) 1.76 721 B lymphoblasts 1.50 Wholebrain 1.37 OccipitalLobe 1.11 mouse
B-cells marginal zone 4.12
pancreas 3.73
lens 3.27
thymocyte SP CD4+ 2.48
B-cells GL7negative Alum 1.52
T-cells CD8+ 1.41
adipose brown 1.39
macrophage bone marrow 2hr LPS 1.34
retina 1.30
25
Family Organism Tissue t-score
PDH human Thyroid 6.10 Prostate 4.71 FetalThyroid 2.96 Liver 2.58 Kidney 1.64 mouse kidney 6.03 liver 4.88 testis 3.84
mast cells IgE 3.11
mast cells IgE+antigen 1hr 1.25
PTGR human BronchialEpithelialCells 7.61 small intestine 3.07 SmoothMuscle 2.13 CardiacMyocytes 2.03 mouse stomach 7.25 cornea 4.86 bladder 1.95 VAT1 human Adrenalgland 4.60 SmoothMuscle 3.56 Adipocyte 2.47 Lung 2.10 BronchialEpithelialCells 1.59 Leukemialymphoblastic(MOLT-4) 1.58 CardiacMyocytes 1.38 AdrenalCortex 1.31 retina 1.28 Hypothalamus 1.23 mouse
macrophage peri LPS thio 1hrs 5.14
macrophage peri LPS thio 0hrs 5.06
macrophage peri LPS thio 7hrs 4.37
neuro2a 1.36 mast cells 1.26 osteoclasts 1.21 vertQOR human 721 B lymphoblasts 6.99 Fetalliver 2.64 Leukemia promyelocytic-HL-60 2.00 Kidney 1.75 pineal night 1.44 Thyroid 1.34 PancreaticIslet 1.33 mouse kidney 9.52 liver 1.33 ZADH2 mouse adipose brown 4.32 Baf3 3.39 adrenal gland 2.73 salivary gland 2.53 ovary 2.33 liver 1.83 heart 1.62 stomach 1.45 intestine large 1.18
26 Tables
Family Term Number Percentage
of Proteins ADH IPR011032 GroES-like 33 100 IPR013149 ADH_C 33 100 IPR002328 ADH_Zn_CS 29 88 IPR013154 ADH_GroES-like 29 88 IPR002085 ADH_SF_Zn 33 100 IPR016040 NAD(P)-bd_dom 32 97 FAS IPR016038 Thiolase-like_subgr 4 80 IPR023102 Fatty_acid_synthase_dom_2 3 60 IPR009081 Acyl_carrier_prot-like 4 80 IPR016035 Acyl_Trfase/lysoPlipase 4 80 IPR020843 PKS_ER 5 100 IPR006163 Phsphopanteth-bd 4 80 IPR001227 Ac_transferase_dom 4 80 IPR020842 PKS/FAS_KR 4 80 IPR001031 Thioesterase 3 60 IPR006162 PPantetheine_attach_site 4 80 IPR016036 Malonyl_transacylase_ACP-bd 4 80 IPR013149 ADH_C 5 100 IPR013968 PKS_KR 4 80 IPR014030 Ketoacyl_synth_N 4 80 IPR013217 Methyltransf_12 5 100 IPR011032 GroES-like 5 100 IPR014043 Acyl_transferase 4 80 IPR018201 Ketoacyl_synth_AS 4 80 IPR014031 Ketoacyl_synth_C 4 80 IPR016039 Thiolase-like 4 80 IPR016040 NAD(P)-bd_dom 5 100 IPR000794 Beta-ketoacyl_synthase 5 100 PDH IPR011032 GroES-like 2 100 IPR013149 ADH_C 2 100 IPR002328 ADH_Zn_CS 2 100 IPR013154 ADH_GroES-like 2 100 IPR002085 ADH_SF_Zn 2 100 IPR016040 NAD(P)-bd_dom 2 100 MECR IPR002085 ADH_SF_Zn 4 100 IPR013154 ADH_GroES-like 4 100 IPR011032 GroES-like 4 100 IPR013149 ADH_C 4 100 IPR016040 NAD(P)-bd_dom 4 100
27
Family Term Number Percentage
of Proteins VAT1 IPR011032 GroES-like 13 93 IPR013149 ADH_C 14 100 IPR013154 ADH_GroES-like 12 86 IPR002085 ADH_SF_Zn 14 100 IPR016040 NAD(P)-bd_dom 14 100 IPR002364 Quin_OxRdtase/zeta-crystal_CS 14 100 PTGR IPR002085 ADH_SF_Zn 8 100 IPR011032 GroES-like 8 100 IPR013149 ADH_C 8 100 IPR014190 B4_12hDH 5 63 IPR016040 NAD(P)-bd_dom 8 100 vertQOR IPR011032 GroES-like 4 100 IPR013149 ADH_C 4 100 IPR013154 ADH_GroES-like 4 100 IPR002085 ADH_SF_Zn 4 100 IPR016040 NAD(P)-bd_dom 4 100 IPR002364 Quin_OxRdtase/zeta-crystal_CS 4 100 ZADH2 IPR011032 GroES-like 4 80 IPR013149 ADH_C 5 100 IPR013154 ADH_GroES-like 4 80 IPR002085 ADH_SF_Zn 5 100 IPR016040 NAD(P)-bd_dom 5 100 IPR002364 Quin_OxRdtase/zeta-crystal_CS 5 100
Table A.7: InterPro annotations, part 2
Family Term Matching Proteins
ADH PS00059: ADH_ZINC 29 FAS PS00012: PHOSPHOPANTETHEINE 4 PS00606: B_KETOACYL_SYNTHASE 4 PS50075: ACP_DOMAIN 4 PDH PS00059: ADH_ZINC 2 VAT1 PS01162: QOR_ZETA_CRYSTAL 14 vertQOR PS01162: QOR_ZETA_CRYSTAL 4 ZADH2 PS01162: QOR_ZETA_CRYSTAL 5
28 Tables
Family Term Number Percentage
of Proteins
ADH
Subcutaneous adipose tissue 18 55
Liver 30 91
Superior cervical ganglion 19 58
Kidney 19 58 Heart 17 53 FAS Pancreas 3 60 Brain 5 100 Trachea 5 100
The region between the LAD artery and the apex 3 60
Brown fat from interscapular depression 3 60
Preoptic area 3 60 Embryo 3 60 Colon 5 100 Ovary 3 60 Preputial gland 3 60 Brainstem 3 60 Hypothalamus 5 100
Subcutaneous adipose tissue 5 100
Intestine 3 60
Epithelium 3 60
Adipose 3 60
Gonadal white adipose tissue 3 60
Lateral geniculate nucleus (thalamus) 3 60
Periaqueductal gray 3 60
Epidermis 5 100
Dorsal skin without fur 3 60
Embryonic stem cell 3 60
Extraocular muscle 3 60
Liver 5 100
Brown fat 3 60
Adipose tissue 5 100
Mesenteric lymph node 3 60
Lung 3 60
Epidydimal white adipose tissue 3 60
Dorsal root ganglion 5 100
Snout epidermis 3 60
Mammary gland 5 100
Nucleus accumbens 3 60
Corpus 3 60
Brown adipose tissue 3 60
Bed nucleus of the stria terminalis 3 60
Skeletal muscle (M. vastus lateralis) 3 60
Collecting duct 3 60
Adrenal gland 3 60
Whole brain 3 60
Forebrain 3 60
White adipose tissue 3 60
29
Family Term Number Percentage
of Proteins
PDH
Kidney 2 100
Dorsal root ganglion 2 100
Muscle 2 100 Liver 2 100 Small intestine 2 100 Testis 2 100 MECR Colon 4 100 Trachea 4 100 Uterus 4 100
Dorsal root ganglion 4 100
VAT1
Pituitary gland 8 57
Lung 8 57
Hypothalamus 13 93
Subcutaneous adipose tissue 10 71
Dorsal root ganglion 8 57
Amygdala 10 71
Heart 8 57
PTGR
Adipose 5 63
White adipose tissue 5 63
Small intestine 5 63
Nodose ganglion visceral sensory neurons 5 63
Stomach 5 63
Epithelium 5 63
Kidney 7 88
Brainstem 5 63
Oviduct 5 63
Cingulate cortex homogenate 5 63
Anterior tibialis 5 63
Whole organism 5 63
Intestine 5 63
Preputial gland 5 63
30 Tables
Family Term Number Percentage
of Proteins vertQOR Kidney 3 75 Uterus 3 75 Pancreas 3 75 Liver 3 75 ZADH2 Urethra 3 60 Paravertebral muscle 3 60 Skin 3 60 Kidney cortex 3 60
Primary clear-cell renal cell carcinoma 3 60
Lung 3 60
Brain 3 60
Endometrium 3 60
Seminiferous tubule 3 60
Ventricular myocardium 3 60
Hematopoietic and lymphatic system 3 60
Cord blood 3 60
Frontal cortex, superior motor cortex 3 60
Umbilical cord 5 100 Occipital lobe 3 60 Spleen 3 60 Cerebrospinal fluid 3 60 Stomach pyloric 3 60 Cerebellum 5 100 Prostate 3 60 Colon 5 100 Penis 3 60 Blood 3 60 Jejunum 3 60 Endometrium/ovary 3 60
Primary visual cortex 3 60
Omental adipose 3 60 Mammary gland 3 60 Stomach fundus 3 60 Cerebral cortex 5 100 Deltoid muscle 3 60 Kidney medulla 3 60 Thyroid 3 60 Cancer, LCM 3 60
31
Family Term Number Percentage
of Proteins
ADH
Drug metabolism - cytochrome P450 16 49
Fatty acid metabolism 16 49
Retinol metabolism 16 49
Glycolysis / Gluconeogenesis 16 49
Metabolism of xenobiotics by cytochrome P450 16 49
Metabolic pathways 16 49
Tyrosine metabolism 16 49
FAS
Insulin signaling pathway 2 40
Metabolic pathways 2 40
Fatty acid biosynthesis 2 40
MECR Metabolic pathways 3 75
Fatty acid elongation in mitochondria 3 75
PDH Metabolic pathways 2 100
Fructose and mannose metabolism 2 100
Table A.12: KEGG annotations. Note that the number of proteins in KEGG for the
B
List of Supplementary data
On the web address http://www.ifm.liu.se/bioinfo/supplements/elfving-2011-thesis, the following supplementary data can be accessed for free:
• Full source code of the Protein Family Annotation Software (PFAS) and various scripts written during the project
• Gene Ontology trees for all the families • Cladograms for all studied families • Raw data for the MDR and SDR families
C
Manual for using the Protein
Family Annotation Software (PFAS)
The full source code is included in appendix B free of charge and without restric-tions. The following text describes how to use the software.
The tool is written and tested under python 2.7. The software requires no extra packages but some of the plugins do. Please refer to table C.1 for specification.
Package Required in Version
PyMySQL mdrDB 0.4
suds PicrSynonyms 0.4
BeatifulSoup ArrayExpress 3.2.0
Table C.1: Packages needed for different plugins
Some of the plugins require separate data files available at each data provider’s site. The paths to these files must then be entered in each plugin’s configuration file.
Installation is quite straight-forward, untar the file in a folder of your choice and add that path and the path to the lib and plugins folders to your PYTHONPATH variable and the system should be up and running.
Upphovsrätt
Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av doku-mentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerhe-ten och tillgänglighesäkerhe-ten finns det lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan be-skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förla-gets hemsida http://www.ep.liu.se/
Copyright
The publishers will keep this document online on the Internet — or its possi-ble replacement — for a period of 25 years from the date of publication barring exceptional circumstances.
The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/
c