Stockholm University
This is an accepted version of a paper published in Journal of Molecular Biology. This paper has been peer-reviewed but does not include the final publisher proof-corrections or journal pagination.
Citation for the published paper:
Ekman, D., Elofsson, A. (2010)
"Identifying and Quantifying Orphan Protein Sequences in Fungi"
Journal of Molecular Biology, 396(2): 396-405 URL: http://dx.doi.org/10.1016/j.jmb.2009.11.053
Access to the published version may require subscription.
Permanent link to this version:
http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-49277
http://su.diva-portal.org
Our reference: YJMBI 61939 P-authorquery-v7
AUTHOR QUERY FORM
Journal: YJMBI
Article Number: 61939
Please e-mail or fax your responses and any corrections to:
E-mail: corrections.essd@elsevier.spitech.com Fax: +1 61 9699 6721
Dear Author,
Any queries or remarks that have arisen during the processing of your manuscript are listed below and highlighted by flags in the proof. Please check your proof carefully and mark all corrections at the appropriate place in the proof (e.g., by using on-screen annotation in the PDF file) or compile them in a separate list.
For correction or revision of any artwork, please consult http://www.elsevier.com/artworkinstructions.
Articles in Special Issues: Please ensure that the words ‘this issue’ are added (in the list and text) to any references to other articles in this Special Issue.
Uncited references: References that occur in the reference list but not in the text – please position each reference in the text or delete it from the list.
Missing references: References listed below were noted in the text but are missing from the reference list – please make the list complete or remove the references from the text.
Location in article
Query / remark
Please insert your reply or correction at the corresponding line in the proof Q1 Running head: Orphan Protein Sequences in Fungi. Short title OK?
Q2 Data "95 or 76%" changed to "95% or 76%." Change correct?
Q3 "Rapidly evolving" changed to "rapidly evolving sequences." Modification correct?
Electronic file usage
Sometimes we are unable to process the electronic file of your article and/or artwork. If this is the case, we have proceeded by:
Scanning (parts of) your article Rekeying (parts of) your article Scanning the artwork
Thank you for your assistance.
UNC ORRECT
ED PRO OF
1
Identifying and Quantifying Orphan Protein Sequences
2
in Fungi
3 Diana Ekman and Arne Elofsson⁎
4 Stockholm Bioinformatics
5 Center/Center for Biomembrane
6 Research, Department of
7 Biochemistry and Biophysics,
8 Stockholm University, SE-
^10691
9 Stockholm, Sweden
10
11 Received 16 March 2009;
12 received in revised form
13 17 November 2009;
14 accepted 20 November 2009
15
16 For large regions of many proteins, and even entire proteins, no homology
17 to known domains or proteins can be detected. These sequences are often
18 referred to as orphans. Surprisingly, it has been reported that the large
19 number of orphans is sustained in spite of a rapid increase of available
20 genomic sequences. However, it is believed that de novo creation of coding
21 sequences is rare in comparison to mechanisms such as domain shuffling
22 and gene duplication;hence,most sequences should have homologs in other
23 genomes.
24 To investigate this, the sequences of
^19 complete fungi genomes were
25 compared. By using the phylogenetic relationship between these genomes,
26 we could identify potentially de novo created orphans in Saccharomyces
27 cerevisi ^
^ae. We found that only a small fraction, b2%, of the S. cerevisiae
28 proteome is orphan, which confirms that de novo creation of coding
29 sequences is indeed rare. Furthermore, we found it necessary to compare
30 the most closely related species to distinguish between de novo created
31 sequences and rapidly evolving sequences where homologs are present but
32 cannot be detected.
33 Next, the orphan proteins (OPs) and orphan domains (ODs) were
34 characterized. First,it was observed that bothOP
^s and
^ODsare short. In
35 addition, at least some of the OP
^s have been shown to be functional in
36 experimental assays,showing that they are not pseudo
^genes. Furthermore,
37 in contrast to what has been reported before and what is seen for older
38 orphans, S. cerevisiae specific
^ODs and proteins are not more disordered than
39 other proteins. This might indicate that many of the older, and earlier
40 classified,orphans
^indeed are fast
^-evolving sequences. Finally,N90% of the
41 detectedOD
^s are located at the protein termini, which suggests that these
42 orphans could have been created by mutations that have affected the start or
43 stop codons.
44 © 2009 Published by Elsevier Ltd.
45 Edited by M. Sternberg
46 Keywords: evolution; protein domain; orphan protein; fungi
4748
49 Introduction
50 Proteins that have no detectable homologs have
51 been termed orphan proteins (OPs).1 In analogy,
52 parts of proteins that lack homologs may be referred
53 to as orphan domains (ODs).2 Novel genes and
54 domains are believed to be created through gene
55 duplication, shuffling of gene segments, modifica-
56 tions of existing sequences, combinations of short
57 functional peptides,3 and
^de novo creation from
non 58
^coding material. Proteins/domains created de novo could be defined as the true OP 59
^s/OD
^s.
However, it is believed that de novo creation is a 60
comparatively rare mechanism,4implying that most 61
gene segments should be homologous to other 62
genes. Anyhow, less than half of the residues in a 63
typical proteome can be matched to a domain in the 64
Pfam- 65
^A database.2 Even when the less well-
^characterized domains in Pfam- 66
^B are included, about one quarter of the residues within the 67
proteome remain uncharacterized. Hence, for large 68
parts of the genome,the origin cannot be identified. 69
Thus,either de novo creation is more frequent than 70
generally believed, or these regions are evolving so 71
fast that current methods cannot detect the homol- 72
ogy. It can indeed be difficult to identify homologs 73
*Corresponding author. E-mail address:arne@bioinfo.se.
Abbreviations used: OP, orphan protein; OD, orphan domain; SCE, Saccharomyces cerevisiae level; SAC, Saccharomyces level
^^^^.
ARTICLE IN PRESS
LDB YJMBI-61939; No. of pages: 10; 4C:
doi:10.1016/j.jmb.2009.11.053 J. Mol. Biol. (2009) xx, xxx–xxx
Available online at www.sciencedirect.com
0022-2836/$ - see front matter © 2009 Published by Elsevier Ltd.
UNCOR
RECTE
D PROOF
74 of apparentOP
^s, since
^they are often both short and
75 rapidly evolving.5–7
76 Further, orphanicity can be regarded as a relative
77 concept (like lineage specificity5) in evolution, since
78 it depends on the relation between the compared
79 species. Hence, an orphan sequence may have
80 homologs if more closely related species are
81 sequenced. However, Fischer and colleagues noted
82 early that the fraction of orphans decreased only
83 slowly when new genome sequences became
84 available.8 Further, although the large increase in
85 sequences from environmental sequencing projects
86 (metagenomics) revealed homologs to several
87 orphans, it also resulted in the detection of many
88 novel orphans.9Finally, gene deletion and pseudo-
89 genization might explain why no homologs can be
90 detected. However, the effect of these uncertainties
91 should decrease when additional genomes of closely
92 related organisms have been sequenced.
93 Recent studies have provided evidence of de novo
94 creation both of whole proteins
^10 and parts of
95 proteins11 in the Saccharomyces cerevisiae genome.
96 Here, we characterize the
^OPs and
^ODs in S.
97 cerevisiae and
^^quantify the contribution of the afore-
98 mentioned mechanisms of novel gene formation in
99 the S. cerevisiae genome.
100 In earlier studies, we have defined orphans as
101 sequence regions that remain after domain assign-
102 ment by Pfam
^or other domain databases.2,12In S.
103 cerevisiae, 1372 unassigned proteins and almost 5000
104 unassigned domains (≥50^ aa) remained after
105 assignment of Pfam-
^A and Pfam-
^B domains (Sup-
106 plementary Material). However, in this study, we
107 found that only 179 (13%) of these unassigned
108 proteins and 23 (b1%) of the unassigned domains
109 lacked detectable homologs when compared to the
110 most closely related genomes, and another 123 (9%)
111 proteins and almost 500 (9%) domains have homo-
112 logs only within the Saccharomyces family. Further,
113 since about one third of the unassigned sequences
114 have homologs in non-
^fungi species, we conclude
115 that only a fraction of these unassigned sequences
116 can be considered true orphans.
Although domain assignment is a powerful 117
method for homology detection, this approach is 118
not ideal for detection of orphans. First, many 119
domain families are represented by the conserved 120
core, leaving the borders unassigned. Further, the 121
detection of orphan sequences will be dependent 122
on the minimum length chosen for a region to be 123
considered. In addition, different organisms can be 124
more or less well represented in the domain 125
database. Therefore, we have in this study used a 126
strategy based on pairwise alignments between 127
completely sequenced genomes. Comparison of 128
sequences from closely related species should, to 129
some degree, compensate for the fact that Blast is 130
less sensitive than the HMMs used for Pfam 131
assignments. 132
With this method, the amount of orphans was 133
estimated at a number of reference points in the 134
fungal evolution. We predicted that about 1% of 135
the residues in the proteome are specific to S. 136
cerevisiae, b5% percent is specific to the Saccharo- 137
myces family,and about 70% is ancient 138
^(i.e.,have a pre- 139
^fungi origin).
^A characteristic of the detected orphans is that they are short 140
^; in addition, the domains are often disordered sequences located at 141
the termini. However, we found significant differ- 142
ences between the species- 143
^specific orphans and the domains/proteins specific to the Saccharomyces 144
family. First, species- 145
^specific proteins are unusual- ly short, and second 146
^, species-
^specific domains are, in contrast to family- 147
^specific domains, not partic- ularly disordered. These differences indicate that 148
many of the orphans detected at the S. cerevisiae 149
level are true de novo created orphans, whereas 150
most of the older ones 151
^are instead sequences evolving too rapidly for the homology relationship 152
to be found. 153
Results and Discussion 154
We have used a strategy based on Blast13pairwise 155
sequence alignments to identify and quantify OP 156
^
Fig. 1. A simplified species tree for the fungi included in this study. Each internal node represents a phylogenetic level.
The three-^letter abbreviations used for the levels^are^in boldface. A star marks the whole genome duplication event that distinguishes the species at the Saccharomyces level from the rest of the Hemiascomycota. The ancient level includes all non-^fungi eukaryotes and prokaryotes. Fungi: AGO: Ashbya gossypii, AFU: Aspergillus fumigatus, CAL: Candida albicans, CGL: Candida glabrata, CNE: Cryptococcus neoformans, DHA: Debaryomyces hansenii, ECU: Encephalitozoon cuniculi, GZE:
Gibberella zeae, KLA: Kluyveromyces lactis, MGR: Magnaporthe grisea, NCR: Neurospora crassa, SBA: Saccharomyces bayanus, SCA: Saccharomyces castellii, SCE: Saccharomyces cerevisiae, SKL: Saccharomyces kluyveri, SMI: Saccharomyces mikatae, SPO:
Schizosaccharomyces pombe, UMA: Ustilago maydis, YLI: Yarrowia lipolytica. Branch lengths have no relevance
^;however,the species have been ordered within the groups so that species that are most closely related to S. cerevisiae^arelisted first.
Q1 2 Orphan Protein Sequences in Fungi
ARTICLE IN PRESS
UNC ORRECT
ED PRO OF
157 sequences in S
^.cerevisiae
^(i.e.,protein sequences that
158 lack homologs in other species). The S. cerevisiae
159 genome was selected because it has gene annota-
160 tions of high quality,and
^in addition, many closely
161 related genome sequences are available. In this
162 study,we searched for homologs of the S. cerevisiae
163 proteins in 18 fungi species at different evolutionary
164 distances from S. cerevisiae. To simplify the estima-
165 tion of protein age, these species were divided into
166 five phylogenetic levels, from the least specific level
167 (fungi) to the species specific
^ [i.e., S. cerevisiae
168 specific level (SCE)] (
^Fig. 1
^). We defined
^the
169 ^Saccharomyces level
^(SAC) to befor all species that
170 evolved after a whole genome duplication in a
171 common ancestor about 100 million years ago,14
172 which distinguishes them from the rest of the
173 Hemiascomycota. Next, we estimated the amount
174 of novel, or orphan, genetic material
^(i.e.,sequences
175 that lack homologs in more distantly related species
^)
176 at these different points in evolution (Fig. 2). Finally,
177 the orphan regions were divided into
^OPs
^(i.e.,
178 entire proteins that lack homologs
^) and
^ODs
^(i.e.,
179 orphan regions), which are adjacent to non-
^^ODs.
180 The distinction between ODs and OPs was intro-
181 duced in part due to the expected differences in
182 mechanisms involved in their creations
^and to the
183 problems associated with their detection. For exam-
184 ple,
^ODs are expected to be found in proteins known
185 to be functional, whereas
^OPs may be results of gene
186 annotation errors.
Detection of orphans 187
Innovation at the residue level 188
We used Blast to align each S. cerevisiae protein 189
with sequences from several complete fungi gen- 190
omes as well as non- 191
^fungi eukaryotes and prokar- yotes. Each residue was thereby assigned to one of 192
the six phylogenetic levels inFig. 1, depending on 193
how distantthehomologs it aligned with (Fig. 2). To 194
reduce the problem associated with alignments of 195
low- 196
^complexity sequences, we run Blast twice, first with the low- 197
^complexity filter to identify hits and then without the filter to realign low 198
^-complexity regions. 199
According to our alignments, about 70% of the 200
residues in the proteome of S. cerevisiae originate 201
from a pre- 202
^fungi ancestor (Table 1), while approx- imately 1% of the residues are defined as species 203
^- specific orphans 204
^because they did not align to any protein. Another 3% to 205
^5% aligned only to proteins in the most closely related Saccharomyces species 206
^ 207 and were hence assigned to the Saccharomyces level, while the remaining proteins are related to proteins 208
in more distantly related fungi. This result was 209
robust at different length cutoffs (between 25 and 210
100 a 211
^a) for inclusion of regions (Fig. 3
^) (i.e., although
^ 212 the cutoff has a major effect on the number of detected orphans, it has only a moderate impact on 213
the fraction of residues belonging to orphan 214
regions). It can be noted that the orphan residues 215
were quite evenly divided between 216
^ODs and
^OPs.
Finally, these numbers were similar when we 217
instead used a domain 218
^-based approach (Supple- mentary Material). 219
The results in other fungi were similar to those in S. 220
cerevisiae (Fig. 4). As expected, longer evolutionary 221
distances between a species and its closest relative 222
result 223
^in a larger proportion of orphans. Worth noting is the comparatively large fraction of species/ 224
lineage 225
^-specific residues in the genome of the filamentous fungi Neurospora crassa. N. crassa also 226
contains almost twice as many genes as S. cerevisiae, 227
suggesting that many orphans appear when gen- 228
omes increase in size. 229
Defining orphan domains and orphan proteins 230
To identify 231
^ODs and
^OPs,we extracted continu- ous regions that lack homologs at each level (Fig. 2). 232
Fig. 2. Orphan assignment using a pairwise sequence alignment method.^(a) Each protein is aligned to proteins from other species groups.^(b) Each residue is assigned an age according to the most distantly related sequence it is aligned to. Continuous non-^ancient regions (≥cutoff) are defined as orphan regions at the^SCE (S. cerevisiae), SAC (Saccharomyces), FUN (^fungi),andANC (a^ncient)levels. If the orphan region covers the entire protein,it is defined as an orphan protein^;if it covers only parts of the protein,it is defined as an orphan domain.
Table 1
t1:1 . Orphans
t1:2
t1:3 Level Residues (%) OP OPcorr OPnr OD ODcorr ODnr
t1:4 S. cerevisiae (SCE) 1.3 188 158 149 22 22 20
t1:5 Saccharomyces (SAC) 4.9 158 125 116 255 239 209
t1:6 Hemiascomycota (HEM) 13 379 —
^ 293 1040 —
^ 841
t1:7 Fungi (FUN) 13 203 —^ 179 889 —^ 767
t1:8 Ancient (ANC) 68 2466 —^ 1882 2150 —^ 1715
Residues assigned to each orphan level with the pairwise alignment method. Corrections were made at the SCE and SAC levels after detection of Pfam domains, hits to non-
^annotated open reading frames in other species as well as with HHpred for
^OPs. OP
^,numbers of orphan proteins; OPnr
^, nonredundant
^^^OPs; OPcorr
^, corrected number of OPs; OD
^, orphan domains (length ≥50 aa
^); ODnr
^, nonredundant
^^^OD sets; ODcorr
^,corrected number of ODs.
t1:9
Q1 Orphan Protein Sequences in Fungi 3
ARTICLE IN PRESS
UNCOR
RECTE
D PROOF
233 This resulted in 188 proteins defined as
^OPs and 22
234 domains (longer than 50
^aa
^) defined as
^ODs specific
235 to S. cerevisiae (SCE level), or 346 OPs and 277 ODs at
236 the Saccharomyces level (Table 1). A minimum length
237 was required for
^ODs, since
^we only wanted to
238 include sequences that were long enough to be likely
239 to contain a domain.
240 However, there are a number of caveats that can
241 make it difficult to detect homologs and therefore
242 result in false positives in the orphan assignments.
243 The problem of aligning sequences of low complex-
244 ity was mentioned above. Furthermore, the results
245 are dependent on the sensitivity of homology
246 detection tools, the quality of the gene annotations,
247 pseudogenization, and the evolutionary rates of
248 individual genes. In the following sections,we have
249 therefore used more sensitive alignment methods
250 and genomic sequences to limit the number of false
251 positives.
252 More sensitive methods reduce the number
253 of orphans
254 To increase the sensitivity of the homology
255 detection and detect more distantly related matches,
256 we used HMMER15 to find homologs in the Pfam
257 domain database. In addition, the HHpred16 web
258 ^server was used to find even more distantly related
259 homologs in a large number of protein and domain
260 databases.
261 We found that Pfam domains overlapped 6 SCE
262 OPs, 27 SAC OPs,and 16 SAC ODs. With HHpred,
263 we detected 2 additional SCE OP homologs and 7
264 new SAC OP homologs(
^HHpred was not applied to
265 ^ODs
^).These ODs and OPs were removed from the
266 orphans
^; see the “corrected values” in Table 1.
267 Overlapping domains were more common at the
268 fungi level, indicating that the sensitivity of Blast can
269 be insufficient for detection of distantly related
270 homologs, while it can be used to align the closely
271 related sequences in the Saccharomyces family.
Homologs detected in genomic sequences 272
Further, incomplete gene annotations and pseu- 273
dogenization in closely related species might result 274
in overestimations of the number of orphans. 275
Therefore, we searched for homologs of the 276
^species- specific 277
^OPs in the genome sequences of Sac- charomyces 278
^paradoxus, Saccharomyces
^mikatae, and Saccharomyces 279
^bayanus. Hits were detected for 98 (54%) of the SCE OPs, and we found that 29 of these 280
matched a non- 281
^annotated open reading frame of a similar length. This suggests that these sequences are 282
also present in closely related species. Hence, the 283
Fig. 3. Quantification of or- phans. The proportion of the pro- teome that is orphan at different phylogenetic levels for different length cutoffs (25, 50,and100 a^a) using a pairwise sequence align- ment method. Residues in
^OP
^sare shown separately.
Fig. 4. Protein innovation. Residue age distribution in seven species at increasing evolutionary distance from S.
cerevisiae (SCE). CGL: Candida glabrata; KLA: Kluyveromyces lactis; DHA: Debaryomyces hansenii; NCR: Neurospora crassa;
UMA: Ustilago maydis. Legend: Residue age from the youngest (species specific) to the oldest (ancient) level^. SAC/EQ: Saccharomyces level^or equivalent families;
HEM/PEZ: Hemiascomycota/Pezizomycotina; ASC/
BAS: Ascomycota/Basidiomycota. The interpretation dif- fers depending on the species^;however, all species have the fungi level in common. Blast was run with^low- complexity filtering and residues in^low-complexity regions were ignored in the calculations.
Q1 4 Orphan Protein Sequences in Fungi
ARTICLE IN PRESS
UNC ORRECT
ED PRO OF
284 number of
^species-specific
^OPs is probably smaller
285 than first estimated (corrected values inTable 1).
286 The orphans detected with our method are listed
287 inSupplementary Material. However, it should be
288 noted that homologs are likely to be found for many
289 of these orphans;
^hence,this is not intended to be the
290 final list of orphans present in S. cerevisiae.
291 Most orphans are singletons
292 Most of the detected orphans are singletons
^(i.e.,
293 most
^OPs have no paralogs within the genome and
294 the domains occur in only one protein in S.
295 cerevisiae).
^For example, 17 (78%) and 147 (93%) of
296 the SCE ODs and OPs, respectively, lack homologs
297 within the genome. Corresponding numbers at the
298 SAC level are 206 (86%) and 94 (75%). Orphan
299 domains that have more than one copy are almost
300 exclusively found in homologous proteins, and have
301 probably multiplied through gene duplication rath-
302 er than through domain duplication and recombi-
303 nation. Finally, by removing homologs in the
304 orphan sets, the number of unique orphans was
305 further reduced (non
^redundant values inTable 1).
306 Orphan domains
307 Most
^ODs are short sequences, in particular,the
308 most recent ones, and more than 90% are located at
309 the protein termini (Fig. 5). In addition, we predicted
the amount of low 310
^complexity and disorder in these sequences using Seg17and IUPred,18which showed 311
that ODs have larger fractions of residues in low 312
^- complexity segments and more predicted disorder 313
than ancient regions (χ2^test, Pb10−^5). However, 314
this might reflect the terminal location of orphans, 315
since the amount of disorder is elevated at the N 316
^termin 317
^us in general
^ and not only in sequences identified as orphan. On the other hand, we did not 318
see any such bias for low 319
^-complexity regions.
Furthermore, the few S. cerevisiae specific ODs 320
have a comparatively small amount of low com- 321
plexity and disorder (Fig. 5). It is well known that 322
both low 323
^-complexity regions and disordered sequences can have biased residue compositions, 324
which complicates sequence alignments.19 Hence, 325
this difference between the SCE ODs and other ODs 326
could reflect that many of the older ODs have low 327
complexity or contain disordered sequences that 328
cannot be aligned to their homologs, whereas many 329
SCE ODs are de novo created orphans. 330
Distinguishing between true orphan domains and 331
unaligned sequences 332
Due to the limitation of sequence alignments,it is 333
difficult to distinguish between novel 334
^ODs and rapidly evolving sequences that are homologous 335
^but where homology cannot be detected. However, if 336
we assume that creation makes the protein longer, 337
Fig. 5. Orphan domain characteristics. Saccharomyces cerevisiae protein sequence regions that lack homologs in other species were termed orphan domains. Orphans were defined at four different levels with increasing phylogenetic distance from S. cerevisiae^[fungi (FUN), Hemiascomycota (HEM), Saccharomyces (SAC),and S. cerevisiae (SCE)^]. The orphans were characterized and compared to ancient (ANC) domains^(i.e.,domains with homologs in non-^fungi species).(^a) Length distributions.^(b)Relative positions within the protein chain, from the N^termin^usto the C^termin^us.^(c)Proportion of residues in^low-complexityregions.^(d) Proportion of residues that are predicted to be disordered.
Q1 Orphan Protein Sequences in Fungi 5
ARTICLE IN PRESS
UNCOR
RECTE
D PROOF
338 whereas rapid evolution does not, we may compare
339 the lengths of the S. cerevisiae protein and its
340 orthologs, thereby identifying true
^ODs. For exam-
341 ple, a potential de novo created sequence under this
342 assumption is the 102
^-
^aa
^-long N-
^terminal OD that
343 was predicted in the alanine transaminase YLR089C
344 (Fig. 6).
345 We found that 67 (28%) of the proteins with SAC
346 ODs were longer than their orthologs. Of the
347 remaining proteins, 88 (37%) had not increased,
348 31 (13%) belonged to protein families with complex
349 length variations, and for 49 proteins (21%) no
350 orthologs were detected. On the other hand, as
351 many as 16 of the 17 proteins with SCE OD
^and
352 detected orthologs
^were longer than these ortho-
353 logs. Hence, a much larger fraction of the
^ODs at
354 the S. cerevisiae level are classified as de novo created
355 orphan sequences than at the Saccharomyces level.
356 Again, this indicates that novel
^ODs were detected
when studying the most closely related species, 357
while many ODs at the Saccharomyces level are 358
probably results of a failure to align homologs 359
correctly. This implies that we need to compare 360
very closely related species to identify 361
^ODs reliably.
Anyhow, our results agree with those of Giacomelli 362
et al. 363
^11 that short sequences can be added to proteins, leading to gradual extensions, preferably 364
at the termini. 365
366
How can the existence of orphan domains be explained? 367
As described above, 368
^ODs are typically short sequences located at the protein termini. Thus, 369
innovation might be explained by mutations that 370
result in new translation start and stop codons 371
^20,11 372 and thereby incorporation of previouslynoncoding
^^^ 373 sequences in the coding region. This was studied by
Fig. 6. Orphan domain.(a)^Species tree with the lengths (aa^) of the orthologs of YLR089C in each branch. Abbreviations are shown for the species where orthologs were found.^(b)N-^terminal part of a multiple sequence alignment of YLR089C and a selection of orthologs. The predicted orphan domain is^in boldface.
Q1 6 Orphan Protein Sequences in Fungi
ARTICLE IN PRESS
UNC ORRECT
ED PRO OF
374 aligning the genomic sequences of the
^ODs with the
375 untranslated regions of its orthologous genes in
376 closely related genomes (S. paradoxus, S. mikatae,and
377 S. bayanus). This confirmed that the open reading
378 framesof the detected homologs were shorter
^in all
379 sequences except two, and shifts of start or stop
380 codons could potentially explain 16 of the S. cerevisiae
381 ^ODs (Supplementary Material). These observations
382 could be due to errors in the genome sequences, but
383 in a recent study, Giacomelli et al.
^found such
384 sequencing errors to be rather rare.11 Finally, it can
385 be noted that an indel close to the C
^termin
^us can
386 cause aframeshift
^that creates an
^ODwith compar-
387 atively small changes in protein length (which might
388 explain the OD in YKR028W). Our discrimination of
389 true
^ODs based onthelength of orthologous proteins
390 might fail to detect such examples.
391 Further, many
^ODs are disordered. Since short
392 nucleotide motifs have been observed in disordered
393 sequences,21a mechanism for nucleotide duplication
394 could have been used to increase the amount of
395 coding sequences in some of these genes, as
396 suggested by Kellis et al.
^^20 Thus, disordered
397 sequences could appear to be orphan
^because they
398 often evolve more rapidly than other sequences,19,22
399 or they might be true orphans if disorder expansion
400 causes protein elongation.
401 Orphan proteins
402 Similar to observations in other species,6,7
^OPs in
403 S. cerevisiae are short (Table 2). However, the length
404 distribution is shifted towards longer proteins at the
405 less specific levels
^(i.e., SAC OPs are longer than
406 SCE OPs) (
^Fig. 7). Further, the
^OPs contain
407 somewhat more low
^-complexity sequences than
408 ancient proteins, whereas the predicted amount of
409 disorder and secondary structure content, as well as
410 the average codon adaptation index, differ very
411 little from older proteins (not shown). Hence,
^OPs
412 are in many respects similar to older proteins, with
413 the exception of S. cerevisiae specific OPs, which are
414 exceptionally short, sometimesb50 residues. How-
415 ever, homology detection is less sensitive for shorter
416 sequences
^;hence,some of these apparent
^OPs could
417 be fast-
^evolving proteins where no homology could
418 be found.
419 Are orphan proteins functional?
420 A problem that appears for
^OPs at the
^species-
421 specificlevel is that it may be questioned if they are
422 expressed and functional. However, this problem
should be comparatively small in S. cerevisiae, since 423
its genome is well studied and many spurious genes 424
have been removed after a comparative genomic 425
study of closely related species.20 Nonetheless, we 426
examined the evidence for the functionality of these 427
proteins using data from large- 428
^scale experimental studies. 429
As expected, 430
^OPs are rarely essential and most of them are uncharacterized (Table 2). However, a 431
recent study demonstrated that deletion of most 432
genes results in a fitness effect under at least some 433
conditions.23 A majority (95% or 76%) of the 434 Q2
Saccharomyces OPs showed a fitness effect in this 435
study,23and so did 28 (18%) of the S. cerevisiae OPs. 436
However, it should be noted that most of the 437
remaining 438
^OPs were not included in the deletion study 439
^and, hence,might
^also prove to be functional.
Table 2
t2:1 . Orphan protein characteristics t2:2
t2:3 Proteins Number Length (
^aa) Uncharacterized Essential Fitness PPI
t2:4 All 5879 496 0.18 0.18 0.83 0.86
t2:5 Saccharomyces OP 125 215 0.59 0.01 0.76 0.54
t2:6 S. cerevisiae OP 158 71 0.88 0.01 0.18 0.08
Length: average length of these proteins. Fitness: fraction of proteins with a fitness effect in a deletion study. PPI: fraction of proteins detected as prey in protein–
^protein interactions.
t2:7
Fig. 7. Orphan protein characteristics. Proteins defined as orphan^(i.e., lacking homologs^) at different phyloge- netic distances (levels) from Saccharomyces cerevisi^ae were characterized. Ancient proteins with homologs in non-
^fungi species were included for comparison.^(a) Length distribution.^(b)Disorder and low complexity. Proportion of residues that are predicted to be disordered or in^low- complexityregions. Levels: ancient (ANC), fungi (FUN), Hemiascomycota (HEM), Saccharomyces (SAC), and S.
cerevisiae (SCE).
Q1 Orphan Protein Sequences in Fungi 7
ARTICLE IN PRESS
UNCOR
RECTE
D PROOF
440 Another experimental evidence for functionality of
441 proteins is identification in protein complexes. Here,
442 67 (54%) of the Saccharomyces OPs were detected (as
443 prey) in interactions,24,25while only 13 (8%) of the S.
444 cerevisiae OPs were detected (Table 2). In summary,
445 a majority of the Saccharomyces level
^OPs have been
446 experimentally verified to be functional, whereas (in
447 these studies) only some S. cerevisiae OPs have been
448 verified. It has been shown that the
^species-specific
449 protein YNL269W evolved through a number of
450 point mutations.10
^Whether random mutations in
451 noncoding
^^^sequences
^(i.e., de novo creation
^)
^are a
452 general mechanism for protein innovation remains
453 to be discovered. Perhaps,these short open reading
454 frames can then combine with other genes or
455 increase in length through de novo creation of
456 domains.
457 Conclusions
458 In contrast to earlier work using domain
^-based
459 methods, we have here used a pairwise sequence
460 alignment approach for detection of
^ODs and
^OPs as
461 well as to estimate the rate of protein innovation in
462 S. cerevisiae. In agreement with the assumption that
463 de novo creation is rare, less than 2% of the residues
464 in S. cerevisiae were found to be species specific and
465 up to 5% of the residues appear to be specific to the
466 Saccharomyces lineage.
467 All methods used to identify orphan regions are
468 dependent on sequence alignments, and short
469 sequences as well as rapidly evolving sequences
470 and sequences with biased amino acid compositions
471 can be difficult to align correctly. Hence, a major^
472 problem when detecting orphans is to distinguish
473 between these unalignable sequences and new
474 genetic material created by some de novo mecha-
475 nism. By studying the predicted
^ODs, we conclude
476 that most orphans at the S. cerevisiae level are longer
477 than their orthologs, and thus appear to have been
478 created fromnoncoding
^^^DNA by a de novo mecha-
479 nism. However, already at the Saccharomyces level,
480 this length difference is not detectable, indicating
481 that many unalignable sequences are included
482 among the potential orphans. Hence, we conclude
483 that to detect de novo created orphan sequences,it is
484 necessary to compare proteins from very closely
485 related species.
486 At the S. cerevisiae level,the
^OPs outnumber the
487 ^ODs, and only about
^20
^ODs were found. Most of
488 ^these
^ODs exist in one single copy and are mainly
489 short domains located at the protein termini. The
490 terminal location suggests a possible mechanism for
491 their creation, involving changes of start and stop
492 codons. In contrast, the Saccharomyces specific
^ODs
493 contain a high amount of low
^-complexity regions
494 and disorder, which suggests that nucleotide repeti-
495 tions could have been involved in their creation.
496 However, increased evolutionary rates are associat-
497 ed with disorder, and therefore, we believe that
498 many of these are not true orphans
^but just rapidly
499 evolvingsequences
Q3 .
Orphan proteins appear to be structurally similar 500
to older proteins, 501
^but shorter. In particular, S.
cerevisiae specific orphans are extremely short. 502
Further, the functions of most orphans are unchar- 503
acterized. However, experimental studies have 504
shown that at least some of these proteins are 505
functional, since their deletion affects the fitness of S. 506
cerevisiae or, alternatively, they are found in protein 507
complexes. In addition, we found that the structure 508
of two SAC 509
^-specific
^OPs have been determined. One of these, YNR034W- 510
^A (2grg), is a yeast protein of unknown function, whose structure is solved by a 511
structural genomics project, and the other, 512
YMR174C (chain B of 1dp5), is a proteinase 513
inhibitor. Hence, at least some of the 514
^OPs appear to be de novo created functional proteins. 515
Finally, it is clear from this study that it is difficult 516
to predict the exact number of true 517
^ODs and
^OPs.
However, by comparing the S. cerevisiae proteins 518
with proteins from several closely related species, 519
we could identify some domains and proteins that 520
appear to be truly orphan. Although many of the 521
older orphans most likely are rapidly evolving 522
sequences, it seems likely that the process of de 523
novo protein creation is responsible also for some of 524
the orphans detected at other levels. 525
Methods 526
527
Data
528
Fungi protein sequences were retrieved from the National Center for Biotechnology Information† 529
^^^and from the Saccharomyces Genome Database (SGD)‡. 530
^^The
531
following genomes were included in the study: Ashbya
532
gossypii, Aspergillus fumigatus, Candida albicans, Candida
533
glabrata, Cryptococcus neoformans, Debaryomyces hansenii,
534
Encephalitozoon cuniculi, Gibberella zeae, Kluyveromyces
535
lactis, Magnaporthe grisea, Neurospora crassa, Saccharomyces
536
bayanus, Saccharomyces castellii, Saccharomyces cerevisiae,
537
Saccharomyces kluyveri, Saccharomyces mikatae, Schizosac-
538
charomyces pombe, Ustilago maydis, andYarrowia lipolytica
539
(for species tree, see Supplementary Material). Genomic sequences were also downloaded from SGD for S^. 540
541
paradoxus, S. mikatae and S. bayanus. The UniRef50 data set from UniProt was used for non-fungal proteins§. 542
543
Detection of orphans
544
Six different phylogenetic levels were defined in
545
accordance with a published phylogenetic tree.26 These
546
levels ranged from ancient (non-
^fungi eukaryotes and
547
prokaryotes) to species specific
^(i.e.,S. cerevisiae specific
^)
548
(Fig. 1). Each species belongs to the level of the last
549
ancestral node shared with S. cerevisiae.
550
Each residue was assigned to one of the phylogenetic
551
levels based on pairwise sequence alignments between the
552
protein and its homologs, created with Blast (Fig.^2b). The
553
low^-complexity filter was used and high-^scoring segment
†ftp://ftp.ncbi.nih.gov/genomes/Fungi/, June 2007.
‡http://www.yeastgenome.org/
§http://www.uniprot.org/, June 2007.
Q1 8 Orphan Protein Sequences in Fungi