Identifying and Quantifying Orphan Protein Sequences in Fungi

(1)

Stockholm University

This is an accepted version of a paper published in Journal of Molecular Biology. This paper has been peer-reviewed but does not include the final publisher proof-corrections or journal pagination.

Citation for the published paper:

Ekman, D., Elofsson, A. (2010)

"Identifying and Quantifying Orphan Protein Sequences in Fungi"

Journal of Molecular Biology, 396(2): 396-405 URL: http://dx.doi.org/10.1016/j.jmb.2009.11.053

Access to the published version may require subscription.

Permanent link to this version:

http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-49277

http://su.diva-portal.org

(2)

Our reference: YJMBI 61939 P-authorquery-v7

AUTHOR QUERY FORM

Journal: YJMBI

Article Number: 61939

Please e-mail or fax your responses and any corrections to:

E-mail: corrections.essd@elsevier.spitech.com Fax: +1 61 9699 6721

Dear Author,

Any queries or remarks that have arisen during the processing of your manuscript are listed below and highlighted by flags in the proof. Please check your proof carefully and mark all corrections at the appropriate place in the proof (e.g., by using on-screen annotation in the PDF file) or compile them in a separate list.

For correction or revision of any artwork, please consult http://www.elsevier.com/artworkinstructions.

Articles in Special Issues: Please ensure that the words ‘this issue’ are added (in the list and text) to any references to other articles in this Special Issue.

Uncited references: References that occur in the reference list but not in the text – please position each reference in the text or delete it from the list.

Missing references: References listed below were noted in the text but are missing from the reference list – please make the list complete or remove the references from the text.

Location in article

Query / remark

Please insert your reply or correction at the corresponding line in the proof Q1 Running head: Orphan Protein Sequences in Fungi. Short title OK?

Q2 Data "95 or 76%" changed to "95% or 76%." Change correct?

Q3 "Rapidly evolving" changed to "rapidly evolving sequences." Modification correct?

Electronic file usage

Sometimes we are unable to process the electronic file of your article and/or artwork. If this is the case, we have proceeded by:

Scanning (parts of) your article Rekeying (parts of) your article Scanning the artwork

Thank you for your assistance.

(3)

UNC ORRECT

ED PRO OF

1

Identifying and Quantifying Orphan Protein Sequences

2

in Fungi

3 Diana Ekman and Arne Elofsson⁎

4 Stockholm Bioinformatics

5 Center/Center for Biomembrane

6 Research, Department of

7 Biochemistry and Biophysics,

8 Stockholm University, SE-

^10691

9 Stockholm, Sweden

10

11 Received 16 March 2009;

12 received in revised form

13 17 November 2009;

14 accepted 20 November 2009

15

16 For large regions of many proteins, and even entire proteins, no homology

17 to known domains or proteins can be detected. These sequences are often

18 referred to as orphans. Surprisingly, it has been reported that the large

19 number of orphans is sustained in spite of a rapid increase of available

20 genomic sequences. However, it is believed that de novo creation of coding

21 sequences is rare in comparison to mechanisms such as domain shuffling

22 and gene duplication;hence,most sequences should have homologs in other

23 genomes.

24 To investigate this, the sequences of

^19 complete fungi genomes were

25 compared. By using the phylogenetic relationship between these genomes,

26 we could identify potentially de novo created orphans in Saccharomyces

27 cerevisi ^

^ae. We found that only a small fraction, b2%, of the S. cerevisiae

28 proteome is orphan, which confirms that de novo creation of coding

29 sequences is indeed rare. Furthermore, we found it necessary to compare

30 the most closely related species to distinguish between de novo created

31 sequences and rapidly evolving sequences where homologs are present but

32 cannot be detected.

33 Next, the orphan proteins (OPs) and orphan domains (ODs) were

34 characterized. First,it was observed that bothOP

^s and

^ODsare short. In

35 addition, at least some of the OP

^s have been shown to be functional in

36 experimental assays,showing that they are not pseudo

^genes. Furthermore,

37 in contrast to what has been reported before and what is seen for older

38 orphans, S. cerevisiae specific

^ODs and proteins are not more disordered than

39 other proteins. This might indicate that many of the older, and earlier

40 classified,orphans

^indeed are fast

^-evolving sequences. Finally,N90% of the

41 detectedOD

^s are located at the protein termini, which suggests that these

42 orphans could have been created by mutations that have affected the start or

43 stop codons.

45 Edited by M. Sternberg

46 Keywords: evolution; protein domain; orphan protein; fungi

4748

49 Introduction

50 Proteins that have no detectable homologs have

51 been termed orphan proteins (OPs).¹ In analogy,

52 parts of proteins that lack homologs may be referred

53 to as orphan domains (ODs).² Novel genes and

54 domains are believed to be created through gene

55 duplication, shuffling of gene segments, modifica-

56 tions of existing sequences, combinations of short

57 functional peptides,³ and

^de novo creation from

non 58

^coding material. Proteins/domains created de novo could be defined as the true OP 59

^s/OD

^s.

However, it is believed that de novo creation is a 60

comparatively rare mechanism,⁴implying that most 61

gene segments should be homologous to other 62

genes. Anyhow, less than half of the residues in a 63

typical proteome can be matched to a domain in the 64

Pfam- 65

^A database.² Even when the less well-

^characterized domains in Pfam- 66

^B are included, about one quarter of the residues within the 67

proteome remain uncharacterized. Hence, for large 68

parts of the genome,the origin cannot be identified. 69

Thus,either de novo creation is more frequent than 70

generally believed, or these regions are evolving so 71

fast that current methods cannot detect the homol- 72

ogy. It can indeed be difficult to identify homologs 73

*Corresponding author. E-mail address:arne@bioinfo.se.

Abbreviations used: OP, orphan protein; OD, orphan domain; SCE, Saccharomyces cerevisiae level; SAC, Saccharomyces level

^^^^.

ARTICLE IN PRESS

LDB YJMBI-61939; No. of pages: 10; 4C:

doi:10.1016/j.jmb.2009.11.053 J. Mol. Biol. (2009) xx, xxx–xxx

Available online at www.sciencedirect.com

(4)

UNCOR

RECTE

D PROOF

74 of apparentOP

^s, since

^they are often both short and

75 rapidly evolving.^5–7

76 Further, orphanicity can be regarded as a relative

77 concept (like lineage specificity⁵) in evolution, since

78 it depends on the relation between the compared

79 species. Hence, an orphan sequence may have

80 homologs if more closely related species are

81 sequenced. However, Fischer and colleagues noted

82 early that the fraction of orphans decreased only

83 slowly when new genome sequences became

84 available.⁸ Further, although the large increase in

85 sequences from environmental sequencing projects

86 (metagenomics) revealed homologs to several

87 orphans, it also resulted in the detection of many

88 novel orphans.⁹Finally, gene deletion and pseudo-

89 genization might explain why no homologs can be

90 detected. However, the effect of these uncertainties

91 should decrease when additional genomes of closely

92 related organisms have been sequenced.

93 Recent studies have provided evidence of de novo

94 creation both of whole proteins

^¹⁰ and parts of

95 proteins¹¹ in the Saccharomyces cerevisiae genome.

96 Here, we characterize the

^OPs and

^ODs in S.

97 cerevisiae and

^^quantify the contribution of the afore-

98 mentioned mechanisms of novel gene formation in

99 the S. cerevisiae genome.

100 In earlier studies, we have defined orphans as

101 sequence regions that remain after domain assign-

102 ment by Pfam

^or other domain databases.^2,12In S.

103 cerevisiae, 1372 unassigned proteins and almost 5000

104 unassigned domains (≥50^ aa) remained after

105 assignment of Pfam-

^A and Pfam-

^B domains (Sup-

106 plementary Material). However, in this study, we

107 found that only 179 (13%) of these unassigned

108 proteins and 23 (b1%) of the unassigned domains

109 lacked detectable homologs when compared to the

110 most closely related genomes, and another 123 (9%)

111 proteins and almost 500 (9%) domains have homo-

112 logs only within the Saccharomyces family. Further,

113 since about one third of the unassigned sequences

114 have homologs in non-

^fungi species, we conclude

115 that only a fraction of these unassigned sequences

116 can be considered true orphans.

Although domain assignment is a powerful 117

method for homology detection, this approach is 118

not ideal for detection of orphans. First, many 119

domain families are represented by the conserved 120

core, leaving the borders unassigned. Further, the 121

detection of orphan sequences will be dependent 122

on the minimum length chosen for a region to be 123

considered. In addition, different organisms can be 124

more or less well represented in the domain 125

database. Therefore, we have in this study used a 126

strategy based on pairwise alignments between 127

completely sequenced genomes. Comparison of 128

sequences from closely related species should, to 129

some degree, compensate for the fact that Blast is 130

less sensitive than the HMMs used for Pfam 131

assignments. 132

With this method, the amount of orphans was 133

estimated at a number of reference points in the 134

fungal evolution. We predicted that about 1% of 135

the residues in the proteome are specific to S. 136

cerevisiae, b5% percent is specific to the Saccharo- 137

myces family,and about 70% is ancient 138

^(i.e.,have a pre- 139

^fungi origin).

^A characteristic of the detected orphans is that they are short 140

^; in addition, the domains are often disordered sequences located at 141

the termini. However, we found significant differ- 142

ences between the species- 143

^specific orphans and the domains/proteins specific to the Saccharomyces 144

family. First, species- 145

^specific proteins are unusual- ly short, and second 146

^, species-

^specific domains are, in contrast to family- 147

^specific domains, not partic- ularly disordered. These differences indicate that 148

many of the orphans detected at the S. cerevisiae 149

level are true de novo created orphans, whereas 150

most of the older ones 151

^are instead sequences evolving too rapidly for the homology relationship 152

to be found. 153

Results and Discussion 154

We have used a strategy based on Blast¹³pairwise 155

sequence alignments to identify and quantify OP 156

^

Fig. 1. A simplified species tree for the fungi included in this study. Each internal node represents a phylogenetic level.

The three-^letter abbreviations used for the levels^are^in boldface. A star marks the whole genome duplication event that distinguishes the species at the Saccharomyces level from the rest of the Hemiascomycota. The ancient level includes all non-^fungi eukaryotes and prokaryotes. Fungi: AGO: Ashbya gossypii, AFU: Aspergillus fumigatus, CAL: Candida albicans, CGL: Candida glabrata, CNE: Cryptococcus neoformans, DHA: Debaryomyces hansenii, ECU: Encephalitozoon cuniculi, GZE:

Gibberella zeae, KLA: Kluyveromyces lactis, MGR: Magnaporthe grisea, NCR: Neurospora crassa, SBA: Saccharomyces bayanus, SCA: Saccharomyces castellii, SCE: Saccharomyces cerevisiae, SKL: Saccharomyces kluyveri, SMI: Saccharomyces mikatae, SPO:

Schizosaccharomyces pombe, UMA: Ustilago maydis, YLI: Yarrowia lipolytica. Branch lengths have no relevance

^;however,the species have been ordered within the groups so that species that are most closely related to S. cerevisiae^arelisted first.

Q1 2 Orphan Protein Sequences in Fungi

ARTICLE IN PRESS

(5)

UNC ORRECT

ED PRO OF

157 sequences in S

^.cerevisiae

^(i.e.,protein sequences that

158 lack homologs in other species). The S. cerevisiae

159 genome was selected because it has gene annota-

160 tions of high quality,and

^in addition, many closely

161 related genome sequences are available. In this

162 study,we searched for homologs of the S. cerevisiae

163 proteins in 18 fungi species at different evolutionary

164 distances from S. cerevisiae. To simplify the estima-

165 tion of protein age, these species were divided into

166 five phylogenetic levels, from the least specific level

167 (fungi) to the species specific

^ [i.e., S. cerevisiae

168 specific level (SCE)] (

^Fig. 1

^). We defined

^the

169 ^Saccharomyces level

^(SAC) to befor all species that

170 evolved after a whole genome duplication in a

171 common ancestor about 100 million years ago,¹⁴

172 which distinguishes them from the rest of the

173 Hemiascomycota. Next, we estimated the amount

174 of novel, or orphan, genetic material

^(i.e.,sequences

175 that lack homologs in more distantly related species

^)

176 at these different points in evolution (Fig. 2). Finally,

177 the orphan regions were divided into

^OPs

^(i.e.,

178 entire proteins that lack homologs

^) and

^ODs

^(i.e.,

179 orphan regions), which are adjacent to non-

^^ODs.

180 The distinction between ODs and OPs was intro-

181 duced in part due to the expected differences in

182 mechanisms involved in their creations

^and to the

183 problems associated with their detection. For exam-

184 ple,

^ODs are expected to be found in proteins known

185 to be functional, whereas

^OPs may be results of gene

186 annotation errors.

Detection of orphans 187

Innovation at the residue level 188

We used Blast to align each S. cerevisiae protein 189

with sequences from several complete fungi gen- 190

omes as well as non- 191

^fungi eukaryotes and prokaryotes. Each residue was thereby assigned to one of 192

the six phylogenetic levels inFig. 1, depending on 193

how distantthehomologs it aligned with (Fig. 2). To 194

reduce the problem associated with alignments of 195

low- 196

^complexity sequences, we run Blast twice, first with the low- 197

^complexity filter to identify hits and then without the filter to realign low 198

^-complexity regions. 199

According to our alignments, about 70% of the 200

residues in the proteome of S. cerevisiae originate 201

from a pre- 202

^fungi ancestor (Table 1), while approx- imately 1% of the residues are defined as species 203

^- specific orphans 204

^because they did not align to any protein. Another 3% to 205

^5% aligned only to proteins in the most closely related Saccharomyces species 206

^ ₂₀₇ and were hence assigned to the Saccharomyces level, while the remaining proteins are related to proteins 208

in more distantly related fungi. This result was 209

robust at different length cutoffs (between 25 and 210

100 a 211

^a) for inclusion of regions (Fig. 3

^) (i.e., although

^ ₂₁₂ the cutoff has a major effect on the number of detected orphans, it has only a moderate impact on 213

the fraction of residues belonging to orphan 214

regions). It can be noted that the orphan residues 215

were quite evenly divided between 216

^ODs and

^OPs.

Finally, these numbers were similar when we 217

instead used a domain 218

^-based approach (Supple- mentary Material). 219

The results in other fungi were similar to those in S. 220

cerevisiae (Fig. 4). As expected, longer evolutionary 221

distances between a species and its closest relative 222

result 223

^in a larger proportion of orphans. Worth noting is the comparatively large fraction of species/ 224

lineage 225

^-specific residues in the genome of the filamentous fungi Neurospora crassa. N. crassa also 226

contains almost twice as many genes as S. cerevisiae, 227

suggesting that many orphans appear when gen- 228

omes increase in size. 229

Defining orphan domains and orphan proteins 230

To identify 231

^ODs and

^OPs,we extracted continuous regions that lack homologs at each level (Fig. 2). 232

Fig. 2. Orphan assignment using a pairwise sequence alignment method.^(a) Each protein is aligned to proteins from other species groups.^(b) Each residue is assigned an age according to the most distantly related sequence it is aligned to. Continuous non-^ancient regions (≥cutoff) are defined as orphan regions at the^SCE (S. cerevisiae), SAC (Saccharomyces), FUN (^fungi),andANC (a^ncient)levels. If the orphan region covers the entire protein,it is defined as an orphan protein^;if it covers only parts of the protein,it is defined as an orphan domain.

Table 1

t1:1 . Orphans

t1:2

t1:3 Level Residues (%) OP OPcorr OPnr OD ODcorr ODnr

t1:4 S. cerevisiae (SCE) 1.3 188 158 149 22 22 20

t1:5 Saccharomyces (SAC) 4.9 158 125 116 255 239 209

t1:6 Hemiascomycota (HEM) 13 379 —

^ 293 1040 —

^ 841

t1:7 Fungi (FUN) 13 203 —^ 179 889 —^ 767

t1:8 Ancient (ANC) 68 2466 —^ 1882 2150 —^ 1715

Residues assigned to each orphan level with the pairwise alignment method. Corrections were made at the SCE and SAC levels after detection of Pfam domains, hits to non-

^annotated open reading frames in other species as well as with HHpred for

^OPs. OP

^,numbers of orphan proteins; OPnr

^, nonredundant

^^^OPs; OPcorr

^, corrected number of OPs; OD

^, orphan domains (length ≥50 aa

^); ODnr

^, nonredundant

^^^OD sets; ODcorr

^,corrected number of ODs.

t1:9

Q1 Orphan Protein Sequences in Fungi 3

ARTICLE IN PRESS

(6)

UNCOR

RECTE

D PROOF

233 This resulted in 188 proteins defined as

^OPs and 22

234 domains (longer than 50

^aa

^) defined as

^ODs specific

235 to S. cerevisiae (SCE level), or 346 OPs and 277 ODs at

236 the Saccharomyces level (Table 1). A minimum length

237 was required for

^ODs, since

^we only wanted to

238 include sequences that were long enough to be likely

239 to contain a domain.

240 However, there are a number of caveats that can

241 make it difficult to detect homologs and therefore

242 result in false positives in the orphan assignments.

243 The problem of aligning sequences of low complex-

244 ity was mentioned above. Furthermore, the results

245 are dependent on the sensitivity of homology

246 detection tools, the quality of the gene annotations,

247 pseudogenization, and the evolutionary rates of

248 individual genes. In the following sections,we have

249 therefore used more sensitive alignment methods

250 and genomic sequences to limit the number of false

251 positives.

252 More sensitive methods reduce the number

253 of orphans

254 To increase the sensitivity of the homology

255 detection and detect more distantly related matches,

256 we used HMMER¹⁵ to find homologs in the Pfam

257 domain database. In addition, the HHpred¹⁶ web

258 ^server was used to find even more distantly related

259 homologs in a large number of protein and domain

260 databases.

261 We found that Pfam domains overlapped 6 SCE

262 OPs, 27 SAC OPs,and 16 SAC ODs. With HHpred,

263 we detected 2 additional SCE OP homologs and 7

264 new SAC OP homologs(

^HHpred was not applied to

265 ^ODs

^).These ODs and OPs were removed from the

266 orphans

^; see the “corrected values” in Table 1.

267 Overlapping domains were more common at the

268 fungi level, indicating that the sensitivity of Blast can

269 be insufficient for detection of distantly related

270 homologs, while it can be used to align the closely

271 related sequences in the Saccharomyces family.

Homologs detected in genomic sequences 272

Further, incomplete gene annotations and pseu- 273

dogenization in closely related species might result 274

in overestimations of the number of orphans. 275

Therefore, we searched for homologs of the 276

^species- specific 277

^OPs in the genome sequences of Sac- charomyces 278

^paradoxus, Saccharomyces

^mikatae, and Saccharomyces 279

^bayanus. Hits were detected for 98 (54%) of the SCE OPs, and we found that 29 of these 280

matched a non- 281

^annotated open reading frame of a similar length. This suggests that these sequences are 282

also present in closely related species. Hence, the 283

Fig. 3. Quantification of orphans. The proportion of the proteome that is orphan at different phylogenetic levels for different length cutoffs (25, 50,and100 a^a) using a pairwise sequence alignment method. Residues in

^OP

^sare shown separately.

Fig. 4. Protein innovation. Residue age distribution in seven species at increasing evolutionary distance from S.

cerevisiae (SCE). CGL: Candida glabrata; KLA: Kluyveromyces lactis; DHA: Debaryomyces hansenii; NCR: Neurospora crassa;

UMA: Ustilago maydis. Legend: Residue age from the youngest (species specific) to the oldest (ancient) level^. SAC/EQ: Saccharomyces level^or equivalent families;

HEM/PEZ: Hemiascomycota/Pezizomycotina; ASC/

BAS: Ascomycota/Basidiomycota. The interpretation dif- fers depending on the species^;however, all species have the fungi level in common. Blast was run with^low- complexity filtering and residues in^low-complexity regions were ignored in the calculations.

ARTICLE IN PRESS

(7)

UNC ORRECT

ED PRO OF

284 number of

^species-specific

^OPs is probably smaller

285 than first estimated (corrected values inTable 1).

286 The orphans detected with our method are listed

287 inSupplementary Material. However, it should be

288 noted that homologs are likely to be found for many

289 of these orphans;

^hence,this is not intended to be the

290 final list of orphans present in S. cerevisiae.

291 Most orphans are singletons

292 Most of the detected orphans are singletons

^(i.e.,

293 most

^OPs have no paralogs within the genome and

294 the domains occur in only one protein in S.

295 cerevisiae).

^For example, 17 (78%) and 147 (93%) of

296 the SCE ODs and OPs, respectively, lack homologs

297 within the genome. Corresponding numbers at the

298 SAC level are 206 (86%) and 94 (75%). Orphan

299 domains that have more than one copy are almost

300 exclusively found in homologous proteins, and have

301 probably multiplied through gene duplication rath-

302 er than through domain duplication and recombi-

303 nation. Finally, by removing homologs in the

304 orphan sets, the number of unique orphans was

305 further reduced (non

^redundant values inTable 1).

306 Orphan domains

307 Most

^ODs are short sequences, in particular,the

308 most recent ones, and more than 90% are located at

309 the protein termini (Fig. 5). In addition, we predicted

the amount of low 310

^complexity and disorder in these sequences using Seg¹⁷and IUPred,¹⁸which showed 311

that ODs have larger fractions of residues in low 312

^- complexity segments and more predicted disorder 313

than ancient regions (χ²^test, Pb10⁻^⁵). However, 314

this might reflect the terminal location of orphans, 315

since the amount of disorder is elevated at the N 316

^termin 317

^us in general

^ and not only in sequences identified as orphan. On the other hand, we did not 318

see any such bias for low 319

^-complexity regions.

Furthermore, the few S. cerevisiae specific ODs 320

have a comparatively small amount of low com- 321

plexity and disorder (Fig. 5). It is well known that 322

both low 323

^-complexity regions and disordered sequences can have biased residue compositions, 324

which complicates sequence alignments.¹⁹ Hence, 325

this difference between the SCE ODs and other ODs 326

could reflect that many of the older ODs have low 327

complexity or contain disordered sequences that 328

cannot be aligned to their homologs, whereas many 329

SCE ODs are de novo created orphans. 330

Distinguishing between true orphan domains and 331

unaligned sequences 332

Due to the limitation of sequence alignments,it is 333

difficult to distinguish between novel 334

^ODs and rapidly evolving sequences that are homologous 335

^but where homology cannot be detected. However, if 336

we assume that creation makes the protein longer, 337

Fig. 5. Orphan domain characteristics. Saccharomyces cerevisiae protein sequence regions that lack homologs in other species were termed orphan domains. Orphans were defined at four different levels with increasing phylogenetic distance from S. cerevisiae^[fungi (FUN), Hemiascomycota (HEM), Saccharomyces (SAC),and S. cerevisiae (SCE)^]. The orphans were characterized and compared to ancient (ANC) domains^(i.e.,domains with homologs in non-^fungi species).(â) Length distributions.^(b)Relative positions within the protein chain, from the N^terminûsto the C^terminûs.^(c)Proportion of residues in^low-complexityregions.^(d) Proportion of residues that are predicted to be disordered.

ARTICLE IN PRESS

(8)

UNCOR

RECTE

D PROOF

338 whereas rapid evolution does not, we may compare

339 the lengths of the S. cerevisiae protein and its

340 orthologs, thereby identifying true

^ODs. For exam-

341 ple, a potential de novo created sequence under this

342 assumption is the 102

^-

^aa

^-long N-

^terminal OD that

343 was predicted in the alanine transaminase YLR089C

344 (Fig. 6).

345 We found that 67 (28%) of the proteins with SAC

346 ODs were longer than their orthologs. Of the

347 remaining proteins, 88 (37%) had not increased,

348 31 (13%) belonged to protein families with complex

349 length variations, and for 49 proteins (21%) no

350 orthologs were detected. On the other hand, as

351 many as 16 of the 17 proteins with SCE OD

^and

352 detected orthologs

^were longer than these ortho-

353 logs. Hence, a much larger fraction of the

^ODs at

354 the S. cerevisiae level are classified as de novo created

355 orphan sequences than at the Saccharomyces level.

356 Again, this indicates that novel

^ODs were detected

when studying the most closely related species, 357

while many ODs at the Saccharomyces level are 358

probably results of a failure to align homologs 359

correctly. This implies that we need to compare 360

very closely related species to identify 361

^ODs reliably.

Anyhow, our results agree with those of Giacomelli 362

et al. 363

^¹¹ that short sequences can be added to proteins, leading to gradual extensions, preferably 364

at the termini. 365

366

How can the existence of orphan domains be explained? 367

As described above, 368

^ODs are typically short sequences located at the protein termini. Thus, 369

innovation might be explained by mutations that 370

result in new translation start and stop codons 371

^^20,11 ₃₇₂ and thereby incorporation of previouslynoncoding

^^^ ₃₇₃ sequences in the coding region. This was studied by

Fig. 6. Orphan domain.(a)^Species tree with the lengths (aa^) of the orthologs of YLR089C in each branch. Abbreviations are shown for the species where orthologs were found.^(b)N-^terminal part of a multiple sequence alignment of YLR089C and a selection of orthologs. The predicted orphan domain is^in boldface.

ARTICLE IN PRESS

(9)

UNC ORRECT

ED PRO OF

374 aligning the genomic sequences of the

^ODs with the

375 untranslated regions of its orthologous genes in

376 closely related genomes (S. paradoxus, S. mikatae,and

377 S. bayanus). This confirmed that the open reading

378 framesof the detected homologs were shorter

^in all

379 sequences except two, and shifts of start or stop

380 codons could potentially explain 16 of the S. cerevisiae

381 ^ODs (Supplementary Material). These observations

382 could be due to errors in the genome sequences, but

383 in a recent study, Giacomelli et al.

^found such

384 sequencing errors to be rather rare.¹¹ Finally, it can

385 be noted that an indel close to the C

^termin

^us can

386 cause aframeshift

^that creates an

^ODwith compar-

387 atively small changes in protein length (which might

388 explain the OD in YKR028W). Our discrimination of

389 true

^ODs based onthelength of orthologous proteins

390 might fail to detect such examples.

391 Further, many

^ODs are disordered. Since short

392 nucleotide motifs have been observed in disordered

393 sequences,²¹a mechanism for nucleotide duplication

394 could have been used to increase the amount of

395 coding sequences in some of these genes, as

396 suggested by Kellis et al.

^^²⁰ Thus, disordered

397 sequences could appear to be orphan

^because they

398 often evolve more rapidly than other sequences,^19,22

399 or they might be true orphans if disorder expansion

400 causes protein elongation.

401 Orphan proteins

402 Similar to observations in other species,^6,7

^OPs in

403 S. cerevisiae are short (Table 2). However, the length

404 distribution is shifted towards longer proteins at the

405 less specific levels

^(i.e., SAC OPs are longer than

406 SCE OPs) (

^Fig. 7). Further, the

^OPs contain

407 somewhat more low

^-complexity sequences than

408 ancient proteins, whereas the predicted amount of

409 disorder and secondary structure content, as well as

410 the average codon adaptation index, differ very

411 little from older proteins (not shown). Hence,

^OPs

412 are in many respects similar to older proteins, with

413 the exception of S. cerevisiae specific OPs, which are

414 exceptionally short, sometimesb50 residues. How-

415 ever, homology detection is less sensitive for shorter

416 sequences

^;hence,some of these apparent

^OPs could

417 be fast-

^evolving proteins where no homology could

418 be found.

419 Are orphan proteins functional?

420 A problem that appears for

^OPs at the

^species-

421 specificlevel is that it may be questioned if they are

422 expressed and functional. However, this problem

should be comparatively small in S. cerevisiae, since 423

its genome is well studied and many spurious genes 424

have been removed after a comparative genomic 425

study of closely related species.²⁰ Nonetheless, we 426

examined the evidence for the functionality of these 427

proteins using data from large- 428

^scale experimental studies. 429

As expected, 430

^OPs are rarely essential and most of them are uncharacterized (Table 2). However, a 431

recent study demonstrated that deletion of most 432

genes results in a fitness effect under at least some 433

conditions.²³ A majority (95% or 76%) of the 434 Q2

Saccharomyces OPs showed a fitness effect in this 435

study,²³and so did 28 (18%) of the S. cerevisiae OPs. 436

However, it should be noted that most of the 437

remaining 438

^OPs were not included in the deletion study 439

^and, hence,might

^also prove to be functional.

Table 2

t2:1 . Orphan protein characteristics t2:2

t2:3 Proteins Number Length (

^aa) Uncharacterized Essential Fitness PPI

t2:4 All 5879 496 0.18 0.18 0.83 0.86

t2:5 Saccharomyces OP 125 215 0.59 0.01 0.76 0.54

t2:6 S. cerevisiae OP 158 71 0.88 0.01 0.18 0.08

Length: average length of these proteins. Fitness: fraction of proteins with a fitness effect in a deletion study. PPI: fraction of proteins detected as prey in protein–

^protein interactions.

t2:7

Fig. 7. Orphan protein characteristics. Proteins defined as orphan^(i.e., lacking homologs^) at different phylogenetic distances (levels) from Saccharomyces cerevisi^ae were characterized. Ancient proteins with homologs in non-

^fungi species were included for comparison.^(a) Length distribution.^(b)Disorder and low complexity. Proportion of residues that are predicted to be disordered or in^low- complexityregions. Levels: ancient (ANC), fungi (FUN), Hemiascomycota (HEM), Saccharomyces (SAC), and S.

cerevisiae (SCE).

ARTICLE IN PRESS

(10)

UNCOR

RECTE

D PROOF

440 Another experimental evidence for functionality of

441 proteins is identification in protein complexes. Here,

442 67 (54%) of the Saccharomyces OPs were detected (as

443 prey) in interactions,^24,25while only 13 (8%) of the S.

444 cerevisiae OPs were detected (Table 2). In summary,

445 a majority of the Saccharomyces level

^OPs have been

446 experimentally verified to be functional, whereas (in

447 these studies) only some S. cerevisiae OPs have been

448 verified. It has been shown that the

^species-specific

449 protein YNL269W evolved through a number of

450 point mutations.¹⁰

^Whether random mutations in

451 noncoding

^^^sequences

^(i.e., de novo creation

^)

^are a

452 general mechanism for protein innovation remains

453 to be discovered. Perhaps,these short open reading

454 frames can then combine with other genes or

455 increase in length through de novo creation of

456 domains.

457 Conclusions

458 In contrast to earlier work using domain

^-based

459 methods, we have here used a pairwise sequence

460 alignment approach for detection of

^ODs and

^OPs as

461 well as to estimate the rate of protein innovation in

462 S. cerevisiae. In agreement with the assumption that

463 de novo creation is rare, less than 2% of the residues

464 in S. cerevisiae were found to be species specific and

465 up to 5% of the residues appear to be specific to the

466 Saccharomyces lineage.

467 All methods used to identify orphan regions are

468 dependent on sequence alignments, and short

469 sequences as well as rapidly evolving sequences

470 and sequences with biased amino acid compositions

471 can be difficult to align correctly. Hence, a major^

472 problem when detecting orphans is to distinguish

473 between these unalignable sequences and new

474 genetic material created by some de novo mecha-

475 nism. By studying the predicted

^ODs, we conclude

476 that most orphans at the S. cerevisiae level are longer

477 than their orthologs, and thus appear to have been

478 created fromnoncoding

^^^DNA by a de novo mecha-

479 nism. However, already at the Saccharomyces level,

480 this length difference is not detectable, indicating

481 that many unalignable sequences are included

482 among the potential orphans. Hence, we conclude

483 that to detect de novo created orphan sequences,it is

484 necessary to compare proteins from very closely

485 related species.

486 At the S. cerevisiae level,the

^OPs outnumber the

487 ^ODs, and only about

^20

^ODs were found. Most of

488 ^these

^ODs exist in one single copy and are mainly

489 short domains located at the protein termini. The

490 terminal location suggests a possible mechanism for

491 their creation, involving changes of start and stop

492 codons. In contrast, the Saccharomyces specific

^ODs

493 contain a high amount of low

^-complexity regions

494 and disorder, which suggests that nucleotide repeti-

495 tions could have been involved in their creation.

496 However, increased evolutionary rates are associat-

497 ed with disorder, and therefore, we believe that

498 many of these are not true orphans

^but just rapidly

499 evolvingsequences

Q3 .

Orphan proteins appear to be structurally similar 500

to older proteins, 501

^but shorter. In particular, S.

cerevisiae specific orphans are extremely short. 502

Further, the functions of most orphans are unchar- 503

acterized. However, experimental studies have 504

shown that at least some of these proteins are 505

functional, since their deletion affects the fitness of S. 506

cerevisiae or, alternatively, they are found in protein 507

complexes. In addition, we found that the structure 508

of two SAC 509

^-specific

^OPs have been determined. One of these, YNR034W- 510

^A (2grg), is a yeast protein of unknown function, whose structure is solved by a 511

structural genomics project, and the other, 512

YMR174C (chain B of 1dp5), is a proteinase 513

inhibitor. Hence, at least some of the 514

^OPs appear to be de novo created functional proteins. 515

Finally, it is clear from this study that it is difficult 516

to predict the exact number of true 517

^ODs and

^OPs.

However, by comparing the S. cerevisiae proteins 518

with proteins from several closely related species, 519

we could identify some domains and proteins that 520

appear to be truly orphan. Although many of the 521

older orphans most likely are rapidly evolving 522

sequences, it seems likely that the process of de 523

novo protein creation is responsible also for some of 524

the orphans detected at other levels. 525

Methods 526

527

Data

528

Fungi protein sequences were retrieved from the National Center for Biotechnology Information† 529

^^^and from the Saccharomyces Genome Database (SGD)‡. 530

^^The

531

following genomes were included in the study: Ashbya

532

gossypii, Aspergillus fumigatus, Candida albicans, Candida

533

glabrata, Cryptococcus neoformans, Debaryomyces hansenii,

534

Encephalitozoon cuniculi, Gibberella zeae, Kluyveromyces

535

lactis, Magnaporthe grisea, Neurospora crassa, Saccharomyces

536

bayanus, Saccharomyces castellii, Saccharomyces cerevisiae,

537

Saccharomyces kluyveri, Saccharomyces mikatae, Schizosac-

538

charomyces pombe, Ustilago maydis, andYarrowia lipolytica

539

(for species tree, see Supplementary Material). Genomic sequences were also downloaded from SGD for S^. 540

541

paradoxus, S. mikatae and S. bayanus. The UniRef50 data set from UniProt was used for non-fungal proteins§. 542

543

Detection of orphans

544

Six different phylogenetic levels were defined in

545

accordance with a published phylogenetic tree.²⁶ These

546

levels ranged from ancient (non-

^fungi eukaryotes and

547

prokaryotes) to species specific

^(i.e.,S. cerevisiae specific

^)

548

(Fig. 1). Each species belongs to the level of the last

549

ancestral node shared with S. cerevisiae.

550

Each residue was assigned to one of the phylogenetic

551

levels based on pairwise sequence alignments between the

552

protein and its homologs, created with Blast (Fig.^2b). The

553

low^-complexity filter was used and high-^scoring segment

†ftp://ftp.ncbi.nih.gov/genomes/Fungi/, June 2007.

‡http://www.yeastgenome.org/

§http://www.uniprot.org/, June 2007.