Analysis of natural and artificial yeastpopulations by next-generationsequencingAnders Bergström

(1)

Analysis of natural and artificial yeast populations by next-generation

sequencing

Anders Bergström

Degree project in bioinformatics, 2012

Examensarbete i bioinformatik 30 hp till masterexamen, 2012

Biology Education Centre, Uppsala University, and Institute for Research on Cancer and Aging in Nice (IRCAN)

Supervisor: Gianni Liti

MSc BIOINF 12 007

(2)

(3)

Abstract

Advances in DNA sequencing technology, particularly the advent of a new group of technologies referred to as

next-generation sequencing, is now making it feasible to sequence the genomes of multiple individuals from the

same population. Simultaneously, there is an increasing interest in exploiting natural variation to elucidate the

relationship between genotype and phenotype and to dissect the genetic basis of traits, particularly so called

complex traits which are influenced by multiple factors. The budding yeast Saccharomyces cerevisiae has

emerged as a powerful system for carrying out these kinds of studies because of its small genome, experimental

tractability and well-characterized biology. Described here is the analysis of whole-genome sequence data

from a number of natural S. cerevisiae strains selected to capture the majority of the genetic variation in

the species as well as data from a large number of highly recombinant segregant individuals derived from

crossing experiments between representative natural strains. Using both de-novo assembly and reference-

based approaches the data from the natural strains is used to elucidate patterns of population history and

structure and to identify genomic material not present in the reference genome of the species. The genomes

of the segregant strains from the crossing experiments are characterized to reveal cases of de-novo variation

not present in the parent strains as well as to track the segregation of genomic regions that are unique to one

of the parents. Genetic linkage information from these recombinant populations is also used to improve the

structure of parental strain de-novo genome assemblies. Overall these analyses provide an illustration of the

use of next-generation sequencing data within the context of yeast population genomics and demonstrate what

kind of insights can be gained from a biological perspective.

(4)

(5)

Analysis of natural and artificial yeast populations by next-generation sequencing

Popular science summary Anders Bergström Degree project in bioinformatics, 2012

Biological organisms within the same population tend to vary in many of their characteristics and traits.

Variation in some traits is governed by genetic variation in single genes, in the sense that individuals with one variant of a gene will all be the same with regards to this trait and differ from all individuals with another variant of the gene, in a qualitative fashion. Most traits however, including most human disease risks and other characteristics as well as economically important traits of organisms used in various industrial settings, tend to be under the influence of variation in more than one gene as well variation in the environment. The outcome in such traits for particular individuals, for example the height or the diabetes risk of humans or the growth rate or final size of crops, is then the result of the sum of contributions, some acting in one direction and some in the other direction, of many different genes and environmental factors. These kind of traits are often referred to as complex traits, and large efforts within the biological sciences are now being devoted to the study of these traits and the search for the particular genetic and environmental factors that shape the variation in them.

A basic element of studies aiming to find the genetic variants that influence the variation in traits is to survey numerous individuals from the population by measuring and quantifying the traits of interest and determining which genetic variants each individuals carry at multiple genetic markers. The fundamental idea is then to test if there are any trends or patterns that connect certain genetic variants to the traits - if all the individuals in the population that happen to carry one of the variants of a gene tend to systematically differ in a given trait from the individuals that happen to carry the other variant of the gene, this suggests that the gene has a role that influences this trait. The first prerequisite of these studies is to characterize the genetic as well as the trait variation in the population, and studies aiming to catalog and describe genetic variation is now underway in many species. The budding yeast Saccharomyces cerevisiae, which is used for brewing and baking but is also an important model organism in biology research, is very well-suited for these kinds of studies because it easy to carry out experiments on and has a small genome with relatively few genes.

Technological advances in DNA-sequencing - the determination of the physical structure of the genetic material - has made it possible to determine the sequence of and study the whole genomes of multiple individuals from a population, providing an excellent way to catalog genetic variation. This project describes the computational analysis of DNA sequence data from a number of individual strains of S. cerevisiae and demonstrates what can be learned about this population from the sequence data. While the genome of an organism such as yeast consists of DNA molecules that are hundreds of thousands to millions of DNA letters long, the data produced by DNA sequencing experiments only comes in the form of randomly sampled short fragments of only 100 or so DNA letters at the time. Substantial computational work is therefore needed to puzzle these tiny fragments together to reconstruct the full sequence and enable biological conclusions to be drawn. In addition to the whole-genome data from these natural S. cerevisiae strains, the same kind of data from a large number of strains which have been constructed by crossing natural strains for multiple generations is analyzed. These

“artificial” populations are very useful in the search for genetic variants underlying variation in traits since each individual will have a genome that is a unique mix of the genomes of the natural parent strains and they therefore enable a systematic search for trends connecting genetic variants to traits.

Degree project in bioinformatics, 2012 (Examensarbete i bioinformatik 30 hp till masterexamen, 2012) Biology Education Centre, Uppsala University, and Institute for Research on Cancer and Aging in Nice

Supervisor: Gianni Liti

(6)

(7)

Abstract 1

Popular science summary 3

Introduction 7

Next-generation sequencing data . . . . 7

Saccharomyces cerevisiae population genomics . . . . 9

Project overview . . . . 9

Methods 12 De-novo assembly . . . . 12

Read mapping . . . . 12

SNP calling and genotyping . . . . 12

Depth of coverage calculations . . . . 13

Assembly scaffolding by genetic linkage . . . . 13

Genome comparisons and chromosomal similarity plots . . . . 14

Identification of unique genomic regions . . . . 14

Gene annotation . . . . 14

Results 16 De-novo assemblies of natural strain genomes . . . . 16

Improved assembly scaffolding by genetic linkage . . . . 16

Gene content and non-reference genomic material . . . . 18

Population history and structure . . . . 20

Genome characterization of F12 segregants . . . . 20

Segregation of unique parental segments in the F12 . . . . 22

Discussion 24

References 26

Appendix 29

(8)

(9)

Introduction

Many fields of biology are currently undergoing dramatic changes as a consequence of rapid advances in DNA sequencing technology. Since the completion of the first human genome sequence draft in 2001 the per-base pair cost of DNA sequencing has decreased by five orders of magnitude (http://www.genome.gov/sequencingcosts/), and experiments that only a few years ago would have been highly expensive and time-consuming endeavors involving major multi-laboratory collaborations or consortia can now be performed by single research groups within reasonable time frames and budgets. These reductions in cost are partly driven by continuous, incre- mental improvements in sequencing chemistry and machinery, but the field also saw a major qualitative shift with the advent of a number of novel technologies that far surpassed the traditional so-called Sanger sequencing method in terms of throughput. These technologies, primarily those of 454 pyrosequencing, Illumina sequenc- ing (formerly Solexa) and SOLiD sequencing, are collectively known as next-generation sequencing and are already having a massive impact on many important areas of the life sciences [1]. While the higher throughput is enabled by novel approaches to sequencing DNA, the data produced by these approaches also comes with new and different inherent properties that give rise to new analytical and informatics challenges, and so an outburst of bioinformatic innovation and progress has followed in its wake.

One of the areas which have seen major advances during the on-going sequencing revolution is the study of variation between individuals within populations. During the early days of the genomics era many genomes of different organisms, primarily model organisms utilized in molecular biology research and later species of industrial and biotechnological importance, were sequenced. The genome sequences produced by these projects have provided enormous insights into genome architecture and evolution and established crucial resources for continued studies on the organisms in question, but they suffered from the fact that the sequenced genome represented only a single individual of a population and contained no information about the variation within that population. While the interest in the genetics of populations stretches back further in scientific history than most other fields of the life sciences, the ability to sequence whole genomes of multiple individuals from a population has renewed and broadened that interest with the establishment of the field of population genomics. Parallel to this development there has also been an increased interest in leveraging population level data in order to answer biological and medical questions, particularly those concerning the relationship between genotype and phenotype and the genetic architecture of complex traits - traits influenced by more than one factor. While so far genome-wide association (GWAS) and other population genetics studies have primarily relied on microarray technology for large-scale genotyping of individuals at specific marker loci, the shift to sequencing-based studies has started in this area too [2].

The budding yeast Saccharomyces cerevisiae, one of the major molecular biology model organisms, has emerged as a powerful system in which to carry out population genomics and genotype-phenotype mapping studies [3].

In this study I present the analysis of Illumina next-generation sequencing data from a large number of both natural and artificially constructed budding yeast strains. After a brief review of the relevant background I describe both the bioinformatics methodologies put to use and the biological results emerging from the analyses, and discuss this within the general frameworks of next-generation sequencing, population genomics and genotype-phenotype mapping.

Next-generation sequencing data

The data produced by next-generation sequencing machines consists of short sequence reads each sampled randomly from a redundant pool of DNA molecule fragments. Early Illumina sequencers produced reads not

7

(10)

longer than 36 base pairs (bp), but the latest machines are capable of read lengths of 100 and even 150 bp (illumina.com). Yet this is still much shorter than the traditional, capillary-based Sanger sequencing machines, which typically produce reads of lengths 700-1000 bp. The read length is a crucial factor in most sequencing experiments as it determines the ability to uniquely identify the genomic origin of a read. Both Sanger and next-generation sequencing technologies are capable of producing so-called paired end-reads - by sequencing both ends of a single DNA molecule one can obtain pairs of reads that are known to come from the same physical molecule fragment. Experimental techniques can be used to prepare a pool of fragments with a relatively narrow distribution of sizes so that the subsequent analyses can rely on a prior expectation for the approximate distance between two reads in a pair. Illumina as swell as the other next-generation sequencing technologies produces data that is generally of lower-quality than Sanger data, with higher sequencing error rates and a higher fraction of reads that can not be reliably used. The particular characteristics of the quality and error profiles also differ [4], for example Illumina data has higher rates of single-nucleotide miscall errors than Sanger data but lower rates of insertion-deletion miscall errors. These and other factors, of course most importantly the much higher throughput achievable with the new technologies, means that novel processing and analysis methods are required to be able to accurately draw biological conclusions from these data sets and to do so using realistic computational capacities.

For genome sequencing experiments, where the goal is the accurate reconstruction of the DNA sequence of an individual genome, there are two main conceptual approaches to the analysis of sequence data. The first is de- novo assembly, which is the attempt to infer the sequence using only the read data derived from the experiment (and possibly other auxiliary or follow-up experiments performed on the same individual). Utilizing the overlap between reads resulting from redundantly sequencing the genome to high coverage, various algorithms have been devised to puzzle together the complete sequence - popular software implementations of these algorithms include Velvet [5], ABySS [6], ALLPATHS [7] and SOAPdenovo [8]. These programs take short sequence reads as input and merge them into contigs, continuous segments of sequence, which can be further ordered and grouped into scaffolds using different sources of contig connectivity information such as the placement on different contigs of paired-end read pairs. The second approach is reference-based assembly which uses a previously sequenced genome of a closely related individual as a backbone to which the reads are aligned, or mapped, and polymorphisms then inferred. Traditional sequence search software such as BLAST are generally far too slow to be used to process the millions or sometimes billions of reads produced by next-generation sequencing experiments. This has lead to the emergence of a new software family of so-called short read mappers which typically work by efficient indexing of the reference genome enabling rapid search for the most likely location of a given read [9]. Popular programs of this family include Maq [10], BWA [11], Bowtie [12]

and BFAST [13].

While de-novo assembly has traditionally been the only option for sequencing projects, in population genomics studies of species for which a high-quality reference genome exists using that reference as a backbone is also a possibility. The two approaches come with different advantages and disadvantages. Obtaining good de-novo assemblies usually requires a fairly high depth of sequencing coverage (the average number of times a given nucleotide position in the genome is included in a read) and the short read lengths of next-generation sequencing technologies usually result in assemblies that are more fragmented, that is consist of contigs and scaffolds that are larger in number and smaller in size, than those derived from the longer Sanger reads. Reference-based analyses can yield good results even with very low sequencing coverage but comes with the risk of introducing a reference-bias into various analyses, and furthermore any genomic material which is present in a sequenced individual but absent from the reference genome will simply be missed out on. The best choice between the two strategies will depend on the particular features of the experiment and the biological questions that one wants to answer, and in many situations the two can be complementary to each other.

8

(11)

Saccharomyces cerevisiae population genomics

The budding yeast Saccharomyces cerevisiae has a long and successful history as a model organism for molecular biology and genetics research and has been in the forefront for many methodological advances in these fields.

In 1996 it became the first eukaryotic organism to have its complete genome sequence determined [14] and its experimental tractability, especially the ease by which it can be genetically modified, has made it the most well-characterized eukaryotic organism. This status as well as the small genome size (13 Mb) has made S.

cerevisiae one of the best-suited systems for population level sequencing, and in 2009 it became one of the first organisms to enter the population genomics era as over seventy isolates of S. cerevisiae and its closest relative species S. paradoxus were sequenced to low-coverage with Sanger technology [15]. The study surveyed strains from a variety of environments including research laboratories, brewing and baking industries as well as the natural environment, and provided insight into patterns of natural variation and the history and population structure of this important organism (figure 1). The structure of the S. cerevisiae population is complex, consisting of a handful of “clean” phylogenetic lineages that do not seem to exchange a lot of genetic material and a large number of “mosaic” strains with genomes made up of a mixture of segments of different origin.

S. paradoxus on the other hand, the close relative which has not been domesticated by humans for brewing, baking or other purposes, displays a simple population structure without any mosaic strains as well as much larger genetic distances between phylogenetic lineages. The same set of strains were also phenotyped for a large number of traits and generally the phenotypic variation within S. cerevisiae was found to be much higher than that within S. paradoxus. While much remains to be learned about the recent evolutionary history of S.

cerevisiae, it at least seems probable that human activity has had an impact on the population. A number of other strains have also been sequenced at various levels of coverage primarily using Sanger technology [16, 17, 18, 19, 20, 21].

Various efforts have been made to exploit the natural variation within S. cerevisiae for purposes of identifying genetic variants that underlie phenotypic variation and deciphering the genetic architecture of traits. Attempts to directly use the genetic and phenotypic data of natural isolates for genome-wide association studies have not proved very fruitful, primarily because of the high degree of population structure, or population stratification [22]. For any casual variant that influences a given trait there will be a large number of neutral variants elsewhere in the genome that have the same distribution among strains, simply because of shared phylogenetic history, so that attempts to associate the casual variant with phenotypic variation will tend to also produce the same statistical signals for these neutral variants. While population stratification is a problem for association studies in all organisms, it might be more severe in yeast than in for example humans that display a greater degree of admixture among geographical regions. Other studies have utilized crosses between genetically and phenotypically diverged strains to map quantitative trait loci (QTLs), loci that have a statistical effect on quantitative traits, either by traditional linkage mapping using collections of recombinant segregants [23] or by novel experimental approaches based on exposing large pools of millions of different segregants to selection pressures and measuring changes in allele frequencies by sequencing total DNA from the pool [24].

Project overview

Of the 72 strains of S. cerevisiae and S. paradoxus that were originally sequenced to low coverage by Sanger sequencing technology [15], 42 strains selected to represent the majority of the genetic variation within the species have been re-sequenced to high-coverage using Illumina technology (read length: 108 bp, paired-end with fragment size distribution mean around 300 bp). The assembly and analysis of the data from these strains forms the core of this project, with the focus on the S. cerevisiae strains and the various biological questions pertinent to the population genomics of this species, including those of population structure and history, gene

9

(12)

content variation and general patterns of sequence polymorphism. Both de-novo assembly and reference-based analyses are utilized for different purposes, though with the emphasis on de-novo assembly and what novel insights this approach can bring. The low-coverage Sanger data from the original sequencing project is also used to augment de-novo assemblies made from the Illumina data. In addition to the analysis of these natural strains, sequence data from two major crossing experiments between some of these S. cerevisiae strains is analyzed and put to use in various ways. The first experiment is a cross between representatives of two of the clean phylogenetic lineages of S. cerevisiae, the strain YPS128 from the North American lineage and the strain DBVPG6044 from the West African lineage (see figure 1 for their phylogenetic positions within the species), and has been previously used in QTL mapping studies [24]. 192 individual segregant strains from the twelfth generation (F12) of random mating from this cross have been subjected to medium coverage Illumina whole- genome sequencing (read length: 100 bp, paired-end). The second cross is similar but involves four parental strains, the above two plus the strain Y12 from the Japanese/Sake lineage and the strain DBVPG6765 from the Wine/European lineage. In this cross the first round of mating is restricted to pairs of two parents (North American - West African, and Japanese/Sake - Wine/European) and the remaining eleven generations are random mating. 192 F12 segregant strains from this cross were sequenced to medium coverage by Illumina sequencing (read length: 100 bp, paired-end). These so-called advanced intercross lines consisting of highly recombined individuals are a powerful tool in QTL mapping as they constitute artificial populations that are more or less free from the problem of population stratification. The cross involving two parental strains will be referred to as the 2-way cross and the cross involving four parental strains will be referred to as the 4-way cross. All strains, both natural and segregant, were sequenced in the haploid state.

The primary purpose for performing whole-genome sequencing on these segregants from advanced intercross lines is to genotype the individuals at segregating sites, which is a requirement for downstream analyses of recombination and QTL mapping. The use of whole-genome sequencing rather than microarray or other technologies that target only certain pre-specified marker loci however brings with it a number of interesting avenues for further exploration, in addition to providing a more dense set of markers. The segregant sequence data here is used for the investigation of certain types of de-novo variation that has arisen in the segregant offspring but is not present in any of the parents, as well as for tracking the segregation of genomic material which is present in one parental strain but completely absent in another, as identified using de-novo assemblies of the parental genomes. Additionally, since sequence data of a large number of highly recombined individuals should provide plenty of information about the frequency of recombination between different polymorphic loci, this can be exploited for inferring the physical proximity of loci on different contigs of a de-novo assembly.

The segregant data is therefore put to use for further scaffolding of de-novo assemblies of the parental strain genomes.

10

(13)

Figure 1: Phylogenetic structure of the population of Saccharomyces cerevisiae and its closest relative Saccharomyces paradoxus. a) S. cerevisiae and S. paradoxus within the context of the greater phylogenetic clade known as Saccharomyces sensu stricto, here with the species S. mikatae, S. bayanus and S. kudriavzevii. As is visible at this phylogenetic scale, the S. paradoxus population consists of four major lineages separated by much larger genetic distances than those found within the total S. cerevisiae population. b) Close-up of one of the S. paradoxus lineages, the European lineage. c) The total S. cerevisiae population, with the major clean lineages shared in dark. Strains falling outside the clean lineages have mosaic genomes consisting of parts of different origin. Colors of dots at the end of tree branches indicate the geographic origin and colors of names indicate the source context of each strain. All trees are neighbor-joining trees based on genome-wide single-nucleotide polymorphism (SNP) data from low-coverage Sanger sequencing. Scale bars indicate genetic distances in units of SNP differences per base pair. Reprinted from [15] .

11

(14)

Methods

De-novo assembly

De-novo assemblies of natural strain genomes were made using the program SGA [25], and the runs used for the final assemblies were performed by a collaborator who is also the author of the software. Various miscellaneous de-novo assembly tasks were also performed with ABySS [6]. The de-novo assembly of the strain DBVPG6044 was found to contain contaminant scaffolds/contigs which by sequence similarity searches to NCBI’s nr database could be identified as being from the prokaryotic Staphylococcus genus. All scaffolds that had a BLASTN hit to a Staphylococcus species in the top five matches were removed from the assembly.

Further scaffolding of the de-novo assemblies was performed by mapping paired-end Sanger reads to the scaffolds and inferring connections from the mapping results. To prevent incorrect scaffolding caused by mismapped reads or chimeric read pairs (deriving from molecule fragments incorrectly merged during library preparation), only pairs of scaffolds that were bridged by a minimum of two read pairs were considered.

Additionally, to prevent incorrect scaffolding caused by different regions of the genome having collapsed into the same contig during assembly because of high sequence similarity (which can happen for e.g. very recently duplicated genes and especially for transposable elements), the Illumina reads were mapped back to the assemblies in order to identify regions of higher coverage which would be indicative of assembly collapse. The Illumina coverage of the first and the last 5 kb of each scaffold was computed and any such edge having a log

2

mean coverage of 0.5 or higher was excluded from scaffolding. The filtered read mapping results were used as input for the SSPACE stand-alone scaffolder [26].

Read mapping

Prior to mapping, Illumina reads from the natural strains were cleaned from adapter contaminants using cutadapt 0.9 (http://code.google.com/p/cutadapt/). Quality control on the F12 segregant strain data did not find any substantial levels of adapter contamination so adapter cleaning was not performed for these strains.

Illumina reads were mapped to the S. cerevisiae S288c reference genome (Release R64-1-1, obtained from the Saccharomyces Genome Database [27]) and to de-novo assemblies using BWA (versions 0.5.9 and 0.6.1) [11]

and Stampy (versions 1.0.17 and 1.0.18) [28]. Stampy is thought to be more sensitive but is substantially slower than BWA. For downstream purposes requiring high nucleotide-level alignment accuracy such as SNP calling, Stampy was used, with the “–sensitive” parameter and with strain-specific genetic distances to the reference genome obtained from [15] (for mapping segregant reads to parental assemblies, the appropriate average genetic distances were used), in BWA hybrid mode with BWA 0.5.9 and the “-q 10” parameter. For purposes not requiring a high nucleotide-level alignment accuracy such as depth of coverage analyses, BWA was sometimes used instead of Stampy, with the “-q 10” parameter. For the mapping of Sanger reads, SSAHA2 (version 2.5.5) [29] was used. The results of read mapping runs were stored in the Sequence Alignment/Map (SAM) format [30] and its binary form (BAM) and various post-processing steps were performed using SAMtools (version 0.1.18) [30] and Picard (versions 1.62-1.69) (http://picard.sourceforge.net/).

SNP calling and genotyping

Primary calls of single-nucleotide polymorphisms (SNPs) for natural strains with reads mapped to the reference genome and for parental and segregant strains with reads mapped to parental de-novo assemblies were made

12

(15)

using FreeBayes 0.9.5 (http://bioinformatics.bc.edu/marthlab/FreeBayes) set for haploid samples and with the

“-i” parameter to suppress the calling of indels (insertions and deletions). To establish lists of segregating sites on the 2-way and 4-way parental de-novo assemblies, the parental reads were mapped to the assembly scaffolds and candidate SNPs were filtered only for sites where all corresponding parents (YPS128 and DBVPG6044 in the case of the 2-way cross and YPS128, DBVPG6044, Y12 and DBVPG6765 in the case of the 4-way cross) had at least ten reads covering the site and at most one read conflicting the consensus allele call. Multi-allelic sites were excluded. Segregants were genotyped at the resulting segregating sites by calculating genotype likelihoods using the “mpileup” program of the SAMtools package [30]. This program assumes diploid samples but was adapted for haploid genotyping by calling a given of the two possible alleles if the log

10

-likelihood ratio between the state of homozygosity for this allele and the state of homozygosity for the alternative allele was bigger than 10, the alternative allele if the log

10

-likelihood ratio was smaller than -10 and assigning missing data if the ratio was in-between 10 and -10. To prevent genotyping errors arising due to reads mapping to the incorrect location because of repetitive sequence or collapsed repeats, sites where more than three individuals in a segregant population showed signs of heterozygosity (diagnosed by a genotype likelihood ratio in-between 5 and -5) were excluded from analyses. SNP calls and genotype results were stored in the Variant Call Format (VCF) [31].

Depth of coverage calculations

Per-position read counts were calculated from BAM files of mapped reads using the “genomecov” utility of BEDTools (version 2.15.0) [32] or the “depth” utility of SAMtools [30]. Coverage values were averaged in non- overlapping windows of various sizes depending on the application, normalized by the genome-wide median level of coverage for the given strain and typically converted to a log

²

-scale. To prevent excessively negative logarithmic values for low-coverage regions the values were cut off at -3, that is all windows or regions with log

2

coverage values smaller than -3 had their values set to -3.

Assembly scaffolding by genetic linkage

For each of the de-novo assemblies of the parents used for the 2-way and the 4-way crosses, SNPs between the relevant parents were called on the assembly scaffolds and the set of SNPs where the genotype uniquely identifies the parent of origin as the given parent under consideration was identified and selected. For the the 2-way cross this is trivially the full set of SNP’s but in the 4-way cross this is only a subset of the SNPs, as sites with an allelic configuration of 1:3 (one parent having the first allele and the other three parents having the second allele) can only be used to uniquely identify one of the parents and sites with an allelic configuration of 2:2 can never be used to uniquely identify the parent. The relevant segregant strains were then genotyped at the resulting lists of segregating sites. Linkage disequilibrium (LD) in units of of r ² between all pairs of SNPs were then computed using PLINK [33]. The relative orientation and the distance between scaffolds with linked SNPs could then be then inferred from the patterns of LD. To deduce the relative orientation of two scaffolds, the four values corresponding to the LD of the two outer-most SNPs (the one closest to the first edge and the one closest to the other edge) of one scaffold to the two outer-most SNPs of the other scaffold was used. By finding the maximum of these values the pair of scaffold edges that are most strongly linked is identified and this gives the orientation and position of the two scaffolds relative to each other. To infer the approximate distance between scaffolds, an exponential function of the form r ² = e ⁻ ^λ ^· ^d where d is physical distance between two SNPs in units of base-pairs and λ is a constant was fit by non-linear least squares to the relationship between genetic distances and physical distances of pairs of SNPs within the same scaffold (only scaffolds of size 50 kb or larger were used for fitting) to obtain a model for the estimation of physical distances

13

(16)

from the strength of linkage disequilibrium. The fitted values of λ were 3.85 · 10 ^-5 for the 2-way cross and 6.54 · 10 ^-5 for the 4-way cross (the difference in values reflecting differences in the overall recombination rate between the two crossing experiments). The pairwise scaffold-scaffold relationships with inferred relative orientation and position and estimated physical distances then constitutes a directed graph in which each linkage group, i.e. chromosome, should be traceable as one simple path. The paths through this graph were traced partly using the scaffolding algorithm of the SGA assembly program [25] and partly by manual curation aided by visualizations of the graph in the Cytoscape software [34]. Scaffolds that could not be confidently oriented because of a lack of difference in LD profiles between the two SNPs closest to each edge or because they did not have at least two SNPs were not included into the resulting linkage group scaffolds but left as unplaced.

Genome comparisons and chromosomal similarity plots

For large-scale genome and assembly comparisons such as the one displayed in figure 2 MUMmer (version 3.1) [35] was used. For more fine-scale alignments BLAST [36] and BWA-SW [37] were used. Chromosomal similarity plots such as those displayed in figures 3 and 4 a-d) were constructed by first aligning the de-novo assemblies of the strains used for similarity comparison (representatives of the clean phylogenetic lineages in the case of figure 3 and the respective parental strains in the case of figure 4) as well as the assembly of the given input strain to the reference genome with BWA-SW. Alignments with a mapping quality lower than 250 and positions with more than one sequence aligned were excluded and SNPs were then inferred from the alignments. The inferred SNPs were used to compute the average pairwise sequence similarity between the given input genome and each of the comparison genomes in non-overlapping sliding windows across each chromosome, where the sequence similarity metric in each window is computed as _{N +1} ^S where S is the size of the window (not counting sites with missing data) and N is the number of sites where the two strains have a different allele (adapted from [15]). If alignment data was not present at a sufficiently large fraction of sites in a window, no similarity metric was computed and missing data was assigned for that window. The similarity values were plotted on a logarithmic scale. The vertical axis scales were not indicated in the plots as their absolute magnitudes vary with window size and they carry little interpretive value - a more intuitive scaling is rather provided by the relative similarity levels to the different comparison strains.

Identification of unique genomic regions

To identify regions in one given genome that is missing from another, alignments to the other genome were made using BLAST [36] without filtering for low complexity regions (“-F F”) and all alignments that had a length of at least 100 base pairs and a sequence identity of at least 75% or a length of at least 50 base-pairs and a sequence identity of at least 90% were identified. All continuous regions of at least 100 bp that were not part of this set of alignments and did not contain any stretches of “N” ’s (representing unknown sequence) longer than 20% of the region length were called as unique regions (though note that for the values in table 2 a more stringent length cutoff of 1000 bp was used).

Gene annotation

Ab initio gene prediction was performed on the natural strain de-novo assemblies using Glimmer (version 3.02) [38] trained on the set of open reading frames in the S. cerevisiae reference genome classified as “Verified”.

Gene sequences were compared to the reference genome gene sequences by means of reciprocal best BLAST,

14

(17)

an approach where a pair of genes are classed as orthologs if the first gene is the top BLAST hit when using the second gene as the query and reciprocally the second gene is the first BLAST hit when using the first gene as the query. Genes that did not seem to have a close ortholog in the reference genome were searched against NCBI’s nr database for possible orthologs in other strains or species.

15

(18)

Results

De-novo assemblies of natural strain genomes

De-novo assembly was performed on strains which had a depth of sequencing coverage of 20x or higher (which was the case for 14 out of 19 S. cerevisiae strains and 13 out of 23 S. paradoxus strains). The core assembly contigs were produced by the de-novo assembly software SGA [25], which also performs scaffolding using the paired-end information available in the Illumina read data. Low coverage paired-end Sanger reads available for the same strains were then used for further scaffolding by mapping them to the SGA assembly scaffolds and using the mapping of reads from the same pair to different assembly scaffolds to infer physical connections. The physical distance between reads in a pair is much bigger for the Sanger reads than for the Illumina reads (strain means around 4-5 kbp compared to around 300 bp, respectively) and this allowed many additional connections to be made, increasing assembly continuity quite substantially for many strains. Summary statistics on the S. cerevisiae assemblies are shown in table 1, displaying numbers for both before and after the additional scaffolding with Sanger reads (see Appendix table A.1 for the corresponding statistics for the S. paradoxus strains). The strain DBVPG6044 was found to contain some contaminant sequence of bacterial origin and scaffolds affected by this were removed from the assembly (no hybrid scaffolds containing both S. cerevisiae and bacterial sequence were found).

Overall the genomes assembled relatively well, especially for those strains with higher coverage. The assembly statistics are comparable and in some cases better than what has typically been obtained for yeast genomes using Sanger technology. The biggest scaffolds for each assembly tend have sizes on the same order as chro- mosome sizes and some chromosomes are completely contained in a single scaffold, with the exception of the subtelomeric regions. Generally the subtelomeric regions did not assembly well, which is expected given their highly repetitive and complex structure featuring many genes and other elements that exist in several homol- ogous and highly similar copies in the subtelomeres of different chromosomes. The mitochondrial genomes did not assemble well, though it is unclear why - perhaps its extremely high AT content makes the assembly process problematic. The depth of sequencing coverage had a large impact on assembly quality. The sequencing of two S. cerevisiae strains and four S. paradoxus strains was carried out in a sequencing run on a later Illumina machine model, giving dramatically higher coverage, and the continuity of these assemblies is generally higher than the others. Overall the Pearson correlation between coverage and the N50 value across all S. cerevisiae and S. paradoxus assemblies is 0.55 (p = 3.1 · 10 ^-3 ).

Improved assembly scaffolding by genetic linkage

Data from the two large-scale crossing experiments between S. cerevisiae strains was put to use to improve the structural continuity of the de-novo assemblies for the parental strains in question, exploiting the genetic linkage that manifests itself in these recombinant populations between loci that are physically close to each other on a chromosome. By first mapping the reads of the parent strains to the scaffolds of the parental de-novo assemblies, single-nucleotide polymorphism (SNPs) sites where the parents carry different alleles were identified. The segregants were then genotyped at these variant sites after mapping their reads to the assemblies in the same way. Linkage disequilibrium, which is a measure of the degree to which the genotypes at two different polymorphic loci tend to correlate across individuals, could then be computed between all pairs of SNPs to establish dense genetic linkage maps of the parental de-novo assemblies. Loci that are physically close on a chromosome should tend to display strong linkage disequilibrium regardless of whether they happen be

16

(19)

Table 1: De-novo assembly statistics for S. cerevisiae strains.

Strain Cov. Scaffold nr. Size (bp) Nr. of N’s Largest size Avg. size N50

UWOPS87-2421 834 559 / 536 11658429 / 11671772 3425 / 16768 546126 / 546126 20855 / 21775 160939 / 200122 UWOPS83-787.3 638 513 / 495 11676943 / 11678122 2989 / 4168 557771 / 885033 22762 / 23592 187881 / 244661 YPS128 56 855 / 791 11741720 / 11768379 1375 / 28086 451294 / 518960 13733 / 14877 109555 / 187623 SK1 40 1315 / 1119 11704669 / 11753398 2170 / 51157 321072 / 569610 8900 / 10503 67234 / 267272 DBVPG6765 37 967 / 779 11622418 / 11670953 2664 / 51404 280644 / 540506 12019 / 14981 65667 / 208326 DBVPG1106 34 1280 / 1163 11539734 / 11580105 4525 / 45149 172964 / 183080 9015 / 9957 36291 / 47371

L1528 34 951 / 808 11565290 / 11594025 5015 / 33985 246181 / 463986 12161 / 14349 55077 / 101460 W303 33 1391 / 1219 11630510 / 11666829 4825 / 41242 209223 / 284547 8361 / 9570 52228 / 122824 DBVPG6044 32 1137 / 983 11598603 / 11656308 2286 / 60053 236917 / 350970 10201 / 11857 52144 / 94331

Y12 29 1721 / 1582 11610064 / 11649747 4618 / 44451 113242 / 229213 6746 / 7363 36944 / 60148 Y55 28 1516 / 1219 11635423 / 11701139 7619 / 73860 263008 / 418695 7675 / 9598 42157 / 141898 DBVPG1373 26 1232 / 970 11559454 / 11626504 8790 / 76243 185406 / 368851 9382 / 11986 35542 / 98526 UWOPS03-461.4 24 3646 / 3213 11499146 / 11694268 10678 / 206837 40466 / 60006 3153 / 3639 6919 / 10747 YJM975 22 3069 / 2429 11406731 / 11688902 15503 / 298728 33464 / 59403 3716 / 4812 7414 / 15038

De-novo assembly was performed for the listed strains using high-coverage 108 bp paired-end Illumina reads and the resulting scaffolds were then further scaffolded using low coverage longer fragment size paired-end Sanger reads. Only scaffolds of size 200 bp or larger were kept in the assemblies. Numbers before and after the forward slash “/” denote values of the given statistic before and after scaffolding with Sanger reads, respectively. “Cov.” (coverage) is an estimate of the average number of times a position in the genome is included in a sequencing read and can be calculated as the number of base pairs sequenced divided by the expected genome size (the values here denote raw coverage before any filtering steps). “Scaffold nr.” is the total number of scaffolds in the assembly. “Nr. of N’s” is the total count in the sequence of the “N” character which is used to represent gaps of unknown sequence but known approximate length between contigs in a scaffold. “Size (bp)” is the total size of the assembly sequence in units of base pairs, including

“N” ’s. “Largest size” is the size of the largest scaffold in the assembly. “Avg. size” is the average size of scaffolds.

“N50” is a measure of assembly continuity and is defined as the size S of the smallest scaffold such that half of all the sequence in the assembly is contained in scaffolds of size larger than S.

contained in the same scaffold in the de-novo assembly or not. The genetic linkage information obtained from the segregant populations was thus used to infer the physical proximity of different scaffolds in the assembly and enabled an additional level of scaffolding information. This resulted in dramatically increased large-scale assembly connectivity and scaffold N50 values for all the four strains used as parents in the cross experiments.

The sequence data for the 192 segregant offspring from the 2-way cross was used to improve the assemblies of the two parental strains, the North American YPS128 and the West African DBVPG6044. 50503 SNPs were called and used as markers on the North American de-novo assembly, and 50974 on the West African assembly. The structure of the North American assembly before and after scaffolding with genetic linkage data as compared to the S. cerevisiae reference genome is displayed in figure 2 (the results are similar for the West African strain). Chromosome-sized scaffolds were obtained for all the 16 nuclear chromosomes. Similar results were obtained for the assemblies of the Sake/Japanese Y12 and the Wine/European DBVPG6765 using the segregant data from the 4-way cross in which these two strains were also included as parents. 9272 SNPs were called and used as markers on the Sake/Japanese assembly, and 36184 on the Wine/European assembly.

For these strains too, the genetic linkage data allowed the construction of one large scaffold per chromosome, each containing the majority of the sequence for that chromosome. While not all assembly material could be linked into the chromosomal scaffolds and the sub-scaffolds of each chromosomal scaffold are still separated by gaps of unknown sequence, this level of structural continuity is rarely if ever achieved in de-novo assemblies of

17

(20)

eukaryotic genomes using only short next-generation sequencing reads. The scaffold N50 value of the linkage- augmented assemblies increased to around 890 kb for the North American strain, 864 kb for the West African, 852 kb for the Sake/Japanese and 877 kb for the Wine/European (for comparison, the N50 value of the completed S. cerevisiae reference genome assembly, and thus the N50 of the actual physical genome, is 924 kb).

The improved assemblies confirms the more or less complete collinearity and synteny of the chromosomes of the four parental strains from the major phylogenetic lineages with those of the reference genome as well as with each other. A high degree of structural conservation is a prerequisite for the production of viable meiotic offspring in crossing experiments, though smaller intra-chromosomal variation might still be allowed and could disturb various downstream analyses. Recently the genome of another yeast strain from the Sake/Japanese lineage was sequenced and a relatively large inversion within chromosome 5 relative to the reference strain was identified [21]. The assembly of the Sake/Japanese strain Y12 however shows that the latter strain does not share this inversion with its close relative but instead has the same configuration of chromosome 5 as the reference strain.

Figure 2: The effect on assembly continuity of scaffolding by genetic linkage. De-novo assemblies are compared to the S. cerevisiae reference genome. The scaffolds of the assembly are represented along the vertical axis (unlabeled) and the chromosomes of the reference genome along the horizontal axis, with scaffolds/chromosomes separated by dotted lines. Colored dots or lines are drawn at points in the two-dimensional space if a sufficiently good alignment can be made between the two sequences at the corresponding positions. Long homologous regions will therefore appear as continuous diagonal lines. a) The de-novo assembly of the North American strain YPS128 before addition of genetic linkage data. b) The de-novo assembly of YPS128 after further scaffolding using genetic linkage data from the 192 segregants derived from a cross between YPS128 and the West African strain DBVPG6044. Each chromosome of the reference genome is now matched by one large scaffold in the YPS128 assembly.

Gene content and non-reference genomic material

One of the major advantages of de-novo assembly over reference-based analyses is that the former allows for the identification and analysis of genomic material which is absent from the individual whose genome is used as the reference. Table 2 displays results from the identification in the S. cerevisiae de-novo assemblies of

18

(21)

genomic regions that could not be found in the reference genome. A positive relationship was seen between the amount of non-reference genomic material identified in a strain and the genetic distance from the reference genome (r = 0.67, p = 3.1 · 10 ^-3 ), as would be expected under a model where this kind of variation in genome content is established gradually over time similar to the gradual divergence on the nucleotide level. However it’s also possible that insufficient levels of sequencing coverage would impose difficulties on the identification of non-reference material, as in a more fragmented assembly these regions are more likely to be split up into smaller contigs and scaffolds and might therefore not pass thresholds imposed the minimum length of a region. In a multiple linear regression model both genetic distance (p = 0.00781) and sequencing coverage (p = 0.03452) are significantly predictive of the amount of non-reference material identified. However the two strains sequenced to extremely high coverage on the HiSeq Illumina machine are also among the most diverged from the reference, potentially confounding the model - in a model excluding these two strains only genetic distance (p = 0.0145) and not sequencing coverage (p = 0.2736) remains significant. In either case genetic distance accounts for much more of the explained variance (R ² = 0.5549 for the model including only genetic distance versus R ² = 0.638 for the model including both genetic distance and sequencing coverage, all strains included). Considerable variance still remains unexplained however, indicating that this kind of genome content variation evolves in a more discontinuous and stochastic fashion than the highly gradual nucleotide level divergence.

Differences in gene content could potentially be important for phenotypic variation between strains and might in some cases reflect adaptations to different local environments and life styles. By ab initio gene prediction genes not present in the reference genome could be identified in the de-novo assemblies. Sequence similarity searches for genes predicted in the S. cerevisiae assemblies shows that most of these are also present in other previously sequenced strains of the species and some have been characterized with respect to their phenotypic effects. The MEL1 gene for example encodes an alpha-galactosidase enzyme that is required for the catabolism of the sugar melibiose and its presence and absence patterns have been found to associate with the ability to grow on this sugar [39]. Here the gene was found in the assemblies of the Malaysian strain (UWOPS03-461.4) and one mosaic strain (UWOPS87-2421) and with a slightly more diverged sequence in the West African lineage (DBVPG6044 and the predominantly West African mosaic strain Y55). The BIO6 gene encodes an enzyme that is part of the pathway for biotin synthesis and is missing in the reference S. cerevisiae which requires biotin for normal growth but present in Sake strains which are capable of synthesizing biotin [40].

Here the gene was found in the Sake strain Y12 as expected, as well as in a mosaic strain (UWOPS03-461.4) and in the West African lineage (DBVPG6044 and the predominantly West African mosaic strains SK1 and Y55).

Table 2: Amount of non-reference genomic material identified in S. cerevisiae de-novo assemblies.

Strain Non-ref Top 5 largest regions UWOPS87-2421 88588 20893 11617 8944 7739 4906 UWOPS83-787.3 79062 14112 7931 7693 4906 4507

YPS128 65902 12955 7692 6193 4912 4042 SK1 60451 13797 7692 5222 4900 4536 DBVPG6765 82670 20015 12252 8199 7553 5817 DBVPG1106 29888 11419 5380 4908 3711 2084

L1528 35216 8540 8257 4907 4031 3703

Strain Non-ref Top 5 largest regions W303 8318 3701 3475 1142 875 678 DBVPG6044 65100 13797 7692 5249 4536 4041

Y12 37185 6214 5052 3655 3452 2363 Y55 47082 13833 5368 5227 3735 3703 DBVPG1373 49894 13053 12670 5977 4906 4156 UWOPS03-461.4 73926 10545 4571 4141 3909 3369

YJM975 16974 3676 3411 3388 1627 1296 De-novo assemblies were compared to the reference genome using BLAST [36] and regions in the assemblies for which no sufficiently good alignments could be made were identified and inferred as being absent from the reference genome (see Methods for details). The “Non-ref ” column lists the amount of such non-reference sequence identified in each strain, in units of base-pairs. Only continuous regions of size 1 kb or bigger are included in this sum. The “Top 5 largest regions” lists for each assembly the sizes of the five largest continuous non-reference regions, in units of base-pairs.

19

(22)

Population history and structure

The genome sequences of natural strains allows for the investigation of their phylogenetic relationships and recent evolutionary history. Phylogenetic and similarity clustering trees (such as the neighbor-joining tree in figure 1) provide a good way to assess the overall phylogenetic structure of a group of sequenced individuals, but a tree that averages over the whole genome will inevitably collapse information and fail to illustrate more complex evolutionary relationships. As many S. cerevisiae strains are mosaics with different parts of their genome derived from different phylogenetic ancestors, trees constructed from different sub-segments of the genomes of these strains acquire different topologies depending on the origin of the particular segment chosen.

An alternative approach is to calculate and plot the sequence similarity between strains across the chromosomes, allowing each segment to undergo phylogenetic comparison independently of the rest of the genome. Figure 3 displays examples of genome-wide similarity plots for four S. cerevisiae strains, two sequenced here and two sequenced in other studies. By comparing a given strain against representatives of each of the major phylogenetic lineages of the species, these visualizations provide much more detailed views of the history and structure of the genomes.

The application of this methodology to all the strains sequenced here as well as to strains sequenced in other studies reveals that most strains of industrial and biotechnological importance come from the same phylogenetic lineage, the Wine/European lineage. The strains used in Sake production however do not seem to share genomic material with strains used in wine production, indicating separate domestication events.

Some laboratory strains such as SK1, Y55, W303 as well as Σ1278b [19] seem to be result of admixture events between strains from the Wine/European and the West African lineages. The lack of concordance between the genomic breakpoint profiles of these mosaic strains also indicates that they derive from separate admixture events, rather than sharing a common Wine/European-West African mosaic ancestor. Interestingly some of these strains also appear to carry some smaller Sake/Japanese segments. The North American and the Malaysian lineages seem to contribute less genomic material to strains of relevance to human activities. In the case of the Malaysian strains this might be related to a partial reproductive isolation to the rest of the strains in the species, likely caused by a number of chromosomal rearrangements present in this lineage which lead to very low rates of viable spores in out-crossing experiments and likely hinders the flow of genetic material between these strains and the rest of the species [39].

Genome characterization of F12 segregants

The genomes of the large number of F12 segregants derived from the 2-way cross and the 4-way cross experi- ments were explored using the whole-genome sequencing data from these strains. The 2-way segregants had a mean coverage of about 22x and the 4-way segregants a mean coverage of about 15x. The highly recombined structure of these genomes is illustrated in figure 4 a-d) showing the sequence similarity of example segregants to the parental genomes along one chromosome. With the exception of a few regions of the genome where selection for one of the parental alleles during the twelve generations of intercross lead to very high frequencies of that allele, overall high degrees of mixture and haplotype diversity were observed across the whole genome.

This is promising for ongoing and future QTL mapping studies utilizing these segregant populations, as the power to narrow in on and identify casual polymorphisms influencing phenotypic variation depends on neigh- boring regions of the genome not being too strongly linked and therefore correlated to each other [41]. The mitochondrial genome showed different inheritance patterns in the two crossing experiments. In the 2-way cross, all 192 segregants inherited the same parental mitochondrial genome, that of the West African parent, whereas the situation was more complex in the 4-way cross where all segregants seemed to possess more or less distinct recombinant mitochondrial genomes, dominated by West African and Sake/Japanese segments.

20

(23)

Figure 3: Chromosomal similarity plots for four S. cerevisiae strains. De-novo assemblies of representatives of the clean phylogenetic lineages of S. cerevisiae plus the given strain of interest are aligned to the reference genome and SNPs are inferred from the alignments. Pairwise sequence similarities between the strain of interest and each of the clean representatives are then computed in a sliding window along each chromosome and plotted on a logarithmic scale in the vertical dimension with one color for each clean lineage (see Methods for details). Here the size of the sliding window is 10 kb. The absence of values in some windows means that in these regions reliable alignments could not be made to the reference genome for both strains in the comparison. The colors are dark green for the North American lineage (strain YPS128), red for the West African (DBVPG6044), light green for the Sake/Japanese (Y12), yellow for the Wine/European (DBVPG6765) and blue for the Malaysian (UWOPS03-461.4). a) Chromosomal similarity profile for EC1118, an important strain in the wine industry sequenced in another study [18] b) Kyokai no. 7, a Sake yeast sequenced in another study [21] c) Y55, a laboratory strain d) W303, a laboratory strain.

Some de-novo variation, that is variation present in one or more of the segregant offspring but not in any of the parental genomes, was identified. Using the depth of sequencing coverage as a measure of the copy number of genomic regions, cases of aneuploidy, specifically the presence of an extra copy of a single chromosome, were discovered in some the segregants (figure 4 e). While aneuploidy is thought to be mostly detrimental in yeast, it seems to be more tolerated than in for example human cells and might even provide adaptive advantages in specific environmental conditions [42]. Interestingly, in both the 2-way and the 4-way cross the aneuploidy was limited to a particular chromosome for which several individuals were affected. In the 2-way cross nine segregants had an extra copy of chromosome 9 and in the 4-way cross 12 segregants had an extra copy of chromosome 1. The nature of the procedure by which the recombinant segregants are derived makes common ancestry of de-novo variants highly unlikely in the absence of positive selection, as the lineage derived from the common ancestor in which the variant arose will constitute a tiny fraction of an enormous population from which only a small number of segregants are sampled. Furthermore, both the 2-way and the 4-way experiments were carried out in two biological replicates each giving 96 segregants, and in both experiments the aneuploidies were observed in both replicates. This implies that the high frequency of the aneuploidic variants in the segregant population might be a product of positive selection during the intercross. However the biological details underlying these observed results remain to be elucidated.

A relatively large number (~100) of de-novo single-nucleotide polymorphisms were also identified in the 2-way segregants (the analysis was not performed for the 4-way segregants). Consistent with the population genetic properties of the experiment outlined above and in contrast to the aneuploidy variants, all of these SNPs were present only in a single individual. The biological properties and correlates of these mutations, such as whether they are enriched in recombinational hotspots as would be expected under a model where meiotic recombination is mutagenic, remains to be investigated.

21

(24)

Figure 4: Genome structure of F12 advanced intercross lines segregants. a-b) Sequence similarity along chromosome 4 for two example F12 segregants from the 2-way cross between the North American strain YPS128 and the West African strain DBVPG6044. The plot was made in the same way as in figure 3, with a sliding window of size 2.5 kb. Similarity to the North American parent is displayed in dark green, and the West African parent in red. c-d) Sequence similarity along chromosome 4 for two example segregants from the 4-way across involving the two parents from the above 2-way cross plus the Sake/Japanese strain Y12 and the Wine/European strain DBVPG6765. The North American and West African parents are dark green and red respectively as above, the Sake/Japanese parent is light green and the Wine/European parent is yellow. e) Example of an aneuploidy event identified by differences in the relative sequencing coverage. The depth of coverage of reads mapped to the reference genome was calculated and averaged over chromosomal windows here of size 10 kb. Horizontal lines denote the levels corresponding to 1, 1.5 and 2 times the genome-wide median coverage. For this strain, the double coverage of windows on chromosome 9 indicates the presence of an extra copy of this chromosome.

Segregation of unique parental segments in the F12

Using the de-novo assemblies of the advanced intercross line parental strains, genomic segments that are present in one of the parents but absent in the others could be identified. The fate of these segments in the F12 segregants was then determined by mapping the segregant reads to the parental assemblies and using the depth of coverage of these regions to infer presence or absence. Additionally, the chromosomal location of these segments could be determined by treating their presence or absence as genotypes of the segregants and computing linkage disequilibrium to all SNPs previously called on scaffolds of the parental de-novo assemblies.

Given the uniqueness of these regions to one parent, they should segregate as single units without much recombination. While most regions are fairly small, some are large enough to harbor one or multiple genes and might potentially have phenotypic effects.

Figure 5 illustrates the segregation of the largest of the identified unique parental segment using depth of

22

(25)

coverage on the Wine/European parent de-novo assembly. This Wine/European unique region is at least 63 kb in size and genetic linkage localizes it to the right chromosome 15 subtelomere. Size, chromosomal location and sequence similarity shows that this is a region that was previously identified and described in the commercial wine strain EC1118 [18]. This study found that the region has most likely been introgressed from another fungal species and appears to be present in many other strains used in wine-making, however its relevance if any to phenotypic variation in general and oenological phenotypes in particular has not yet been investigated. The tracking of this and other regions in this artificial population of segregants will allow for such phenotypic investigation.

Figure 5: Segregation in the 4-way F12 segregants of a genomic region unique to one parent. A region that was identified as being present only in the Wine/European parent DBVPG6765 and not in the other three parents of the 4-way cross was tracked in the F12 segregants by mapping the segregant sequence reads to the parental DBVPG6765 de-novo assembly and calculating the depth of coverage over this region. The figure includes about 50kb of the region which was contained in a single scaffold in the de-novo assembly. Displayed is the coverage averaged over windows of size 2.5 kb in the parental strains and in the segregant strains. The coverage in each window is normalized by the genome-wide median coverage of the strain and displayed on a log

²

-scale mapped onto a color gradient where log

²

values

≤ -3 are black, a value of 0 is yellow and values ≥ 3 are white. Vertical columns correspond to individual strains and horizontal rows to windows along the genomic region. The first four columns display the coverage of the four parental strain in the left-right order Wine/European (DBVPG6765), Sake/Japanese (Y12), West African (DBVPG6044) and North American (YPS128), followed by a gap and then the 4-way segregant columns. The difference between strains having zero copies (color going towards black) and those having one copy (color around yellow) of the region is clearly visible.

23

(26)

Discussion

The great promises for biological science offered by the accelerating pace of sequencing technology depends on the ability to handle and process the vast amounts of data being generated and to distill it into concrete biological results. Here I have presented analyses that demonstrate the use of next-generation sequencing data within the context of yeast population genomics and illustrate what kind of insights can be gained from a biological perspective. Saccharomyces cerevisiae has long served the biological sciences as a model system because of its excellent experimental properties and this has not changed with the advent of high-throughout sequencing - in fact the relevance of studies in yeast to other systems might be even higher than usual in the case of sequencing since the fundamental informatics principles underlying the analyses are the same regardless of what organism is being studied. Many of the issues touched upon here should therefore hopefully be relevant also within the broader contexts of next-generation sequencing and genomics.

Both natural yeast strains isolated from a variety of environments and geographical origins as well as “artificial”

populations of strains constructed from intercross experiments carried out in the laboratory were analyzed.

While the type of data is essentially the same in both cases the biological questions asked mostly differ and in general information and results obtained from the natural strains forms a necessary background framework for the analysis of the artificial populations.

Assembling the natural strain genomes de-novo enabled investigations that would not have been possible with only a reference-based sequence analysis. These assemblies were used to identify genomic material which is differentially present in different strains, in particular material absent from the reference strain which harbors the genome sequence typically forming the basis for genomics studies in budding yeast. An interesting and important question in this context is how much sequence is needed to obtain good quality de-novo assemblies that allow for these kind of analyses of genome and gene content variation. Out of the strains sequenced with Illumina technology only the ones with a coverage of 20 or higher were used for de-novo assembly, but even so coverage made a big difference within this group and strains with higher coverage generally had less fragmented assemblies. The Malaysian strain UWOPS03-461.4 was among the strains with the lowest coverage and the assembly quality for this strain was not as good as for the higher-coverage strains, probably having a negative impact on e.g. gene finding and structural variation analyses. The results from the non-reference material identification indicated that coverage might have had some effect on this analysis, though the effect was far from certain. In general the shape of the relationship between sequencing coverage and assembly quality is not well-understood. Given that a few strains here were sequenced to extremely high coverage (400- 800x) these data sets could in principle be down-sampled continuously along some interval and the qualities of the assemblies resulting from the different levels of coverage be studied, though this was not performed here. Another issue in yeast de-novo assembly is the structurally complex subtelomeric regions, which did not assembly well even in the high-coverage strains, indicating that alternative experimental approaches such as sub-cloning of individual subtelomeres prior to sequencing might be required for the accurate reconstruction of their sequence.

The use of genetic linkage data from the advanced intercross lines to improve the scaffolding of the parent strain de-novo assemblies proved quite successful and resulted in chromosome-sized scaffolds for each of the four parental genomes. This represents a relatively novel approach to the problem of genome assembly scaffolding and was enabled by the dense genotyping of the segregant strains provided by the whole-genome sequencing.

However the unconventionality of the approach also comes with a low availability of software suitable for the computational work involved in the task and so the process was quite labour-intensive. A number of methodological issues also exist (see Methods for more details) - for instance the high degree of spatial variation in recombination rates across the genome leads to quite a bit of uncertainty in the estimation of physical

24

Analysis of natural and artificial yeastpopulations by next-generationsequencingAnders Bergström

Analysis of natural and artificial yeast populations by next-generation

sequencing

Anders Bergström

Degree project in bioinformatics, 2012

Examensarbete i bioinformatik 30 hp till masterexamen, 2012

Biology Education Centre, Uppsala University, and Institute for Research on Cancer and Aging in Nice (IRCAN)

Supervisor: Gianni Liti

MSc BIOINF 12 007

Abstract

Advances in DNA sequencing technology, particularly the advent of a new group of technologies referred to as

next-generation sequencing, is now making it feasible to sequence the genomes of multiple individuals from the

same population. Simultaneously, there is an increasing interest in exploiting natural variation to elucidate the

relationship between genotype and phenotype and to dissect the genetic basis of traits, particularly so called

complex traits which are influenced by multiple factors. The budding yeast Saccharomyces cerevisiae has

emerged as a powerful system for carrying out these kinds of studies because of its small genome, experimental

tractability and well-characterized biology. Described here is the analysis of whole-genome sequence data

from a number of natural S. cerevisiae strains selected to capture the majority of the genetic variation in

the species as well as data from a large number of highly recombinant segregant individuals derived from

crossing experiments between representative natural strains. Using both de-novo assembly and reference-

based approaches the data from the natural strains is used to elucidate patterns of population history and

structure and to identify genomic material not present in the reference genome of the species. The genomes

of the segregant strains from the crossing experiments are characterized to reveal cases of de-novo variation

not present in the parent strains as well as to track the segregation of genomic regions that are unique to one

of the parents. Genetic linkage information from these recombinant populations is also used to improve the

structure of parental strain de-novo genome assemblies. Overall these analyses provide an illustration of the

use of next-generation sequencing data within the context of yeast population genomics and demonstrate what

kind of insights can be gained from a biological perspective.

Analysis of natural and artificial yeast populations by next-generation sequencing

Popular science summary Anders Bergström Degree project in bioinformatics, 2012

Biological organisms within the same population tend to vary in many of their characteristics and traits.

Degree project in bioinformatics, 2012 (Examensarbete i bioinformatik 30 hp till masterexamen, 2012) Biology Education Centre, Uppsala University, and Institute for Research on Cancer and Aging in Nice

Supervisor: Gianni Liti

Contents

Abstract 1

Popular science summary 3

Introduction 7

Next-generation sequencing data . . . . 7

Saccharomyces cerevisiae population genomics . . . . 9

Project overview . . . . 9

Methods 12 De-novo assembly . . . . 12

Read mapping . . . . 12

SNP calling and genotyping . . . . 12

Depth of coverage calculations . . . . 13

Assembly scaffolding by genetic linkage . . . . 13

Genome comparisons and chromosomal similarity plots . . . . 14

Identification of unique genomic regions . . . . 14

Gene annotation . . . . 14

Results 16 De-novo assemblies of natural strain genomes . . . . 16

Improved assembly scaffolding by genetic linkage . . . . 16

Gene content and non-reference genomic material . . . . 18

Population history and structure . . . . 20

Genome characterization of F12 segregants . . . . 20

Segregation of unique parental segments in the F12 . . . . 22

Discussion 24

References 26

Appendix 29

Introduction

The budding yeast Saccharomyces cerevisiae, one of the major molecular biology model organisms, has emerged as a powerful system in which to carry out population genomics and genotype-phenotype mapping studies [3].

Next-generation sequencing data

The data produced by next-generation sequencing machines consists of short sequence reads each sampled randomly from a redundant pool of DNA molecule fragments. Early Illumina sequencers produced reads not

7

and BFAST [13].

8

Saccharomyces cerevisiae population genomics

The budding yeast Saccharomyces cerevisiae has a long and successful history as a model organism for molecular biology and genetics research and has been in the forefront for many methodological advances in these fields.

cerevisiae, it at least seems probable that human activity has had an impact on the population. A number of other strains have also been sequenced at various levels of coverage primarily using Sanger technology [16, 17, 18, 19, 20, 21].

Project overview

9

The segregant data is therefore put to use for further scaffolding of de-novo assemblies of the parental strain genomes.

10

11

Methods

De-novo assembly

mean coverage of 0.5 or higher was excluded from scaffolding. The filtered read mapping results were used as input for the SSPACE stand-alone scaffolder [26].

Read mapping

Illumina reads were mapped to the S. cerevisiae S288c reference genome (Release R64-1-1, obtained from the Saccharomyces Genome Database [27]) and to de-novo assemblies using BWA (versions 0.5.9 and 0.6.1) [11]

SNP calling and genotyping

Primary calls of single-nucleotide polymorphisms (SNPs) for natural strains with reads mapped to the reference genome and for parental and segregant strains with reads mapped to parental de-novo assemblies were made

12