Genomic distribution and estimation of nucleotide diversity in natural populations: perspectives from the collared flycatcher (Ficedula albicollis) genome

(1)

F R O M T H E C O V E R

Genomic distribution and estimation of nucleotide diversity in natural populations: perspectives from the collared

flycatcher ( Ficedula albicollis) genome

LUDOVIC DUTOIT, RETO BURRI, ALEXANDER NATER, CARINA F. MUGAL and H A N S E L L E G R E N Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Norbyv€agen 18D, SE-752 36 Uppsala, Sweden

Abstract

Properly estimating genetic diversity in populations of nonmodel species requires a basic understanding of how diversity is distributed across the genome and among individuals. To this end, we analysed whole-genome resequencing data from 20 collared flycatchers (genome size 1.1 Gb; 10.13 million single nucleotide polymorphisms detected). Genomewide nucleotide diversity was almost identical among individuals (mean= 0.00394, range= 0.00384–0.00401), but diversity levels varied extensively across the genome (95% confidence interval for 200- kb windows= 0.0013–0.0053). Diversity was related to selective constraint such that in comparison with intergenic DNA, diversity at fourfold degenerate sites was reduced to 85%, 3⁰UTRs to 82%, 5⁰UTRs to 70% and nondegenerate sites to 12%. There was a strong positive correlation between diversity and chromosome size, probably driven by a higher density of targets for selection on smaller chromosomes increasing the diversity-reducing effect of linked selection. Simulations exploring the ability of sequence data from a small number of genetic markers to capture the observed diversity clearly demonstrated that diversity estimation from finite sampling of such data is bound to be associated with large confidence intervals. Nevertheless, we show that precision in diversity estimation in large outbred population benefits from increasing the number of loci rather than the number of individuals. Simulations mimicking RAD sequencing showed that this approach gives accurate estimates of genomewide diversity. Based on the patterns of observed diversity and the performed simulations, we provide broad recommendations for how genetic diversity should be estimated in natural populations.

Keywords: genetic markers, nucleotide diversity, population genomics, recombination Received 18 February 2016; revision received 2 September 2016; accepted 19 September 2016

Introduction

Genetic diversity is a key parameter in evolutionary biology and population genetics. It relates to the evolvability of populations (Fisher 1930), is important in the contexts of adaptation (Barrett & Schluter 2008), inference of population structure (Charlesworth 2010) and speciation (Coyne & Orr 2004), and is also relevant to conservation and management (Frankham 1995; Reed & Frankham 2003). Moreover, explaining the genetic diversity under- lying phenotypic variation has long been a challenge to evolutionary biologists because directional as well as sta- bilizing selection should deplete this diversity (Barton &

Turelli 1989; Barton & Keightley 2002). Knowledge about the levels and character of genetic diversity is important to questions like these, and consequently, the ability to

accurately estimate genetic diversity is essential to the study of evolutionary phenomena.

In population genetic terms, genetic diversity reflects the interplay of mutation, genetic drift, selection, recombination and gene flow on DNA sequence variation.

Intuitively, we expect large populations to harbour more genetic diversity than small populations and, in princi- ple, this is defined by the population mutation rate (Θ), which equals 4Nel where Neis the effective population size andl is the rate of mutation. However, the determinants of genetic diversity are complex and it has long appeared mysterious that variation in levels of genetic diversity among species is relatively limited despite huge variation in population sizes (Leffler et al. 2012), an observation known as Lewontin’s paradox (Lewontin 1974). One possible explanation for this paradox is that the high genetic diversity expected for large populations is counteracted by genetic draft (Gillespie 2000, 2001), with selection being more efficient in large populations Correspondence: Hans Ellegren,

E-mail: Hans.Ellegren@ebc.uu.se

(2)

and thereby reinforcing the diversity-reducing effect of linked selection on neutral diversity (Corbett-Detig et al.

2015). Both pervasive positive selection (Maynard-Smith

& Haigh 1974) and extensive purifying selection (Char- lesworth 2012) will reduce neutral diversity at linked sites. Moreover, recent evidence suggests that life-history traits such as fecundity (Romiguier et al. 2014) rather than ecological disturbance and historical contingency (short-term variation in Ne; (Banks et al. 2013)) may explain variation in diversity levels among species.

Besides variation in diversity levels among populations and species, it is important to note that a singleΘ cannot describe genomic diversity because selection locally reduces Neto a different extent in different parts of the genome (Gossmann et al. 2011). Such effects are expected to be most pronounced in regions of low recombination where linked selection will affect the frequency of neutral variants over longer physical distances than in regions of high recombination. This prediction is supported by observations in several organisms of a positive correlation between recombination rate and nucleotide diversity (reviewed in (Cutter & Payseur 2013)).

Additional factors that may contribute to within-genome variation in diversity levels include introgression, which may not be uniformly distributed across the genome (Wu & Ting 2004), and local or regional variation in the rate of mutation (Hodgkinson & Eyre-Walker 2011).

Moreover, even under an idealized scenario of constant selection and mutation, stochastic variation in the coalescence process for individual genomic regions will render the amount of diversity variable across the genome (Wakeley 2009). On top of this, individuals within at least small populations can vary in their overall degree of genetic diversity due to different levels of inbreeding (Bensch et al. 2006) and this may also apply when there is selfing or frequent introgression. Yet, the ability to obtain a single value of the diversity parameter is important when broadly considering the effectiveness of selection and drift and in among-population comparisons.

For the reasons mentioned above, estimating genetic diversity is inherently sensitive to the number of sampled loci (because of heterogeneity in diversity levels across the genome) and individuals (because of potential variation in inbreeding among individuals), and also to the type of sequence analysed (e.g. neutrally evolving loci vs. sequences subject to selection). Theoretical work (Pluzhnikov & Donnelly 1996; Felsenstein 2006) and simulations (Carling & Brumfield 2007) have been used to describe the expected distribution of diversity levels across the genome and to determine how well finite sampling of loci and individuals is able to capture the variation in genetic diversity. However, it remained unclear to what extent these studies captured biologically relevant scenarios in terms of how genetic diversity in

distributed across the genome in natural populations. An alternative way of analysing the extent to which genomic diversity is reflected in finite sampling from a population is to first empirically determine the genomewide landscape of diversity and then simulate sampling from the observed distribution. As this requires vast amounts of genomic data, it has up until now not been a viable option. However, this is bound to change given current progress in genome sequencing and resequencing of diverse groups of organisms (Ellegren 2014).

At this point, the collared flycatcher (Ficedula albicollis) is one of the ‘ecological model organisms’ in which the genome has been mapped in most detail. Following the generation of a draft sequence assembly of the 1.1 Gb collared flycatcher genome (Ellegren et al. 2012), the con- struction of a high-density linkage map using a 50-k single nucleotide polymorphism (SNP) array (Kawakami et al. 2014a) allowed the development of a second-generation assembly version with unusually high assembly continuity and with scaffolds ordered and oriented along chromosomes (Kawakami et al. 2014b; Smeds et al. 2014, 2015). In addition, whole-genome resequencing data from multiple populations and from related species are available (Burri et al. 2015; Nater et al. 2015; Kardos et al.

2016). This system therefore serves as a most useful model and offers excellent opportunities for studies of the landscape of genetic diversity in a eukaryotic genome. Here, we use whole-genome resequencing data from a population of collared flycatchers to address the following questions: How is genetic diversity distributed across chromosomes and the genome? Is the distribution of diversity more heterogeneous than expected by chance? To what extent does genomewide diversity vary among individuals and among functional categories of sequences? Then, based on real data, we use simulations to examine how different sampling schemes would affect estimates of genetic diversity and how sampling schemes can be optimized to capture most of the variation in genetic diversity within populations.

Material and methods

Sequence analysis

We used whole-genome resequencing data from 20 (10 males, 10 females) collared flycatchers from a single population in the Apennine mountain range in central Italy (available on the European Nucleotide Archive (ENA) under Accession Number PRJEB7359). Detailed information about sampling, DNA analyses and bioinformatic methods used for generating these data is given in Burri et al. (2015). Briefly, birds were sequenced with paired- end technology on an Illumina HiSeq 2000 instrument.

Reads (29 100 bp, from libraries with insert sizes of

(3)

approximately 450 bp) were then mapped to a repeat- masked version of the collared flycatcher genome assembly FicAlb1.5 (GenBank Accession GCA_000247815.2) using BWA 0.7.5a (Li & Durbin 2009) and local realign- ment with GATK 3.2.2 (McKenna et al. 2010; DePristo et al. 2011). Variant calling was performed using a com- bination of UnifiedGenotyper in GATK, Samtools (Li et al. 2009), and FreeBayes (Garrison & Marth 2012) with default settings. Base quality score recalibration was then applied in two rounds. In addition to the procedures for variant calling described in (Burri et al. 2015), we applied a final filtering step by only including sites at which every individual was covered by at least five reads (with the exception of when comparing different individuals within windows).

Annotation

We downloaded ENSEMBL annotations for version FicAlb1.4 of the collared flycatcher genome assembly, translated them to the FicAlb1.5 assembly version and then categorized sequences as either representing intergenic regions, introns, coding sequences (CDS), or 5⁰ or 3⁰untranslated regions (UTR). Alternative transcripts of the same gene were concatenated. Within coding sequences, fourfold and 0-fold degenerate sites were dis- tinguished. In doing so, codons with coding sequence on both strands and where more than one site was variable were removed from the data set.

Estimation of nucleotide diversity

As an estimator of the population mutation rate (Θ), we estimated nucleotide diversity (p), the average pairwise number of differences per site among the chromosomes in a population (Nei & Li 1979).

p ¼X

ij

xixjpij ð1Þ

where i, j2 {1, . . ., n}, and n is the number of sequences in the sample. Further, xiand xjdenote the respective frequencies of the ith and jth sequences;pijdenotes the number of nucleotide differences per nucleotide site between the ith and jth sequences.

We estimated nucleotide diversity along the genome in 200-kb nonoverlapping windows for every individual and for the population as a whole using in-house scripts and the Python package ^PYVCF 0.4.0 (https://

pyvcf.readthedocs.org). Windows with <10 000 sites remaining after filtering were discarded. The chosen window size was a trade-off between capturing fine-scale variation in genetic diversity and reducing measurement error. In addition, separate estimates of genomewide

nucleotide diversity were made for concatenated sequences of each of the sequence categories described above. Reported estimates of nucleotide diversity are at the population level. Estimates of individual nucleotide diversities are specified as such. For estimating Z chromosome diversity, we only used data from the 10 males;

in birds, males are the homogametic sex.

To obtain a null expectation for the variation in genetic diversity along the autosomal genome, we randomly reassigned the location of the total number of observed SNPs. We performed 10 resamplings and extracted the distribution of nucleotide diversity within windows under this random distribution of segregating sites for every replicate.

To examine whether any of the individuals were closely related, we assessed relatedness among individuals by calculating of the unadjusted Ajk statistic (Yang et al.

2010) and inbreeding was assessed by FIS using the Hierfstat (Goudet 2005) package for R (R Development Core Team 2014).

Simulation of diversity estimation

To simulate the precision of nucleotide diversity estimates from a given number of markers and individuals, we subsampled the empirical data under different sampling schemes designed to correspond to amplicon sequencing and RAD sequencing, respectively.

In the schemes mimicking amplicon sequencing, we defined a marker as 500 sites for which all the sampled individuals had a coverage at least five reads in a physical region of maximum 1500 bp. Sampling schemes consisted of 1–20 individuals sampled for 1–50 ‘markers’ (in this case regions of the genome), done separately for coding sequences, intergenic regions, introns and 5⁰UTRs, and 1, 5, 10 and 20 individuals for 50–200 markers in intergenic regions. The schemes mimicking RAD sequencing consisted of 10 or 20 individuals sampled for 500–5000 100-bp markers of random sequence for which each bp had a coverage of at least five reads for all the sampled individuals.

For all sampling schemes, subsampling was repeated 1000 times and nucleotide diversity estimated for each sampled set of individuals and markers. The interval covering 95% of the distribution of nucleotide diversity estimates was recorded. Previous studies examining the variance in estimates of nucleotide diversity from the use of a finite number of markers have referred to this as accuracy of precision in diversity estimation (Pluzhnikov

& Donnelly 1996; Felsenstein 2006; Carling & Brumfield 2007). Here, we refer to the width of the 95% confidence intervals (CIs) as precision in diversity estimation as it is a measure of the consistency ofp across subsamples for a given sampling scheme.

(4)

Results and discussion

Overall levels of genetic diversity

Whole-genome resequencing of 20 collared flycatchers resulted in a mean per-site coverage of 13.99 across individuals, with a range of 7.6–21.49. 575.64 Mb out of the 1.047 Gb total autosomal assembly met the criteria of having a read coverage of at least 59 in all 20 individuals, and 10.13 million of these sites were variable in our sample. The large number of SNPs provides an excellent opportunity to assess genomewide diversity at the population as well as at the individual level. None of the individuals were closely related (maximum pairwise AJK value: 0.09 Fig. S1, Supporting information), and there was limited evidence for inbreeding (FIS= 0.0037). This suggests that the analysed individuals constitute a random sample of birds from the population. Estimates of genomewide, autosomal nucleotide diversity per individual were almost identical among the 20 individuals (mean p = 0.00386, range = 0.00376–0.00393; Fig. 1A, Table S1, Supporting information) and corresponded to a mean of one heterozygous position every 259 bp in an individual’s genome (Table 1). With all 20 individuals taken together, the mean density of segregating sites in the sample was one every 103 bp. A survey of diversity data from 167 species representing 14 phyla found that the majority of species havep in the range of 0.0005–0.05, with a median of0.0065 (Leffler et al. 2012). Collared flycatcher thus shows intermediate levels of genetic diversity seen in this broader context. As expected from

the lower effective population size, nucleotide diversity along the Z chromosome was lower than on autosomes (meanp = 0.00288). As the evolutionary forces affecting diversity levels of autosomal and sex-linked sequences differ, and addressing these differences was beyond the scope of the study, we will not further deal with sex- linked sequences.

Genetic diversity in different sequence categories Selection will generally reduce diversity levels. Overall, nucleotide diversity was strongly dependent on the annotated type of sequence and was directly related to the expected selective pressure (Table 2), with coding sequences being the least variable category (mean p = 0.0014). Using intergenic sequence as a putatively neutral reference (meanp = 0.0040), diversity at fourfold degenerate sites was reduced to 85% (0.0034), 3⁰UTRs to 82% (0.0033), 5⁰UTRs to 70% (0.0028) and nondegenerate sites to 12% (0.0005). Diversity in introns (mean p = 0.0039) was similar to that in intergenic regions. As these two categories of sequences are by far most common in avian genomes, they largely determined the genomewide average.

The lower diversity at fourfold degenerate (silent) sites compared to intergenic and intronic sites may seem surprising at first given that all three sequence categories traditionally have been considered to evolve neutrally.

However, it supports observations in molecular evolutionary studies of birds (K€unstner et al. 2011) and several other organisms (E}ory et al. 2010; Pollard et al. 2010;

(A) (B) (C)

Fig. 1 Representations of collared flycatcher genetic diversity along its main axes of variation. (A) Mean genomewide nucleotide diversity per individual. (B) Density distribution of genetic diversity in 200-kb windows with (solid line) or without (dashed line) coding sequences included. Grey lines represent 10 simulations of the estimated density distribution of genetic diversity assuming random distribution of SNPs across the genome. (C) Autocorrelation analysis of nucleotide diversity among neighbouring windows. Partitions refer to 200-kb windows, such that, for example, 10 partitions correspond to a distance of 2 Mb. The solid line shows Spearman’s rho, and the dotted line refers to P-values. Dashed lines show rho= 0 (lower) and the significance threshold of P < 0.05 (upper).

(5)

Clemente & Vogl 2012; Lawrie et al. 2013) indicating that silent sites are constrained by purifying selection. For example, they might be functionally involved with splic- ing or mRNA stability (Chamary et al. 2006). An alternative but not mutually exclusive explanation is that close linkage to nearby nonsynonymous sites under selection might reduce diversity at synonymous sites due to hitchhiking effects (Cutter & Payseur 2013; Corbett-Detig et al.

2015). However, if linked selection is pervasive, this should also to some extent affect introns.

It is noteworthy that regulatory sequences subject to purifying selection reside within introns and intergenic regions (Ponting 2008), illustrating the difficulty in defin- ing a truly neutral category of sequences. One way of handling this in attempts to estimate the neutral substitution rate (and thereby getting a proxy for the mutation rate) is to mask noncoding regions that are highly conserved in alignments to sequences from distantly related species (Siepel et al. 2005). The downside of this approach is that it is difficult to distinguish between weak selective constraints and locally reduced mutation rates as the cause of sequence conservation. Moreover, functional elements in noncoding DNA may be lineage- specific due to rapid turnover (Smith et al. 2004; Meader et al. 2010) and therefore not detectable as conserved regions in distant alignments.

Finally, we note that many molecular evolutionary studies, including in birds (Kunstner et al. 2011), have

revealed evolutionary constraints in UTRs. This is consistent with the reduced diversity in these regions that we observe. Among other functions, UTRs are likely to contain regulatory sequences affecting gene expression including binding sites for micro-RNAs (Bartel 2009).

Variation in genetic diversity across the genome A comparison of the simulated distribution of p estimates among nonoverlapping 200-kb windows assuming random location of segregating sites in the genome and the observed distribution ofp estimates clearly showed that genetic diversity is heterogeneously distributed across the genome (Fig. 1B). This was also evident from a significant autocorrelation in p estimates between neighbouring 200-kb windows (Fig. 1C), demonstrating regional similarity in diversity levels. On average, there was a statistically significant correlation in p estimates between windows located up to 1 Mb apart (Fig. 1C).

The similarity in diversity levels on a regional scale, but also the extensive variation on a larger scale, can readily be seen from the distribution ofp estimates along a chromosome (Fig. 2A). Diversity estimates differed by nearly two orders of magnitude between the least and the most variable window, with a 95% CI of 0.0013–0.0053.

The variation inp among windows was not merely a consequence of different proportions of less variable coding sequences as excluding coding sequences from the estimate did not change the density distribution of nucleotide diversity (Fig. 1B). Moreover, as coding sequences only constitute 2% of the flycatcher genome, and UTRs another 1%, the proportion of coding sequence per window has limited influence on the variation in diversity among 200-kb windows (while it would have had higher influence at smaller window sizes). However, as shown below, the density of coding sequence may still be important in determining diversity levels via linked selection. Although window-basedp estimates of different individuals were highly correlated (Fig. 2B; Spear- man’s rho, range= 0.70–0.86), there was far more variation among individuals at the level of windows than at the level of the whole genome (Fig. 2A). This may of course be a statistical effect of the much larger number of sites involved when the whole genome is considered but variation in coalescence times among haplo- types, for example due to selection, could add another layer of variation among individuals in different genomic regions. The mean difference between the highest and lowest individual p estimate per window was 0.0018 0.0012.

To address the causes of the broad-scale variation in genetic diversity across the genome, we analysed the relationship between chromosome size and diversity.

This analysis was motivated by the fact that chromosome Table 1 Average (SD) number of variable sites per individual

for each type of annotated sequence category extrapolated for the whole genome

Sequence category Number of variable sites

5⁰UTR 15 698 225

CDS 14 343 193

3⁰UTR 865 29

Introns 754 081 8020

Intergenic regions 1 482 708 17 170

Total 2 267 696 25 208

Table 2 Nucleotide diversity (p) in annotated regions of the collared flycatcher genome, and their fraction of the whole genome

Sequence category p

Proportion of genome (%)

Intergenic 0.0040 65.9

Intronic 0.0039 30.7

Fourfold degenerated sites 0.0034 1.3 Nondegenerated (0-fold) sites 0.0005 0.4

5⁰UTR 0.0028 1.0

3⁰UTR 0.0033 0.1

Total 0.0039

Data are for autosomes.

(6)

size and recombination rate are negatively correlated in collared flycatchers (Kawakami et al. 2014b) as well as in many other species of birds (e.g. (ICGSC 2004)) and other organisms (e.g. (Farre et al. 2012; Jensen-Seaman et al.

2004). In flycatchers, the variation in mean recombination rate per chromosome is extensive, from 2 cM/Mb for chromosomes>100 Mb to >10 cM/Mb for chromosomes

<10 Mb (Kawakami et al. 2014b). High rates of recombination in small chromosomes should be associated with high levels of genetic diversity (i.e. there should be a negative correlation between genetic diversity and chromosome size) because recombination uncouples selected and neutral loci, thereby reducing the effect of linked selection on neutral diversity. Studies in many species of

animals and plants (reviewed in (Cutter & Payseur 2013)), including flycatchers (Burri et al. 2015), have indeed demonstrated a positive correlation between rate of recombination and genetic diversity. However, we found a positive relationship between chromosome size and p, with diversity increasing from a chromosome average of p 0.0025 for the smallest chromosomes (<5 Mb) to reach a plateau at 0.0040–0.0043 for chromosomes in the size range of 50–150 Mb (Fig. 3A; Spear- man’s rho= 0.803, P < 0.0001). A positive correlation was seen for all sequence categories (coding sequences, fourfold degenerate sites, nondegenerate sites, 5⁰UTR, 3⁰UTR, introns, intergenic DNA) when analysed separately (Fig. S3, Supporting information).

(A) (B)

Fig. 2 (A) Example of variation in nucleotide diversity of a chromosomal segment shown for 200-kb windows within position 30–60 Mb of chromosome 1. The solid line represents the population median, and the grey lines delimit the maximum and minimum individual nucleotide diversity among 20 birds. (B) Example of the correlation between individual nucleotide diversity in 200-kb windows of two birds (Pearson’s r= 0.87).

(A) (B) (C)

Fig. 3 Correlation between nucleotide diversity and chromosome size (log10scale) in (A) collared flycatcher (this study), (B) brown creeper (Certhia americana) with data from (Manthey et al. 2015) and (C) Hawaii amakihi (Hemignathus virens) with data from Callicrate et al. (2014).

(7)

The positive relationship between chromosome size and nucleotide diversity indicates that factors other than recombination are related with chromosome size and affect diversity. One such factor is the density of targets for selection, which can be approximated by the density of coding sequence. There was a strong negative correlation between the density of coding sequence and chromosome size in flycatchers (Fig. 4), consistent with findings in other avian genomes [e.g. (ICGSC 2004)].

Everything else being equal, the higher the density of sequences subject to selection, the stronger the diversity- reducing effect of linked selection, which is observed in this system (Burri et al. 2015). The opposing effects of high rates of recombination and high density of targets of selection on levels of nucleotide diversity in small avian chromosomes suggest that the direction of the correlation between nucleotide diversity and chromosome size will be governed by the relative strength of the effects of these two causative factors. Without detailed investigation, it is hard to predict which of the effects would dominate and it might very well be that the domi- nating factor varies among avian species. For example, as the diversity-reducing role of linked selection should be more prevalent when Neis large (Corbett-Detig et al.

2015), it might be that the importance of the density of targets for selection increases with increasing Ne. In our study system, it seems that the effect from density of targets of selection prevails because we found a positive correlation between nucleotide diversity and chromosome size. Analyses using the pairwise sequentially Markovian coalescent (PSMC) indicate that the long-term Neof the investigated collared flycatcher population has been500 000 (Nadachowska-Brzyska et al. 2016).

To further investigate the relationship between chromosome size and diversity, we searched the literature for data on levels of nucleotide diversity per

chromosome in other bird species. Figure 3B (Certhia americana (Manthey et al. 2015)) and C (Hemignathus virens (Callicrate et al. 2014)) show the relationship between nucleotide diversity and chromosome size for two species of the order Passeriformes. In Certhia americana, in contrast to flycatchers, the correlation is negative, while in Hemignathus virens there is no significant relationship. In chicken, one study found no significant relationship between SNP rate and chromosome size [figure 2 in (ICGPC 2004)], but using data from dbSNP (http://

www.ncbi.nlm.nih.gov/snp/), we actually find a negative correlation between the number of SNPs per bp and chromosome size in chicken (Spearman’s rho= 0.723, P= 6 9 10⁵). Using the same statistic for flycatchers (the number of SNPs per bp), we still find a strong positive correlation between diversity and chromosome size (rho= 0.905; P < 0.001; Fig. S2, Supporting information).

Thus, the data available so far suggest that the relationship between nucleotide diversity and chromosome size in birds is inconsistent among species.

The allele frequency spectrum

With complete sequence data from 20 diploid individuals (40 chromosomes), we could obtain a detailed picture of the genomewide allele frequency spectrum, both overall and for specific sequence categories (Fig. 5). Aware- ness of the shape of the frequency spectrum is important when selecting informative SNP markers for inclusion in genotyping arrays, but can also inform about selection and demography. The spectrum was strongly shifted towards low frequency alleles with about 25% of all segregating sites in intergenic and intronic sequences being singletons (i.e. corresponding to a minor allele frequency of 0.025 in our sample of 40 chromosomes). As expected for sequences evolving under purifying selection, coding sequence alleles (and especially so for nondegenerate sites) segregated at lower frequencies than alleles in intergenic and intronic DNA (Fig. 5A). A slight shift towards low frequency alleles for fourfold degenerate sites (compared to intergenic and intronic DNA) suggests higher selective constraint on these sites, in line with the observation of reduced nucleotide diversity level at fourfold degenerate sites (Fig. 5B). Alternatively, close linkage to nonsynonymous sites under selection might affect the allele frequency spectrum also of silent sites.

Simulation ofp estimation with genetic markers Having characterized the landscape of genetic diversity in the flycatcher genome by whole-genome analysis, we set out to test how well analyses of a finite number of loci, such as often applied in ecological or evolutionary studies, would be able to capture the observed Fig. 4 Density of coding sequence (per bp) in relation to chro-

mosome size in the collared flycatcher genome.

(8)

genomewide diversity in the population. To mimic a situation of analysis of genetic diversity based on amplicons (PCR-amplified markers), we used different sampling schemes of 1–20 individuals sequenced for up to 200–500-bp regions randomly placed in the genome.

We refer to these regions as ‘markers’. We ran separate simulations for intergenic DNA, introns, coding sequences and UTRs, respectively.

An overall and indeed important observation from the simulation of sampling variance was that the 95%

CI of p estimation was relatively broad for many sampling schemes, in particular when the number of markers was limited. This reflects that there is significant variation in diversity levels across the genome (Fig. 6A,B). Moreover, our results show that the precision in diversity estimation increased more by adding markers than by adding individuals (Figs 6 and 7), consistent with the observation of higher variation in diversity levels along the genome as compared to variation among individuals. For example, with 10 intergenic markers, the 95% CI for p estimation was similar for sequence data from only one individual (0.0018–0.0068; a ratio of 3.7 between upper and lower 2.5% tail) and from 10 individuals (0.0027–0.0055; 2.9).

In contrast, with sequence data from one marker, the 95% CI for 10 individuals was very much wider (0.0004–0.009; 22.5) than that for 10 markers (0.0027–

0.0055; 2.9). The precision in the estimation of p showed a steep increase (narrower CI) in the range of 1–5 markers, while further addition of markers only slowly led to improved precision (Fig. 6A,B). How- ever, even with as much as 50 markers (i.e. a total of 25 kb) sequenced for 10 individuals, the ratio between the upper and lower 95% CI tails ofp estimation was 1.3, and for 200 markers 1.2, giving an indication of how many markers are needed to accurately estimate genomewide diversity. The same trends were seen for all sequence categories (Fig. 7), but it should be noted that the precision in the estimation of p was higher for selectively constrained markers (in particular for coding sequences) than for intergenic and intronic DNA (Fig. 7; compare the size of CIs towards the lower left corner of the different plots). This is probably an effect of the lower genetic diversity of coding sequences leading to lower variance.

Simulations meant to mimic RAD sequencing (500–

5000 markers) gave narrow confidence intervals ofp estimates throughout the whole marker range (Fig. 6C). The ratio between the upper and lower CI tails ofp estimation was only 1.2 for 500 markers (0.0035–0.0042) and 1.1 for 5000 markers (0.0037–0.00409). It is noteworthy that high precision was obtained despite only 100 bp were analysed per sampled marker, as is commonly the case in next-generation sequencing. Again this is consistent with the higher precision provided by the sampling of many markers. It could also be noted that as the RAD sequencing simulations sampled random sequences across the genome and thus included data from all sequence categories, p estimates converged towards somewhat lower values than was the case when amplicon sequencing of intergenic sequences was simulated (Fig. 6).

(A)

(B)

Fig. 5 Folded allelic frequency spectra for different sequence categories based on data from sites covered in all 20 individuals.

(A) Spectrum including coding sequences (orange), introns (brown), intergenic DNA (yellow), 5⁰ UTR (purple) and 3⁰UTR (blue). (B) Refined spectrum for coding sequences including 0- fold degenerate sites (red), fourfold degenerate sites (green), and, as a reference, introns (brown).

(9)

Recommendations for sampling design

Under the conditions of our study system (an outbred population with low levels of linkage disequilibrium;

(Backstr€om et al. 2006)) and simulation design, the recommendation for diversity estimation would be to put more emphasis on the number of loci analysed than on the number of individuals sequenced. This recommendation should be relatively insensitive to the precise length of individual markers used and we see no reason that it would not be broadly applicable across organisms and populations. Variation in Neacross the genome, beyond stochastic variation in the coalescence process, is likely to be a common feature of many organisms and generating

sequence data from a large number of individuals from the same loci is essentially adding nonindependent data because they share a common evolutionary history (Pluzhnikov & Donnelly 1996). Moreover, we argue that significant heterogeneity in diversity levels across the genome implies that, given a fixed total amount of sequence obtained, increasing the amount of analysed sequence per locus does not generally make up for reducing the number of loci. The simulations mimicking RAD sequencing clearly showed that even with only 100 bp sequenced per loci, precision in p estimation was high when a large number of loci were sampled. In practice, our results demonstrate that RAD sequencing data will accurately reflect genomewide diversity and that little is

(A) (B) (C)

Fig. 6 About 95% subsampling confidence intervals of simulated estimated nucleotide diversity in intergenic regions in relation to (A) the number of individuals and (B) the number of 500-bp markers. (C) shows intervals for simulations mimicking RAD sequencing of 100-bp markers (x-axis in log10scale).

0 0.001 0.002 0.003 0.004

Size of the 95% CI

Intergenic

10 20 30 40 5

10 15 20

CDS

10 20 30 40 5

10 15 20

Number of individuals

Intronic

10 20 30 40 10 20 30 40

5 10 15 20 5 10 15 20

Number of markers 5’ UTR

Fig. 7 Size of 95% subsampling confidence intervals from simulations of estimated nucleotide diversity for different sampling schemes, with varying number of 500-bp markers and individuals. Separate results from intergenic regions, introns, coding sequences and 5⁰ UTRs are shown. Colour coding for interval size is shown to the right.

(10)

to be gained in this respect by whole-genome sequencing as far as mean levels of diversity are concerned.

Our recommendations concord with the conclusions from theoretical work by (Pluzhnikov & Donnelly 1996) and (Felsenstein 2006). (Carling & Brumfield 2007) used simulations to test the influence of the number and length of markers, and the number of individuals, on accuracy of estimates of genetic diversity. Overall, their work also supported the use of more markers over more individuals, but in contrast to our study, their simulations were based on simulated sequence data, not empirical data. The strong and heterogeneous effects of linked selection across the genome mean that it can be difficult to simulate sequence evolution in a way that the extent and character of polymorphism mirror biologically meaningful scenarios.

One situation at which the recommendation could potentially be compromised is for populations with significant inbreeding and, in particular, variation in the extent of inbreeding among individuals. In such cases, the importance of analysing more individuals in order to get a representative estimate of the genetic diversity in the population increases. Another aspect of inbred populations is the increased proportion of the genome that is identical by descent (IBD), meaning that the distribution of levels of genetic diversity within the genome starts to become detectably bimodal, with a certain fraction devoid of variability due to runs of homozygosity in IBD tracts (Pemberton et al. 2012). We also note that for several other purposes related to population genetic (e.g. for high resolution estimation of the allele frequency spectrum, and statistical tests based on such spectra) and/or molecular evolutionary analyses (e.g. like in the case of the McDonald–Kreitman test), it is more critical to sample a large number of individuals than in the case of estimating nucleotide diversity (Robinson et al. 2014).

Our simulations were based on sequence data from markers randomly distributed across the genome. With considerable variation in diversity levels among genomic regions and chromosomes, a nonrandom selection of markers for characterization of genetic diversity in a population could easily bias diversity estimates, or at least make them less comparable to estimates based on different sets of markers in other species. A correlation between chromosome size and genetic diversity seems particularly relevant to consider in this respect because the direction of correlation is apparently not consistent among species. For this reason, the recommendation would be to sample loci from as wide a range of chromosomes as possible.

Comparisons of diversity levels among species could also be hampered if the markers used are heterogeneous with respect to the type of annotated sequence, for example, if they contain both coding sequence and UTRs, or

both coding and intronic sequence. If the goal is to estimate neutral genetic diversity, noncoding loci well distributed across the genome should therefore be targeted.

This speaks in favour of using RAD sequencing or whole-genome resequencing for this purpose. It may be particularly relevant to emphasize that our results demonstrate that synonymous sites are generally less variable than intergenic regions and introns, indicating that they are evolving under the influence of purifying selection. Thus, such sites do not represent an ideal category of sequences for estimating neutral diversity.

Finally, we stress that the analyses and simulations pre- sented here, and the recommendations based on them, are for autosomal sequences. Many of the evolutionary forces affecting sequence diversity differ between sex chromosomes and autosomes.

Conclusions

We have shown that there is extensive variation in the level of nucleotide diversity across the genome of an avian species. This variation is seen in autosomal sequence and is thus unrelated to the well-known effects of sex linkage on genetic diversity (Hedrick 2007; Frank- ham 2012). Linked selection is likely to play a strong role in governing within-genome heterogeneity in diversity levels, with (variation in) recombination rate and density of targets of selection being primary determinants of the extent of linked selection. As far as we are aware, this study is the first to characterize genomewide nucleotide diversity through whole-genome resequencing of a large population sample and then use these real data to simulate how well genetic diversity would be captured by the use of genetic markers. We find that diversity estimation by sequencing a small number of amplicons is bound to be associated with large confidence intervals. Given the heterogeneity in diversity levels across the genome, gath- ering sequence data from many loci will increase the precision in diversity estimation. Naturally, one could ask whether molecular ecological studies will continue to be based on sequence data from a limited number of loci when genotyping-by-sequencing and whole-genome resequencing become increasingly feasible in many pro- jects. However, even with the use of next-generation sequencing technologies, target capture approaches are cost-effective and can be used for a wide range of appli- cations (Jones & Good 2016).

The ability to reliably estimate genetic diversity of different populations is critical for making conclusions about evolutionary processes. To end with an example (Gohli et al. 2013), recently reported an association between genetic diversity and female promiscuity in 18 passerine bird species based on sequence data from five introns (mean length400 bp). One possible explanation

(11)

to this would be that species with strong sexual selection for compatible genes (i.e. negative frequency-dependent selection for rare or dissimilar alleles) have relatively high levels of genetic diversity. The validity of this result was criticized by (Spurgin 2013) on several methodologi- cal grounds, including the precision of diversity estimates and the inference of species level diversity from sampling of individual populations (see also response to the criticism by (Lifjeld et al. 2013). Based on experiences from the present study, we note that more firm conclusions should have been possible to reach with more extensive sampling of genomic data, either confirming or disproving the idea of a relationship between genetic diversity and female promiscuity.

Acknowledgements

This study was funded by the Swedish Research Council (grant numbers 2010-565 and 2013-8271) and the Knut and Alice Wal- lenberg Foundation (KAW Scholar grant). Sequencing was performed by the SNP&SEQ Technology Platform at Uppsala University, as part of the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory. The SNP&SEQ Platform is also supported by the Swedish Research Council and the Knut and Alice Wallenberg Foundation. Computations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) through Uppsala Multidis- ciplinary Center for Advanced Computational Science (UPPMAX). We thank Rory Craig for comments on the manu- script.

References

Backstr€om N, Qvarnstr€om A, Gustafsson L, Ellegren H (2006) Levels of linkage disequilibrium in a wild bird population. Biology Letters, 2, 435–438.

Banks SC, Cary GJ, Smith AL et al. (2013) How does ecological disturbance influence genetic diversity? Trends in Ecology & Evolution, 28, 670–679.

Barrett RDH, Schluter D (2008) Adaptation from standing genetic variation. Trends in Ecology & Evolution, 23, 38–44.

Bartel DP (2009) MicroRNAs: target recognition and regulatory functions.

Cell, 136, 215–233.

Barton NH, Keightley PD (2002) Multifactorial genetics: understanding quantitative genetic variation. Nature Reviews Genetics, 3, 11–21.

Barton NH, Turelli M (1989) Evolutionary quantitative genetics: How little do we know? Annual Review of Genetics, 23, 337–370.

Bensch S, Andren H, Hansson B, et al. (2006) Heterozygosity gives hope to a wild population of inbred wolves. PLoS ONE, 1, e72.

Burri R, Nater A, Kawakami T et al. (2015) Linked selection and recombination rate variation drive the evolution of the genomic landscape of differentiation across the speciation continuum of Ficedula flycatchers.

Genome Research, 25, 1656–1665.

Callicrate T, Dikow R, Thomas JW et al. (2014) Genomic resources for the endangered Hawaiian honeycreepers. BMC Genomics, 15, 1–13.

Carling MD, Brumfield RT (2007) Gene sampling strategies for multi- locus population estimates of genetic diversity (h). PLoS ONE, 2, e160.

Chamary JV, Parmley JL, Hurst LD (2006) Hearing silence: non-neutral evolution at synonymous sites in mammals. Nature Reviews Genetics, 7, 98–108.

Charlesworth B (2010) Elements of Evolutionary Genetics. Roberts Publish- ers, Englewood, Colorado.

Charlesworth B (2012) The effects of deleterious mutations on evolution at linked sites. Genetics, 190, 5–22.

Clemente F, Vogl C (2012) Evidence for complex selection on four-fold degenerate sites in Drosophila melanogaster. Journal of Evolutionary Biol- ogy, 25, 2582–2595.

Corbett-Detig RB, Hartl DL, Sackton TB (2015) Natural selection con- strains neutral diversity across a wide range of species. PLoS Biology, 13, e1002112.

Coyne JA, Orr HA (2004) Speciation. Sinauer Associates Incorporated, Sunderland, Massachusetts.

Cutter AD, Payseur BA (2013) Genomic signatures of selection at linked sites: unifying the disparity among species. Nature Reviews Genetics, 14, 262–274.

DePristo MA, Banks E, Poplin R et al. (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43, 491–498.

Ellegren H (2014) Genome sequencing and population genomics in nonmodel organisms. Trends in Ecology & Evolution, 29, 51–63.

Ellegren H, Smeds L, Burri R et al. (2012) The genomic landscape of species divergence in Ficedula flycatchers. Nature, 491, 756–760.

E}ory L, Halligan DL, Keightley PD (2010) Distributions of selectively constrained sites and deleterious mutation rates in the hominid and murid genomes. Molecular Biology and Evolution, 27, 177–192.

Farre M, Micheletti D, Ruiz-Herrera A (2012) Recombination rates and genomic shuffling in human and chimpanzee—a new twist in the chromosomal speciation theory. Molecular Biology and Evolution, 30, 853–864.

Felsenstein J (2006) Accuracy of coalescent likelihood estimates: Do we need more sites, more sequences, or more loci? Molecular Biology and Evolution, 23, 691–700.

Fisher RA (1930) The Genetical Theory of Natural Selection. The Clarendon Press, Oxford.

Frankham R (1995) Conservation genetics. Annual Review of Genetics, 29, 305–327.

Frankham R (2012) How closely does genetic diversity in finite populations conform to predictions of neutral theory? Large deficits in regions of low recombination. Heredity, 108, 167–178.

Garrison E, Marth G (2012) Haplotype-Based Variant Detection from Short- Read Sequencing. ArXiv e-prints q-bio.GN.

Gillespie JH (2000) Genetic drift in an infinite population: the pseudo- hitchhiking model. Genetics, 155, 909–919.

Gillespie JH (2001) Is the population size of a species relevant to its evolution. Evolution, 55, 2161–2169.

Gohli J, Anmarkrud JA, Johnsen A et al. (2013) Female promiscuity is positively associated with neutral and selected genetic diversity in passerine birds. Evolution, 67, 1406–1419.

Gossmann TI, Woolfit M, Eyre-Walker A (2011) Quantifying the variation in the effective population size within a genome. Genetics, 189, 1389–

1402.

Goudet J (2005)HIERFSTAT, a package for R to compute and test hierarchi- cal F-statistics. Molecular Ecology Notes, 5, 184–186.

Hedrick PW (2007) Sex: differences in mutation, recombination, selection, gene flow, and genetic drift. Evolution, 61, 2750–2771.

Hodgkinson A, Eyre-Walker A (2011) Variation in the mutation rate across mammalian genomes. Nature Reviews Genetics, 12, 756–766.

ICGPC (2004) A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Nature, 432, 717–722.

ICGSC (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature, 432, 695–

716.

Jensen-Seaman MI, Furey TS, Payseur BA et al. (2004) Comparative recombination rates in the rat, mouse, and human genomes. Genome Research, 14, 528–538.

Jones MR, Good JM (2016) Targeted capture in evolutionary and ecological genomics. Molecular Ecology, 25, 185–202.

Kardos M, Husby A, McFarlane SE, Qvarnstr€om A, Ellegren H (2016) Whole-genome resequencing of extreme phenotypes in collared flycatchers highlights the difficulty of detecting quantitative trait loci in natural populations. Molecular Ecology Resources, 16, 727–741.

(12)

Kawakami T, Backstrom N, Burri R et al. (2014a) Estimation of linkage disequilibrium and interspecific gene flow in Ficedula flycatchers by a newly developed 50k single-nucleotide polymorphism array. Molecular Ecology Resources, 14, 1248–1260.

Kawakami T, Smeds L, Backstrom N et al. (2014b) A high-density linkage map enables a second-generation collared flycatcher genome assembly and reveals the patterns of avian recombination rate variation and chromosomal evolution. Molecular Ecology, 23, 4035–4058.

Kunstner A, Nabholz B, Ellegren H (2011) Evolutionary constraint in flanking regions of avian genes. Molecular Biology and Evolution, 28, 2481–2489.

K€unstner A, Nabholz B, Ellegren H (2011) Significant selective constraint at 4-fold degenerate sites in the avian genome and its consequence for detection of positive selection. Genome Biology and Evolution, 3, 1381–

1389.

Lawrie DS, Messer PW, Hershberg R, Petrov DA (2013) Strong purifying selection at synonymous sites in D. melanogaster. PLoS Genetics, 9, e1003527.

Leffler EM, Bullaughey K, Matute DR et al. (2012) Revisiting an old rid- dle: What determines genetic diversity levels within species? PLoS Biology, 10, e1001388.

Lewontin R (1974) The Genetic Basis of Evolutionary Change. Columbia University Press, New York and London.

Li H, Durbin R (2009) Fast and accurate short read alignment with Bur- rows-Wheeler Transform. Bioinformatics, 25, 1754–1760.

Li H, Handsaker B, Wysoker A et al. (2009) The sequence alignment/

map format and SAMtools. Bioinformatics, 25, 2078–2079.

Lifjeld JT, Gohli J, Johnsen A (2013) Promiscuity, sexual selection, and genetic diversity: a reply to Spurgin. Evolution, 67, 3073–3074.

Manthey JD, Klicka J, Spellman GM (2015) Chromosomal patterns of diversity and differentiation in creepers: a next-gen phylogeographic investigation of Certhia americana. Heredity, 115, 165–172.

Maynard-Smith J, Haigh J (1974) The hitch-hiking effect of a favourable gene. Genetical Research, 23, 23–35.

McKenna A, Hanna M, Banks E et al. (2010) The genome analysis toolkit:

a M^APR^EDUCEframework for analyzing next-generation DNA sequencing data. Genome Research, 20, 1297–1303.

Meader S, Ponting CP, Lunter G (2010) Massive turnover of functional sequence in human and other mammalian genomes. Genome Research, 20, 1335–1343.

Nadachowska-Brzyska K, Burri R, Smeds L, Ellegren H (2016) PSMC- analysis of effective population sizes in molecular ecology and its application to black-and-white Ficedula flycatchers. Molecular Ecology, 25, 1058–1072.

Nater A, Burri R, Kawakami T, Smeds L, Ellegren H (2015) Resolving evolutionary relationships in closely related species with whole-genome sequencing data. Systematic Biology, 64, 1000–1017.

Nei M, Li WH (1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Acad- emy of Sciences USA, 76, 5269–5273.

Pemberton TJ, Absher D, Feldman Marcus W et al. (2012) Genomic patterns of homozygosity in worldwide human populations. The American Journal of Human Genetics, 91, 275–292.

Pluzhnikov A, Donnelly P (1996) Optimal sequencing strategies for sur- veying molecular genetic diversity. Genetics, 144, 1247–1262.

Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A (2010) Detection of nonneutral substitution rates on mammalian phylogenies. Genome Research, 20, 110–121.

Ponting CP (2008) The functional repertoires of metazoan genomes. Nat- ure Reviews Genetics, 9, 689–698.

R Development Core Team (2014) R: A Language and Environment for Sta- tistical Computing. R Foundation for Statistical, Computing, Vienna, Austria.

Reed DH, Frankham R (2003) Correlation between fitness and genetic diversity. Conservation Biology, 17, 230–237.

Robinson JD, Coffman AJ, Hickerson MJ, Gutenkunst RN (2014) Sam- pling strategies for frequency spectrum-based population genomic inference. BMC Evolutionary Biology, 14, 1–16.

Romiguier J, Gayral P, Ballenghien M et al. (2014) Comparative population genomics in animals uncovers the determinants of genetic diversity. Nature, 515, 261–263.

Siepel A, Bejerano G, Pedersen JS et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research, 15, 1034–1050.

Smeds L, Kawakami T, Burri R et al. (2014) Genomic identification and characterization of the pseudoautosomal region in highly differenti- ated avian sex chromosomes. Nature Communications, 5, 5448.

Smeds L, Warmuth V, Bolivar P et al. (2015) Evolutionary analysis of the female-specific avian W chromosome. Nature Communications, 6, 7330.

Smith NGC, Brandstr€om M, Ellegren H (2004) Evidence for turnover of functional noncoding DNA in mammalian genome evolution. Geno- mics, 84, 806–813.

Spurgin LG (2013) Comment on Gohli et al. (2013): “Does promiscuity explain differences in levels of genetic diversity across passerine birds?” Evolution, 67, 3071–3072.

Wakeley J (2009) Coalescent Theory. Roberts Publishers, Englewood, Colorado.

Wu C-I, Ting C-T (2004) Genes and speciation. Nature Reviews Genetics, 5, 114–122.

Yang J, Benyamin B, McEvoy BP et al. (2010) Common SNPs explain a large proportion of the heritability for human height. Nature Genetics, 42, 565–569.

L.D. performed all population genomic analyses. R.B.

provided access to data, A.N. and C.F.M. provided input on study design, data interpretation and statistical analyses. H.E. conceived of the study and supervised the work. L.D. and H.E. initiated the study and wrote the paper.

Data accessibility

Sequences are available in the European Nucleotide Archive (ENA), Accession Number PRJEB7359. Scripts and details of the data are made available at Dryad:

doi:10.5061/dryad.1n84v.

Supporting Information

Additional Supporting Information may be found in the online version of this article:

Table S1Genome-wide nucleotide diversity per individual Fig. S1Pairwise relatedness among individuals in the studied collared flycatcher population

Fig. S2Number of SNPs per bp in relation to chromosome size in the collared flycatcher genome

Fig. S3 Relationship between chromosome size (log10) and nucleotide diversity for different sequence categories. All corre- lations were significant (P= 0.013, or less). The strength of cor- relations were: coding sequences (R²= 0.68), 0-fold degenerate sites (R² = 0.62), fourfold degenerate sites (R²= 0.69), 5⁰ UTR sites (R²= 0.44), 3⁰ UTR sites (R²= 0.80) intergenic DNA (R²= 0.18) and introns (R²= 0.29)