Genetic diversity and differentiation in natural and managed stands of Norway spruce (Picea abies)

(1)

Genetic diversity and

differentiation in natural and managed stands of Norway

spruce (Picea abies)

Helena Eklöf

Department of Ecology and Environmental change Umeå 2021

(2)

This work is protected by the Swedish Copyright Legislation (Act 1960:729) Dissertation for PhD

ISBN: 978-91-7855-472-0 (print)

ISBN digital version: 978-91-7855-473-7 (pdf) Information about cover photo: Helena Eklöf

Electronic version available at: http://umu.diva-portal.org/

(3)

To my family

(4)

(5)

(6)

List of chapters

This thesis is a summary based on the following chapters:

I.

Bernhardsson, C., Wang, X., Eklöf, H., & Ingvarsson, P. K. (2020).

Variant calling using whole genome resequencing and sequence capture for population and evolutionary genomic inferences in Norway spruce (Picea abies). In: The Spruce Genome (pp. 9-36). Springer, Cham.

II.

Eklöf, H., Bernhardsson, C., & Ingvarsson, P. K. (2020). Comparing the Effectiveness of Exome Capture Probes, Genotyping by Sequencing and Whole-Genome Re-Sequencing for Assessing Genetic Diversity in Natural and Managed Stands of Picea abies. Forests, 11(11), 1185.

III.

Eklöf, H., Bernhardsson, C., & Ingvarsson, P. K. Do modern forestry practices impact the genetic diversity of planted stands of Norway spruce (Picea abies). Manuscript.

IV.

Ingvarsson, P. K., & Dahlberg, H. (2019). The effects of clonal forestry on genetic diversity in wild and domesticated stands of forest

trees. Scandinavian Journal of Forest Research, 34(5), 370-379.

(11)

ii

Author contributions

Chapter I

PKI designed the study. HE performed field sampling and lab experiments. XW performed variant calling for data sets with support from CB and PKI. CB performed variant filtering. XW, HE and CB analysed results. XW wrote the variant calling part of the manuscript. CB wrote the variant filtration part of the manuscript. All authors contributed with comments on the final version of the paper.

Chapter II

PKI and CB designed the study. HE performed experimental work. HE, CB and PKI performed bioinformatic analyses. HE wrote the manuscript with help from PKI. All authors contributed with comments on the final version of the paper.

Chapter III

PKI and HE designed the study. HE performed all lab work, CB, HE and PKI performed bioinformatic analyses, HE performed all data analysis with support from PKI. HE wrote the manuscript. All authors contributed with comments on the final version of the paper.

Chapter IV

PKI designed the study. PKI performed population genetic modelling, HE conducted the literature survey, PKI wrote the manuscript with assistance from HE.

Paper I is reprinted with the kind permission from Springer International Publishing.

(12)

Abstract

Being one of Sweden’s most important tree species, both as a keystone species and for the forest industry, it is important that we keep our stands of Norway spruce (Picea abies (L.) Karst.) as healthy as possible. With an unclear starting point of existing genetic diversity in natural forests we need to both evaluate what levels of natural diversity we have to begin with and how modern forestry practices might affect this. Previous studies have used relatively few markers to study this or similar situation before. We proved that both capture probes and genotyping by sequencing (GBS) show similar results in common diversity measurements and offers many SNPs, although capture probes showed slightly more diversity in the results, we choose to use PoolSeq and GBS together to examine a large number of planted and natural stands of Norwegian spruce in northern Sweden. In line with previous results on the subject we did not find any large differences between our young, planted forests and our old forests, suggesting that today’s re-planting methods have not affected the general diversity in different stands. However, we did find a difference in the variance of our summary statistics on a stand level between planted and old stands, an indicator that there is a possibility that forestry can cause long-term effects. This becomes even more important in the light of possible clonal deployment of Norway spruce. I believe that more research is needed over both larger geographical areas and with a focus on within stand variation. Using mitochondrial and chloroplast DNA to discern finer details of spatial distribution within stands and looking closer at the genotypic diversity within natural and planted stands. An effort should also be put into examine how these possible differences are affecting the rest of the ecosystem, living with and among Norway spruce.

(13)

(14)

Introduction

Domestication

A complete domestication has already taken place in all of our highly productive crop species, such as wheat, rice and maize, for which human survival is highly dependent on today (Doebley et al., 2006). The domestication for some crop species has gone so far that, for example, neither cauliflower nor maize are longer capable of propagating themselves naturally and are completely depend on humans for their reproduction (Doebley et al., 2006). Forest trees, on the other hand, are still relatively undomesticated and retain abundant genetic and phenotypic variation (González-Martínez et al., 2006).

All types of breeding are associated with risks. If too few individuals are chosen or the chosen individuals are too similar at the genetic level, a large portion of the genetic diversity might be left behind and not make it into the breeding population. When only the best individuals are chosen for reproduction every generation, it induces a genetic bottleneck that reduce genetic diversity across the whole genome (Doebley, 1989). Two of the biggest factors determining how big the loss of genetic diversity will be is the duration of the domestication period and how large the population is during this period (Eyre-Walker et al., 1998).

However, the loss of genetic diversity is not equal for all genes in the genome.

Genes encoding traits under selection undergoes stronger reduction in diversity, or alternatively a more severe bottleneck, that can even remove all variation from the target locus. Neutral genes (genes that doesn’t affect a trait under selection) are not targeted by the direct selection and while they are also affected by the domestication bottleneck that acts to remove genetic variation, the reduction in diversity in neutral genes are often not as severe as that of the selected genes.

However, an exceptionally strong domestications bottleneck could leave little variation also in neutral genes, making it difficult to distinguish a neutral locus from a selected locus (Doebley et al., 2006, Wright et al., 2005). Traits that are selected for during domestication, so called ‘domestication syndromes’ (Smýkal et al 2018), are often specific phenotypes, and this is especially true for early domestication events where humans were unaware of genes and/or genetic techniques (Wright et al., 2005).

Loss of genetic diversity always pose a risk for the organism as it may lose the ability to evolve and adapt to a changing environment. With the current global warming crisis, it is important to ensure that genetic diversity remain in populations to give the organism a chance to cope with the new environmental conditions or habitats.

(15)

2

Study organism

Norway spruce (Picea abies (L.) Karst.) is an evergreen gymnosperm tree that belongs to the biggest group of gymnosperms, the family Pinaceae, which also includes, for instance, the genus Pinus (Källman, 2009). Norway spruce is one of the most important conifer species in Europe, with a distribution range from the west coast of Norway to mainland Russia and populations across the Alps, the Carpathians and the Balkans (Farjón, 1990). Norway spruce is wind pollinated and a predominantly outcrossing species (Burczyk et al., 2004). Shimono et al.

(2011) found an average selfing rate of only 0,5% in a Norway spruce clone archive in Ekebo Sweden. It is also one of the most economically important tree species for the production of timber and pulp (Koski et al., 1997). In Sweden there were around 3,5 billion cubic meters of standing forest on productive forest land in 2013, with 41% being made up of Norway spruce and 39.1% Scots pine (Black- Samuelsson et al., 2020).

Colonization after the last glaciation

Around 18 000 years BP the last glaciation reached its peak and 16 000 years BP the climate started to warm, and the ice retreated. This allowed species to expand northwards from refugia south of the maximal glaciation (Hewitt, 1999). 8000 years BP most deciduous forest species had reached their present habitats and 5000 BP the vegetation patterns was similar to today’s patterns, but pollen data indicates a fluctuation in distribution and abundance depending on the current climate (Huntley, 1990). Italy, the Balkans, Greece and the southern parts of the Iberian Peninsula are confirmed refugia in Europe. From these, Northern Europe was colonized mostly from the Balkans with the Alps being a large barrier and the Pyrenees a slightly smaller obstacle (Hewitt, 1999).

In a study from 2012, Parducci et al. identified a rare mitochondrial DNA (mtDNA) haplotype in standing trees and in pollen from sediments that suggests a refugia of spruce was present in western Norway as well as in in Russia. Based on this, Parducci et al. (2012) suggest that recolonization of Scandinavia took place from both a refugia in Norway and from Russia through Finland and the Baltic states. The majorities of studies, however, suggest that Norway spruce spread to Sweden after the latest glaciation from a population in Russia (Chen et al., 2012, Binney et al., 2009, Väliranta et al., 2011). Tollefsrud et al. (2008), based on a study of mtDNA and fossil pollen, found low genetic structure in the northern range of Norway spruce populations and argued that these results were consistent with a colonization event from a single refugium. Tollefsrud et al.

(2009) investigated migration routes of Norway spruce using a combination of nuclear microsatellites and mtDNA microsatellites and concluded that the colonization took place from a large refugium in Russia via two different routes, one southwest across the Baltic Sea and one northwest over Finland.

(16)

Genome

Conifer genomes have a largely conserved karyotype, consisting of 12 (2n = 24) chromosomes (Morse et al., 2009), all of similar size and shape (Morse et al., 2009, Nystedt et al., 2013). Conifer genomes also contains a large amount of repetitive material. This, together with their large size (>15Gbp) makes both sequencing and assembling of the conifer genomes a hard task. In 2013, Nystedt et al. published the first draft assembly for any gymnosperm species, the Norway spruce (Picea abies) genome and they concluded that the entire genome is 19.6 gigabase pairs. To put that in perspective, the model species Arabidopsis thaliana has a genome size of 125 Mbp (Kaul et al., 2000) and the human genome is 3.08 gigabase pairs (International Human Genome Sequencing Consortium, 2004).

There is no evidence that a recent whole genome duplication is responsible for the large genome size of Norway spruce (and other conifers), instead the genome contains a high abundance of repetitive sequences belonging to different families of transposable elements that have expanded and inserted into the spruce genome over tens of millions of years, creating large introns (>50kb) and a high number of pseudogenes (Nystedt et al., 2013).

Genetic diversity

Genetic diversity quantifies genetic variability within a stand, population or entire species, depending on the scale of study. Genetic diversity is needed to preserve the ability to adapt to changing environmental conditions (Gregorius, 1991) because the level of heterozygosity is directly linked to a reduction in individual and population fitness due to inbreeding depression. This connection leads to an expectation that at the population level, fitness and level of heterozygosity will be correlated (Reed and Frankham 2003). When performing breeding in small populations, inbreeding may increase due to the small number of individuals that are genetically different (Ellstrand and Elam, 1993). Choosing a limited number of individuals from a natural stand to establish, for example, a seed orchard, results in a population size decrease and therefore increase the risk of inbreeding. In seed orchards there is also a further risk of increased selfing, the most extreme form of inbreeding, since many ramets of the same clone are usually present, facilitating pollination among identical clones which is equivalent to selfing (Muona and Harju, 1989).

Norway spruce is wind pollinated and this creates high levels of gene flow that is one of the most important factors influencing the genetic structure of populations. High gene flow results in high genetic diversity within populations and reduces the difference between populations (Burczyk et al., 2004). Evidence of this gene flow can be found in studies examining the parentage of seeds from seed orchards. Adams and Burczyk (2000) reviewed six different conifer species

(17)

4

spruce seed orchards high levels of pollen contamination has been found. A seed orchard in Finland were found to have pollen contamination rates of 69-71% and (Pakkanen et al. 2000) two seed orchards in Sweden reported 43% and 59% of pollen contamination, respectively (Paule et al., 1993). The high level of pollen contamination can have negative effects for both the seed orchard crops and the natural forest stands. For the seed orchard crop, pollen contamination results in an uneven quality of seeds between sites and years since pollen contamination is expected to lower the genetic gain (Kang et al., 2001, Adams and Burczyk, 2000).

If, on the other hand, gene flow from seed orchards into natural stands is large, this can reduce genetic diversity of surrounding populations and might also result in offspring that are less adapted to local environments since genotypes from seed orchards are often transplanted from elsewhere (Adams and Burczyk, 2000).

Wind dispersal of pollen, based on paternally inherited chloroplast markers, tend to have a homogenizing effect on genetic and genotypic diversity, both over long and short distances, whereas seed dispersal, on the other hand, creates patches of genotypes in mitochondrial markers that are only maternally transmitted through seeds (Scotti et al., 2008).

Conifers carry many lethal alleles and can therefore be strongly affected by inbreeding depression (Williams and Savolainen, 1996). Both Norway spruce and Scots pine have even been reported to fail to produce seeds after selfing (Tigerstedt, 1973, Kärkkäinen and Savolainen 1993). A lower survival rate among selfed trees during germination compared to open pollination has also been found from four different mothers in an experimental plantation in Norway.

Selfed trees also showed a reduction in average height, average trunk diameter and average trunk volume compared to open pollinated trees after 61 years of growth (Eriksson et al. 1973). Skrøppa (1996) monitored outcrossed and selfed families from three natural Norway spruce populations over 10 years in a nursery.

The selfed families suffered from higher mortality, geminated slower, set buds earlier and a shorter shoot growth period. Shoot growth rate, height and diameter growth were all reduced and inbreeding depression was significantly different from zero for all mentioned traits (Skrøppa, 1996).

Forestry practices

Norway spruce and Scots pine (Pinus sylvestris L.) dominate Swedish forestry.

As an example, an average of 199 million Norway spruce seedlings and 135 million Scots pine seedlings were delivered annually during 1998-2019 (Black- Samuelsson et al., 2020). In Sweden, the annual growth is larger for Norway spruce than Scots pine and the rotation time is shorter (Lindgren et al., 2007), which makes Norway spruce the most economically important tree species in Sweden (Lindgren, 2009). In 2013, Sweden was the third largest exporter of pulp, paper and sawn timber according to the Swedish Forest Industries’ Federation.

(18)

The forest industry accounts for 9-12% of Swedish industry’s total employment, exports and sales, and the export value of forest industry products was estimated to 124 billion SEK in 2014 (Swedish Forest Industries’ Federation 2015).

There are several methods for reforestation, both natural ways, such as leaving seed trees, or more artificial methods that include direct sowing of seeds or by planting already established seedlings. Planting improved seedlings from nurseries is the most common method in Sweden because it speeds up the reforestation process and allows for more homogeneous stands while also reducing the need for thinning (Hallsby, 2013). In Sweden, around 84% of all reforestations are achieved through planting seedlings, natural regeneration makes up approximately 10%, the remaining fraction is direct seeding or no measures were taken so the regeneration method is unknown (Black-Samuelsson et al., 2020). One prominent negative aspect to planting seedlings is the increased cost (Hallsby, 2013), but there is also a fear of decreasing genetic diversity, that planting may limit species diversity and hinder natural selection (Black- Samuelsson, 2012). However, Hallsby (2013) argued that the genetic variation would likely be kept at least as large as a natural regenerated stand since the plus trees forming seed orchards are derived from a wide geographic range.

The development of Swedish seed orchards is roughly divided into three different rounds were methods and tree selection are improved in every step. During the first round (1949-1972) grafts were made from selected plus trees (trees that show the phenotypically desired qualities) from natural forests. To improve genetic gain in the second round (1981-1994) breeding values from the first round was used together with a new selection of trees from both Sweden and foreign origin.

In the third round (from 2004) all material is selected based on field-testing. For every round the genetic gain has been increased to improve value production, volume production, and to ensure a constant quality (Lindgren et al., 2007).

Today Skogforsk has around 20 breeding populations distributed across Sweden that are adapted to different photoperiods and temperature regimes. Each breeding population has a target effective population size (Ne) of at least 50 individual trees. In the breeding program, within family selection is applied, raising Ne to 100 by ensuring equal contribution from all parents. With an Ne of 100 there is later room to vary founder contribution by selection among families but still keeping the Ne higher than 50 (Rosvall et al., 2011).

There is a deficit of Norway spruce seeds to cover annual plantation needs, and only around 74% of the seedlings have an origin in commercial seed orchards, with 69% coming from Swedish orchards and 5% from foreign orchards. The remaining seeds are collected from common production forests, 12% from

(19)

6

Swedish stands and the remaining 13% from foreign stands (Skogsstyrelsen, 2014).

Genetic markers

There are several types of marker that have been used in an effort to characterize the genetic diversity in plants, including markers for unipaternally inherited chloroplasts, maternally inherited mitochondria and for the biparentally inherited nuclear genome. When using nuclear marker both the maternally and paternally inherited diversity gets captured (Maghuly et al., 2006). One of the early markers, isozymes, (an enzyme that can differ in amino acid sequence but still produce the same chemical reaction) had limited genome coverage and was affected by the plants phenological stage (El-Kassaby, 1991). With random amplified polymorphic DNA (RAPD) and amplified fragment length polymorphism (AFLP) only the dominant allele is detected, preventing heterozygotes for being distinguished from homozygotes (Port and El-Kassaby, 2014). Simple sequence repeats (SSRs or microsatellites) are widely used for assessing genetic diversity, primarily because their high variability provides high power to resolve individuals as well as the relatively simple laboratory work involved in scoring them, which makes the procedure replicable (Chambers and MacAvoy, 2000). SSRs do, however, have problems with null alleles that make it harder to estimate the true number of heterozygotes (Nybom et al., 2014). A problem with all markers described above is that relatively few are available, allowing for few points of comparison and in addition they are time-consuming to design and test primers and later score the results for (Davey et al., 2011).

Next generation sequencing

With next generation sequencing (NGS) there are several methods available for genotyping genome wide markers that can be applied to any non-model organisms with no or limited access to complete genome sequence information (Davey et al., 2011). Genotyping by sequencing (GBS) is a method developed by Elshire et al. (2011) and is based on the use of restriction enzyme together with NGS to generate large amounts of single nucleotide polymorphism (SNP) markers (Pan et al., 2014). One of the benefits with GBS compared to SSRs is the large number of markers you receive without the need of laborious PCR work and pre-design of primers (Davey et al., 2011). For the GBS method around 100 ng DNA is needed at the start (Elshire et al., 2011) and the first step is to cut genomic DNA into smaller pieces using one or more restriction enzymes. Adaptors, including both a barcode (that are later used to distinguish individuals from each other) and a common adaptor are added by ligation to the fragments before they are pooled. The final fragments should be less than 1 kb in size to allow for efficient genotyping with the short-read length common for most modern NGS

(20)

platforms and only the fragments with one barcode and one common adaptor will be genotyped (Davey et al., 2011).

Whole-genome re-sequencing (WGS) is made possible by existing reference genomes but is remains a costly practice in Norway spruce due to its large genome size (Nystedt et al. 2013) A more cost-effective method compared to WGS is target capture that utilize biotinylated probes to extract known regions from a genome.

The drawback here is that probes have to be designed in advance, thus requiring pre-existing information on genome structure and that you can only genotype areas with an existing probe (Hirsch et al., 2014). However, this method is highly replicable once probe design has been finished and validated (Gnirke et al., 2009). A large set of 40,018 target capture probes have recently been developed for Norway spruce by Vidalis et al. (2018). Each probe consists of the probe region (120 bp) but also includes 100 bp on each side of the probe making, resulting in a total extended probe region of 320 bp.

(21)

8

Aims

The overall aim of this thesis was to investigate if and how modern forestry practices in Sweden affect the genetic diversity of Norway spruce (Picea abies).

The specific aim for each chapter in this thesis is:

Chapter I: Build a working data flow for variant calling from next-generation sequencing data in Norway spruce and examine possible issues that arise due to the complex genome structure.

Chapter II: Compare how effective exome capture probes, genotyping by sequencing and whole-genome re-sequencing is for assessing genetic diversity in Norway spruce.

Chapter III: Evaluate genetic diversity in pristine old-forest stands and compare it to young re-planted stands of Norway spruce?

Chapter IV: Perform a literature review of the current knowledge on the genetic effects of clonal forestry and to use basic population genetic models to assess the possible impacts of clonal forestry on genetic and genotypic diversity at the stand level.

(22)

Materials and Methods

This thesis consists of two paper based on largely the same data set (chapter I and II), one paper based on a large number of sequences pools (chapter III) and one paper based on a literature review and modelling (chapter IV).

Study area

The majority of spruce samples used in this thesis were collected from 15 different locations in northern Sweden (figure 1). 14 in Västerbotten and one in Västernorrland, making up a total of 2250 trees. The remaining samples, used in chapter I and II, were collected from Norway spruce in Russia, Belarus, Poland, Romania, Norway, Finland and central Sweden.

Europe and central Sweden

Samples were taken as dormant buds or fresh needles and stored in -80°C until DNA extraction. More information on the sample collection and handling can be found in Bernhardsson et al. (2019) and Baison et al.

(2019). These trees were genotyped using whole-genome re-sequencing (chapter I) or using capture probes (chapter I, Vidalis et al 2018).

Northern Sweden

In collaboration with Länsstyrelsen Västerbotten, SCA Skog AB and Sveaskog, samples were collected from 14 different areas in Västerbotten and one area in Västernorrland, spanning from the east coast near the Baltic sea to the mountain region of Västerbotten close to the Norwegian border in the west. The sampling design was based on a paired sampling set-up where we initially identified old (minimum 150 years) and

Figure 1: Map over Sweden and the 15 sampling sites used in chapter II and III. Sites 2 and 12 are also used in chapter I.

(23)

10

provided by Länsstyrelsen Västerbotten. Once the old stands had been identified we used information provided by SCA Skog AB and Sveaskog to locate two young and recently planted stands (<20 years old) that were located in close vicinity of the old stand. From all stands, fresh buds were taken from 50 random trees using transect sampling and stored on ice until they were brought back to the lab where we kept them at -80°C until DNA extraction. The location of all sampling sites is depicted in Figure 2.

Figure 2: Map over Västerbotten and Västernorrland and the 15 sampling sites used in chapter II and III. Sites 2 and 12 are also used in chapter I

Lab work

For the whole-genome re-sequencing data, DNA extraction was made using a Qiagen plant mini kit (Qiagen, Hilden, Germany) and sequencing was performed at the National Genomics Initiative platform (NGI) at SciLifeLab facilities, Stockholm, Sweden. Sequencing was done using pair-end libraries with an insert

(24)

size of 500 bp. Each library was sequenced using either Illumina HiSeq 2000 or Illumina HiSeq X instruments.

The capture probe data was based on 517 maternal trees derived from Skogforsk’s southern breeding population located in Ekebo, Sweden (Baison et al. 2019).

DNA was extracted from all samples using a Qiagen Plant DNA mini extraction kit and subsequently sent to RAPiD Genomics for sequencing (http://rapid- genomics.com). Sequence capture was performed using the 40 018 diploid probes designed from the Norway spruce v1.0 genome assembly (Nystedt et al 2013) and evaluated for capture efficiency (Vidalis et al., 2018). After sequence capture, sequencing libraries were constructed following Agilent’s SureSelect Target Enrichment protocol (AgilentTechnologies, https://www.agilent.com/) and sequenced using an Illumina HiSeq 2500 instrument.

Genotyping-by-sequencing was performed on a total of 1350 of the individuals collected in Västerbotten, Sweden. Here an Omega Bio-tek E-Z 96 plant kit (OMEGA Bio-Tek, Norcross, GA, USA) was used to extract DNA according to the instructions. 30 samples from each of the 45 different populations in northern Sweden was extracted and DNA concentration measured on a Qubit (ThermoFisher Scientific). A detailed description of the method used for library preparation can be found in Pan et al. (2015). We did a few minor adjustments, a total amount of 200 ng DNA was pooled from each individual into 45 stand pools.

DNA digestion and ligation were performed in 50 µl regent systems using the restriction enzyme PstI (New England BioLab, Woburn, MA, USA). We used five different adaptors that was ligated to the 45 stand pools, nine superpools was then created using 5 stands with different barcodes. All superpools was purified using a QIAquick PCR purification Kit (Qiagen, Hilden, Germany) and DNA concentrations were again measured on a Qubit (ThermoFisher Scientific). In a PCR step all nine superpools were amplified and then purified again. E-gel Size- Select II pre-cast gel (ThermoFisher Scientific) was used together with 100 µl sample from each superpool. We targeted fragments in the range 350-450 bp (which include 125-132 bp barcodes and sequencing adapters). After approximately 20 minutes the targeted fragment size was cut from the gel. Using QIAquick Gel Extraction Kit (Qiagen, Hilden, Germany) the resulting 9 libraries of 40 µl was created. After quantification the libraries was pair-end sequenced (2x150 bp) on an Illumina HiSeqX by Novogene (Hong Kong) with one HiSeq X lane per library, resulting in over 120Gbp raw sequencing data per superpool.

Genotyping

More information about the sequencing and SNP calling of our whole-genome re- sequencing data used in chapter I and chapter II can be found in Bernhardsson

(25)

12

sequence reads against the reference genome v.1.0 of P. abies (Nystedt et al.

2013). The genome was reduced by filtering out scaffolds shorter than 1kb (Li et al. 2009), keeping roughly 9.9 Gbp of the 18Gbp Norway spruce genome for further analyses. All PCR duplicates was marked using MarkDuplicates in Picard v2.0.1 (http://broadinstitute.github.io/picard/) to eliminate duplicate fragments generated by the PCR amplification step during library preparation. After mapping, GATK HaplotypeCaller was used to call SNPs. The raw variants were filtered to only include biallelic SNPs that met five pre-set criteria. All SNPs had to be positioned >5 bp away from an indel, they had to fulfil GATKs quality

parameters recommendations for hard filtering

(https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard- filtering-germline-short-variants), they had a read depth between 6–30× and finally have a p-value for excess of heterozygosity greater than 0.05.

All capture probe data used in chapter II was extracted from the whole-genome re-sequencing data using positions of the probes designed by Vidalis et al. (2018) with BEDTools v2.26.0 (Quinlan, 2014). SNP data included extended probe regions, 120bp probes plus 100bp on either side for a total of 320 bp per probe.

We also included data from 517 individuals from southern Sweden genotyped by the capture probes in our examination of population structure in Norway spruce that were previously published by Baison et al. (2019).

The raw sequencing reads from our GBS libraries (chapter III) were quality

checked using FastQC v0.11.8

(https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). After trimming adaptors with Trimmomatic v0.36 (Bolger et al. 2014), the data was demultiplexed using the ‘process_radtags’ routine from Stacks v2.2 (Catchen et al. 2011). All reads were mapped against the reference genome of P. abies v1.0 (Nystedt et al. 2013) using BWA-MEM with default parameters. To identify genomic regions common to all 45 stands and where there was enough sequencing coverage to ensure accurate SNP calling, BAM files for all stand pools were intersected using the multiinter tool from BEDTools v2.26.0 (Quinlan 2014). SNP calling was then performed on all genomic regions common to the 45 stands for each pool separately using Crisp (https://github.com/vibansal/crisp, Bansal 2010). SNP calling was preformed assuming a pool size of 60 haploid genomes, or 30 diploid samples, for each stand. After filtration only biallelic SNPs (‘VT=SNV’) that also passed Crisp’s quality filters (‘FILTER=PASS’) were kept.

To be able to create comparable data sets for the analyses presented in chapter II, whole-genome data was extracted for the genomic regions corresponding to the GBS data set using BEDTools v2.26.0 (Quinlan, 2014).

(26)

Data analysis

For chapters II and III most analyses were performed using RStudio (RStudio Team, 2020). In both chapters population structure was analysed using a principle component analysis (PCA). In chapter II we based it on a relatedness matrix calculated using the ‘relatedness2’ option in VCFtools (Danecek et al., 2011) and in chapter III we based it on the allele frequency matrix across all populations. Both PCAs were created with the function “prcomp” in R V3.4.3 (R Core Team 2017). In chapter II we calculated Tajimas’D (Tajima, 1989) and pairwise theta (Korneliussen et al., 2013) using ANGSD v0.921 (Korneliussen et al., 2014) and used VCFtools to calculate FST values between the populations. In chapter III several common diversity statistics were calculated with custom made functions in R V3.4.3 (R Core Team 2017) and t-tests and F-tests, utilizing the var.test function, were used to compare the mean and variances of the different summary statistics, respectively. We also used the R package OutFLANK in chapter III to calculate FST values among populations. The distance between the stands based on GIS coordinates was calculated using the “geodist” V0.0.6 package (https://github.com/hypertidy/geodist).

(27)

14

Main results and discussion

Chapter I

In chapter I we establish a ‘best practice’ pipeline for variant calling of NGS data in Norway spruce. To be able to utilise NGS data for e.g. population genetic analyses, accurate calling of SNPs and genotypes are required. As the Norway spruce genome is very large (~20 Gb) and contain sizable fractions of repetitive material (at least 70%, Nystedt et al. 2013), this presents a formidable challenge for using NGS data. In chapter I we highlight possible problems and pitfalls of genotyping individuals using NGS data in Norway spruce and highlight methods that can be applied to minimize or eliminate possible sources of errors. We highlight how different genomic regions, such as repeats, gene regions and exons, are affected by applying different hard filtering parameters.

The large and complex nature of the Norway spruce genome presents challenges for manipulating genome-wide sequencing data using existing software. The main limiting factor for the current Norway spruce draft genome (v1.0) is that the number of genomic scaffolds in the assembly, ~10M unique scaffolds, is substantially greater than what existing software can handle. In order to efficiently analyse the data and to reduce the computational complex SNP calling pipeline, we first filtered out all scaffolds shorter than 1kb and then subsetted the remaining data into 20 smaller data sets, each consisting of ~200,000 scaffolds.

The smaller size of these subsets allowed us to process the data in parallel using existing software tools, such as GATK (Figure 3).

Following variant calling, the raw output was filtered in several steps using several hard filtering criteria. In particular we kept only biallelic SNPs positioned

>5 bp away from an indel where the SNP quality parameters fulfilled GATK recommendations for hard filtering. A common source of errors is read mapping to repetitive regions that are collapsed in the reference genome. To minimize the impact of this we utilised strict filtering based on sequencing depth, keeping only sites with read depths in the rage 8-20x. Finally, we removed all SNPs that showed an excess of heterozygosity as such SNPs could also be indicative of collapsed regions in the assembly.

(28)

Figure 3: Variant calling pipeline for NGS data in Norway spruce.

Chapter II

In chapter II we examined the results from two different complexity reduction genotyping techniques to assess if these are comparable and if they can be used to genotype a large number of individuals to assess genetic diversity and population structure. Based on an original data set consisting of 33 Norway spruce trees that have been subjected to whole-genome re-sequencing, we could extract known genomic regions corresponding to target capture probes and GBS regions. We used 40,018 probe regions, covering approximately 2.3% and 8731 GBS regions, covering 0.7% of the available genome assembly from Norway spruce, respectively. We show that we can reliably estimate population structure based on the probe regions for our samples and a large number of samples originally genotyped by Baison et al. (2019). The population structure (figure 4) mirrors geography well, with Northern Sweden and Norway separating from the rest and with Finland also showing a tendency to be separated. Central Europe and southern Sweden cluster together, indicating a large amount of material transfers from Central Europe to Southern Sweden (Chen et al. 2019). It is interesting to note that the old samples from North Sweden, Västerbotten also separates into a mountain cluster and a coast cluster.

Pre-processing of raw reads

Read mapping (BWA-MEM)

Reduce genome to scaffolds >1kb

Subset BAMs

Remove PCR duplicates (Picard)

Local realignment (GATK)

Variant calling (GATK HaplotypeCaller)

VCF file

Subset 1 Subset 2 Subset 3 ……. Subset 20

10M Scaffolds 12.6 Gb

2M Scaffolds 9.9 Gb

100k Scaffolds per subset 0.16-2.64 Gb

(29)

16 Figure 4: Population structure based on probe regions

We then examined Tajima’s D and pairwise theta for both genomic subsets (capture probes and GBS) and also assessed variation across the entire scaffolds targeted by the capture probes or GBS. The overall results show no systematic difference in Tajima’s D for capture probe data (figure 5) and only a slightly higher variation in pairwise theta for the capture probe bearing scaffolds compared to the extended probe regions. We found a similar result for the GBS data (figure 6), with one difference being fewer extreme values in both GBS regions and GBS scaffolds. There is also a clear difference in the number of data points between figure 5 and figure 6, simply because the GBS data set contains fewer regions compared to the capture probe data set.

The four populations from northern Sweden showed overall weak population differentiation with no major differences regardless of what data set we used for the analyses.

−1.5 −1.0 −0.5 0.0 0.5 1.0

−0.4

−0.2 0.0 0.2

Population structure in probe regions

PC1 (14.3 %)

PC2(1.2%)

●●

●

●●

●

● ●

●

● ●

●●

●

● ●

●

●●

●

●●

●

●●

● ●●

●

●●●

●

●●

●

●●

●●●

●

●●

●

●●

●

●●●

●

●● ●●

●●●

●

●●

●

●●●

●

●●

●

●●

●

●● ●●

●

●●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

● ●●● ●

●

● ●

●

●●

●

●●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●●●

●

●●

●

●●●

●

●●

●

●●●

●

●●● ●

● ●

●

● ●

●

●●

●

●●●

●

●●●

●

● ●

●●●

●

●●●

●

●●●

●●

●

●●

●

●●●

●

●●

●

● ●●

●

North Sweden and Norway North Sweden old

Central Sweden (reference genome) Southern Sweden

Finland Central Europe

(30)

Figure 4: Tajima’s D and pairwise theta (nucleotide diversity) in the Probe data. Red indicates the highest number of data points and blue the lowest number.

Figure 5: Tajima’s D and pairwise theta (nucleotide diversity) in the GBS data. Red indicates the highest number of data points and blue the lowest number.

All things considered it appears that the capture probe data is slightly more effective in measuring variation in diversity across the genome, mainly since you get more control in your complexity reduction protocol compared to the GBS data

(31)

18

where restriction enzymes are used for complexity reduction. Polymorphism in a restriction enzyme target sites can effectively eliminate cut sites and size selection of fragments after digestion can also cause a problem, since this can possibly affect the subset of genomic fragment available in the final data set between different individuals. We did achieve the same general result when estimating the FST between populations independent of data choice, suggesting that population differentiation is less susceptible to variation due to genotyping method. In conclusion we believe that both GBS and capture probes are useful methods for genotyping large amounts of samples as we cannot detect any systematic biases in either data set.

Chapter III

In chapter III we use GBS to genotype PoolSeq libraries in order to assess if and how genetic diversity and differentiation is affected by current Swedish forestry practices. To study this, we used 45 different populations sampled from northern Sweden, 15 old (>150 years) and 30 young (<20 years) planted stands. The complete data set consisted of 40,049 SNPs that we obtained after variant calling and filtering for genomic regions common across all 45 stands. In this study we estimated population structure based on allele frequencies but found no correspondence between the estimated population structure and the geographic location of the samples, regardless of the set of stands we investigated. It is possible the overall genetic differentiation between populations were too small across our chosen area to detect a correspondence between genetic and geographic distance and that samples spanning a larger geographic area is need to detect such patterns. Few of our statistical test for the different summary statistics detected any significant differences in either mean or variance. The number of variable sites and the proportion of rare alleles (AF <5%) was not significantly different, but as shown in figure 6, the values tend to vary more among the planted stands. Similarly, neither the number of private alleles nor the number of singletons were significantly different between old and planted stands, but these showed a similar trend with a larger variance among planted stands.

(32)

Figure 6: The number of variable sites in each stand and the proportion of rare alleles (AF <5%) in each stand. Blue dots represent the planted stands and red dots represents old stands the horizontal black line indicates the median value and the vertical line show how 50% of the values varies around the median.

When analysing the mean allele frequency and mean expected heterozygosity we once again found no difference in mean values between planted and old stands.

However, there was a significantly larger variance among the planted stands for mean allele frequency but not for mean expected heterozygosity (figure 7).

Genetic differentiation between the different stands within each class, planted or old, did not show any correlation with geographic distance between stands. The variance of FST among planted stands was, however, significantly greater than among old stands and also had a greater number of high FST values (FST >0.05).

(33)

20

Figure 7: Mean allele frequency in each stand and mean expected heterozygosity in each stand. Blue dots represent the planted stands and red dots represents old stands the horizontal black line indicates the median value and the vertical line show how 50% of the values varies around the median.

Chapter IV

In chapter IV we discuss the possible effects of a large-scale implementation of clonal forestry based on a literature overview of the subject and a number of quantitative models. From our modelling it is clear that clones have a bigger impact on genotypic diversity than genetic diversity. Genetic diversity is simply a function of the number and frequency of alleles while genotypic diversity measures the number and frequency of unique multilocus genotypes. The number of trees needed to capture sufficiently large fractions of the baseline genetic diversity and heterozygosity are substantially lower than the number needed to preserve genotypic diversity. The risk of losing a rare allele due to sampling effects is also greatly dependent on the sample size. Assuming that the two extreme choices are either clonally propagated trees or sexually produced offspring from cloned trees, you rapidly lose genotypic diversity when deploying clones compared to using their offspring. Key factors to consider for determining the effects of clonal forestry are the level of genetic diversity contained among the selected parents, how many clones (figure 8) and the number of trees from the same clone that will be planted and over what area they will be planted. It is also important to consider exactly how genetic diversity is measured as rare variants are easy to lose but have minimal impact on genetic diversity in the population whereas common alleles, which have the greatest effect on genetic diversity, are

(34)

easy to capture even with relatively limited sampling. Taken together it is important to monitor both genetic and genotypic diversity when deploying clonal material and also to compare these estimates to naturally occurring stands to better understand how clonal forestry might impact the genetic diversity and differentiation of production stands.

Figure 8: (A) Fraction of original genetic diversity maintained as a function of number of clones deployed per site. (B) Genetic diversity maintained relative to what would be maintained in a situation with no clonal forestry. More details can be found in chapter IV.

0 10 20 30 40 50

0.0 0.2 0.4 0.6 0.8 1.0

Numer of unique clones per population

Relative genetic diversity

A

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of sites replanted using clones

Relative genetic diversity

B

n=125 n=25 n=10

n=2

(35)

22

Conclusions and future research

With improved sequencing and variant calling techniques we are now able to assess genetic diversity in some of the most complex plant genomes available, such as conifers, with better and more accurate results. Using complexity reduction methods for genotyping we can yield good results when assessing common diversity measurements. The benefits of using complexity reduction methods, compared to whole-genome sequencing is chiefly that you are able to assess genetic diversity using a substantially greater number of individuals while still being able to assess a relatively large fraction of the genome of an organism.

Using capture probes allows you to target a larger number of genomic regions and to usually get a more consistent result across many individuals, with less missing data, compared to genotyping by sequencing where large fractions of missing data is a known problem with the method. Using PoolSeq allows us to genotype an even greater number of individuals than whit individual genotyping, using either capture proves or GBS, would allow. The PoolSeq approach comes with the cost of not being able to assess individual genotypes, meaning that you can only describe genetic variation using aggregate statistics at the population level.

Our results from comparing old and planted stands agree with earlier studies that have been performed using fewer genetic markers. The results from this thesis therefore suggest that seedlings deployed 20 years ago has not had any major effects on the genetic diversity in Norway spruce in the area of northern Sweden we have studied. Forestry practices and re-planting of material derived from seed orchards can nevertheless have long-term effects in the form changes in the distribution of genotypes and genetic diversity among stands. Also, depending on rate, area and number of clones deployed in a possible clonal forestry scenario, the risk of losing genetic and genotypic diversity can increases substantially. On the other hand, if these things are considered, it is possible to design a clonal plantation that has minimal impact on the genetic diversity, especially if clones are derived from seeds generated by crossing genetically superior individuals, so called family forestry.

Future research should focus on both assessing the amount and distribution of genetic diversity across larger geographical areas in order to identify whether certain areas are at risk losing genetic diversity and to what extent old, non- managed forest stands can help maintain novel alleles. The larger variation I have observed among planted stands also beg the question of whether planting alters the distribution of genotypes within stands. By examining the spatial distribution of genotypes within a stand we can assess how planting patches of specific genotypes or genetic diversity compares to naturally regenerated stands. To further assess natural vs managed ‘dispersal’ it would be interesting to compare

(36)

also maternally inherited genetic markers, such as mitochondrial DNA and paternally inherited genetic markers such as chloroplast DNA. Finally, one thing this thesis has not examined is the importance of Norway spruce as a keystone species in the landscape (Aguilar and Boecklen, 1992, Fritz and Price, 1988, Messina et al., 1996, Whitham et al., 2003). In the future this should be considered when planning more experiments, since the signs of a healthy forest also manifest itself on the surrounding ecosystem. Such studies would examine not only genetic diversity at the tree level but also how this diversity, at the genetic or genotypic level, impacts the communities of dependent organisms such as arthropods, lichens and mosses. Studies in other ‘keystone’ species have shown that changes in genetic diversity in foundation species can have cascading effects on the associated communities.

(37)

24

Acknowledgement

This work was funded by grants from the Swedish Foundation for Strategic Research (SSF Grant No. RBP14-0040) and Knut and Alice Wallenberg foundation. All analyses were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) at Uppsala Multidisciplinary Centre for Advanced Computational Science (UPPMAX) under the projects SNIC 2017/1-438, SNIC 2018/3-529, SNIC 2019/3-555 and uppstore2017066.

(38)

References

Adams, W. T., & Burczyk, J. (2000). Magnitude and implications of gene flow in gene conservation reserves. Forest Conservation Genetics: Principles and Practice, 215-244.

Aguilar, J. M., & Boecklen, W. J. (1992). Patterns of herbivory in the Quercus grisea× Quercus gambelii species complex. Oikos, 64(3), 498-504.

Baison, J., Vidalis, A., Zhou, L., Chen, Z. Q., Li, Z., Sillanpää, M. J., Bernhardsson, C., Scofield, D., Forsberg, N., Grahn, T., et al. (2019). Genome-wide association study identified novel candidate loci affecting wood formation in Norway spruce. The Plant Journal, 100(1), 83-100.

Bansal, V. (2010). A statistical method for the detection of variants from next- generation resequencing of DNA pools. Bioinformatics, 26(12), i318- i324.

Bernhardsson, C., Vidalis, A., Wang, X., Scofield, D. G., Schiffthaler, B., Baison, J., Street NR., García-Gil M.R. & Ingvarsson, P. K. (2019). An ultra- dense haploid genetic map for evaluating the highly fragmented genome assembly of Norway spruce (Picea abies). G3: Genes, Genomes, Genetics, 9(5), 1623-1632.

Binney, H. A., Willis, K. J., Edwards, M. E., Bhagwat, S. A., Anderson, P. M., Andreev, A. A., ... & Kremenetski, K. V. (2009). The distribution of late- Quaternary woody taxa in northern Eurasia: evidence from a new macrofossil database. Quaternary Science Reviews, 28.23, 2445-2464.

Black-Samuelsson, S. (2012). The state of forest genetic resources in Sweden- report to FAO. Skogsstyrelsen, rapport, 12, 2012.

Black-Samuelsson, S., Eriksson, A., & Bergqvist J. (2020). The state of the world´s forest genetic resources. Country report Sweden. In English with a Swedish abstract. Skogsstyrelsen Rapport 2020/3.

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114-2120.

Burczyk, J., Lewandowski, A., & Chalupka, W. (2004). Local pollen dispersal and distant gene flow in Norway spruce (Picea abies [L.] Karst.). Forest

(39)

26

Catchen, J. M., Amores, A., Hohenlohe, P., Cresko, W., & Postlethwait, J. H.

(2011). Stacks: building and genotyping loci de novo from short-read sequences. G3: Genes, Genomes, Genetics, 1(3), 171-182.

Chambers, G. K., & MacAvoy, E. S. (2000). Microsatellites: consensus and controversy. Comparative Biochemistry and Physiology Part B:

Biochemistry and Molecular Biology, 126(4), 455-476.

Chen, J., Källman, T., Ma, X., Gyllenstrand, N., Zaina, G., Morgante, M., ... &

Lagercrantz, U. (2012). Disentangling the roles of history and local selection in shaping clinal variation of allele frequencies and gene expression in Norway spruce (Picea abies). Genetics, 191(3), 865-881.

Chen, J., Li, L., Milesi, P., Jansson, G., Berlin, M., Karlsson, B., Aleksic, J., Vendramin, G.G, & Lascoux, M. (2019). Genomic data provide new insights on the demographic history and the extent of recent material transfers in Norway spruce. Evolutionary applications, 12(8), 1539- 1551.

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., ...

& 1000 Genomes Project Analysis Group. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156-2158.

Davey, J. W., Hohenlohe, P. A., Etter, P. D., Boone, J. Q., Catchen, J. M., &

Blaxter, M. L. (2011). Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nature Reviews Genetics, 12(7), 499-510.

Doebley, J. (1989). Isozymic evidence and the evolution of crop plants. In:

Isozymes in plant biology (pp. 165-191). Springer Netherlands.

Doebley, J. F., Gaut, B. S., & Smith, B. D. (2006). The molecular genetics of crop domestication. Cell, 127(7), 1309-1321.

El-Kassaby, Y. A. (1991). Genetic variation within and among conifer populations:

review and evaluation of methods. Biochemical markers in the population genetics of forest trees, pp. 61-76.

Ellstrand, N. C., & Elam, D. R. (1993). Population genetic consequences of small population size: implications for plant conservation. Annual review of Ecology and Systematics, 24, 217-242.

(40)

Elshire, R. J., Glaubitz, J. C., Sun, Q., Poland, J. A., Kawamoto, K., Buckler, E. S.,

& Mitchell, S. E. (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One, 6(5), e19379.

Eriksson, G., Schelander, B., & Åkebrand, V. (1973). Inbreeding depression in an old experimental plantation of Picea abies. Hereditas, 73(2), 185-193.

Eyre-Walker, A., Gaut, R. L., Hilton, H., Feldman, D. L., & Gaut, B. S. (1998).

Investigation of the bottleneck leading to the domestication of maize.

Proceedings of the National Academy of Sciences of U.S.A., 95(8), 4441-4446.

Farjón, A. (1990). Pinaceae: drawings and descriptions of the genera Abies, Cedrus, Pseudolarix, Keteleeria, Nothotsuga, Tsuga, Cathaya, Pseudotsuga, Larix and Picea. Koönigstein: Koeltz Scientific Books.  

Fritz, R. S., & Price, P. W. (1988). Genetic variation among plants and insect community structure: willows and sawflies. Ecology, 69(3), 845-856.

Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., LeProust, E. M., Brockman, W., ... & Gabriel, S. (2009). Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature biotechnology, 27(2), 182-189.

González-Martínez, S. C., Krutovsky, K. V., & Neale, D. B. (2006). Forest-tree population genomics and adaptive evolution. New Phytologist, 170(2), 227-238.

Gregorius, H. R. (1991). Gene conservation and the preservation of adaptability. Species conservation: a population-biological approach. Birkhäuser Basel. pp. 31-47.

Hallsby, G. (2013). Skogsskötselserien – Plantering av barrträd. Skogsstyrelsen.

Hewitt, G. M. (1999). Post-glacial re-colonization of European biota. Biological journal of the Linnean Society, 68(1-2), 87-112.

Hirsch, C. D., Evans, J., Buell, C. R., & Hirsch, C. N. (2014). Reduced representation approaches to interrogate genome diversity in large repetitive plant genomes. Briefings in functional genomics, 13(4), 257- 267.

Genetic diversity and differentiation in natural and managed stands of Norway spruce (Picea abies)