Evaluation of de novo assembly using PacBio long reads

(1)

Evaluation of de novo assembly using PacBio long reads

Huiwen Che

Degree project inbioinformatics, 2016

Examensarbete ibioinformatik 15 hp tillmasterexamen, 2016 Biology Education Centre and SciLifeLab, Uppsala University Supervisor: Adam Ameur

(2)

(3)

Abstract

New sequencing technologies show promise for the construction of complete and accurate genome sequences, by a process called de novo assembly that joins reads by overlap to longer contiguous sequences without the need for a reference genome. High-quality de novo assembly leads to better understanding in genetic variations. The purpose of this thesis is to evaluate human genome sequences obtained from the PacBio sequencing platform, which is a new technology suitable for de novo assembly of large genomes. The evaluation focuses on comparing sequence identity between our own de novo assemblies and the available human reference and through that, benchmark accuracy of our data. Sequences that are absent from the reference genome, are investigated for potential unannotated genes coordinately. We also assess the complex structural variation using different approaches.

Our assemblies show high consensus with the human reference genome, with ⇠ 98.6% of the bases in the assemblies mapped to the human reference. We also detect more than ten thousand of structural variants, including some large rearrangements, with respect to the reference.

I

(4)

II

(5)

Evaluation of de novo assembly using PacBio long reads

Popular Science Summary Huiwen Che

Genome sequencing is widely used nowadays to determine the structure of human DNA, to characterize the function of genes, and to ultimately decipher disease-associated genetic variations. Characterizing a genome relies a lot on the quality of the data produced from modern DNA sequencing technologies. Much of genomic research on human samples is based on comparing random DNA fragments, obtained from a DNA sequencing instrument, to the human reference genome sequence. These DNA fragments, or sequence reads, which are often relatively short, have a fundamental problem of resolving variations in complex regions of the human genome. Therefore these fragments are reconstructed to make long stretch sequences, and the process is referred to as de novo assembly. While de novo assembly is theoretically complete and becomes increasingly available, high-quality de novo assembly remains difficult due to computational challenges.

Many questions about the genome structure for a human individual cannot be elucidated without a high-quality complete assembly. Efforts are made to develop ground-breaking sequencing technologies and state-of-the-art de novo assembly approaches. In this thesis work, we make use of long reads of two samples sequenced from a new single-molecule long-read technology and assemble them into accurate and contiguous human genomes. To assess the quality of our assemblies, we compare our de novo assemblies to the human reference genome. To further characterize structural variations on these two individual genomes, we use different approaches to detect genomic variants from the genomes.

Through the analysis, our assemblies show high sequence similarity with the reference genome.

We also acquire a few megabases of novel sequences from the two individual genomes. The novel sequences may contain novel genomic elements of functional importance. Although our analysis is still short of confirming the identification of novel genes, it highlights potential novel sequences and genomic elements missing from the reference genome, which await discovery.

We compared structural variants between two individuals and results from different detection approaches. While we found overlap between results, there are notable differences between individuals and different approaches.

Degree project in bioinformatics, Master of Science (1 year), 2016 Examensarbete i bioinformatik 15 hp till masterexamen, 2016 Biology Education Center and SciLifeLab, Uppsala University Supervisor: Adam Ameur

III

(6)

Abbreviations

DNA - Deoxyribonucleic acid Indels - insertions/deletions NGS - next-generation sequencing SMRT - Single Molecule Real-Time SV - structural variation

WGS - Whole-genome sequencing

IV

(7)

1 Introduction

The advent of the first DNA sequencing technology developed by Frederick Sanger in 1977 [1], has stimulated significant opportunities in the field of genomics. Since then, advances in sequencing technologies have continuously brought revolutions. The development of so-called next-generation sequencing technologies (NGS) has been the driving force behind the recent data explosion in the field of genomics [2, 3], and it spawned efforts to sequence complete individual human genomes. Whole-genome sequencing (WGS), not only provides a comprehensive collection of an individual’s genetic variation that facilitates better understanding on genetic disease and human health, but also provides data source to discover novel genes [4].

The availability of a human reference genome has served as a principal guide for investigating variations and provided considerable insights into genetics. Recently, the genomic landscape has been influenced by another wave of projects, focused on the genetic composition in specific geographical regions. These types of projects would benefit from using a local version of the human reference instead of the universal GRCh38 [5, 6]. A local reference genome would enable more reliable estimates of the genetic variation in the population [6, 7].

Because of developments of new long-read sequencing technologies, it is now becoming fea- sible to generate novel human reference genomes with relatively high quality to a reasonable cost. Long-read sequencing is geared to break some bottlenecks of short-read sequencing technologies, which still dominate the market for human DNA sequencing. De novo assembly - the process of assembling reads to a continuous sequence without using a reference, using short reads, struggles with repeat elements and does not give good results. With longer reads, however, a number of gaps in the reference can be fully or partially closed and complex genomic elements can be largely resolved [8, 9]. It is therefore promising for long-read sequencing to deliver more complete de novo genome assemblies with fewer gaps.

Under these circumstances, a project has been initiated to create local reference sequences and to accelerate exploration of complex variations in the Swedish population. The thesis work is part of the large-scale ’1000 Swedes project’ that aims at establishing a database of genetic variants to be used as a reference population both for research projects and for clinical routine investigations of patients in the healthcare system.

1.1 Aims of the thesis project

The main goal of this thesis is to investigate the value of de novo assembly, leveraging the advantages of long-read single molecule sequencing. More specifically, the thesis focuses on (1) digging into draft genome assembly, evaluating sequence consensus to the current human reference (GRCh38) and meanwhile assessing acquired novel sequences, and (2) structural variants analysis, including detection using different methods (both long reads and assembled contigs), and comparison between different approaches and individuals.

The description of the remaining thesis sections flow as: It starts with the scientific background on this project (section 2), covering both biological and technical information. Section 3 gives the detailed profile of approaches and algorithms used in this thesis. The subsequent section presents results obtained by the work. The last section briefly touch upon the need for changes and refinements, and ends with conclusion remarks.

3

(10)

2 Background

In the ’1000 Swedes project’, 1000 individuals have been sequenced using short-read WGS on Illumina platform to an intermediate depth of 30⇥. Two of these individuals have been selected for long-read single molecule real-time sequencing on the PacBio RS II platform to high depth ( 50⇥). While striking a balance between sample size and coverage, increased read depth or sequencing coverage is expected to offset some sequencing errors and improve precision. The data used in the thesis is primarily based on long reads from these two individuals with the male sample denoted as PB1 (PacBio1) and the female sample as PB2 (PacBio2), and short reads from these two individuals are also used for validation purpose.

2.1 Single Molecule Real-Time sequencing

Sequencing has become a common tool for a wide range of genomic analyses, varying in scopes, technologies and purpose. In this project we concentrate on the newer generation of single-molecule technologies.

While NGS has transformed the way we think and improved the sequencing capabilities, the third-generation sequencing technologies make a new avenue for understanding biology. As a major player in the third-generation sequencing market, Pacific Biosciences (PacBio) developed Single Molecule, Real-Time (SMRT) sequencing technology to produce long reads, currently averaging 10 kbp. An recognition of the fact that long reads can outstrip short reads in sequence assembly and analysis motivated a movement towards SMRT technology. Without template amplification, SMRT employs sequencing-by-synthesis strategy, which enzymatic reaction is observed in real time and imaging of fluorecent signals are captured as the reaction progresses [3, 10, 11].

The improvement in read length has, without doubt, raised the possibility of generating high- quality de novo assemblies, though it comes with a much higher cost compared to short read technologies. The lower read accuracy of SMRT sequencing has been a concern. However, sufficient coverage seems to be a remedy for relatively high error rates, since the errors are randomly distributed in long reads [11]; then high coverage can help to vote out errors.

2.2 De Novo Assembly

Given the collection of reads, assemblers are needed to assemble the reads into contigs. The quality of de novo assembly is affected by the particular genome under study, sequencing method adopted, and assembly strategy used. Assembly strategy can differ in terms of assembly method (non-hybrid assembly or hybrid assembly), and the specific assembler employed.

Hybrid assembly takes reads generated from different sequencing methods to glean advantages of varying inputs. Non-hybrid assembly, in this context, PacBio-only assembly, was used in the project, as studies [12, 13] have demonstrated that with adequate coverage and robust pipeline, PacBio data alone can produce more complete and high contiguity assemblies. Long-read diploid-aware assembler - FALCON [14] was utilized to generate assemblies. The FALCON assembler streamlines the assembly process to raw reads error correction, string graph construction that fuses reads based on read overlap, assembled contigs can be further refined by phasing heterozygous variants and process to an associated tool FALCON-Unzip to generate

4

(11)

updated assemblies. Diploid genomes assembly has been a challenge and many assemblers fail to address heterozygous structural variations that disturb read overlap. The FALCON assembler makes use of long reads to integrate haplotype-phasing information. The resulting contigs from FALCON hence include primary contigs (p_contig) and alternative contigs (a_contig, alternate haplotypes that contain heterozygous structural variations) [8, 14].

2.3 Genetic Variation

Genetic variation, defined as differences between genomes, can vary from single nucleotide variants to chromosomal changes. In this thesis, structural variants, referring to genomic rearrangements greater than 50 base pair (bp) in length, are investigated.

Structural variations (SV), including large indels, duplications, inversions, and translocations, tend to change the physical structure of genomes. The sequence-based computational methods that have been applied in SV discovery are read-pair, read-depth, split-read, and sequence assembly approaches [15]. Read-pair methods evaluate the discrepancy and orientation of mapping pattern between paired reads to highlight putative structural variants. Read-depth methods identify copy number changes by read mapping density signals. Split-read approaches, at- tempting to define intra-alignment gaps, are used to capture small and medium-size structural variants. Finally, sequence assembly is able to detect all types of structural variants by using high-quality de novo assembly to campare to reference [16].

These four strategies, depending on the signals they use, retain different pros and cons in SV detection. Read-pair is widely used to find SV signatures as long as paired reads are correctly aligned to reference, but ambiguous mapping can cause problems. Read-depth, notwithstand- ing the advantage of detecting copy numbers, lacks power for breakpoint resolution. Split-read can resolve breakpoints accurately and reliably, but the methods largely depend on coverage and read length. Though sequence assembly is supposed to identify all types of SV, it gains power only when coverage is sufficiently high and the repeats are well resolved [15].

5

(12)

3 Methods

As crucial as techniques are to generate sequencing data, analytical methods are indispensable to be coordinately addressed for meaningful data interpretation. Two major parts of analyses - de novo assembly evaluation and structural variants analysis, were carried out using PacBio sequencing data from two different swedish individuals. We refrained from diving deep into the mathematical and statistical aspects of algorithms, whereas we tried to clarify analysis workflow working through this section.

3.1 Assembly evaluation

The assembly assessment is largely based on mapping contigs. Figure 1 demonstrates the workflow of this mapping strategy.

Figure 1: Mapping strategy.

3.1.1 Whole-genome alignment

Besides looking at the widely-used key metrics, which includes average contig size, number of contigs, assembled genome size, and N50 (a statistic calculating the length of smallest contig needed in a set of assembled sequences whose sum length can meet the threshold of 50% of

6

(13)

the total assembly length [17]), we draw attention to measuring the mappability of the contigs. Finding a mapping from each position in the draft assembly to its corresponding genomic position in the reference genome can evaluate the consensus quality and the mapping information can also lend itself to alignments visulization. This whole-genome scale comparison was performed using MUMmer 3.0 [18].

MUMmer 3.0 employs suffix trees based algorithm, which allows fast implementation of sequence searching and aligning. One of the main functionalities - NUCmer, from the MUMmer suite was used for whole-genome alignment, as it is well suited for mapping a set of contigs to multiple reference sequences. There are three algorithms working collectively behind the scene of NUCmer. In the first place, all maximal matches of a specified minimum length between sequences are computed, using suffix tree data structure. Secondly, consistent maximal matches are clustered into larger groups with adjustable size, gap and distance parameters. The clustering algorithm makes it more robust to tackle with sequence rearrangements, duplications and inversions, and to filter out possible spurious matches. Finally, clusters are aligned and extended to high scoring pair-wise mappings.

The output delta files generated by NUCmer were subsequently processed by the delta-filter utility to filter alignments. Due to our interest in investigating the contigs consensus in respect to the reference genome, the filtering is an effort to find the best mapping of the contigs to the reference. Therefore, repetitive contig alignments were removed, and for each contig, only the longest identical alignment was kept. To get summary information about the mapping, we further used show-coords and dnadiff programs from MUMmer to analyze delta files. ¹

3.1.2 Novel sequences

While the majority of contigs could be mapped to the reference using MUMmer, a few contigs or regions failed to map to the reference. With the unmapped sequences came the need to identify possible novel sequences and to predict their potential functions.

The unmapped sequences in MUMmer were derived from two sources - one from entirely unmapped contigs and the other from mapped contigs with insertions that can be placed nowhere in the reference. Despite the fact that some unmapped sequences might arise from sequencing errors, it was likely that the whole-genome alignment was not sensitive enough to map some of the sequences to the reference. To refine the unmapped sequences, we remapped the unmapped seqeunces to the de novo assembly of a Chinese individual HX1 from a recent study ([9]) by NUCmer. The HX1 contigs were assembled adopting the same pipeline as ours, except that HX1 has higher coverage and did further assembly polishing to reduce error rates. We aligned all unmapped contigs and unmapped insertions that are longer than 100 bp in our data to HX1 polished contigs and to HX1 non-GRCh38 sequences with increased sensitivity by changing NUCmer parameters. HX1 non-GRCh38 sequences are novel sequences identified in the Chi- nese genome study, and no less importantly, mapping to HX1 novel sequences may assist in finding shared genome sequences that are currently missing in reference. The candidates of novel sequences, taken together sequences that are still unmapped from the second round mapping to HX1 and sequences that mapped to HX1 non-GRCh38, came under spotlight for further investigations. Due to the time limit, we have only focused on novel sequences from primary contigs so far.

1See Appendix for detailed commands

7

(14)

We explored the novel sequences mainly by utilizing the MAKER2 [19] pipeline, which provides annotation perspectives on data. In our case, MAKER2 is used as an effort to do ab initio gene predictions. Followed by gene predictions, we did functional annotation on protein products of these genes by BLAST [20] and InterProScan [21] approaches, which might shed light on identifying potential novel genes. BLAST comparisons of the protein sequences to NCBI non-redundant (nr) protein database were conducted to identify the homologous gene in a different species. Because of noise in data and complexity of inference, the information of BLAST annotations of these protein sequences is, in itself, hardly illuminating for validation.

To add information to homology analysis, InterProScan, which is a software that integrates various protein signature applications to match protein sequences to potential functions, was applied to identify conserved domains. We filtered InterProScan results with 1e 6 E-value threshold and BLAST results with at least 70% sequence identity and 100 alignment score.

Combining filtered best-hit from BLAST and predictions from InterProScan may lead to a more comprehensive profile of functional predictions.

3.2 Analysis of Genomic Variants

To detect structural variants, two approaches - sequence assembly and split-read, were used.

Figure 2 is the schematic illustration of analyses.

Figure 2: Flow chart of SV detection.

3.2.1 Sequence assembly analysis

Whole-genome alignment by MUMmer paves the way to perform sequence assembly analysis. One superiority that sequences assembly approaches possess over other approaches is the

8

(15)

ability to detect novel insertions. We acquired novel insertions as a product of whole-genome alignment. When anchoring contigs to the reference genome, some inserted sequences in the contigs are not present in the reference, and the inserted sequences can be either genuine novel sequences in the contigs or mis-assembled sequences.

Other types of SVs, including insertions, deletions, tandem duplications, duplications, inversions, translocations, were also provided by MUMmer. Owing to the sequences anchor process, the alignment gaps between the assembly and the reference are identified, and MUMmer will classify all these alignment breakpoints to estimate macroscopic differences between two genomes. While delivering a set of versatile SVs, the quantification and accuracy of the char- acterizations could be rough estimations due to the fact of alignment artifacts (spurious alignments and large repetitive elements). To get more confident and precise SV calls, we applied Assemblytics [22] to analyze MUMmer output. Assemblytics, pimarily targeting long-read sequencing and mapping, has a conservative filter to exclude alignments that does not meet the threshold of unique contig anchor length, and it centers on calling indels (insertions/deletions) of 50 bp - 10 kbp in size. Since alignments can confound with repetitive elements near large SVs and SV breakpoints spanning might be extremly unbalanced, Assemblytics restricts the SVs detection size to 10 kbp. Indels are characterized to more specialized form based on alignment, including insertions, deletions, tandem expansions, tandem contractions, repeat expansions, and repeat contractions. While insertions/deletions are characterized with defined alignment breakpoints either in reference or contig, tandem variants are defined when alignments overlap, and repeat variants are cases when alignment has a gap in both reference and contig. Expansions refer to the situation where there are more sequences in contigs than in the reference; on the other hand, contractions are cases where there are more sequences in the reference than in contigs.

3.2.2 Split-read analysis

As an alignment-based approach, Sniffles [23] was used for SV calling. Sniffles takes split-read methods, which gain more power in SV detection with long reads to span events. The structural variant indicates split-read signal when the full alignment of the read is disrupted. The current version of Sniffles is used in conjuction with BWA-MEM [24]. PacBio reads are first aligned by BWA-MEM ² and then piped to Sniffles for SV analysis based on split reads alignment.

Sniffles is able to detect insertions, deletions, duplications, inversions and translocations, and assess nested events as well.

2Appendix

9

(16)

4 Results

4.1 Assembly

4.1.1 Mapping

Table 1 shows the summary statistics of the whole-genome alignment result from MUMmer.

The overall consensus is high, with 98.60% and 98.64% in p_contig of PB1 and PB2, respectively. We are aware that there is a trade-off between specificity and sensitivity in whole- genome alignment. The Chinese genome HX1 has even higher consensus (99.73%) with the reference [9]. Comparing the parameters, the setting in our alignment has more gain in specificity to generate more reliable alignments. The feature estimates block gives an overview of alignment breakpoint classifications to assist evaluation of structural similarity of the assembly and the reference. For both assemblies, they share a similar structural quantification.

Table 1: Summary statistics on contigs mapping

PB1 PB2

Primary contig Alternative contig Primary contig Alternative contig [Seqeunces]

TotalSeqs 6,667 3,864 6,247 3,713

AlignedSeqs 4,940(74.10%) 3,670(94.98%) 4,751(76.05%) 3,494(94.10%)

UnalignedSeqs 1,727(25.90%) 194(5.02%) 1,497(23.95%) 219(5.90%)

[Bases]

TotalBases 2,836,568,842 92,419,689 2,842,340,423 89,717,795

AlignedBases 2,796,926,265(98.60%) 87,165,314(94.31%) 2,803,666,611(98.64%) 84,339,995(94.01%) UnalignedBases 39,642,577(1.40%) 5,254,375(5.69%) 38,673,812(1.36%) 5,377,800(5.99%) [Feature Estimates]

Breakpoints 33,596 5,437 34,935 5,334

Relocations 307 21 442 25

Translocations 1,783 291 1,641 290

Inversions 286 55 298 52

Insertions 14,874 2,694 16,249 2,631

InsertionSum 47,977,591 5,161,092 55,140,279 5,432,592

InsertionAvg 3225.60 1915.77 3393.46 2064.84

TandemIns 814 49 773 51

TandemInsSum 398,664 25,901 400,148 45,640

TandemInsAvg 489.76 528.59 517.66 894.90

To have better understanding how contigs are mapped to the reference, we transformed data into visualization, shown as Figure 3. Each panel displays data from the particular chromosome with the GRCh38 reference at the bottom track. The primary contig alignments from two individuals (top red track as PB1, second top blue track as PB2), in general, show concordant patterns, with high mappability to the reference, although there are notable alignment gaps around centromeres and telomeres. For chromosome Y, hardly any p_contig in PB2 aligns to reference as the individual is a female. The alternative contig alignment (middle red track as PB1, middle blue track as PB2) are more sparsely distributed as these contigs are only regions where alternate haplotypes exist. The bottom two multi-colored tracks are feature estimates of alignment gaps, with yellow indicating a new contig is needed to continue the alignment and green indicating that no contig can align to the reference. These two tracks are presented to facilitate interpretation of alignment gaps.

10

(17)

Figure3:Contigsmappingof2individuals.

11

(18)

(A) (B)

(C) (D)

Figure 4: Distribution of top BLASTp hits by species. (A, C) BLASTp hits by kingdom from the NCBI nr database showing the majority of hits were bacteria for PB1 (A) and PB2 (C). (B, D) BLASTp hits by species for sequences with hits to eukaryota for PB1 (B) and PB2 (D).

4.1.2 Novel sequence

The second round of mapping to HX1 resulted in 2,210 novel sequences covering ⇠18 Mb from PB1 and 1,818 novel sequences covering ⇠12 Mb from PB2. 19,612 and 10,831 genes were predicted by Maker2, from PB1 and PB2 novel sequences respectively. Filtering InterProScan protein sequence search results yielded 256 significant hits, representing 86 sequences and 129 predicted genes in PB1, and 157 significant hits, representing 69 sequences and 95 predicted genes in PB2. After filtering BLASTp hits, in PB1, 259 sequences and 1,671 predicted genes remained, while in PB2, 192 sequences and 1320 predicted genes have strong hits against the nr database. Among these strong hits, non-eukaryota hits were not considered as they might be contaminations, though some of the sequence hits may attribute to gene transfers. The rest 4% - 5% have fairly high sequence similarity to protein sequences in eukaryota (Figure 4).

While approximately half of the sequences matched with parasite proteins in both individuals, the other half match with vertebrate proteins, with strong hits to sea turtles by species. In PB1, except 2 homo sapiens protein hits, the remaining 84 eukaryota hit to predicted or hypothetical protein, and similarly in PB2, except 1 Pan troglodytes and 1 uncharacterized protein, the rest of 51 eukaryota hits have sequence identity to predicted or hypothetical proteins.

Merging results from InterProScan and BLAST, we identified 9 potential sequences that may contain potential novel genes from PB1 and 8 shortlisted from PB2. The results are detailed in AppendixList 1. Sequences with more InterProScan signatures were prioritized.

12

(19)

4.2 Structural Variants

4.2.1 Assemblytics

Table 2 shows SVs detected by Assemblytics. In aggregate, ⇠8M bases are affected by SVs in both individuals. In general, there are more and longer insertions or expansions than deletions or contractions. This may indicate additional sequences absent from the reference or population-specific (or individual-specific) sequences [8].

Table 2: Summary statistics on SVs by Assemblytics

PB1 PB2

Primary Alternative Primary Alternative

Number Base Number Base Number Base Number Base

Total 11,906 6,895,744 1,583 1,102,214 12,045 6,875,868 1,546 1,056,875

Insertion 4,617 1,723,578 805 453,238 4,711 1,818,093 756 425,711

Deletion 3,682 1,587,770 522 371,894 3,769 1,725,408 537 342,058

Tandem_expansion 1,818 1,999,992 56 74,790 1,757 1,905,490 61 72,711

Tandem_contraction 572 426,360 8 2,699 514 357,050 7 3,559

Repeat_expansion 644 670,413 107 95,131 649 628,758 102 95,729

Repeat_contraction 573 487,631 85 104,462 645 441,069 83 117,107

The length distribution of SVs by Assemblytics are consistent between two individuals (Figure 5). For indels from 50 bp - 1000 bp, the mass of the distributions are concentrated on smaller indels, ranging from 50 bp - 100 bp (Figure 5A,C). However, there is a peak around 300 bp.

It is due to abundant Alu elements whose body lengths are ⇠280 bases [25]. For SVs from 1 kbp to 10 kbp, Figure 5B,D agree with a clear deletion enrichment pattern around 6 kbp, and insertion seems to have a peak here as well, though it is less visible in PB1. The pattern contributes to LINE-1 (L1) retrotransposons.

4.2.2 Sniffles

Table 3 summarizes SVs calling from Sniffles. Comparing to Assemblytics results, the indels are of larger size. Although it is partly due to the maximum SV size (10 kbp) that Assemblytics manages to classify, Sniffles attempts to identify long-range complex SVs. The deletions and duplications in PB1, though share similar numbers as those in PB2, spans markedly more bases. It is likely that some deletions and duplications are of larger size in PB1 than those in PB2. The inversions have less number in PB1 than in PB2, whereas the average size is larger in PB1. One possible explanation for the differences might be that some complex SVs are nested and individual differences affect these SVs. The translocations are amounting several gigabases because the measuring of translocation takes the whole inter-chromosomal range into consideration.

13

(20)

Figure 5: SVs by Assemblytics. (A, C) Distribution of INDELs between 50 - 1000 bp for PB1 (A) and PB2 (C). (B, D) Size distribution from 1 - 10 Kbp of all types of SVs detected by Assemblytics for PB1 (B) and PB2 (D).

Table 3: Summary statistics on SVs by Sniffles

PB1 PB2

Number Base Number Base

Total 10,584 836,604,408,148 12,856 808,860,940,038

Insertion 4,004 5,715,557 3,843 4,966,864

Deletion 3,690 65,379,652 3,930 3,862,482

Duplication 1,327 62,660,331 1,359 1,313,617

Inversion 877 433,110,572 3,050 550,607,907

Translocation 686 836,037,542,036 674 808,300,189,168

While the size distributions of Sniffles SVs (50 bp - 1 kbp) show a consistent peak near 300 bp with Assemblytics SVs, only deletions seem to account for the peak (Figure 6 A, C). In- sertions of smaller size (50 - 100 bp) are more abundant than deletions. An apparent peak in deletions at 6 kbp can be observed as well (Figure 6 B, D). The circular plot (Figure 6 E) shows genomic distribution of larger SVs (>10 kbp). The outermost circle represents GRCh38 chromosome ideograms and the innermost circle shows interchromosomal translocations, with red lines standing for PB1 and blue lines for PB2. The remaining two circles (outer-PB1, inner- PB2) show duplications, inversions, deletions and insertions from outer to inner space in order with height difference. A large inversion on chromosome 1 present in both individuals, and a duplication and a deletion are nested in the genomic region of chromosome 11 in PB1. While some other large SV overlaps between two individuals are also visible, individual differences are noticeable.

14

(21)

E

Figure 6: SVs by Sniffles. (A, C) Distribution of INDELs from 50 bp - 1 kbp for PB1 (A) and PB2 (C).

(B, D) Size distribution from 1 - 10 Kbp of all types of SVs by Sniffles for PB1 (B) and PB2 (D). (E) Circular plot for SVs of size greater than 10 kbp. 15

(22)

4.2.3 Comparison among different datasets

To compare SVs from Assemblytics and Sniffles, indels of the same type and a 75% reciprocal overlap between SVs were required to assess shared calls. The total intersections between the two methods are 2188 in PB1 and 2194 in PB2. Excluding indels with a length greater than 10 kbp, the overlapping numbers dropped to 2023 with 588 insertions and 1435 deletions in PB1, and 2118 with 605 insertions and 1513 deletions in PB2. The deletions are evidently more consistent between the two methods; yet on the other hand, small insertions detected by Assemblytics show much longer length in Sniffles (Figure 7 A, B). Larger insertions detected by Assemblytics seem to deviate from insertions by Sniffles as barely no insertions longer than 5 kbp in Assemblytics coincide with those in Sniffles.

In addition, we compared the indels in PB1 and PB2 with those found in HX1. SV detection in HX1 is mainly sequence assembly based coupled with read-mapping by refining consensus sequences mapped to the reference. It is more likely that HX1 shares more SVs with sequence assembly based SVs in PB1 and PB2 (those from Assemblytics), though the pipelines of detection differ. In Assemblytics result, PB1 and HX1 have 3401 overlaps in sum, with 2203 (⇠65%) insertions and 1198 (⇠35%) deletions, and PB2 and HX1 have 3509 intersections, with 2249 (⇠64%) insertions and 1260 (⇠36%) deletions. While in Sniffles, PB1 and HX1 share 2475 SVs, among which insertions are 773 (⇠31%) and deletions are 1702 (⇠69%), and PB2 and HX1 have 2569 overlaps, among which insertions are 776 (⇠30%) and deletions are 1793 (⇠70%). The shared SVs between HX1 and Assemblytics are of higher consensus relative to those between HX1 and Sniffles (Figure 7. A - E). The overlaps of HX1 and Assem- blytics have more comparable SV size for both insertions and deletions. A greater proportion of these overlaps concentrates at smaller SVs (< 1 kbp), and as SV size increases, the shared calls drop.

Figure 7: SV size distribution of shared SVs from different approaches. (A, D) Assemblytics size versus Sniffles size of shared SVs, (A) PB1, (D) PB2. (B, E) Assemblytics size versus HX1 size of shared SVs, (B) PB1, (E) PB2. (C, F) HX1 size versus Sniffles size of shared SVs, (C) PB1, (F) PB2.

16

(23)

5 Discussion

The sequence identity of PB1 and PB2 compared to GRCh38 is high, which suggests good quality of the assemblies. Although we have mentioned in the previous section that our mapping strategy has more specificity inclination that lead to lower sequence consensus with the reference, the sequencing error rates and coverage are also likely to result in lower sequence accuracy [26]. Additional polishing on assemblies can reduce error rates and improve consensus accuracies. Coverage, at a fundamental level, may bring down error rates and is another factor that affects assembly quality. Though there is no best practice on read depth required for de novo assembly, in [13], the authors evaluated that SMRT sequencing on human BAC with ⇠80 - 100 fold coverage proceeded by a FALCON alike assembly method that could yield high- quality complete assemblies comparable to assemblies using hybrid sequencing and assembly methods. The acquisition of novel sequences and comparative analysis implies the possibilities of identifying functional elements from PacBio sequencing.

The SVs identified by Assemblytics and Sniffles demonstrated consistent patterns with a previous study [8]. Local SVs (< 10 kbp) detected by the sequence assembly method is in relatively high concordance with SVs in HX1, with more than 40% overlap in insertions and 28.5% overlap in deletions. In previous SMRT sequencing studies [8, 9, 26], the shared SVs for insertions are around 39% and deletions about 12%, although the percentage can be high up to ⇠53% for insertions and ⇠42% for deletions with exact the same SV detection pipeline. When comparing SVs from long-read data to short-read data, the shared SVs are less than ⇠25% for insertions and ⇠30% for deletions with 50% reciprocal overlap. Insertions called by Sniffles tend to have a rift with the sequence assembly method, but deletions seem to be in much better consistency.

SV differences, despite detection approaches, are evident between individuals.

5.1 Future work

There is definitely much more to be done with the analysis. A more robust novel sequence identification pipeline is needed to diminish data noise. The candidate novel sequences are not well filtered in current workflow, which leads to confoundings in each computational step.

We have listed a few candidates of novel sequence that may contain new genes; however ideal the list might be, higher confidence in valid identification is desired. Other approaches to conduct effective validation and to gain more supportive evidences should be done. More effort may also be poured into SV validations. There are a number of large SVs (> 100 kbp) detected by Sniffles and being able to validate these SVs may give insights in complex genomic rearrangements and long-read sequencing benefits.

6 Acknowledgements

I am grateful to people who have helped me to accomplish this thesis project. Much gratitude is owed to my supervisor, Adam Ameur, who is very helpful, responsible and inspiring. I would like to thank professor Ulf Gyllensten and his group for offering the opportunity to learn so much about bioinformtics and genomics.

17

(24)

7 References

[1] F. Sanger, S. Nicklen, and A. R. Coulson, “DNA sequencing with chain-terminating in- hibitors,” Proceedings of the National Academy of Sciences, vol. 74, no. 12, pp. 5463–

5467, 1977.

[2] M. L. Metzker, “Sequencing technologies - the next generation,” Nat Rev Genet, vol. 11, no. 1, pp. 31–46, 2010.

[3] L. Liu, Y. Li, S. Li, N. Hu, Y. He, R. Pong, D. Lin, L. Lu, and M. Law, “Compari- son of next-generation sequencing systems,” Journal of Biomedicine and Biotechnology, vol. 2012, p. 251364, 2012.

[4] D. A. Wheeler, M. Srinivasan, M. Egholm, Y. Shen, L. Chen, A. McGuire, W. He, Y.-J.

Chen, V. Makhijani, G. T. Roth, X. Gomes, K. Tartaro, F. Niazi, C. L. Turcotte, G. P. Irzyk, J. R. Lupski, C. Chinault, X.-z. Song, Y. Liu, Y. Yuan, L. Nazareth, X. Qin, D. M. Muzny, M. Margulies, G. M. Weinstock, R. A. Gibbs, and J. M. Rothberg, “The complete genome of an individual by massively parallel DNA sequencing,” Nature, vol. 452, pp. 872–876, 04 2008.

[5] The UK10K Consortium, “The UK10K project identifies rare variants in health and disease,” Nature, vol. 526, pp. 82–90, 10 2015.

[6] K. A. Fakhro, M. R. Staudt, M. D. Ramstetter, A. Robay, J. A. Malek, R. Badii, A. A.- N. Al-Marri, C. A. Khalil, A. Al-Shakaki, O. Chidiac, D. Stadler, M. Zirie, A. Jayyousi, J. Salit, J. G. Mezey, R. G. Crystal, and J. L. Rodriguez-Flores, “The Qatar genome:

a population-specific tool for precision medicine in the Middle East,” Human Genome Variation, vol. 3, pp. 16016 EP –, 06 2016.

[7] S. Besenbacher, S. Liu, J. G. Izarzugaza, J. Grove, K. Belling, J. Bork-Jensen, S. Huang, T. D. Als, S. Li, R. Yadav, A. Rubio-García, F. Lescai, D. Demontis, J. Rao, W. Ye, T. Mailund, R. M. Friborg, C. N. S. Pedersen, R. Xu, J. Sun, H. Liu, O. Wang, X. Cheng, D. Flores, E. Rydza, K. Rapacki, J. Damm Sørensen, P. Chmura, D. Westergaard, P. Dworzynski, T. I. A. Sørensen, O. Lund, T. Hansen, X. Xu, N. Li, L. Bolund, O. Ped- ersen, H. Eiberg, A. Krogh, A. D. Børglum, S. Brunak, K. Kristiansen, M. H. Schierup, J. Wang, R. Gupta, P. Villesen, and S. Rasmussen, “Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios,” Nat Commun, vol. 6, 01 2015.

[8] M. J. P. Chaisson, J. Huddleston, M. Y. Dennis, P. H. Sudmant, M. Malig, F. Hormozdi- ari, F. Antonacci, U. Surti, R. Sandstrom, M. Boitano, J. M. Landolin, J. A. Stamatoy- annopoulos, M. W. Hunkapiller, J. Korlach, and E. E. Eichler, “Resolving the complexity of the human genome using single-molecule sequencing,” Nature, vol. 517, pp. 608–611, 01 2015.

[9] L. Shi, Y. Guo, C. Dong, J. Huddleston, H. Yang, X. Han, A. Fu, Q. Li, N. Li, S. Gong, K. E. Lintner, Q. Ding, Z. Wang, J. Hu, D. Wang, F. Wang, L. Wang, G. J. Lyon, Y. Guan, Y. Shen, O. V. Evgrafov, J. A. Knowles, F. Thibaud-Nissen, V. Schneider, C.-Y. Yu, L. Zhou, E. E. Eichler, K.-F. So, and K. Wang, “Long-read sequencing and de novo assembly of a Chinese genome,” Nat Commun, vol. 7, 06 2016.

[10] J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan, B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark,

18

(25)

R. Dalal, A. deWinter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. Heiner, K. Hes- ter, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. Lin, P. Lundquist, C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy, R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener, D. Wu, A. Yang, D. Zaccarin, P. Zhao, F. Zhong, J. Korlach, and S. Turner, “Real-time DNA sequencing from single polymerase molecules,” Science, vol. 323, no. 5910, pp. 133–138, 2009.

[11] R. J. Roberts, M. O. Carneiro, and M. C. Schatz, “The advantages of SMRT sequencing,”

Genome Biology, vol. 14, no. 7, pp. 1–4, 2013.

[12] K. Berlin, S. Koren, C.-S. Chin, J. P. Drake, J. M. Landolin, and A. M. Phillippy, “As- sembling large genomes with single-molecule sequencing and locality-sensitive hashing,”

Nat Biotech, vol. 33, pp. 623–630, 06 2015.

[13] C.-S. Chin, D. H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, A. Copeland, J. Huddleston, E. E. Eichler, S. W. Turner, and J. Korlach, “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data,” Nat Meth, vol. 10, pp. 563–569, 06 2013.

[14] C.-S. Chin, P. Peluso, F. J. Sedlazeck, M. Nattestad, G. T. Concepcion, A. Clum, C. Dunn, R. O’Malley, R. Figueroa-Balderas, A. Morales-Cruz, G. R. Cramer, M. Delledonne, C. Luo, J. R. Ecker, D. Cantu, D. R. Rank, and M. C. Schatz, “Phased diploid genome assembly with single molecule real-time sequencing,” bioRxiv, 2016.

[15] C. Alkan, B. P. Coe, and E. E. Eichler, “Genome structural variation discovery and geno- typing,” Nat Rev Genet, vol. 12, pp. 363–376, 05 2011.

[16] G. H. Kai Ye and Z. Ning, “Structural variation detection from next generation sequencing,” Journal of Next Generation Sequencing & Applications, vol. S1, no. 007, pp. –, 2016.

[17] K. R. Bradnam, J. N. Fass, A. Alexandrov, P. Baranay, M. Bechner, I. Birol, S. Boisvert, J. A. Chapman, G. Chapuis, R. Chikhi, H. Chitsaz, W.-C. Chou, J. Corbeil, C. Del Fabbro, T. R. Docking, R. Durbin, D. Earl, S. Emrich, P. Fedotov, N. A. Fonseca, G. Ganapathy, R. A. Gibbs, S. Gnerre, É. Godzaridis, S. Goldstein, M. Haimel, G. Hall, D. Haussler, J. B. Hiatt, I. Y. Ho, J. Howard, M. Hunt, S. D. Jackman, D. B. Jaffe, E. D. Jarvis, H. Jiang, S. Kazakov, P. J. Kersey, J. O. Kitzman, J. R. Knight, S. Koren, T.-W. Lam, D. Lavenier, F. Laviolette, Y. Li, Z. Li, B. Liu, Y. Liu, R. Luo, I. MacCallum, M. D.

MacManes, N. Maillet, S. Melnikov, D. Naquin, Z. Ning, T. D. Otto, B. Paten, O. S. Paulo, A. M. Phillippy, F. Pina-Martins, M. Place, D. Przybylski, X. Qin, C. Qu, F. J. Ribeiro, S. Richards, D. S. Rokhsar, J. G. Ruby, S. Scalabrin, M. C. Schatz, D. C. Schwartz, A. Sergushichev, T. Sharpe, T. I. Shaw, J. Shendure, Y. Shi, J. T. Simpson, H. Song, F. Tsarev, F. Vezzi, R. Vicedomini, B. M. Vieira, J. Wang, K. C. Worley, S. Yin, S.-M.

Yiu, J. Yuan, G. Zhang, H. Zhang, S. Zhou, and I. F. Korf, “Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species,” GigaScience, vol. 2, no. 1, pp. 1–31, 2013.

[18] S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. L.

Salzberg, “Versatile and open software for comparing large genomes,” Genome Biology, vol. 5, no. 2, p. R12, 2004.

19

(26)

[19] C. Holt and M. Yandell, “Maker2: an annotation pipeline and genome-database manage- ment tool for second-generation genome projects,” BMC Bioinformatics, vol. 12, no. 1, pp. 1–14, 2011.

[20] C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, and T. L.

Madden, “Blast+: architecture and applications,” BMC Bioinformatics, vol. 10, no. 1, pp. 1–9, 2009.

[21] P. Jones, D. Binns, H.-Y. Chang, M. Fraser, W. Li, C. McAnulla, H. McWilliam, J. Maslen, A. Mitchell, G. Nuka, S. Pesseat, A. F. Quinn, A. Sangrador-Vegas, M. Scheremetjew, S.-Y. Yong, R. Lopez, and S. Hunter, “Interproscan 5: genome-scale protein function classification,” Bioinformatics, vol. 30, pp. 1236–1240, 05 2014.

[22] M. Nattestad, and M. Schatz, Assemblytics: a web analytics tool for the detection of assembly-based variants, 2016. bioRxiv. doi: http://dx.doi.org/10.1101/044925.

[23] F. Sedazeck, Sniffles, 2015–2016. Available at https://github.com/fritzsedlazeck/Sniffles.

[24] H. Li, Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, 2013. Available at arXiv:1303.3997v2 [q-bio.GN].

[25] P. Deininger, “Alu elements: know the SINEs,” Genome Biology, vol. 12, pp. 236–236, 12 2011.

[26] M. Pendleton, R. Sebra, A. Pang, A. Ummat, O. Franzen, T. Rausch, A .Stutzand, W .Stedman, T .Anantharama, A .Hastie, H .Dai, M .Fritz, H. Cao, A. Cohain, G. Deikus, R. Durrett, S. Blanchard, R. Altman, C.S. Chin, Y. Guo, E.E. Paxinos, J.O. Korbel, R.B. Darnell, W.R. McCombie,P.Y. Kwok, C.E. Mason, E.E. Schadt, and A. Bashir, “As- sembly and diploid architecture of an individual human genome via single-molecule technologies,” Nat Meth, vol. 12, pp. 780–786, 08 2015.

20

(27)

8 Appendix

1 #Mapping contigs to GRCh38 by MUMmer

2 nucmer --maxmatch -l 500 -p $out $reference $query

3 delta-filter -q $out.delta > out.filtered.delta

4 show-coords -c -d -l -q -o out.filtered.delta > out.coords

5 dnadiff -p $dnadiff.out -d out.filtered.delta

1 #Mapping unmapped sequences to HX1 by MUMmer

2 nucmer --maxmatch -c 300 -l 300 -p $out $reference $query

1 #Aligning PacBio reads to the reference by BWA-MEM

2 bwa mem -x pacbio $reference $reads

21

Evaluation of de novo assembly using PacBio long reads

Evaluation of de novo assembly using PacBio long reads

Huiwen Che

Evaluation of de novo assembly using PacBio long reads

Abbreviations

Contents

1 Introduction

1.1 Aims of the thesis project

2 Background

2.1 Single Molecule Real-Time sequencing

2.2 De Novo Assembly

2.3 Genetic Variation

3 Methods

3.1 Assembly evaluation

3.2 Analysis of Genomic Variants

4 Results

4.1 Assembly

4.2 Structural Variants

5 Discussion

5.1 Future work

6 Acknowledgements

7 References

8 Appendix