Development and validation of bioinformatic methods for GRC assembly and annotation Roberto Rossini

(1)

Development and validation of

bioinformatic methods for GRC assembly and annotation

Roberto Rossini

Degree project in bioinformatics, 2020

Examensarbete i bioinformatik 45 hp till masterexamen, 2020

(2)

(3)

Abstract

This thesis presents the work done during my master degree projects under the supervision of Alexander Suh and Francisco J. Ruiz-Ruano. My work focused on the development of in-silico methods to improve the assembly of the Germline Restricted Chromosome (GRC) of songbirds, more specifically that of zebra finch. GRCs are a good example of the popular saying “The exception that proves the rule”. For a very long time, it was assumed that every cell in a healthy multicellular organism carries the same genetic information. Cytogenetic evidence dating back as far as early XX century suggests that this is not always the case, as it has been documented that certain organisms carry supernumerary B chromosomes, which are dispensable chromosomes that are not part of the normal karyotype of a species. GRCs are often regarded as a special case of B chromosomes, where every individual from a species carries an additional chromosome whose presence is restricted to germline cells only. GRCs presence has been documented in insects, hagfishes and songbirds. A peculiar case of GRCs is that of zebra finch, whose GRC has an estimated size of over 150 Mb, accounting for over 10% of zebra finch total genome size. Despite the first cytogenetic evidence of zebra finch GRC dating back to 1998, it was only last year that the first comprehensive genomic study about this relatively large chromosome was published. This study shed some light on the gene content of the GRC in zebra finch, revealing that the GRC of zebra finch mostly consists of paralogs of A chromosomal genes. The GRC assembly and annotation that were published as part of this study included 115 GRC-linked genes that were identified through germline/soma read mapping, as well as 36 manually curated scaffolds with a median length of 3.6 kb. Considering the conspicuous size of the GRC of zebra finch, it is clear that this is a very fragmented and likely incomplete GRC assembly. There are many factors that can have a negative impact on assembly completeness and contiguity. In the GRC case, these factors collectively affect coverage in ways that are not properly handled by available genome assemblers.

In the course of my master degree project I developed kFish, a bioinformatic software to perform alignment-free enrichment of GRC-linked barcodes from a 10x Genomics linked-read DNA Chromium library. kFish uses an iterative approach where the k-mer content of a set of GRC- linked sequences is compared with that of reads corresponding to each individual 10x Genomics barcode. This comparison allows kFish to identify likely GRC-linked barcodes, and then only use reads corresponding to these barcodes when trying to assemble the GRC. First benchmarking results generated using five GRC-linked genes from zebra finch as reference sequences, show that kFish is not only capable of assembling already known GRC-linked sequences, but also new ones with high confidence. kFish can do all of this in a matter of hours, using only few gigabytes of system memory, while previous efforts took over two days to assemble zebra finch genome and identify GRC-linked scaffolds using an approach based on read mapping.

High quality genome assemblies and annotations are the foundations of modern genomics research, the lack of which greatly limits the breadth of the questions that can be answered. There is still a lot that we do not understand about GRCs, and part of this is due to the lack of high quality GRC assemblies and annotations. Producing such an assembly will likely require an integrated approach, where multiple sequencing technologies as well as bleeding edge bioinformatic tools such as kFish, are combined together to produce an high quality assembly, which

(4)

(5)

Assembling weird chromosomes with Next-Generation DNA Sequencing

Popular Science Summary Roberto Rossini

History teaches that technological progress doesn’t happen in a linear fashion. Instead, pro- longed periods of stagnation are interleaved by a technological breakthroughs that change hu- mans and human society forever. Over the last four centuries we went through several of such revolutions, like the industrial revolution, the transportation revolution or the digital revolution.

Molecular biology is one of the youngest field amongst life-sciences. In spite of its age, over the past few decades several technological breakthroughs revolutionized this very young field.

Amongst these breakthroughs is Next-Generation Sequencing (NGS), a set of methodologies for DNA sequencing that have been developed and made commercially available over the last two decades.

Living organisms use DNA molecules to store and convey information. This information can be thought of as a set of instructions, known as genes, that encode how to perform specific functions, such as synthetizing collagen, the main component of your skin, or producing the lactase enzyme, which breaks down lactose molecules into glucose and galactose, and the lack of which causes lactose intolerance. Before the advent of NGS, accessing the information en- coded by genes through DNA sequencing was a fairly slow and expensive process. For this reason, researchers in the life science and medical sector often had to answer their research ques- tion without having access to the sequence of the genes involved in the process that they were studying. This made answering certain research questions very hard, and sometimes straight up impossible.

NGS made DNA sequencing accessible to virtually every researcher by dramatically increasing sequencing throughput, leading to a massive drop in sequencing time and cost. This improve- ment was achieved by adopting a set of sequencing methodologies that were highly scalable and that could be easily automated. As an example, sequencing and producing a draft of the human genome in the early 2000s had an estimated cost between the $500 million and $1 billion, while the same result can nowadays be achieved with an investment of as little as $4000!

One of the peculiarities of NGS is that it sequences long DNA molecules by breaking them down into small fragments, known as reads. The human genome consists of 23 pairs of DNA molecules. Each molecule is tightly packed in structures known as chromosomes. If we imagine to unwind every chromosome in the human genome and then concatenate them, we would ob- tain a DNA molecule that is about 3.2 billion base pairs (bp) or letters long. NextSeq 2000 is the latest sequencing platform by Illumina, the leading company in NGS sequencing platform. This sequencing machine is capable of producing up to two billion reads in a single run, where each read can be no more than 150 bp long, about 20 million times smaller than the human genome!

It is then clear that NGS sequencing platforms do not output directly the sequence of the DNA molecules that have been sequenced, but rather very short sequence fragments that when over- lapped and chained together in the right order, in a process known as genome assembly, will, at least in theory, produce the sequence of letters representing the complete genome of the organism that has been sequenced.

The problem of assemblying a genome from the data produced by NGS sequencing has been dubbed by some researchers as the problem of “Exploding Newspaper”.

Consider a stack of identical newspapers that are placed on top of a detonating device that can be triggered remotely. The stack of newspaper represents the information stored in the genome of the organism that we intend to sequence. The sequencing process correspond to the explosion that will shred the newspaper to pieces, where a paper shred correspond to a single read, and

(6)

produced by the explosion. This is of course an inherently hard problem to solve, and it often won’t be possible to assemble certain portions of the newspaper. To automate the process of assembling NGS reads into full genomes (or at least portion of them), scientists have developed a class of very sophisticated software known as genome assemblers.

Having access to the genome’s sequence for the organisms subject of a study can be an in- valuable asset for researchers. Nowadays the questions that can be answered by life-science researchers often depend on the availability of genomic information such as genome assemblies and genome annotations (locations of genes that are labeled with a gene’s known or predicted function).

For the most part, individuals from the same species have a constant number of chromosomes in all their cells throughout their whole life. As it is often the case in life-sciences, there are plenty of exception to the previous statement. More specifically, by looking at chromosomes through microscopes, scientist have documented several cases of uneven chromosome num- bers in nematodes and insects, some of which date back as far as 1887! A study published in 2019 analyzed the number, shape and size of chromosomes in different cell types from 16 representative species from the songbird clade, a group of birds that includes roughly half of the known bird species, among these sparrows, finches and crows. This study found that all the analyzed individuals carry an extra chromosome in cells from their reproductive organs (germ cells) when compared to the chromosomes from a control tissue. Given that the presence of these chromosomes appears to be restricted to cells from the germline, they have been dubbed Germline Restricted Chromosomes (GRC).

Even though the first evidence of GRC in zebra finch dates back to 1986, it was only last year that a GRC was assembled for the first time. Unfortunately the assembly turned out to be very fragmented and incomplete. This is due to several technical complications, that range from sample preparation to the assembly process itself. One of these complications is that a considerable fraction of GRC genes show a high sequence similarity to sequences on the “standard chromosomes”. Going back to the exploding newspaper example, assemblying a genome with a GRC is the equivalent of trying to assemble an issue of a newspaper that contains several near identical versions of the same article, where different versions only differ in few words or sentences. The reconstruction produced by assemblying the shards produced by the explosion of such a newspaper will likely contain a single version the article, where words and sentences from different versions have been merged and collapsed into a single article.

In the course of my master thesis I have developed kFish, a bioinformatic software that given linked-read data (a special type of short reads with barcode information that can be used to group reads bases on their DNA molecule of origin), and a set of reference sequences, such as one or more GRC genes, is capable of enriching for reads corresponding to the reference sequences and nearby regions. This means that given a word or sentence that is specific of one of the versions of a news article, kFish is capable of reconstructing the content of that particular article, without collapsing words and sentences from similar articles into a single, chimeric article.

I have developed kFish to aid the assembly of GRCs in songbirds, but in the course of its development, it became clear how kFish may also be applied to a wider range of sequence assembly problems that require targeted assembly approaches, such as the assembly of sex chromosomes or solving the ordering of genes located in the same portion of a chromosome.

Appling kFish to the GRC of zebra finch produced results that are highly overlapping with previous results obtained with substantially different approaches, and also allowed for the reconstruction of new sequences.

Degree project in bioinformatics, 2020

Examensarbete i bioinformatik 45 hp till masterexamen, 2020

Biology Education Centre and Suh Group: Dept. of Organismal Biology, Uppsala University

(7)

B.1 Chosing the right hash function . . . IV C Space-efficient storage of hashed k-mer for high-throughput random lookups V C.1 Storing k-mers using bare vectors . . . V C.2 Storing k-mer using Hash Maps . . . VI C.3 Locality of reference and CPU cache . . . IX C.4 Bloom filter, a space-efficient probabilistic data structure for AMQ . . . X C.5 Counting Quotient Filter (CQF), a cache-friendly probabilistic data structure for

AMQ . . . XII

D Software used by kFish XVII

D.1 C/C++ libraries . . . XVII D.2 Development tools . . . XVIII D.3 Bioinformatic tools . . . XVIII

E Trimming settings (BBDuk) XIX

(9)

Glossary

K-mer Canonical Form Given a nucleic sequence S and one of its k-mers K, to write the k- mer K in its canonical form, the reverse complement of the k-mer K (K’) is generated.

Then the two k-mers K and K’ are compared, and the one that is smaller according to lexicographical order is used as canonical form for the k-mer K. IV

Binary Search is a search algorithms that can be used on a sorted collection of items, whose items can be accessed by index. Given a vector V of length l and query q, binary search starts by looking at the value stored in the vector V at index i = ₂^l, which is V [i]. If V [i] = q, then the search is successful and binary search will return T rue. If V [i]̸= q, then the value of q is compared with V [i]. If q < V [i], then binary search will keep looking for the query q only in the left portion of the vector V (V [0 : i− 1]), while if q > V [i] it will only look in the right portion of the vector (V [i + 1 : l]).

Binary search keeps halving the range of the vector that is being considered based on wether the query q is smaller or larger than the value stored at the index i, where i is the index that was used to partition the vector in the current iteration of binary search.

The search stops when V [i] = q, or when the length of the range being considered is 1 (i.e. when the subrange being considered consists in a single element), in which case if V [i]̸= q binary search will return F alse. V

Bloom Filter Probabilistic data structure used to perform AMQ. VIII, X–XII Cardinality of a set is a measure of the number of items belonging to a set. 22 Database ingestion The process of importing data into a database. 16

Deinterlaced FASTQ file is a set of one or more pairs of FASTQ files where the first and second read in a read pair are stored in two separate files (i.e. all the reads that come first a read in a pair are stored in file A, and all the reads that come from the second in a read pair are stored in file B). 14

Hash Function is a function that can map data of arbitrary size to a finite-set of values, often to a subset ofN. II, IV

Hash Map is a data structure that stores and maps a set of keys to a set of values. It is space- efficient and queries have a constant time complexity in the average caseO(1). VI–X, XII, 22

Hash Table See hash map. VI

Interlaced FASTQ file FASTQ file storing paired-end reads where the forward mates alternate with its reverse mates from the same fragment. 14, 26

K-mer Given a sequence S of length l, k-mers are defined as all the substring of length K that can be obtained from the sequence S. I–XII, 8, 12, 13, 16–26, 37, 41

MemTable is a class of in-memory data structures that are used by RocksDB to store data in system memory, before flushing it to an Static Sorted Table (SST) file. 16

Overlap Coefficient The Overlap Coefficient (OC) or Szymkiewicz–Simpson coefficient, is a coefficient used to describe the overlap between two finite sets. In kFish the OC is used

(10)

OC is defined as the ratio between the cardinality of the intersection and the cardinality of the smallest set between A and B:

OC = A∩ B

min(|A|, |B|) . 22

Vector is a simple data structure used to store a collection of objects with the same data type in contiguous memory with extremely low space overhead. V, VI, X, XI, 16

Acronyms

ABM Advanced Bit Manipulation. XII

AMQ Approximate Membership Query. X, 16

ASCII American Standard Code for Information Interchange. I, II BBF Blocked Bloom Filter. XII

BGZF Blocked GNU Zip Format. 15

BMI1 Bit Manipulation Instruction Set 1. XII bp Base pairs. I, 20

CBF Counting Bloom Filter. X

CCS Circular Consensus Sequencing. 10

CIGAR Concise Idiosyncratic Gapped Alignment Report. 27 CLI Command Line Interface. 41

CQF Counting Quotient Filter. XI, XII, XV, XVI, 16, 17, 19–23, 25, 40 DBG De-Bruijn Graph. 8, 10

EASCII Extended ASCII. I

FNP False Negative Probability. XI FPP False Positive Probability. X, XI GEM Gel bead-in-EMulsion. 11, 18 GNU GNU’s Not Unix. 15

GRC Germline Restricted Chromosome. 7–13, 18, 19, 26–28, 31–35, 37–42 Gzip GNU Zip. 15

HMW DNA High Molecular Weight DNA. 11, 18, 25, 26 IOPS Input/Output operations. 15

KB KiloByte. IX

LRU Least-Recently Used. IX MB MegaByte. I, IX, XVI

(11)

NGS Next-Generation Sequencing. 8, 15 OC Overlap Coefficient. 16, 22–24, 28, 29 OLC Overlap Layout Consensus. 8, 10, 12 oligo Oligo Nucleotide. 11

PCR Polymerase Chain Reaction. 10, 11 QF Quotient Filter. XII–XVI

SBF Spectral Bloom Filter. X, XII SHA-2 Secure Hash Algorithms 2. IV SNP Single Nucleotide Polymorphism. 27 SST Static Sorted Table. 16

TB TeraByte. VI, IX

UTF-8 8-bit Unicode Transformation Format. I VGP Vertebrate Genome Project. 11, 32, 33 ZSTD Zstandard. 15

(12)

(13)

1 Introduction

In the course of my master degree project I developed and implemented kFish, an in-silico method to perform barcode enrichment of a 10x Genomics Chromium linked-read library given one or more reference sequences. kFish was tested on a set of known GRC-linked genes from zebra finch, and the results were compared with the one fromKinsella et al.(2019). Overall the results produced by kFish show a good level of agreement with previous findings and suggest that kFish may be applicable to a broader variety of bioinformatic problems, such as the assembly of sex chromosomes and rapid targeted reassembly of small portions of very large and repetitive genomes.

1.1 Introducing Germline Restricted Chromosomes (GRCs)

For a very long time it was believed that every cell in an healthy multicellular organism carries the same genetic information in the form of one or multiple chromosomes. Since the early XX century there has been growing evidence that this is not always the case, as it has been documented that some individual of certain species carry one or more additional, dispensable chromosomes, which are collectively known as B chromosomes (Wilson, 1907, 1909). Over the past few decades, the presence of B chromosomes has been confirmed in close to 1700 species, and it is currently estimated that B chromosomes might be present in 10-15% of the known eukaryote species (Camacho,2005;Jones,2017). B chromosomes have been defined by Beukeboom(1994) as “additional dispensable chromosomes that are present in some individuals from some populations in some species, which have probably arisen from the A chromosomes but follow their own evolutionary pathway”, using the term “A chromosomes” to refer to the set of chromosomes that are found in every cell of healthy individuals from the same species.

GRCs are often regarded as B chromosomes, even though they are found in all the individuals from a species. As the name suggests, GRCs are special chromosomes that only occur in the germline of their carrier. The key difference here is that B chromosomes are not present in all individuals from the same population or species, while GRCs are present in all the individuals of a GRC-carrying species, but their presence is limited to the germline of their carriers. GRCs are known to occur in several multicellular eukaryotes, among which insects, hagfishes and songbirds (Wang and Davis,2014;Torgasheva et al.,2019).

A peculiar case of GRC is that of zebra finch, which was first identified byPigozzi and Solari (1998). This GRC has an estimated size of over 150 Mb and is the largest chromosome in zebra finch genome. A recent study byTorgasheva et al.(2019) used cytogenetics and microdissection sequencing to look at the somatic and germline karyotype of 24 avian species, including 16 representative species from the songbird clade. This study was able to document the GRC’s presence in the germline of all 16 of the songbird species, as well as its absence in their somatic cell lines. Furthermore, they found no evidence of GRCs in the germline of the remaining eight bird species.

Although we have known of the very large GRC of zebra finch for over 20 years ago, a high quality reference assembly for this peculiar chromosome has yet to be published. A study by Kinsella et al.(2019) identified 115 GRC-linked genes using mapping-based information. 19 of

(14)

assembly consisted of 36 GRC-linked scaffolds with a median scaffold length of 3.6 kb. This study also suggests that the GRC of zebra finch mostly consists of genomic regions with a high sequence similarity compared to 19 of the regular chromosomes, also known as A chromosomes, and that some of the GRC paralogs are present in very high copy number (up to around 300 copies in the case of gene dph6). RNA-Seq and protein mass spectrometry evidence also suggests that some of the GRC paralogs are likely functional, as they are expressed both at the transcript and protein level.

1.2 De-novo genome and GRC assembly

There are two main types of de-novo genome assemblers: the first type is based on the Overlap Layout Consensus (OLC) algorithm, while the second one relies on De-Bruijn Graphs (DBGs).

Both types of assemblers use information from overlapping reads and their coverage to reconstruct the original sequence of a genome, but they do so in ways that are radically different.

OLC assemblers identify overlapping regions by computing all-vs-all pair-wise alignments of the reads, and then generate an overlap graph where each read is represented by a node, and an edge between two nodes indicates that the two nodes have a sequence similarity that is above a certain threshold. DBGs on the other hand do not use read sequences directly, instead they gen- erate all possible substrings of length k (also known as k-mers) for a given read and then build a DBG where each node represents a distinct k-mer, and an edge between two nodes indicates that the two k-mers are neighbor k-mers (i.e. they have the same sequence except for the last or first nucleotide).

With the advent of NGS, OLC assemblers have become less popular, as computing an overlap graph by doing an all-vs-all read alignment does not scale well with the extremely high number of reads that are produced by NGS platforms. Furthermore, simplifying the graph produced by OLC (Layout step) and generating the consensus sequence for each contig can be very time consuming. DBG assemblers are often the only way to assemble very large genomes or high coverage short-read libraries, as building and storing DBGs require less computing resources than OLC. The main weakness of DBG assemblers stems from the fact that they immediately split the already short reads into k-mers, which are even shorter sequences, thus losing the ability of solving most of the repetitive regions that are found in a genome (Li et al.,2012;Langmead, 2018b,a).

1.2.1 Complications in de-novo GRC assembly

De-novo genome assembly is already a very complicated task due to repetitive regions and uneven depth of coverage. Assembling GRCs comes with an additional set of complications as highlighted byKinsella et al.(2019):

• Unusual effective ploidy. GRCs are a special type of chromosome that is only found in the germline of certain species (see section 1.1). Unfortunately the reproductive organs of most organisms, including songbirds, contain a large number of somatic cells, which do not carry a GRC. For this reason, even when taking extra care in isolating testes or ovaries from neighbor tissues, not every cell in the germline sample will contain a GRC.

(15)

Using zebra finch as an example, it is estimated that only one every three cells in testis are actually from the germline (Kinsella et al.,2019). For this reason, GRC sequences are underrepresented right from the early stage of sample preparation.

• A-chromosomal paralogs. Available evidence suggests that the sequence of zebra finch GRC mostly consists of paralogs of A-chromosomal genes. Some of these paralogs differ from the A-chromosomal copy only by few nucleotides or none (Kinsella et al., 2019).

Genome assemblers are known to struggle when assembling near-identical sequences, often producing a chimeric assembly, or not assembling one or more of the paralog copies (Emrich et al.,2004). This is because most genome assemblers are optimized for haploid or diploid genomes (one or two haplotypes respectively), not for genomes with GRCs, where certain genomic regions are characterized by a third haplotype.

• High copy number. Kinsella et al.(2019) first discovered most of the GRC-linked genes of zebra finch are present in very high copy number. According toKinsella et al.(2019, Supplementary table 7), several of the GRC-linked genes are present in 25 or more copies, with the most extreme of those cases being dph6, which has an estimated copy number of 300.

All the above factors collectively influence coverage in a way that is not properly handled by traditional genome assemblers, increasing the likelihood that paths representing GRC-linked sequences will be erroneously removed by the graph simplification step. De-novo genome assemblers usually assume that the read data provided as input produce a uniform coverage over the true sequence of a genome. This assumption is very useful when simplifying the assembly graph, as branches and tips that originated because of sequencing errors can be identified for removal by looking for portions of the graph where the average coverage is significantly lower than expected. Unfortunately relying on coverage information to simplify the assembly graph of the GRC of songbirds, can be expected to produce highly fragmented assemblies, as many of the branches and tips representing GRC-linked sequences will be removed during graph simplification.

1.2.2 Choosing a sequencing platform for de-novo GRC assembly

Because of the relatively short amount of time available to carry out a master degree project and the costs of sequencing, I didn’t have the option of choosing which sequencing approach to use to assemble zebra finch’s GRC. Still, I would like to illustrate the available options and their advantages/disadvantages when applied to the GRCs of songbirds. At the time of writing, there are three well established sequencing platforms for de-novo genome assembly: long-read sequencing (PacBio and Nanopore), short-read sequencing (Illumina) and linked-read sequencing (10x Genomics) (Quail et al.,2012;Bayley,2006;Levene et al.,2003;Loman and Watson, 2015;Weisenfeld et al.,2017a).

As described in sections 1.2 and 1.2.1, having access to long-range sequence information is of vital importance when performing de-novo genome assembly, as it can make the difference between properly assembling repetitive regions and recently diverged paralogs, mis-assembling them and not assembling them at all. Given this, it does not come as a surprise that long-read

(16)

available genome assemblies by assembling genomic regions that just few years ago were impossible to assemble (Sedlazeck et al.,2018;Peona et al.,2019;Rhie et al.,2020). Long-read sequencing platforms currently have much lower throughput than well established NGS platforms, such as Illumina. In the case of genome assembly, this is not necessarily a disadvantage, since as postulated by the Lander–Waterman model, the number of reads required to perform a whole genome assembly is inversely proportional to the read length, and assembling a relatively small number of reads can be done with an OLC assembler (Lander and Waterman, 1988; Li et al.,2012;Langmead,2018b). Another important advantage offered by PacBio and Nanopore sequencing, is that by not involving a PCR amplification step, they are less affected by GC-bias and other sources of sequencing bias. This makes it possible to assemble at least part of the so called “hidden genes”, which are genes that because of sequencing bias end up being underrepresented in the sequencing data and are thus not assembled (Ross et al.,2013;Beauclair et al., 2019). Long-sequencing platforms do not come without drawbacks. The first one is high sequencing cost. This is especially an issue for GRC assembly, as in order to identify GRC-linked scaffolds it is often necessary to sequence the gonad as well as a control tissue from the same individual. The second drawback is sequencing accuracy. Long-read sequencing platforms tend to have a much higher error rate than Illumina sequencing. It should be noted that both PacBio and Nanopore have developed specialized protocols to significantly reduce sequencing errors, which are commercially known as PacBio Circular Consensus Sequencing (CCS) and Nanopore 2D respectively. Both protocols involve sequencing the same DNA molecule multiple times to then later produce a consensus sequence with improved accuracy. Unfortunately, both protocols tradeoff sequencing throughput for accuracy, thus increasing sequencing costs. PacBio CCS also sacrifices read length for improved accuracy, as each read will contain different sequencing runs over the same DNA molecule (Wenger et al.,2019;Jain et al.,2015;Amarasinghe et al., 2020). The last drawback involves the amount of DNA required for sequencing. PacBio and Nanopore require in the order of 10-20 µg, while Illumina and 10x Genomics require DNA in the order of 10-20 ng. This is an issue in the case of songbird’s GRCs as it is often impractical and sometimes impossible to extract several µg of high quality DNA from the germline of a single individual, due to the small size of their gonads.

The second option I will describe is short-read sequencing using Illumina. An important feature of Illumina sequencing, is lower sequencing costs. This can be a big advantage in the case of GRCs, as in order to produce an assembly and identify GRC-linked sequences, we have to sequence the germline as well as a control tissue from the same individual. The second advantage is the superior sequencing accuracy, which is always a desirable feature, but even more so in the case of GRCs since we need the ability to discern between near-identical paralogs (Quail et al., 2012). The last advantage is the ability to achieve adequate sequencing throughput and coverage even in those cases where only a very small amount of DNA is available. The main disadvantage of short-read Illumina sequencing is the inability to solve long and complex repeats. This is not only a direct consequence of the shorter read length, but it is also due to the higher sequencing depth that is required to achieve adequate coverage, which forces us to resort to DBG assemblers.

The last disadvantage is sequencing bias which as described inRoss et al.(2013), might produce incomplete assemblies because certain genomic regions are not sequenced at all.

The last option that I will discuss is linked-read sequencing 10x Genomics Chromium. 10x Genomics developed a barcoding protocol based on microfluidics and Illumina’s sequencing platform in the attempt of combining some of the advantages of short and long-read sequencing

(17)

(Zheng et al., 2016a). This platform essentially consists in a chemistry kit and a microfluidic system known as the Chromium controller. The barcoding protocol starts with the DNA extrac- tion without fragmentation, yielding High Molecular Weight DNA (HMW DNA) in the order of 50-100 kb of length. The Chromium controller is then used to partition HMW DNA so that each of the GEMs will contain around 10 HMW DNA. Gel bead-in-EMulsions (GEMs) are gel beads containing the enzymes required to perform DNA fragmentation and PCR amplification, as well as a pool of identical Oligo Nucleotides (oligos) that can be used as barcode to uniquely identify GEMs. At this point the HMW DNA molecules trapped inside GEMs are fragmented, labeled with the GEM’s barcode and amplified by isothermal-PCR, producing a set of amplicons with the same barcode. The assumption is that it is highly unlikely that a GEM contains two HMW DNA molecules from the same genomic region. The library produced by the Chromium controller can then be sequenced with Illumina sequencing. Barcode information can then be used to pool together reads that originated from the same genomic region. This information can be extremely useful for a variety of applications, including genome assembly, where barcodes are used to correct misassemblies, scaffolding and haplotype phasing. An important limitation of 10x Genomics platform, is the sequencing depth of individual HMW DNA, which is often much less than one. For this reason it is not possible to reconstruct the sequence of a HMW DNA molecule by simply pooling and assemble reads with the same barcode, but it is instead necessary to first identify several barcodes that when considered collectively, provide enough coverage to assemble the region of interest. Another interesting feature of 10x Genomics sequencing, is the ability to produce haplotype-phased genome assemblies with Supernova, a genome assembler that uses linked-read information to produce an haplotype phased assembly, where the mater- nally and paternally inherited chromosomes are assembled separately (Weisenfeld et al.,2017b).

Considering the above, 10x Genomics can be seen as cheaper alternative to long-read sequencing that can offer superior sequencing accuracy, as well as long range haplotype information in the order of 50 kb, all while requiring as little as 1 ng of HMW DNA (Weisenfeld et al.,2017a).

It should not then come as a surprise that 10x Genomics linked-read technology is the sequencing platform that was chosen by my lab to assemble the GRC of songbirds.

Finally I would like to mention that currently the best approach to genome assembly is to not rely on a single sequencing platform, but instead to integrate multiple sequencing technologies such as Illumina short-read sequencing, PacBio or Nanopore long-read sequencing and Hi-C, like is done for example by the Vertebrate Genome Project (VGP) (Koepfli et al., 2015; Rhie et al.,2020;Peona et al.,2019).

1.3 How to improve GRC assembly

Section 1.2.1 addressed the main difficulties that one faces when assembling GRCs in songbirds, while the last paragraph of section 1.2.2 described the strengths and weaknesses of the most popular sequencing platforms.

Here I used zebra finch as an example, which has a genome size of approximately 1.3 Gbp and a GRC with an estimated size of over 150 Mb (Kinsella et al., 2019). When sequencing zebra finch’s germline, even assuming that every cell contains a GRC, we expect that at most 10% of the barcodes have originated from the GRC. Given this, and the fact that assembling

(18)

with Alexander Suh, my supervisor, and Francisco J. Ruiz-Ruano Campaña, a postdoc from Alexander’s lab, we decided to try and develop an in-silico method that would leverage linked- read information to enrich for GRC-linked sequences. The idea we had is rather simple: given a set of reference sequences, also known as baits, we want to select barcodes whose reads display significant sequence similarity when compared to the baits’ sequence. Given that the reads corresponding to a barcode, are expected to represent a genomic region corresponding to the mean length of the input DNA molecules (which is usually around 50 kb), it is reasonable to expect that among the reads corresponding to a barcode that has been flagged as “similar” to one of the baits, there are some reads that map on the genomic regions that are located in the reference genome upstream or downstream of bait sequence. Under this assumption we can use this information to widen our fishing net by looking for barcodes that show a significant sequence similarity with the reads that originated from genomic regions that are located upstream or downstream to the bait sequence. In an ideal situation, which consists of a GRC without repetitive regions, and where barcodes are uniformly distributed across the whole GRC, this process would yield the complete GRC sequence. Section 1.3 shows a simplified version of this iterative fishing approach.

GRC

G₁

B1

B3

B5

B2 B4 B6

Figure 1: Diagram showing a simplified iterative fishing approach.

Gene G₁is a GRC-linked gene. The first fishing iteration uses G₁as bait, and yields B₁and B₂. The second iteration uses reads corresponding to B₁and B₂as baits and yields barcodes B₃and B₄, while the last iteration uses reads from B₃ and B₄as baits and yields barcodes B₅ and B₆.

1.3.1 Using k-mers to assess sequence similarity

The previous paragraph presented the idea of barcode fishing which relies on the ability to detect sequence similarities between reads corresponding to different barcodes. But how can we actually measure sequence similarity? The most intuitive way is through sequence alignment.

Sequence alignment works well to identify barcodes that display a significant sequence similarity in the first fishing iteration, where kFish is comparing reads from a barcode to the sequence of one or few baits, as we can map reads to the bait(s), and select barcodes where a significant fraction of reads mapping to a bait. Sequence alignment does not work as well when trying to estimate sequence similarity between pairs of barcodes, which is what happens from the second fishing iteration onward. The main issue is performance, as we are trying to perform many pairwise alignments, essentially running in the same scalability of the overlap stage of the OLC assembly algorithm described in section 1.2. A solution to this problem is to estimate sequence similarity by comparing the k-mer composition of a pair of barcodes. In this context, we can simply store the k-mer composition of barcodes that are acting as bait, and then identify barcodes that originated from overlapping genomic regions by counting how many k-mers are shared between every barcode and each individual bait.

(19)

1.3.2 Introducing kFish, a software for alignment-free barcode enrichment

Over the course of my master degree project I have designed and implemented kFish, a bioinformatic software to perform alignment-free enrichment of bait-linked sequences. Using again zebra finch’s GRC as an example, the user can provide a set of GRC-linked sequences as bait, and then run kFish to select barcodes that show a sufficient sequence similarity to one of the baits. The barcodes yielded by the current fishing round can then be used as baits for the next fishing round. This process can ideally continue until kFish selected all barcodes that can be unambiguously scored/identified as GRC-linked. By providing the genome assembler only with reads corresponding to the sequence that we are actually interested in assembling, we put the assembler in much better conditions to produce a more complete and less fragmented GRC assembly. This is because of two reasons: by decreasing the amount of input data we reduce the probability of encountering reads (or k-mers) from unrelated sequences that have high sequence similarity. The second reason has to do with the fact that we are removing reads that originated from the two A-chromosomal haplotypes.

As outlined in sections 1.2.1 and 1.3, there are many factors that complicate the iterative fishing approach in the case of GRCs, the most important one are repetitive regions. It is reasonable to expect that at some point kFish will select one or more barcodes that originated from repetitive sequences. This barcode will be used as bait in the next fishing iteration, leading to the selection of an high number of barcodes from all over the genome. kFish currently relies on simple heuristics to detect barcodes containing a high number of reads from repetitive regions, and tries not to use them as bait for the next fishing round. This heuristic is described in section 2.3.6.

One of the priorities during kFish development was performance of execution and efficient re- source usage. To this aim, kFish is implemented in C++ and relies on RocksDB, an high performance embedded database for data storage. kFish achieves high performance by relying on a set of space-efficient and concurrent algorithms and data-structures. Thanks to these design choices, kFish is capable of properly leverage multicore CPUs and runs using few GBs of system memory (RAM).

(20)

2 Materials and Methods

2.1 Architecture

kFish is a program to perform barcode-enrichment of a genomic linked-read library (such as 10x Genomics Chromium) given a set of reference sequences (baits) to enrich for. kFish consists of a single executable, with a set of subcommands that are used to run specific steps of the barcode- enrichment protocol.

These subcommands currently are:

• load: Read one or more libraries in FASTQ format, and load their content in kFish database.

• compact: Manually trigger a compaction on kFish database. This will reduce storage usage and significantly improve the database read performance.

• query: Execute queries on kFish database. Currently the only supported query accepts one or more barcodes as input, and returns the corresponding reads grouped by barcode.

• fish: Run the iterative fishing protocol. During this stage, kFish traverses its database multiple times and identifies barcodes that are related to one or more of the reference sequences. At the end of each run, the yield of the current run is used as reference for the next round. The default output of this step consists of two FASTQ files containing the paired and unpaired reads yielded by the final fishing round.

• assemble: Assemble the reads from an interleaved FASTQ file. The assembly is per- formed using ABySS (Jackman et al., 2017). The scaffolding step is aided by ARCS (Yeo et al.,2017), and the final assembly is then corrected using Tigmint (Jackman et al., 2018).

• eval: Given the output of an assembly, select significant scaffolds using the coverage ratio as filtering criteria. This step uses minimap2 (Li, 2018) to map reads from two different libraries, and use the normalized coverage ratio to identify tissue or individual specific-sequences. Usually one of the two libraries contains the sequences that are being assembled (e.g. germline sample), while the other does not (e.g. somatic sample).

• unshuffle: Deinterlace one or more more interlaced FASTQ file(s).

• interleave: Interlace reads from an deinterlaced library.

kFish has a general help page that can be accessed by usingkFish --help. The help page for a specific subcommand can be viewed with kFish <subcommand> --help (e.g. to view the help page for theload subcommand use kFish load --help).

A typical use of kFish would consists in loading the read library that we want to enrich with barcodes of interest into kFish database by using the load subcommand, run the barcode-

(21)

enrichment protocol with the fish subcommand, assembling the reads yielded by the fish by using theassemble subcommand to then finally identify likely tissue or individual specific scaffolds with the eval subcommand. For most applications the fish, assemble and eval subcommands will be called repeatedly, and the output of theeval subcommand will be used as input by the subsequentfish subcommand.

2.2 Database

To store bulky read data, kFish relies on RocksDB, an embedded database with a key-value interface, which is a fork of LevelDB developed by Facebook (Borthakur,2013;Ghemawat and Dean,2011).

The main reason behind using RocksDB instead of working with FASTQ files directly, has to do with I/O performance, especially random Input/Output operations (IOPS).

Gzipped FASTQ is the de-facto standard to store Next-Generation Sequencing (NGS) short- read libraries. One of the issues with the Gzip compression format, is that it does not support efficient random access on the compressed files. This means that in order to access reads stored in the middle of a FASTQ.gz file, an application has to decompress from the beginning of the FASTQ file, all the way until reaching the reads of interest. This leads to unacceptably slow read performance, especially random read operations. To circumvent this limitation, the Blocked GNU Zip Format (BGZF) was developed (Li,2009, ch. 4.1). BGZF produces Gzip-compatible files with random access capabilities by compressing files into a set of small blocks, and then building an index of the compressed files, thus enabling application to decompress only the data blocks that contain useful data.

During the early stages of kFish development I decided to take a different route, and to use RocksDB as data store instead of relying on BGZF files due to the following factors:

• Support of a wide variety of compression algorithms, including ZSTD (Collet and Turner, 2016;Handte et al.,2018)

• Support of compression using dictionaries

• Multithreading

• Near lockless read operations

• Support of merge operations that can be used to efficiently group reads by barcode

• Efficient data bulk-loading

For a more in-depth overview of the database capabilities, refer to the subsection dedicated to RocksDB, or to RocksDB officialdocumentation.

(22)

In the current kFish implementation, RocksDB is only used for the storage of read data, where reads are grouped by barcode. It is possible that in future releases, kFish will rely on RocksDB to also store scaffolds and other kinds of data that cannot be efficiently handled by a traditional SQL database.

As mentioned at the beginning of this section, RocksDB is an embedded database with a key- value interface. kFish uses 10x Genomics barcodes as keys and reads sharing the same barcode as value. Reads are actually stored by RocksDB as a byte-string where reads belonging to the same read-pair are separated by a tab character (\t), while reads that are unpaired or belonging to different read pairs are separated by a newline character (\n).

When ingesting data into the database, a special set of options are used to tweak the database for bulk-loading operations. These main changes introduced by these options consist in disabling automatic database compactions, using vector memtables and increasing the concurrency level of background flushes. When ingesting data, a single thread is used to both read the raw data from a set of FASTQ files and insert it into kFish database. To improve I/O performance key- value pairs are first buffered in memory, and then written to disk in large batches. During this process, keys are grouped by barcode by using a custom merge operator, that essentially performs string append operations on reads sharing the same barcode. At the end of the database ingestion process, reads with the same barcodes are stored in contiguous memory on system storage, but they aren’t in any particular order. Furthermore some key-value pairs might even be stored across multiple SST files. This can be solved by running a database compaction. When using the load command, the database is automatically compacted once all the data has been ingested. This behavior can be disabled by passing the--no-compact flags when ingesting the database. This flag should only be used when debugging issues related to database compaction and its performance, as running thequery or fish command on a non-optimized database will yield very low performance.

It should be noted that kFish database compaction is one of the few steps that is currently not possible to parallelize. This is because moving files from L0 (the level where key-value pairs are written when flushing a memtable to an SST file) to L1 cannot run in parallel. For this reason, database compaction can take up to several hours when working with very large libraries.

2.3 kFishing

kFish is a program to perform alignment-free enrichment of reference-linked barcodes. Barcode enrichment is achieved by using an iterative approach based on two filtering stages. The first stage serves as a fast screening step to filter out barcodes that are clearly not related to any of the reference sequences, while the second stage performs a more thorough comparison, which involves computing the Overlap Coefficient (OC) between the set of distinct k-mers from the reads of the barcode that is being evaluated, and the k-mers from every reference sequence considered individually.

kFish doesn’t store nor operate on k-mers directly, instead k-mers are hashed to a 64-bit number using ntHash, and the hashes are stored in a Counting Quotient Filter (CQF), which is a probabilistic data structure for Approximate Membership Query (AMQ) (Mohamadi et al., 2016;

Pandey et al., 2017). For an in-depth description of k-mer hashing and storage refer to appen-

(23)

dices B and C. CQFs and other probabilistic data structures are described in sections C.4 and C.5.

The barcode-enrichment procedure, from here on referred to as “kFishing”, is initiated by the fish subcommand. fish has three mandatory arguments and several optional arguments and flags that can be used to tweak kFish behavior.

The three mandatory arguments are:

• --db to specify the path to the database instance to use for kFishing.

• --prefix to specify the prefix to use for the output files.

Passing --prefix '/tmp/kfish_test/output' will cause kFish to write its output files in the folder/tmp/kfish_test/ using output to prefix file names.

• At least one argument between --reference-seq and --reference-barcodes. The first option is used to specify the path to a FASTA file (can be gzipped) containing one or more sequences to use as reference or bait while kFishing. --reference-barcodes can instead be used to specify barcode identifier(s) to be used as baits. Barcodes can be specified directly in the command line as a comma-separated list

(e.g.--reference-barcodes 'BX:Z:AGCTCTCAGATTCGTC-1,BX:Z:GTCTGAAAGTAGCTAA-1').

When specifying more than a few barcodes, it might be more convenient to specify the path to a file storing the reference barcodes as a comma-separated list

(e.g. --reference-barcodes '/tmp/reference_barcodes.csv').

When providing the barcodes using a file, the --separator argument can be used to specify a different barcode separator (e.g. --separator '\t' would be used in case of tab-separated files).

The following are some of the most commonly used optional arguments:

• --dump-intermediate-results will cause kFish to write the fishing yield to file at every iteration.

• --force will disable some of the checks that are in place to prevent kFish from overwrit- ing files.

• --seq-for-mask can be used to specify a FASTA file (can be gzipped) storing a set of sequences to use for masking (usually repeat masking). It is good practice to specify a curated repeat library whenever one is available. K-mers contained in the sequences stored in this file will be ignored by all kFishing steps.

• --max-barcodes-to-fish can be used to specify the maximum number of barcodes to be fished (default: 1,000). It should be noted that this argument does not guarantee that kFish will return exactly 1,000 barcodes. It is possible that kFish will return less than 1,000 barcodes if kFishing is terminated because a round yields 0 barcodes. It is also possible that kFish will return more than 1,000 barcodes. If for instance the current fishing round yields 999 barcodes in total, kFish will run for another iteration, which is likely to yield more than 1 barcode, bringing the total number of barcodes that have been fished above 1,000.

(24)

• --max-fishing-rounds can be used to specify the maximum number of kFishing rounds to execute (default: 10).

• --k1 can be used to specify the k-mer size to use for the first filtering step (default: 56).

• --k2 can be used to specify the k-mer size to use for the second filtering step (default:

25).

• --min-number-of-reads-per-bc can be used to specify the minimum number of reads that a barcode must have in order to be processed by kFish (default: 100). Setting this value too low will negatively affect false positive rate.

For a complete and up-to-date list of the available command line arguments, consult the help message printed bykFish --help.

For the sake of brevity, from here onward I will use the term barcode to refer to the set of reads or k-mers (depending on the context) that share the same barcode, while I will use the term barcode identifier to refer to the actual barcode sequence that is used to uniquely identify reads generated from HMW DNA molecules from the same GEM.

To illustrate the kFishing procedure I will give a walk-through of a fishing run aimed at the enrichment and assembly of GRC-linked sequences. All the settings have their default value unless stated differently.

--reference-seq will be used to pass a FASTA file containing two sequences (S0 and S₁) which are believed to be specific to the GRC of zebra finch, will be used as reference (from here on the terms reference sequence(s) and bait(s) will be used interchangeably). I will also use--seq-for-mask to pass a curated library of repetitive elements.

(25)

kFish Database Chromium

Library (Germline)

CQF 1CQF 2

Barcodes

Global CQF

Baits

Single-seq. CQF

BC4

BC2

Assembler

BCN

BC... 1

GermlineSoma _{~1x cov.}

ratio Germline

Soma ~5x cov.

ratio

A

C

F G

H

B

D

E

Figure 2: Simplified diagram of kFish core algorithm.

A Trimmed reads from a Chromium DNA library are ingested into kFish database. kFish database only needs to be created once, as it can be reused across multiple fishing runs. Dur- ing data ingestion reads sharing the same barcode are grouped together. B The hashed k-mers for the reference sequences are used to build the global and single sequence CQFs (CQF1 and CQF2 respectively). C Reads with the same barcode are pulled from the database and their hashed k-mers are looked up in the global CQF to rapidly discard barcodes that share no or very few k-mers with the reference sequences. D Barcodes that survived the first filtering step (C) are processed by the second filtering step, where the OC between the barcodes being pro- cessed and each individual CQF is calculated. E Barcodes that survived both filtering stages are added to the pool of reference sequences, and steps B through E are repeated until enough barcodes have been fished. F After enough barcodes have been fished, reads corresponding to these barcodes are extracted from the database and feeded to the genome assembler. G Scaffolds produced by the genome assembler are screened to identify likely GRC-linked scaffolds using the germline/soma coverage ratio selection criterion. H Scaffolds that have been identified as likely GRC-linked are used as reference sequences by the next fishing run.

Database icon made byFreepikfromFlaticon.

(26)

2.3.1 Initializing the CQFs

The first step of kFishing involves building the CQFs for the reference sequences and the sequences for masking. kFish will build two different kinds of CQFs for the reference and masking sequences. The first kind is known as globalCQF and will store the k-mers of all the reference sequences (and also reference barcodes when appropriate), while the second kind will only store the k-mers of a single reference sequence (or barcode). Furthermore, because kFish uses different k-mer sizes for the first and second filtering stages, in most situations it is necessary to build two sets of k-mers representing the same reference sequence(s) or barcode(s). The first set will store k-mers of size k₁which will be used by the first filtering step k₁, while the second will use k₂, which is the k-mer size used by the second filtering step.

In the current example, kFish will start by reading the sequences stored in the FASTA file passed with--seq-for-mask and will build two CQFs for masking, the first using k1as k-mer size and the second using k₂, which are respectively 56 and 25 by default. Sequences are processed one by one using a sliding window approach, where a window k₁ or k₂ bp nucleotide wide moves from the 5’ end to the 3’ end one bp at a time, generating the hashes for all the k-mers in the sequence that is being processed, and adding them to the appropriate Counting Quotient Filter (CQF). If a k-mer contains any character other than A, T, G or C, the k-mer will be discarded and the sliding window moved to the next position.

After building the two global CQFs for masking, kFish will proceed to build another pair of global CQFs that will be used in the filtering stage.

The procedure is very similar to what was done to generate the two masking CQFs. The sequences from the FASTA file specified with --reference-seq are processed one by one, and their hashed k-mers are used to build the two global CQFs that will be used for kFish- ing. The only difference is that in this case not only k-mers containing any character other than the standard nucleotides will be discarded, but also k-mers whose hash is found in any of the two masking CQFs. If one or more reference barcodes were specified by using the --reference-barcodes argument, the reads corresponding to those barcodes would have been fetched from the database, and their k-mers added to the global CQFs similarly to what was done with the reference sequences.

At this point kFish has generated all the global CQFs required by the kFishing procedure, and can move on to generate the CQFs for each reference sequence (or barcode) considered individually. These CQFs will only be used by the second filtering stage, and so kFish will generate one CQF for every reference using k₂ as k-mer size. Sequences are once again processed one by one using a sliding-window, and k-mers that are either masked or contain illegal characters are discarded. For a complete overview of CQFs and their capabilities refer to section C.5 and the original publication byPandey et al.(2017).

Once all the CQFs have been generated, kFish is ready so start the actual fishing procedure, which essentially consists in traversing the database to identify barcodes that show the desired degree of similarity with one or more of the baits.

(27)

2.3.2 First ltering stage

The first and second filtering stage take place one right after the other: barcodes that pass the first filtering stage are immediately processed by the second filtering stage, and barcodes that also survive both filtering steps are selected by the fishing procedure. For the sake of simplicity this section only describes the first filtering stage, while the second one will be described in section 2.3.3.

The purpose of the first filtering stage is to efficiently process all the barcodes stored in the database to rapidly discard barcodes that share no or very few k-mers with the reference sequences. Given the k-mer composition of a reference sequence and that of an unrelated barcode, we ideally expect no shared k-mers. If the two sets share one or more k-mers, it is because of one of the following reasons:

• False positives due to hash collisions (i.e. ntHash maps two distinct k-mers to the same number).

• False positives due to the probabilistic nature of the CQF (i.e. two distinct k-mers that are hashed to two different value occupy the same slot in the CQF).

• The reference sequence and the genomic region represented by the barcode originated share one or more genomic regions that are k₁ bp or longer. These regions could be for example due to repetitive elements, low complexity regions or a conserved motif.

To be effective, the first filtering stage should strike a good balance between sensitivity and specificity, so that it can discern between the case where a barcode and a bait share some k-mers by chance, and the case where they do because they both represent the same genomic region.

The specificity and sensitivity of the first filtering stage are mostly controlled through the --barcode-similarity-thresh and --read-similarity-thresh options, which will be described in the following paragraphs.

In the first filtering stage, kFish traverses the database processing batches of thousands of barcodes in parallel, using all the CPU cores available on the machine where kFish is run. Barcodes in the same batch are processed individually by generating the hashed k-mers of size k₁for every read belonging to the barcode that is being processed. The hashed k-mers are looked up in the ap- propriate global CQF, and the number of hits are recorded by keeping a counter for each individ- ual read. When the counter of a read exceeds the threshold set by--read-similarity-thresh, the read is flagged as similar to the global CQF. The number of reads in a barcode that are flagged as similar to the global CQF is tracked by another counter. When this counter exceeds the threshold set by --barcode-similarity-thresh, the barcode is flagged as possibly related to one or more of the baits, and is passed to the second filtering stage.

The threshold r_T specified with--read-similarity-thresh is expressed as a percentage of the maximum number of k-mers of size k₁ that can be generated from a read R of length L.

This upper bound|K|M AX can be calculated with the expression|K|M AX = L− k1+ 1. The threshold r_T specifies the number of shared k-mers that are required in order for a read to be flagged as similar to the global CQF. r_T can then be obtained by multiplying upper bound fore the number of possible k-mers|K|M AX by the value of--read-similarity-thresh.

The threshold bc_T set through--barcode-similarity-thresh can be expressed as a percent-

(28)

and the number of reads that are required to be flagged as similar to the global CQF for a barcode to pass the first filtering stage can be obtained by multiplying bc_T for the number of reads in the barcode that is being processed. When bc_T ≥ 1, the threshold is interpreted literally, meaning that when bc_T = 10, kFish requires 10 reads to be flagged as similar in order for the barcode to pass the first filtering stage.

To efficiently leverage multi-core CPU architecture, barcodes are processed in batch using all the CPU cores available on the machine where kFish is running.

2.3.3 Second ltering stage

Barcodes that pass the first filtering stage are immediately processed by the second filtering stage, which will decide whether or not a barcode is selected by the current fishing round.

The second filtering stage, here on referred to as the evaluation step, essentially measures the degree of similarity between the k-mer composition of the barcode that is being evaluated and every reference sequence or barcode considered individually. The degree of similarity is as- sessed through the Overlap Coefficient (OC). Given two sets of k-mers A and B, the overlap coefficient can be calculated by dividing the cardinality of the intersection of set A and B A∪B by the cardinality of the smallest between sets A and B:

OC(A, B) = |A ∪ B|

min(|A|, |B|)

The first step of the evaluation stage involves computing the cardinality of the intersection be- tween the k-mer composition of the barcode B that is being evaluated, and the k-mer compo- sition of each sequence or barcode R that was provided as reference. To generate the k-mer composition for barcode B we simply fetch its reads from kFish database and process the reads one by one using the sliding-window approach described in section 2.3.1. This time the hashed k-mers will be stored using a set S, which is implemented on top of a hash maps. This is done to avoid the false positives caused by the CQF, and also because kFish will iterate over the set of k-mers S, not perform random queries, and so using a CQF would introduce some complexity without providing any significant performance benefit. Adding k-mers to the set S serves the purpose of generating the set of distinct k-mers for the barcode B. For a brief overview of hash maps refer to section C.2.

At this point kFish proceeds to calculate the OC between the barcode B and every reference sequence (or barcode) R. If the highest amongst the calculated OC is above a certain thresh- old (which can be a fixed value specified with--oc-lower-bound, or dynamically calculated as explained in section 2.3.5), then the barcode will be selected and end up in the yield of the current fishing run. To calculate the OC we only really need to calculate the cardinality of the intersection set, as the size of a set of k-mers can be trivially obtained by calling the size() method on the set S or on the CQF Q_R, which stores the k-mer content of one of reference sequence. The most straight forward way to calculate the intersection cardinality, is to iterate over the set of k-mers S, and querying the CQFs QRwhile keeping track of the number of hits with a counter I_R. Given that the set S and the CQF Q_Rboth store a deduplicated list of k-mers (meaning that a given k-mer is stored in a set or CQF once and only once), the value assumed by the counter IRafter iterating over the k-mers in the set S will correspond to the cardinality of the intersection set I_R.

(29)

While the results produced by the above procedure are correct, the procedure itself is fairly inef- ficient, as it is reasonable to expect that most of the k-mers from the barcode under examination will be part of any of the set intersections, and yet the procedure is querying for all the k-mers every time. A more efficient approach would be to reduce the size of the set of k-mers that will be used to calculate the size of the intersections. This can be accomplished by creating two sets of k-mers, S₀ and S₁. When processing the reads from the barcode B, kFish will add all the hashed k-mers to the set S₀. Before adding a k-mer to S₁, kFish will check whether the k-mer k is stored in the global CQF that was built using k₂ as k-mer size, and proceeds to add the k-mer k if and only if the query was successful.

2.3.4 Wrapping up the rst kFishing round

After processing all the entries in kFish’s database using the filtering steps described in sections 2.3.2 and 2.3.3, if--dump-intermediate-results was specified, kFish will proceed to write the current fishing yield to disk. With the default settings each iteration will output three FASTQ files storing the forward, reverse and unpaired reads corresponding to the barcodes that have been selected by the current fishing run.

At the end of every round, kFish will check whether any of the terminating condition was met, and if they are not, will prepare the CQFs for the next fishing round. With the default settings, this involves resetting all the CQFs, re-adding the k-mers from the reference sequences, and then also add the k-mers from the barcodes yielded by the previous fishing round. If the --remove-kmer-singletons flag was specified, then all the singleton k-mers will be evicted from the global CQFs. This can be especially useful to remove k-mers that originated from reads containing one or more sequencing errors. It should be noted that this will also remove k-mers corresponding to valid genomic regions that are covered by a single read.

The alternating of updating the CQFs with the yield of the last fishing round, and performing a new fishing round will continue until one or more of the terminating conditions are met.

The final output that will produced by the final fishing round consists of two FASTQ files, one storing the paired reads in interlaced format, and the second storing the unpaired reads.

2.3.5 Dynamically set the Overlap Coef cient threshold

Using a fixed threshold for the OC can often be enough during the first fishing round, as kFish is comparing the k-mer content of a well characterized set of reference sequences with the k-mer content of a barcode that made it past the first filtering stage. From the second round onward using a fixed threshold is unlikely to be an optimal solution, as kFish will be making several comparisons between barcodes, each of which represents DNA molecules whose size can be significantly different from one another. To try to make up with this lack of flexibility, kFish provides the option to dynamically adjust the OC which can be toggled on and off by specifying --auto-adjust-similarity-thresh or --no-auto-adjust-similarity-thresh respectively (this option is enabled by default).

Development and validation of bioinformatic methods for GRC assembly and annotation Roberto Rossini