• No results found

Algorithms for Aligning Genetic Sequences to Reference Genomes

N/A
N/A
Protected

Academic year: 2022

Share "Algorithms for Aligning Genetic Sequences to Reference Genomes"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Algorithms for Aligning Genetic Sequences to Reference Genomes

Jens-Ola Ekstr¨om

Jens-Ola Ekstr ¨ om VT 2017

Bachelor thesis, 15 credits Supervisor: Lars-Erik Janlert Examiner: Pedher Johansson

Bachelor of Science Programme in Computing Science, 180 credits

(2)

The technologies for sequencing genetic materials have improved vastly during the last fifteen years. Sequences can now be determined to af- fordable costs and therefore are more genetic sequences available than ever before. The bottleneck is no longer to obtain genetic sequences but rather to analyze all the sequence data.

A primary step in sequence analysis is to determine a sequence frag- ment’s position in the genome, a process called aligning. From Com- puting Science point of view, this is essentially text matching. This is however a much more complex task than searching for strings in or- dinary text documents. First, there is large amount of data. An ordi- nary sequencing experiment could generate more than 100 million se- quences. Each sequence should be matched to a reference genome se- quence, which is some billions characters in size. Second, the obtained sequences may have differences compared to the reference genome se- quence. The algorithms are thus not only searching for exact matches, but for the best approximate matches.

In this work I review the algorithms behind modern sequence alignment

softwares. I also propose to evaluate to the fast Fourier transform for the

task.

(3)

Contents

1 Introduction 3

2 Background 5

2.1 Basics in molecular biology 5

2.2 Next Generation Sequencing 5

3 Algorithms for sequence alignments 7

3.1 Exact matching algorithms 7

3.1.1 Hash tables 7

3.1.2 Tree structures 8

3.1.3 The Burrow-Wheeler transform 9

3.1.4 The fast Fourier transform 10

3.2 Approximate matching algorithms 11

3.2.1 The Smith-Waterman and Needleman-Wunsch algorithms 12

4 Discussion 15

References 17

Appendix: Calculations 21

(4)
(5)

1 Introduction

The term Next Generation Sequencing (NGS) has been coined to describe the last fifteen years of development in new technologies to determine the sequences of genetic materi- als. NGS is essentially a massive parallelization of sequencing reactions. Sequencing of genetic material is now performed to affordable costs, which has led to an explosion in the amount of genetic data. The bottleneck is no longer to obtain genetic data but to analyze all sequences.

Each DNA molecule reveals only a short fragment of its genetic sequence by the sequencing technologies available today. A sequence between 30 and up to some hundred nucleotides may be determined for each DNA molecule. These short sequences are simply called reads.

By determining many reads it is possible, and today also routine, to cover the entire genetic material of an organism. The genome size differ between organisms; human has about 3.3 × 10

9

nucleotides in the genome. Plants may have up to 400-fold larger genomes whereas bacteria have about 1/1000 the size of the human genome.

The main challenge is perhaps that sequencing with NGS produces considerable amount of data. A regular small-scale experiment may generate more than 100 million pieces of sequence fragments. That is 25-50 GB compressed text data from a single experiment (in- cluding identity and quality scores). It is truly a challenge to the Computer Scientists and Programmers to construct softwares that are practically useful for the common Biologist or Clinician, which are the end users of these softwares. The computational resources in Biology research environment is commonly standard desktop computers. In this case may the amount of accessible memory be of concern. Clinical laboratories have increased eco- nomical possibilities to set up dedicated servers with increased memory capacity, but in this case may the time to analyze each sample be of essence. Over all, NGS provides exciting challenges for both Computer Scientists, Biologists and Clinicians, in their respective area of profession.

The primary step in analyzing the NGS data is to identify the position of each sequence fragment in the organism’s reference genome sequence. This is a process called aligning;

the result is an alignment. From Computing Science point of view, aligning DNA is es- sentially text searching. DNA sequences can be described as words of a finite alphabet with four characters. Hundred million or so short pieces of text (each 30-200 characters) should be placed in the right position relative to its corresponding reference text sequence (some billion characters). The challenge increases as biological reality also includes some degree of variation and deviation from previously known sequences. The task is thus not only to find all positions for exact matches, but to find all positions for the best approximate matches.

In this work I review the algorithms behind modern NGS aligning softwares. I also discuss

that an alternative algorithm, the Fast Fourier Transform (FFT), would do well to solve the

task, although this algorithm is not currently used in NGS softwares.

(6)
(7)

2 Background

2.1 Basics in molecular biology

A cell keeps all rules for its survival and reproduction in the deoxyribonucleic acid (DNA) archive. DNA is composed of a backbone of sugar molecules, called ribose, chained to- gether in such way they that make a strand. Each ribose is linked to one of four different nitrogenous bases, called adenosine (A), cytosine (C), guanine (G) and thymine (T). A unit of ribose and any of the nitrogenous bases is called nucleotide. The order of the nucleotides in the DNA strand makes the genetic code. A DNA strand binds to another DNA strand, which together make the well-known double-helix [11]. Putting all pieces of DNA together in a cell, it counts to about 3.3 × 10

9

nucleotides.

DNA is used in mainly two ways. First, it can be copied to an identical

1

copy upon cell division. Second, some selected portions of DNA can be transcribed to a ribonucleic acid (RNA) copy. RNA is chemically almost an identical molecule to DNA but it is much less stable and breaks down within minutes or hours. Those regions in DNA that are transcribed into RNA are called genes. RNA has a number of different roles in the cell, one being it contains instructions for how amino acids should be assembled into proteins.

2.2 Next Generation Sequencing

The main difference between next generation sequencing (NGS) compared to previous se- quencing technologies is massive parallelization of the chemical reactions. Earlier were DNA sequences determined in solutions, with all molecules floating around in the test tube.

Only one particular kind of DNA molecule could be studied in the entire test tube. In NGS are the DNA molecules attached to a surface, which makes it possible to observe single DNA or RNA molecules during the chemical reactions. Analyzing single DNA molecules is possible due to development of new chemistry and improved detection methods. Many different DNA molecules can therefore be studied in parallel in a single sample. [59]

Determining the sequence of single RNA molecules serves mainly two different purposes.

One is counting the number of RNA copies of each gene. This is based on the assumption that the RNA copy number is proportional to the gene’s biological activity in the cell. For example, many genes are active only in the fetal development but not expressed at all in the rest of the organism’s life. In this case, sequencing of RNA serves to identify which gene, i.e. which position in the genome, that was used to produce this piece of RNA. This is very useful data in basic research in Biology and Medicine, but also to diagnose tumors.

1

Changes may occur, sometimes by mistake but sometimes also deliberately, for example as a part of the

immune defense’s adaption to fight various infectious agents. Changes in genetic sequences are commonly

called mutations. These may cause tumors or other diseases but they are also crucial in evolution of the species.

(8)

A second purpose is mapping of mutations, for example to diagnose inherited diseases.

The nucleotide sequence may have been changed by mutations such that a gene has lost

its biological function. The instructions for protein assembly are simply erroneous and

the produced proteins may therefore malfunction. As NGS nowadays is a relatively cheap

technology, it may be economically advantageous to analyze this by sequencing all RNA

molecules in the cell rather than analyzing single genes by other means.

(9)

3 Algorithms for sequence alignments

Most existing sequence mapping softwares use a seed-and-extend technique. For each se- quence fragment obtained by NGS, an exact match method determines candidate positions in the reference sequence. These candidate positions are called seeds. As there might be mismatches between the read sequence and the reference sequence, exact match searches could be performed on only a section of the read sequence. For example, Bowtie2 [28] takes the first half, the second half, and the middle half (excluding the outer 1/4 on each side) in three separate searches for every read.

In a second step, the seed positions are extended to examine how well the entire query se- quence match the reference sequence. Approximate matching methods are used for this.

Many NGS aligning softwares use yet additional processing to further adjust the read se- quence fragment to the reference genome, taking biological models into consideration.

In this work are only the first two steps reviewed. Additional steps, involving biologi- cal models, concerns Biology more than Computing Science, why it is overlooked in this Computing Science exam work. Most softwares use any of the algorithms described below, whereas adaption to biological models have not yet reached conformed standards. The latter may contribute to the large number of softwares available for aligning genetic data. Exam- ples of NGS sequence aligners, published in the last ten years, may be found in Table 1.

3.1 Exact matching algorithms

Exact matching algorithms are used in the first step in the analysis as they are faster than approximate matching methods. However, considering the size of the reference genome sequences, linear search methods such as the popular Boyer-Moore algorithm [2] are not practically useful but methods having a more advantageous time complexity are used.

3.1.1 Hash tables

One way to determine sequence positions is using hash tables. For each position in the reference genome, k characters are used as a key in the hash table. The data that corresponds to each unique key could be a pointer to a list that stores the genome positions. Each key needs to have the capacity to link to several genome positions as the key’s sequence may be present in many different positions in the genome sequence. The key’s sequences should together cover the entire genome sequence, for example:

GCAAAAAGGCCCCTGGGGGGGGGTTAATGAGTACTGGAAAAAGAAGCGCGAGATACCACT...

GCAAAAAGGC index 0 GTTAATGAGT index 22 GCGCGAGATA index 44 CAAAAAGGCC index 1 TTAATGAGTA index 23 CGCGAGATAC index 45

AAAAAGGCCC index 2 TAATGAGTAC index 24 GCGAGATACC index 46

(10)

Table 1 Alignment softwares and their method to determine seed positions.

Hash table Tree structures Burrows-Wheeler ERNE2 [45] Masai [52] Hpg-Aligner [41]

rHAT [37] STAR [12] CUSHAW3 [36]

MOSAIK [29] YAHA [14] mrsFAST-Ultra [18]

Hobbes2 [25] Segemehl [20] CRAC [44]

NextGenMap [51] PatMan [46] GEM [40]

SubRead [33] BatMeth [34]

Srmapper [17] BLASR [5]

SeqAlto [42] Batmis [55]

OSA [22] BarraCUDA [26]

SplazerS [13] Bowtie2 [28]

Stampy [38] BWA [30]

GASSST [48] SOAP2 [32]

GNUMAP [8]

BFAST [21]

PerM [7]

RazerS [56]

SHRiMP [49]

CloudBurst [50]

PASS [4]

Slider [39]

MAQ [31]

SeqMap [23]

ZOOM [35]

RMAP [53]

Hash tables are the fastest algorithm among the methods described here, but also the most memory space-demanding [60]. Representing the entire human genome in this way would take about 26 GB memory for the pointers (calculations may be found in the Appendix).

Storing the human genome positions takes additionally 13 GB, list structures excluded.

Thus, at least 40 GB would be required to construct a hash table for the complete human genome. Memory demands could be relaxed by dividing the genome into sections and analyzing one section at the time (a bucketization approach). However, this comes with the cost that the entire set of sequence fragments needs to be analyzed for each section of the genome. Another strategy might be to only keep every second key. In this case the cost is that each sequence fragment needs to be analyzed twice, excluding the first character the second time.

3.1.2 Tree structures

A 16-mer DNA fragment can encode 4.3 × 10

9

unique combinations. A number of these

combinations will likely remain unused. For comparison, the human genome is about

3.3 × 10

9

nucleotides and the pigeon hole principle gives that at least a billion combinations

will remain unused upon human sequence analyses. In addition, most eukaryotic genomes

(11)

contain repeated sequences

1

, which further reduces the number of unique combinations that actually are used. Only representing those keys that are represented in the genome sequence in a tree structure saves memory space in comparison to using hash tables, as significantly fewer pointers to genome position lists need to be allocated. For example, the PatMan program [46] generates a directed acyclic graph (DAG) of the reference sequence and transforms it into a non-deterministic automaton that is used in the sequence position match analyses.

3.1.3 The Burrow-Wheeler transform

The Burrows-Wheeler (BW) transform [3] was first reported as an algorithm to improve compression of text data. Strings transformed in this way tend to group similar characters, which is advantageous in entropic compression algorithms [3]. This transform also turned out to be a useful algorithm for exact text matching when expanded with so-called FM indexes [16]. The main advantage is a small memory footprint relative to other matching algorithms.

To produce a Burrows-Wheeler transformed string, a unique character is added to the end of the original string. Next, all rotations

2

of the string are lexicographically sorted. The last character of all sorted rotations make the BW transformed string. Here is an example on the word ’BANANA’ ($ is the added string end character):

All rotations: Sorted: Last character:

BANANA$ ANANA$B B

$BANANA ANA$BAN N

A$BANAN A$BANAN N BW-transformed string:

NA$BANA BANANA$ $ BNN$AAA

ANA$BAN NANA$BA A

NANA$BA NA$BANA A

ANANA$B $BANANA A

Inverse-transforming a BW-transformed string take some more steps. The BW transformed string is copied and all characters in the copy are sorted lexicographically. The characters in the transformed string are arranged in one column and the sorted copy in next column. The string end character is found in the sorted column and the character in the corresponding position in the unsorted column is read. This is the last character in the inverse-transformed string. The corresponding character in the sorted column is to be found, and the second last character can be read in the unsorted column. The entire original string can be reconstructed by iterating these steps. See the example:

1

Over two-thirds of the human genome consists of repeated genetic elements [27].

2

Take a copy of the string and move its first character to the last position. Iterate on the last produced string

until every character in the original string has been seen in the first position.

(12)

Inverse-transformed

Unsorted: Sorted: Step: word:

B A 1. Sorted $ -> third unsorted A A$

N A 2. Third sorted A -> second unsorted N NA$

N A 3. Second sorted N -> second unsorted A ANA$

$ B 4. Second sorted A -> first unsorted N NANA$

A N 5. First sorted N -> first unsorted A ANANA$

A N 6. First sorted A -> unsorted B BANANA$

A $ 7. Sorted B -> unsorted $; finished. BANANA$

All rotations are represented in the transformed string, which also means that every sub- string there exists in the original string can be found in the beginning of any rotation. All rotations are also lexicographically sorted by the BW transformation. Of crucial impor- tance, this enables binary search algorithms for substrings; searches can be made in loga- rithmic time.

It could however be an elaborative task to identify a substring’s position in the original string. The position becomes known only when the string end character is seen. The FM index expansion of the BW transform addresses this problem [16]. Each character of a certain type, say every T, is tagged with its position in the original string. To save memory space could all but, for example, every fourth T in the transformed string drop their position assignment. To determine its position, the substring needs to be back-transformed only until a T character that has the position linked to it is found.

Look-ups in the sorted transformed string should cause no serious problem as the characters are lexicographically sorted. An algorithm could simply sum the total count of each charac- ter that lexicographically comes before the character of interest. Look-ups in the unsorted transformed string take more effort and is again addressed by the FM-index approach. The position of, for example, every 16th character of each type is kept in an array, which allows fast look-up for the approximate position. Exact position is determined in a subsequent linear search.

Memory space requirement for using the BW transform as sketched above, applied on the human genome, would be about 2.5 GB in total (see the Appendix for calculations). Mem- ory requirements could be both increased and decreased by changing sizes of the FM in- dexes. This would affect how frequent the position information is available to the algorithm, which in turn impact on the look-up times in an inverse linear relation. The BW transform combined with FM indexes has m × log(n) time complexity, where m is the number of sequence fragments to align and n is the size of the reference sequence. It has however significantly less memory space requirements than the other algorithms discussed in this work.

3.1.4 The fast Fourier transform

During this work it has strike me that the Fourier transform [57] is not used in modern NGS

alignment softwares, but I think it would be an interesting option to explore. The Fourier

transform converts a function from its time domain to its frequency domain (Figure 1), and

it is used in many image and sound compression softwares. A special case is the fast Fourier

transform (FFT) [58], which has the attractive n × log(n) time complexity and that can be

applied on data sizes of 2

n

values.

(13)

The idea is to represent for example every A as the value -1, C as 0, G as 1 and T as 2, as sample points in a wave form, which makes a function for the sequence. The Fourier transform would reveal this function’s frequency components. If all frequencies for the candidate sequence are present in the reference sequence, and the phase angles also have identical differences at some point, the position in the reference sequence can be deter- mined. The Fourier transform has been used in a number of classical sequence alignment softwares [1, 6, 15, 24, 47], providing proof of concept.

Figure 1: A. The Fourier transform divides a complex wave (blue) into its frequency com- ponents (green, yellow and red). B. The transform returns amplitude and phase angle for each frequency component.

3.2 Approximate matching algorithms

Biological reality is that there is numerous differences in an individual’s genome compared to the reference genome sequences. Alignment softwares also need to search for best ap- proximate matches. It seems there is a consensus across different NGS alignment softwares to use either the Smith-Waterman [54] or the Needleman-Wunsch [43] algorithm, which have obvious similarities.

A way to manage approximate matching in sequence analyses is so-called edit-distance arrays [43, 54]. This is a linear search procedure to find the best match between two strings.

Each character in the substring is compared to the corresponding position in the other string.

In case a the characters differ, the substring receives a penalty score. In next round, the

substring is moved one character position and a new penalty score is determined. The best

match is naturally the position that has the lowest penalty score. Allowing gaps in the

sequences could produce lower penalty scores, for example:

(14)

Without gaps:

GCGTATGCGGCTAAGCG Six penalties

||||||||||

GCTATGCGGCTATAGCG With gaps:

GCGTATGCGGCTA-AGCG Two penalties

|| |||||||||| ||||

GC-TATGCGGCTATAGCG

In Biology, mutations are not completely random but nucleotides mutate more often to a specific other nucleotide. Transition type mutations (A ↔ G or C ↔ T) can be observed in about 1/3000 nucleotides, transversions (A ↔ C, A ↔ T, C ↔ G or G ↔ T) in about 1/6000 nucleotides and insertion or deletion is observed in roughly 1/12000 nucleotides. It could also be discussed whether a gap that extends over more than one nucleotide should be considered as a single mutation event or one mutation event for each nucleotide it covers.

For efficient computation could a penalty matrix be created that corresponds to the observed mutation frequencies (og = opening gap, eg = extending gap):

A C G T og eg

A 0 2 1 2 4 1

C 2 0 2 1 4 1

G 1 2 0 2 4 1

T 2 1 2 0 4 1

og 4 4 4 4 0 1

eg 1 1 1 1 1 0

Allowing unlimited number of substitutions and gaps increase the calculations dramatically;

more than 5

n

different combinations could be tested. This is not practically useful in effi- cient computing. Looking closer to a such algorithm, it becomes apparent that many tested candidate strings have in part identical sequences, and these subsequences are therefore tested numerous times. The standard technique to avoid repeated calculations of identical values is dynamic programming, i.e. to store the result of each calculation for later use.

3.2.1 The Smith-Waterman and Needleman-Wunsch algorithms

The Needleman-Wunsch [43] and Smith-Waterman [54] algorithms have obvious similari-

ties. The main difference between the two algorithms is the penalty matrix. The Needleman-

Wunsch algorithm is described here. An array is constructed where the reference string is

aligned with the rows and the candidate string is aligned with the columns. The first line

is assigned zeros and the first column is assigned increasing integers. The value for each

remaining element is calculated by dynamic programming such that the least value of three

separate calculations from previous iterations is selected. These calculations are 1) the

penalty score on comparing the characters in the two strings, plus the value in the diago-

nally up-left element. 2) The penalty to introduce or extend a gap, plus the value in the

element above. 3) The penalty to introduce or extend a gap, plus the value in the element to

the left. This could look like:

(15)

- T A T T G G C T A T A C G G T T

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

G 1 2 1 2 2 0 0 2 2 1 2 1 2 0 0 2 2

C 2 2 4 2 3 4 2 0 3 4 2 4 1 4 2 1 3

G 3 4 4 6 4 3 4 4 2 4 6 4 5 1 4 4 3

T 4 3 6 4 6 6 3 5 4 4 4 8 5 5 3 4 4

A 5 6 3 7 6 8 7 5 7 4 6 4 8 7 7 5 6

T 6 5 7 3 7 8 10 8 5 8 4 8 5 10 9 7 5

G 7 8 7 7 5 7 8 12 9 7 8 6 9 5 9 11 9

C 8 8 10 8 8 7 9 8 12 11 8 10 6 9 7 10 12

To select the best match, the element with the least penalty score in the last row is identified and the path of least values through the array is followed. This path is indicated by dots in the example below. Exact matches would give a path diagonally up-left and gaps are indicated by a vertical or horizontal path.

- T A T T G G C T A T A C G G T T

- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

G 1 2 1 2 2 0 . 2 2 1 2 1 2 0 0 2 2

C 2 2 4 2 3 4 2 . 3 4 2 4 1 4 2 1 3

G 3 4 4 6 4 3 4 4 . 4 6 4 5 1 4 4 3

T 4 3 6 4 6 6 3 5 . 4 4 8 5 5 3 4 4

A 5 6 3 7 6 8 7 5 7 . 6 4 8 7 7 5 6

T 6 5 7 3 7 8 10 8 5 8 . 8 5 10 9 7 5

G 7 8 7 7 5 7 8 12 9 7 8 . 9 5 9 11 9

C 8 8 10 8 8 7 9 8 12 11 8 10 . 9 7 10 12 The best match in this example would be:

TATTGGC-TATACGGTT

|| ||| | GCGTATGC

A path could branch into different matches with equal penalty scores. Edit-distance algo- rithms may therefore take some delicate decision-making to select which branch to report back.

The Smith-Waterman algorithm is almost identical to the Needleman-Wunsch algorithm, the main difference being the penalty score matrix. In Smith-Waterman, matching charac- ters get a positive score whereas mismatches receives a negative score. All scores in the array are however always kept positive. In this case, the best path is the one with the highest scores. The starting point of this path could be found not only in the last row but anywhere in the matrix.

This seemingly subtle difference between the Smith-Waterman and Needleman-Wunsch

algorithms has big impact on how to interpret the results. Needleman-Wunsch returns the

best position for the entire query sequence, whereas Smith-Waterman gives the best position

for a part of the query sequence but on the same time there might be another part that has

(16)

poor match. Smith-Waterman is often referred to as local alignment. The Smith-Waterman and Needleman-Wunsch algorithms have m × n complexity in both time and space.

To improve execution time, dynamic programming is an obvious target for parallel pro-

cessing. However, both the Smith-Waterman and Needleman-Wunsch algorithms have de-

pendencies to previous iterations such that a diagonal wavefront of data across the array is

processed. This is not optimal for parallelization. Dependencies to data both above and to

the left in the array are due to the possibility to handle gaps in both sequences. Gaps are

however a relatively uncommon feature. A number of implementations process an entire

row or an entire column in each round of parallel processing. In case gap scores need to be

taken into consideration, the entire row or column is processed repeatedly until all scores

are set.

(17)

4 Discussion

Next Generation Sequencing has enabled a previously hidden dimension of information in basic Biology and Biomedical research. Due to funding was large-scale sequencing of genetic materials only possible for large consortia specialized on a few model organisms.

Since the first completes genomes were published there have been strong expectancies and pressure for developing affordable sequencing technologies. Big leaps are taken in the last fifteen years in the chemical processing to determine sequences. These technologies are nowadays affordable for essentially any laboratory. The bottleneck is now instead to analyze the enormous amount of data this generates.

Most softwares use the seed-and-extend technique, i.e. find a piece of the candidate se- quence that match to the reference genome by exact match algorithms, and then make an attempt to extend this match to cover the entire candidate sequence. Traditional linear text searches are simply not suitable due to the large reference genome sequences, which may have billions (for some plants even more than a trillion) nucleotides. Instead are the al- gorithms based on data structures that allow fast look-ups (hash tables, tree structures) or that the reference sequence is organized in a way that allows binary searches (the Burrow- Wheeler transform). Further optimizations and improved parallelization can surely be done to most softwares. I believe however that radically better performances would take that someone manages to think out of the box and comes up with a fresh idea.

During this work it strike me that modern NGS alignment softwares do not utilize the Fourier transform to identify the sequence fragment positions. This transform has been used in signal analyses for decades and it is proven to work also in the sequence alignment context. Whether the reason for not using the Fourier transform is that it has been out- competed by the other algorithms or simply has been overlooked remains an open question to me. Sometimes researchers tend to just do what everybody else does. Perhaps it could be worth the effort to evaluate the FFT algorithm also in the Next Generation Sequencing context.

Affordable access to genetic sequences has certainly opened new exciting possibilities to

Biologists, Biomedicine researches and Clinicians in their respective disciplines. Besides

improved research opportunities and analyses in Biology and Medicine, this technology

also asks interesting questions to the Computing Scientists.

(18)
(19)

References

[1] D.C. Benson. Fourier methods for biosequence analysis. Nucleic Acids Research 18:6305–6310, 1990.

[2] R.S. Boyer, J.S. Moore. A fast string searching algorithm. Communications of the ACM 20:762-772, 1977.

[3] M. Burrows, D.J. Wheeler. A block-sorting lossless data compression algorithm. Dig- ital Systems Research Center SRC-RR-124, 1994.

[4] D. Campagna, A. Albiero, A. Bilardi, E. Caniato, C. Forcato, S. Manavski, N. Vitulo, G. Valle G. PASS: a program to align short sequences. Bioinformatics 25:967-698, 2009.

[5] M.J.Chaisson, G. Tesler Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory BMC Bioin- formatics 13:238, 2012.

[6] E.A. Cheever, G.C. Overton, D.B. Searls. Fast Fourier transform-based correlation of DNA sequences using complex plane encoding. Computational Applications in Bio- science 7:143–154, 1991.

[7] Y. Chen, T, Souaiaia, T. Chen. PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics 25:2514-2521, 2009.

[8] N.L. Clement, Q. Snell, M.J. Clement, P.C. Hollenhorst, J. Purwar, B.J. Graves, B.R.

Cairns, W.E. Johnson. The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing. Bioinformatics 26:38-45.

[9] R. Cole. Tight bounds on the complexity of the Boyer-Moore string matching algo- rithm. Proceedings of the second annual ACM-SIAM symposium on Discrete algo- rithms 222-233, 1991.

[10] A. Cornish-Bowden. Nomenclature for incompletely specified bases in nucleic acid sequences: Recommendations 1984. Nucleic Acids Research 13:3021-3030, 1985.

[11] F. Crick. Central dogma of molecular biology. Nature 227:561-563, 1970.

[12] A. Dobin, C.A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M.

Chaisson, T.R. Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21, 2013.

[13] A.K. Emde, M.H. Schulz, D. Weese, R. Sun, M. Vingron, V.M. Kalscheuer, S.A.

Haas, K. Reinert. Detecting genomic indel variants with exact breakpoints in single-

and paired-end sequencing data using SplazerS. Bioinformatics 28:619-27, 2012.

(20)

[14] G.G. Faust, I.M. Hall. YAHA: fast and flexible long-read alignment with optimal breakpoint detection. Bioinformatics 28: 2417–2424, 2012.

[15] J Felsenstein, S Sawyer, R Kochin. An efficient method for matching nucleic acid sequences. Nucleic Acids Research 10:133-139, 1982.

[16] P. Ferragina, G. Manzini. Opportunistic data structures with applications. 41st Annual Symposium on foundations of computer science. 2000.

[17] P.M. Gontarz, J. Berger, C.F. Wong. SRmapper: a fast and sensitive genome-hashing alignment tool. Bioinformatics 29:316-21, 2013.

[18] F. Hach, I. Sarrafi, F. Hormozdiari, C. Alkan, E.E. Eichler, S.C. Sahinalp mrsFAST- Ultra: a compact, SNP-aware mapper for high performance sequencing applications Nucleic Acids Research 42:W494–W500, 2014.

[19] M.T. Heideman, D.H. Johnson, C.S. Burrus. Gauss and the History of the Fast Fourier Transform. IEEE ASSP Magazine 1:14-21, 1984.

[20] S. Hoffmann, C. Otto, S. Kurtz, C.M. Sharma, P. Khaitovich, J. Vogel, P.F. Stadler, J. Hackerm¨uller. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Computational Biology 5:e1000502, 2009.

[21] N. Homer, B. Merriman, S.F. Nelson. BFAST: an alignment tool for large scale genome resequencing. PLoS One 4:e7767, 2009.

[22] J Hu, H. Ge, M. Newman, K. Liu. OSA: a fast and accurate alignment tool for RNA- Seq. Bioinformatics 28:1933-1934, 2012.

[23] H. Jiang and W.H. Wong. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24:2395–2396, 2008.

[24] K. Katoh, K. Misawa, K. Kuma, T. Miyataa MAFFT: a novel method for rapid mul- tiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30:3059–3066, 2002.

[25] J. Kim, C. Li, X. Xie. Improving read mapping using additional prefix grams. BMC Bioinformatics, 15:42, 2014.

[26] P. Klus, S. Lam, D. Lyberg, M.S. Cheung, G. Pullan, I. McFarlane, G.S. Yeo, B.Y.

Lam. BarraCUDA - a fast short read sequence aligner using graphics processing units.

BMC Research Notes 5:27, 2012.

[27] A.P.J. de Koning W. Gu, T.A. Castoe, M.A. Batzer D.D. Pollock. Repetitive Elements May Comprise Over Two-Thirds of the Human Genome. PLoS Genetics 7:e1002384, 2011.

[28] B. Langmead, C. Trapnell, M. Pop, S.L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25, 2009.

[29] W.-P. Lee, M.P. Stromberg, A. Ward, C. Stewart, E.P. Garrison, G.T. Marth MOSAIK:

A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Map-

ping PLoS One, 9:e90581, 2014.

(21)

[30] H. Li, R. Durbin. Fast and accurate short read alignment with Burrows–Wheeler Trans- form. Bioinformatics 25:1754–1760, 2009.

[31] H. Li, J. Ruan, R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18:1851-1858, 2008.

[32] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, J. Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966-1967, 2009.

[33] Y. Liao, G.K. Smyth, W. Shi The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote Nucleic Acids Research 41:e108.

[34] J.-Q. Lim, C. Tennakoon, G. Li, E. Wong, Y. Ruan, C.-L. Wei, W.-K. Sung. BatMeth:

improved mapper for bisulfite sequencing reads on DNA methylation. Genome Biol- ogy 13:R82, 2012.

[35] H. Lin, Z. Zhang, M.Q.Zhang,,B. Ma, M. Li. ZOOM! Zillions of oligos mapped.

Bioinformatics 24:2431-7, 2008.

[36] Y. Liu, B. Popp, B. Schmidt. CUSHAW3: sensitive and accurate base-space and color- space short-read alignment with hybrid seeding. PLoS One 9:e86869, 2014.

[37] B. Liu, D. Guan, M. Teng, Y. Wang Y. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics 32:1625-1631, 2016.

[38] G Lunter, M. Goodson. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Research. 21:936–939, 2011.

[39] N. Malhis, Y.S. Butterfield, M. Ester, S.J. Jones. Slider - Maximum use of probability information for alignment of short sequence reads and SNP detection. Bioinformatics 25:6-13, 2009

[40] S. Marco-Sola, M. Sammeth, R. Guig´o, P. Ribeca. The GEM mapper: fast, accurate and versatile alignment by filtration. Nature Methods 9:1185–1188, 2012.

[41] I. Medina, J. T´arraga, H. Mart´ınez, S. Barrachina, M. I. Castillo, J. Paschall, J. Salavert-Torres, I. Blanquer-Espert,V. Hern´andez-Garc´ıa, E.S. Quintana-Ort´ı, J.

Dopazo. Highly sensitive and ultrafast read mapping for RNA-seq analysis. DNA Re- search 23:93–100, 2016.

[42] J.C. Mu, H. Jiang, A. Kiani, M. Mohiyuddin, N.B. Asadi, W.H. Wong. Fast and accu- rate read alignment for resequencing. Bioinformatics 28:2366–2373, 2012.

[43] S.B. Needleman, C.D. Wunsch. A general method applicable to the search for sim- ilarities in the amino acid sequence of two proteins. Journal of Molecular Biology.

48:443–53, 1970.

[44] N. Philippe, M. Salson, T. Commes, E. Rivals. CRAC: an integrated approach to the analysis of RNA-seq reads Genome Biology 14:R30, 2013

[45] N. Prezza, F. Vezzi, M. K¨aller, A. Policriti. Fast, accurate, and lightweight analysis of

BS-treated reads with ERNE 2 BMC Bioinformatics 17:69, 2016.

(22)

[46] K. Prufer, U. Stenzel, M. Dannemann, R.E. Green, M. Lachmann, J. Kelso. PatMaN:

rapid alignment of short sequences to large databases. Bioinformatics 24:1530-1531, 2008.

[47] S. Rajasekaran, X. Jin, J.L. Spouge. The efficient computation of position-specific match scores with the Fast Fourier Transform. Journal of Computational Biology 9:

23-33, 2004.

[48] G. Rizk, D. Lavenier. GASSST: global alignment short sequence search tool. Bioin- formatics 26:2534-2540, 2010.

[49] S.M. Rumble, P. Lacroute, A.V. Dalca, M. Fiume, A. Sidow, M. Brudno. SHRiMP:

Accurate Mapping of Short Color-space Reads. PLOS Computational Biology 5:e1000386, 2009.

[50] M.C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinfor- matics 25:1363-1369, 2009.

[51] F.J. Sedlazeck, P. Rescheneder, A. von Haeseler NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 29:2790-2791, 2013.

[52] E. Siragusa, D. Weese, K. Reinert. Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Research 41:e78, 2013.

[53] A.D. Smith, Z. Xuan, M.Q. Zhang. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9:128, 2008.

[54] T.F. Smith, M.S. Waterman. Identification of common molecular subsequences. Jour- nal of Molecular Biology. 147:195–197, 1981.

[55] C. Tennakoon, R.W. Purbojati, W.K. Sung. BatMis: a fast algorithm for k-mismatch mapping. Bioinformatics 28:2122-2128, 2012.

[56] D. Weese, A.K. Emde, T. Rausch, A. D¨oring, K. Reinert. RazerS-fast read mapping with sensitivity control. Genome Research 19:1646-1654, 2009.

[57] https://en.wikipedia.org/wiki/Fourier transform. Retrieved June 3, 2017.

[58] https://en.wikipedia.org/wiki/Fast Fourier transform. Retrieved June 3, 2007.

[59] https://en.wikipedia.org/wiki/DNA sequencing. Retrieved June 3, 2017.

[60] https://en.wikipedia.org/wiki/Hash table. Retrieved June 3, 2017.

(23)

Appendix: Calculations

This appendix describes calculations for some of the data stated in the text.

In section 3.3.1 Hash tables

These calculations are based on array sizes rather than hash tables. The size of the hash table is calculated from a worst-case scenario that saturates all possible combinations from k nucleotides, which is unlikely in a real-life experiment. On the other hand, hash tables need a considerable overhead to function well.

The text states it takes 26 GB of memory space to keep the pointers and 13 GB to store all positions in the human genome for the hash table. This is calculated by that there is 3.3 × 10

9

nucleotides in the human genome, i.e. 3.3 × 10

9

key-value pairs if the offset for each k-mer is one nucleotide. Assuming 64 bit pointers (that corresponds to 8 bytes per pointer) and 32-bits integers (4 bytes per integers) to keep genome positions, that is:

3.3 × 10

9

× 8 bytes = 26.4 × 10

9

bytes ≈ 26 GB.

3.3 × 10

9

× 4 bytes = 13.2 × 10

9

bytes ≈ 13 GB.

In section 3.1.3 The Burrow-Wheeler transform

As genetic sequences are represented by an alphabet of four characters, each character could be represented by two bits. This makes four nucleotides per byte. The size of the human genome, 3.3 × 10

9

nucleotides, divided by four makes approximately 825 MB of memory space to keep the human genome sequence.

For the FM indexes, calculating the memory space needed for tagging every fourth T char- acter, I first assume the four different nucleotides are present to on average equal numbers in the reference sequence. There is thus approximately 8.25 × 10

8

T characters in the ref- erence sequence in total; one fourth of these is about 2.1 × 10

8

. Assuming each position is stored as a 32-bit integer (4 bytes), it neutralizes the last division and we are back on 8.25 × 10

8

bytes, or 825 MB, of storage space.

For the second part of the FM indexes, storing the position for every 16 character of each kind, it is simply the whole genome size divided by 16. Taking this, times the storage space for each position (here 32 bits or four bytes) gives:

3.3 × 10

9

× 4 bytes /16 = 825 MB.

In total, memory space is needed for the BW transformed sequence and the FM indexes.

This makes 825 MB ×3 ≈ 2.5 GB.

(24)

References

Related documents

The teachers at School 1 as well as School 2 all share the opinion that the advantages with the teacher choosing the literature is that they can see to that the students get books

Detta steg kommer att fortgå under hela tiden som projektet pågår och dokumenterar projektet. 2.7

If distant shadows are evaluated by integrating the light attenuation along cast rays, from each voxel to the light source, then a large number of sample points are needed. In order

Previous studies on the outcomes of pregnancy in women with CHD show that there are increased risks of preterm birth, SGA, low birth weight and recurrence of CHD in their

Simple analysis of the model residuals gives information about the model error model, that could be instrumental either for the control design or for requiring more accurate

This paper aims to continue the debate and critique within the FWA literature raised by other scholars, namely the perception of FWAs as autonomous per se (Gerdenitsch, Kubicek

This article reports on the views of European experts and scholars who are members of the European COST Action CHIP ME IS1303 (Citizen’s Health through public-private

På grund av resultatet kan vi inte dra någon slutsats för om utnyttjandet av redovisningen av goodwill utnyttjas olika beroende på ersättning till verkställande