Application of machine learning techniques to perform base-calling in next-generation DNA sequencing

(1)

IN

DEGREE PROJECT ENGINEERING PHYSICS, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Application of machine learning

techniques to perform base-calling

in next-generation DNA sequencing

ELIN TEGFALK

(2)

Application of machine learning

techniques to perform base-calling in

next-generation DNA sequencing

Elin Tegfalk

Degree Project in Applied Physics (30 ECTS credits) Degree Program in Engineering Physics

KTH Royal Institute of Technology, 2020

Supervisor at Single Technologies: Jacob Kowalewski Examiner at KTH: Hjalmar Brismar

(3)

I

Abstract

This thesis project is performed in the field of biomedical data analysis and DNA sequencing, with the aim to study next-generation sequencing technologies, and to investigate and assess the potential of using machine learning techniques to perform base-calling in next-generation sequencing platforms.

DNA sequencing, with a special focus on next-generation sequencing, is reviewed through a literature study, while the machine learning approach to sequencing data analysis is investigated by performing base-calling with four different supervised machine learning algorithms and evaluating the accuracy of the base-calls, as well as the training time. Base-calling is the process of determining the DNA bases from sequencing image data, and for the purpose of this project, sequencing data with known bases are used to evaluate the performance of the machine learning algorithms. Two different datasets are used for training and testing.

For training and testing on the same dataset, accuracy scores ranging from 98.0 % to 98.9 %, and 90.3 % to 96.0 %, respectively are achieved for the two datasets. However, when training and testing are performed on different datasets, the accuracy scores decrease to around 50 %. The training times required were only a few seconds, ranging from 0.7 to 16 seconds, for all but one of the algorithms which stood out with a running time of over one hour for the second, larger dataset.

(4)

II

Sammanfattning

Detta examensarbete behandlar biomedicinsk dataanalys samt DNA-sekvensering och syftet med projektet är att både studera DNA-sekvensering, framförallt med fokus på de nyare sekvenseringsmetoder som benämns som next-generation sequencing, samt att undersöka och utvärdera möjligheterna med att använda maskininlärning för att utföra så kallad base-calling, eller basbestämning, vilket är en del av dataanalysen i dagens

sekvenseringsplattformar. Projektet är uppdelat i två delar: en teoretisk litteraturstudie av DNA-sekvensering, samt en del som behandlar maskininlärning och base-calling.

Base-calling innebär att bestämma DNA-baserna utifrån det bilddata som genereras i sekvenseringsprocessen, och möjligheten att använda maskininlärning för att utföra base-calling undersöks med hjälp av fyra olika maskininlärningsalgoritmer. Två olika dataset med kända DNA-baser används också för att träna och testa algoritmerna, vilka utvärderas utifrån den procentuella delen av korrekta base-calls, samt träningstiden för algoritmerna.

Vid test och träning på samma dataset så varierade andelen korrekta base-calls mellan 98,0 % och 98,9 % respektive 90,3 % och 96,0 % för de två dataseten. Vid test och träning på olika dataset sjönk däremot andelen korrekta base-calls till ungefär 50%.

Träningstiderna uppmättes till ett antal sekunder, mellan 0,7 och 16 sekunder, för alla algoritmer utom en, vilken utmärkte sig med att behöva mer än en timmes träningstid för det andra, större datasetet.

(5)

III

Acknowledgements

I would like to give a special thanks to my supervisor at Single Technologies, Jacob

(6)

IV

1. Introduction

1.1 The thesis project

This thesis project is performed in the field of biomedical data analysis and DNA sequencing. The aim is to study next-generation sequencing techniques, and to apply machine learning algorithms to the image data analysis that is an integral part of the present-day sequencing technologies.

The project is performed at Single Technologies, in Stockholm, Sweden; a company with a long tradition in fibre optic, ultra-sensitive fluorescence spectroscopy, proteomics and genomics; with the ambition to become a leading provider of single molecule imaging technologies for the 21st_{century diagnostics and pharmacology.}1

1.2 Introduction to DNA sequencing

DNA, with its spiralling double-helical structure and its function as the carrier of genetic material in all living organisms, is a molecule that since its discovery has, and still does, fascinate and capture the imagination and interest of both researchers and others alike. It has been widely studied since its function was discovered in the 1940s, and although we have learned a great deal about genomics and the DNA molecule itself, it is by no means a solved puzzle, but very much remains an important research topic today.

DNA sequencing is an essential part of genomic research, i.e. the study of genomes, and is the process of determining the order of the nucleoside bases in a given segment of DNA. The first genomes were sequenced in the 1970s and ever since the research community and sequencing companies have worked on developing new, better sequencing techniques to be able to reliably sequence large amounts of genomic data in a realistic timeframe, and to a reasonable cost.

So, what benefits are there to DNA sequencing that makes it so important to further study and improve? Well, the study of DNA and specific genes is a cornerstone in our quest to understand the mysteries of life. Perhaps most importantly, it provides key information in the area of medicine, where the study of genes for example can be used to discover and investigate diseases and genetic illnesses, as well as to explore new possible treatments. There are also other interesting and useful applications of DNA sequencing, such as in forensics, and in the study of human evolution and historical societies, where DNA can provide information of for example ethnic characteristics and ancestry.

(9)

2

1.3 Purpose and goals

As mentioned at the beginning of the introduction the aim of this thesis project is to study next-generation sequencing techniques, and to apply machine learning algorithms to the image data analysis in the sequencing process.

The project therefore consists of two separate but related parts: a literature study of DNA sequencing and its development from the early sequencing till the present-day next-generation sequencing methods; and a machine learning and computer programming part where the aim is to use machine learning to perform a part of the data analysis known as base-calling.

With this in mind I have formulated the following two main goals of the project: • To review DNA sequencing methods, especially focussing on the physics and

biology of next-generation sequencing; with the aim to attain knowledge and understanding of the sequencing process.

• To apply machine learning algorithms to perform base-calling, i.e. to determine the DNA bases from sequencing image data; with the aim to investigate and assess the potential of machine learning in next-generation sequencing platforms.

1.4 Overview of the report

Below follows a short description of the content of the report and the different chapters. Chapter 2: Background, provides a theoretical introduction to the research and is the result of the first part of the project, the literature study of DNA sequencing. It begins with a description of the DNA molecule, continues with the history and development of sequencing from the discovery of the DNA molecule to the present-day sequencing techniques, and then moves on to a more detailed explanation of 21st_century

(10)

3

2. Background

As described in the report overview, this chapter provides a theoretical introduction to the research. It begins with a basic description of DNA in chapter 2.1, and continues on to sequencing in chapters 2.2 and 2.3. Chapter 2.4 then explains confocal fluorescence

microscopy, and chapter 2.5 gives an introduction to machine learning.

2.1 The DNA molecule

Essential for all life on our planet is a single, but most complex molecule –

deoxyribonucleic acid, commonly known as DNA. Inside this molecule lies all hereditary information that an organism needs to develop, function, and reproduce. It also holds information that both determine the characteristics of a species as a whole, as well as that of each individual within it. A copy of DNA is stored in the nucleus of every cell in living organisms, and at cell division and reproduction, it is passed on from cell to cell, as well as from generation to generation.

2.1.1 The structure of DNA

The DNA molecule consists of two complimentary strands where each strand is itself a long polymeric chain made from nucleotide monomers. Each nucleotide in turn consists of three groups of molecules: a nitrogen-containing nucleoside base, the sugar molecule deoxyribose, and one or several phosphate groups.

There are four different nucleoside bases occurring in DNA: adenine, cytosine, guanine, and thymine, commonly denoted by the letters A, C, G, and T. These are in turn structurally divided into two sub-groups: pyrimidines and purines, where

pyrimidines, including cytosine and thymine,

contains one six-membered ring, while purines, including adenine and guanine, contain both a six-membered ring as well as an attached five-membered ring. The structures of the bases are displayed in figure 2.2.

The nucleotide monomers then bind together by covalent bonds, where the phosphate group of one nucleotide binds to the sugar molecule of the next nucleotide. A structure is then created, where the sugar-phosphate groups line up to form the backbone of each single strand of DNA, while the bases protrude from this backbone.

(11)

4

The full DNA molecule consists of two such complimentary single strands that bind together and shape into the famous double helical structure. The strands of the double helix are held together by hydrogen bonds between the bases of each strand. However, the bases do not pair at random, but form only two combinations. A pairs with T by two hydrogen bonds, while C pairs with G by three hydrogen bonds. In

both cases a single-ring pyrimidine is paired with a two-ringed purine base, a prerequisite for both pairs to be of the same size. The specific pairing is why the two DNA strands are said to be complimentary, or antiparallel.

The length, weight, and size of DNA are usually measured in the number of these base pairs (b.p.). For example, the human genome consists of about 3.2 billion base pairs.2

2.1.2 Cell function and genes

DNA is also vital for the function of cells and accordingly for the function of an organism as a whole. Most of the cell’s functions are performed by proteins, which are produced internally in an organism. Instructions on how to produce these proteins are stored in specific regions of the DNA which are called genes. Generally, a gene is defined as a segment of DNA that contains instructions for making a particular protein.

Figure 2.3: Helical structure of DNA. (Courtesy: National Human Genome Research Institute, www.genome.gov) Figure 2.2: Structure the four nucleoside bases

(Courtesy: National Human Genome Research Institute, www.genome.gov)

(12)

5

The number of genes of different organisms varies greatly and generally increases with the complexity of the organism. For example, humans have about 30 000 genes. Together all genes also constitute the genome, that is, the total genetic information of an organism.2

2.1.3 RNA

RNA, short for ribonucleic acid, is another nucleic acid, very similar to DNA. However, it does differ from DNA in three important aspects: it is typically one-stranded instead of double-stranded, contains the sugar molecule ribose instead of deoxy-ribose, and has a base called uracil instead of thymine. The RNA molecule also performs an important function in gene expression. There, DNA is replicated into RNA in a process called

transcription. Sections of the RNA are then translated into particular amino acids, and so, a certain protein can be produced by the instructions in the RNA.

Although DNA is the carrier of genetic material in all known advanced lifeforms, RNA can also carry genetic information and does so in some viruses, commonly referred to as RNA viruses. Such viruses can also replicate its genome by the process of reverse

transcription, where the original RNA molecule serves as a template for a complimentary RNA strand, which in turn serves as a template for the construction of a RNA strand identical to the original.3

2.2 DNA and RNA sequencing

DNA sequencing is the process of determining the order of the nucleoside bases (A, C, G, and T) in a given segment of DNA. Today sequencing of genomes is an important part of biological and medical research, and the field has developed rapidly since the structure of the DNA molecule was determined in the 1950s.

2.2.1 The history of sequencing and development of first-generation

sequencing

The DNA molecule was discovered in 1869 by the Swiss scientist Fredrich Meischer. Meischer worked in the field of physiological chemistry, an at the time relatively new field of research that aims to discover the biochemistry of life. At the time of his

discovery, Meischer worked on the biochemical composition of lymphocytes. During his work he also came to study leukocytes (white blood cells), and it was during his

(13)

6

undiscovered molecule in the nuclei of the leukocytes. He named the newfound molecule nuclein, and it is this molecule that today is known as DNA.

The purpose of DNA, however, remained unknown for many years after its discovery, and genetic information was thought to be contained in proteins. It wasn’t until the 1940s that scientists drew the conclusion that DNA was indeed the carrier of genetic information. This was suggested by Oswald T. Avery, Colin MacLeod, and Maclyn McCarty, in their 1944 paper. Eight years later, this was confirmed by Alfred Hershey and Martha Chase.4

The same year, in 1952, Rosalind Franklin managed to take diffraction photographs of DNA using x-ray crystallography, and from these she started to discern the double-helical structure of DNA with the nucleoside base-pairs on the inside of a backbone of sugar phosphate groups.5_{Her work then led James Watson and Francis Crick to fully solve the}

three-dimensional structure of DNA, which they presented in 1953.

Once the molecular structure was determined, researchers progressed toward sequencing DNA, and we can say that it is at this stage the history of sequencing really begins. It would, however, take until the 1970s before reliable methods for sequencing DNA were developed. In the early stages, researchers focussed on reading RNA sequences, due to its availability and its simpler single-strand structure. The first nucleic acid sequence was produced by Robert Holley and colleagues in 1965, and in 1972 Walter Fiers and his team of scientists managed to produce the first complete protein-coding gene sequence. Four years later they also succeeded in determining the whole genome of the same organism, namely the RNA bacteriophage MS2.

In the wake of these landmarks, the research focus shifted towards sequencing DNA, and it wasn’t long until Frederick Sanger succeeded in sequencing the first DNA genome, that of the bacteriophage X174, also denoted PhiX. Even to this day this particular genome is used in sequencing research, usually as a positive control genome. Sanger would then go on to become one of the leading figures of the early DNA sequencing and his most important breakthrough came in 1977, with the development of a new sequencing technique known as the chain-termination technique or the dideoxy technique.

Sequencing DNA using this technique has thereafter become known as Sanger sequencing after its inventor, and for many years to come it was the most frequently used method to sequence DNA. Today, when new sequencing techniques have been developed, Sanger sequencing is also known as first-generation sequencing.

In the years after the invention of Sanger’s initial sequencing method, several

(14)

7

2.2.2 Human genome sequencing

After the successes of sequencing whole genomes of certain bacteriophages, attention was turned towards sequencing the full human genome. It was considered an ambitious task, considering that the previously sequenced organisms had a genome of only a few

thousand base pairs, compared to the approximately 3.2 billion bases in the human genome. Nevertheless, the project, named the Human Genome Project, received funding from the US government and several other actors, and was initiated in 1990. The goal of the project was to, within a timeframe of 15 years, determine the full nucleotide sequence of the human genome, as well as identify all our genes. The project was finished in 2003, two years ahead of time and is considered to be a great scientific success.7

Sequencing whole human genomes is of interest in both medicine and scientific research, and could lead to an era of personalised medicine. However, for this to be feasible on a large scale the time and cost of whole genome sequencing must not be too large. Sequencing the first genome in the Human Genome Project took over a decade and is estimated to have cost just under $3 billion.

However, after the completion of the Human Genome Project, the scientific community began contemplating the future for genomic research and the US National Human

Genome Research Institute formulated several new visions and goals to be achieved. One of these was to significantly lower the cost of whole genome sequencing more specifically to be able to sequence an entire human genome for $1000 or less.8

(15)

8

2.3 Next-generation sequencing

2.3.1 General introduction to next-generation sequencing

In the first years of the 21st_{century, a paradigm shift took place in the field of DNA}

sequencing with the emergence of a new, revolutionary sequencing technique. This technique, aptly named next-generation sequencing (NGS) came to dramatically change the sequencing science, and by significantly reducing both the cost of sequencing as well as the time, it has made it possible to perform DNA sequencing on a much broader scale than ever before.

Today there are several next-generation sequencing platforms developed by different companies. They are all, however, based on the same fundamental principles that distinguish next-generation sequencing, and their working principles are similar.

The most fundamental difference of next-generation sequencing, compared to first-generation Sanger sequencing, is parallelisation. Instead of sequencing long DNA

fragments base by base, the DNA strands are split into short reads that can be sequenced in parallel. The reads in next-generation sequencing are typically below 100 bases, but massive parallelisation makes it possible to sequence millions of such reads at the same time. This makes next-generation sequencing a high-throughput method that can produce

(16)

9

large sets of data per instrument run, and it is mainly this principle that makes the technique both very much faster and cheaper than earlier sequencing techniques.

DNA sequencing today is very much a developing field and further advances have led to the subsequent development of new next-generation sequencing methods. Considering this, it is sometimes relevant to further divide next-generation sequencing methods into the sub-categories, second- and third-generation sequencing.10

2.3.2 Second-generation sequencing

Second-generation sequencing methods are the first next-generation sequencing

techniques and it was machines based on these principles that replaced Sanger sequencing in the early 21st_century.

The workflow of second-generation sequencing methods can be divided into the following three main steps:

1. Library preparation 2. Sequencing

3. Data analysis

2.3.2.1 Library preparation

The first step of the sequencing process is called library preparation, and is performed before the actual DNA sequencing takes place. This process can in turn be divided into three steps, namely:

1. Extraction and purification of DNA 2. Fragmentation

3. Amplification

The very first step is to obtain a sample of genomic DNA from the person or organism whose DNA is to be sequenced, and then to purify it before it can be further processed. Next, the DNA is fragmented, i.e. cleaved into shorter reads of desired length.11_{There are}

(17)

10

is subjected to ultrasonic waves, producing gas-filled cavities in the liquid that when they collapse generate vibrations that break the DNA molecules. The next set of fragmenting techniques are enzymatic methods, which utilise specific fragmentase enzymes that cleave bonds within polynucleotides and thereby shears the DNA strands. Finally, the third set of methods are chemical shearing, which includes the method of using metal ions to hydrolyse DNA molecules, i.e. to cleave them in a process that consumes one water

molecule per cleavage. The generated DNA fragments then constitute the DNA library. As fragmentation is a very important step in high-throughput sequencing, this is also an area where much research is performed, resulting in a continuous development of new

techniques as well as improvements of already existing methods.12, 13

When fragmentation is complete, the sequencing process continues with amplification. However, before this the fragments of the template library are placed on the device where they will later be imaged, typically some sort of microfluidic cell. The amplification is then performed by cloning the DNA, to produce many copies of every fragment. This step is necessary to be able to generate a strong enough signal in the sequencing step, which enables base-calling to be performed in the final data analysis. Hence, amplification refers to amplifying the detected signal. The most common amplification technique, used on many sequencing platforms, is polymerase chain reaction (PCR). With this technique, thousands to millions of copies of DNA can rapidly be generated. Then, when

amplification is completed, all preparation steps are performed, and the DNA is ready to be sequenced.11

2.3.2.2 Sequencing

The second step is called sequencing, and also this stage of the process can be divided into parts, namely:

• Sequencing chemistry, and • Imaging

However, in this case the above steps are not performed only once, as the steps of library preparation, but they are consecutively performed in iterative cycles of first sequencing chemistry, and then image acquisition.

The sequencing chemistry varies slightly between different platforms, but the basic principles that all platforms base their methods on are the same. However, due to this, I will not explain the sequencing process generally, but I will instead describe the workflow of the Illumina platform14_{, since this is one of the most widely used sequencing platforms}

(18)

11

On the Illumina platform the single-stranded DNA fragments from the library are placed on a flow cell, where they are amplified into clusters of identical DNA fragments. The sequencing then proceeds by a process called sequencing-by-synthesis (SBS). During this process, one nucleotide is sequenced at a time. However, this occurs for all clusters at once, which is what creates the parallelisation in NGS. The sequencing then proceeds one nucleotide at time, until all bases of the DNA reads are sequenced. In the sequencing-by-synthesis process, each single-stranded DNA fragment serves as a template for the

incorporation of its complementary base, i.e. A for T, and G for C, and vice versa. The sequencing process begins by the preparation of a mixture of DNA polymerase and nucleotides, containing all four bases, that are each labelled with a fluorescent tag and a reversible terminator. This mixture is then added to the flow cell, where the DNA

polymerase prompts the binding of one nucleotide to its complimentary base in each DNA strand in every cluster. The reversible terminator blocks the incorporation of further bases, thus ensuring that only one base is incorporated in every strand; and the fluorescent tag, which is individual for each base, (A, C, G, and T), serves as an identifier for that particular base. After base incorporation, non-incorporated nucleotides are washed away, and imaging is performed.

In the imaging step, the fluorophores are first excited by lasers, and then a set of four images, one for every base, of the fluorescence emission are acquired, using confocal fluorescence microscopy. After image collection the fluorophores and the reversible terminators are cleaved off so that the next nucleotide base can bind to the template strand. The DNA polymerase and nucleotide mixture is then added again and all the above steps are repeated for the consecutive bases until images have been collected for all bases in the DNA reads.

The sequencing can also be made either single-end or paired-end. In single-end sequencing only one DNA strand is sequenced, while in paired-end sequencing the process is also repeated for the reverse strand, which increases the accuracy of the sequencing. When sequencing large DNA sets, such as in whole genome sequencing, paired-end methods are advantageous and used today on most sequencing platforms.11, 15

2.3.2.3 Data analysis

When the sequencing is completed, the remaining step is the data analysis which aims to determine the base-sequence of the analysed DNA. The data analysis step can also be divided into two separate parts:

(19)

12

First, image-analysis needs to be performed to identify the DNA clusters and separate them from the background. Secondly, the correct sequence of nucleotide bases must be inferred from the light-intensity signals in a process known as base-calling. This process is performed using the acquired images of the fluorescent intensity for all four bases, and the idea of the base-calling is simply that the channel, i.e. the base with the highest

fluorescent intensity, is the correct base. This is true in an ideal world; however, reality is seldom ideal, and nor are the present-day sequencing technologies. Instead, there are a number of factors, particularly arising from imperfections in the sequencing chemistry and the fluorescence emission spectra, that complicate the base-calling process.

Due to imperfections in the sequencing chemistry, sequencing-by-synthesis methods can give rise to two related problems called phasing and pre-phasing. Phasing, also known as lagging, occurs when enzymes in the synthesis process do not work properly, and fail to incorporate or synthesise the next base. If the synthesis process proceeds correctly after this, that read will then lag behind the rest of the correctly synthesised reads in the cluster, and will emit another fluorescent signal. This will in turn affect the measured channel intensities in the acquired images, which directly leads to a higher risk of error in the base-calls. The opposite problem, that, however, has similar consequences for the base-calls, is pre-phasing, or leading. This occurs when two bases are synthesised and incorporated in one sequencing cycle, i.e. where only one base should be incorporated. This will cause that particular read in the cluster to be one base ahead of the rest, which in turn will affect the base-calling in the same way as phasing.

Another phenomenon that affects the measured intensities, and thereby the base-calling, is crosstalk, also known as crossover or bleed-through. This occurs when the emission spectra of two fluorophores overlap so that fluorescent intensity from both fluorophores is detected in a channel designated to measure only the intensity from one of the

fluorophores. Optimally, fluorophores with non-overlapping emission spectra should be used in the sequencing process to avoid this problem. However, this is rarely possible due to the broad bandwidths of many common fluorophores, which make crosstalk a

commonly occurring phenomenon in most applications of fluorescent microscopy. The phenomenon is illustrated in figure 2.5 where it can be seen how the emission spectra of

In

te

n

sity

Wavelength

Emission spectrum of fluorophore tagging base A

Emission spectrum of fluorophore tagging base C

Figure 2.5: Illustration of crosstalk between the A and C channels C channel

(20)

13

the A and C bases overlap and that the fluorescent intensity from the C emission spectrum bleeds into the channel where fluorescent intensity from base A is to be detected.

The quality of the intensity signals usually decreases in later sequencing cycles, making the base-calling less accurate at the end of long reads. This is because, even if the above errors occur with low probability, added together over many sequencing cycles they together contaminate the signal noticeably, and thereby affect the data quality and base-calling results. A significant part of the base-base-calling software is therefore to address and correct for the above-mentioned errors and imperfections, to be able to generate accurate base-calls.

All next-generation sequencing platforms have a built-in base-caller. However, for several of the large platforms, such as the Illumina technology, external parties have developed a significant number of other base-callers that might be used as alternatives to the built-in algorithms. Base-calling algorithms can either be model-based, or empirical, i.e. based on machine learning methods. Most base-callers for the Illumina platform are model-based, such as the built-in base-caller Bustard, and other well-known algorithms such as Rolexa and BayesCall. However, a few regularly used empirical algorithms have also been

developed, such as the base-callers Ibis, freeIbis, and Alta-cyclic. Common for these three algorithms is that they are all based on the same machine learning technique, namely support vector machines, to train the program and perform the base-calling. For example, Ibis uses multi-class support vector machines with polynomial kernels for each cycle.16

2.3.3 Third-generation sequencing

The continuous research in the field of DNA sequencing and the aspiration to improve existing second-generation sequencing technologies has led to the emergence of new sequencing methods, generally categorised as third-generation sequencing techniques to distinguish them from earlier methods.

(21)

14

Another promising technique is nanopore sequencing which also produces real-time sequence data. This sequencing technique works by driving an ionic current, containing the DNA sample to be sequenced, through nanopores and record how the conductivity of the pore changes when the pore is blocked by a nucleotide base. Each of the four

nucleotides yields a different, and recognisable change in the conductivity which enables base-calling to be made based on the shape of the recorded signal. 17, 18

2.3.4 RNA sequencing

Not only DNA is sequenced in the present-day research, but in some applications sequencing of RNA can be advantageous and yield different information from DNA. As described in chapter 2.1.3, RNA has very important functions in living organisms, and is directly involved in the synthesis of proteins. Proteins are in turn responsible for various cell functions, and malfunctions of RNAs can therefore lead to several diseases.

Sequencing RNA makes it is possible to study the steps in the transcription and translation processes, and thereby give a better understanding of the function of RNA.

Several methods to sequence RNA exists today. However, many of them rely on the process of reverse transcription to convert RNA to cDNA (complimentary DNA) using enzymes known as reverse transcriptase. This method is advantageous much due to the fact that DNA is more stable than RNA and therefore better suited for the required analyses.19

2.4 Confocal fluorescence microscopy

Confocal fluorescence microscopy is an imaging technique that can provide high-resolution three-dimensional images of a sample by the acquisition of a stack of images from consecutive cross-sections of the sample. The technique is based on the detection of fluorescence from fluorophores attached to the sample, and on the creation of specific in-focus planes. This imaging technique is commonly used in the sequencing process for the acquisition of image data.

(22)

15

usually has a lower average energy, and lower peak wavelength than the spectrum of the incident light.

High-resolution images are obtained by creating an in-focus plane, so that the detected fluorescence only originates from that particular plane, and signals from out-of-focus planes are blocked. This is achieved by placing a pinhole in front of the detector, so that only light from the chosen plane will pass through the pinhole, whereas it will block light originating from other depths in the sample.

A confocal fluorescence microscope consists of the following main components: a laser, a detection pinhole; a dichroic mirror, a microscope objective, a photo-detector, and some device to select the incident light, usually a single-mode optical fibre, or sometimes an excitation pinhole. The light from the laser and the emitted fluorescence follow two different pathways in the microscope, referred to as the excitation and emission light pathways. The basic composition of the microscope, and the light pathways, are illustrated in figure 2.6, below. The excitation light pathway consists in the laser light passing

through the excitation pinhole and being reflected by the dichroic mirror, before it is focused on a small spot in the sample by the objective. The emitted fluorescence is instead focussed by the microscope objective, transmitted through the dichroic mirror, and

imaged on the detector.20

Figure 2.6: Light pathways in a confocal fluorescence microscope

(23)

16

2.5 Introduction to machine learning

Machine learning can be said to be the science and application of programming computers so that they, by learning from data, can be taught to perform certain tasks and improve through experience. The field of machine learning and artificial intelligence is thought by many to be a relatively new science. However, its roots go back to the mid-20th_century,

and one pioneer in the field was the computer scientist Arthur Lee Samuel who famously created a checkers playing program that could learn and improve from good and bad moves in the game, and eventually outplay human chess champions. The field of machine learning though, has grown rapidly during the 21st_{century and will most likely continue}

to do so. Applying machine learning algorithms to obtain a high-standard performance usually requires processing a lot of data, and the growth of the field can mainly be attributed to the increase in computing power, which makes it possible to use machine learning in an increasing number of applications. Examples of such applications where machine learning algorithms can be advantageous are in classification problems, for example classification of different areas of interest in images or texts; in recommender systems that predict products that a customer may be interested in: in creating programs to play certain games, and in many more areas.21, 22

2.5.1 General structure of machine learning data

The data to be used in machine learning algorithms is usually arranged in a matrix with rows and columns, where each row represents one instance or observation and each column correspond to an attribute. The general structure is displayed in the table 2.1, below.

No/ID Attribute 1 Attribute 2 … Attribute n Label

1 6.5 Blue 12 A

2 4.2 Green 15 B

3 3.1 Brown 22 A

m 5.8 Brown 8 D

The first column is called an ID, and is, as its name suggests, an identifier that should be unique for each instance, and is used for reference purposes. The following columns contains the attributes, which are the data variables that are used to make the predictions. The number of attributes vary and is up to the programmer to choose. You may want to

(24)

17

use only a few variables, or up to several hundred depending on your problem. Finally, the last column contains the labels, i.e. the things that you wish to predict, and can be seen as the key to a given problem.

Attributes and labels can be of two different types, numbers or text, as illustrated in table 2.1. These are called numeric and categorical variables, where numeric variables are the most usual and also the type required in most machine learning algorithms. In some instances, however; categorical variables are of importance, and these can be either two-valued like male or female; or multi-two-valued like for example A, B, C, or D. The type of labels of a problem also specifies it as either a regression problem, when labels are numeric; or a classification problem, when labels are categorical. Classification problems can be further divided into binary classification problems, when the categorical labels are two-valued; and multiclass classification problems, when the categorical labels are multi-valued.

Attributes and labels also go by different names, and in varying literature attributes might also be referred to as predictors, features, or independent variables; while labels are sometimes called outcomes, targets, dependent variables, or responses.23

2.5.2 Types of machine learning

Machine learning algorithms can be divided into categories based on different criteria, for example what data they require to learn and how they learn.

Supervised, unsupervised, semi-supervised, and reinforced learning

Commonly, algorithms are classified according to what data and supervision they need during training. In such a classification system, algorithms are divided into the following four major categories: supervised learning, unsupervised learning, semi-supervised learning, and reinforced learning.

(25)

18

Lastly, we have reinforced learning which is very different from the differently supervised methods. In this case the learning program, called an agent, learns by receiving positive or negative feedback, rewards, or penalties, when performing an action. This method of machine learning is typically used to teach computer programs to play certain computer games, as well as in automation in robotics.22

Regression methods, instance-based methods, decision trees, and ensemble methods

Methods can also be sorted according to what mathematical framework they are based on. Some of the most common classes of algorithms are: regression methods that, as their name suggests, are based on regression analysis; instance-based, or memory-based, methods that work by comparing new data instances to previously seen instances memorised in the training process; and decision tree models that use decision trees as a predictive model to make conclusions about the observed data. Moreover there are

ensemble methods that work by combining multiple learning algorithms to obtain a better performance than any of the algorithms do alone.22

2.5.3 List of machine learning methods

Below follows a list of some commonly used machine learning algorithms.

Regression methods • Linear regression • Logistic regression • Penalised regression Instance-based methods • K-nearest neighbour (kNN) • Support vector machines (SVM)

o SVM with linear kernel o SVM with polynomial kernel

o SVM with gaussian radial basis function (RBF) kernel

Decision trees

(26)

19 Ensemble methods • Bagging • Boosting • Random Forest Optimisation methods

The following algorithms are not specific machine learning algorithms in themselves, but are optimisation techniques that often are applied to machine learning problems to optimise algorithm parameters.24

• Gradient descent

• Stochastic gradient descent (SGD) • Batch gradient descent

2.5.4 Performance measures of binary classifiers

To evaluate the performance of a binary classifier, it is common to use measures derived from values in the so-called confusion matrix. To construct the confusion matrix, the number of times a prediction is correctly, or incorrectly classified, are counted. A distinction is also made between positive and negative predictions, yielding a two times two matrix with a total of four values. Correct positive and negative predictions are referred to as true positives and negatives, whereas incorrect predictions are known as false positives and negatives, as illustrated in figure 2.7, below.

A number of useful performance measures can be derived from the confusion matrix. The most common of these are:

(27)

20 accuracy TP TN TP TN FP FN + = + + + precision TP TP FP = + recall TP TP FN = + precision recall F1- score 2 precision recall  =  +

The accuracy is determined as the sum of the correctly classified instances, both positive and negative, divided by the total population. Simply put it is the percentage of correct classifications. The following two measures are, precision, which is the accuracy of the positive predictions; and recall, also known as sensitivity or the true positive rate, which is the ratio of correctly classified positive instances. The last measure, the F1-score, is a combination of the precision and recall values, and is calculated as the harmonic mean of the two. By using the harmonic mean, a classifier will only achieve a high F1-score if both the precision and the recall scores are high, since the harmonic mean gives more weight to low values. When evaluating classifiers, the F1-score is often the chosen performance measure since it summarises the performance in a single value. However, the F1-score favours classifiers that have similar precision and recall scores, but depending on your problem, a high precision might be of greater importance that the recall, and vice versa. Which is the most appropriate performance measure therefore depends on the aim of the classifier.22

2.5.5 Principal component analysis

Principal component analysis (PCA) is a commonly used algorithm that performs dimensionality reduction. Machine learning data often contains many attributes and reduction of these to only two or three variables makes it possible to visualise the data. The idea is therefore to reduce the number of variables but at the same time preserve as much information as possible. The principal components are constructed from

(28)

21

3. Materials and method

This chapter firstly provides descriptions of the materials used, i.e. primarily the

microscopy data, and the tools for the machine learning. It then continues with a review of the method and working process of the second part of the project, where machine learning is applied to perform base-calling. Finally, it also touches upon methods of data visualisation.

3.1 The microscopy data

This section describes the data used in this particular thesis project. Other, in some cases similar methods, might be used in other sequencing platforms, but details such as how images are stored, and how features are selected may differ.

In the imaging step of the sequencing process the fluorophores attached to the

incorporated bases are first excited by lasers, and then a set of four images, one for every base, of the fluorescence emission are acquired using high-resolution confocal fluorescent microscopy. The images are stored as OME-TIFF files, and to be more exact every image file contains a set of images each recorded at a different depth of the sample so that they together display the full three-dimensional structure of the DNA-containing sample.

A two-dimensional cross-section of a typical microscopy image is shown in figure 3.1, below, where the DNA clusters, henceforth referred to as blobs, can be seen as small bright dots against the dark background.

(29)

22

The last part of the sequencing process is the image analysis which consists of cluster identification followed by base-calling. In this thesis project, already existing software is used to perform the cluster analysis, which yields the positions of the blobs and their size in form of the blob radius. The blob positions and radii are then used to extract a number of features from each blob and its immediate background. These features are then used as attributes, i.e. input data in the machine learning algorithms to perform the base-calling.

The feature extraction is performed by studying the light intensity of each pixel of a certain rectangular area of the blob and some surrounding background. A total of eight statistical measures are then calculated, four for the background, and four for the foreground, i.e. the blob. The process is illustrated in figure 3.2, below.

The extracted statistical measures are:

Background: max, mean, median, and mode Foreground: max, mean, pct90, and pct99

where max is the maximum intensity value, mean is the arithmetic mean value, mode is the most commonly occurring value, and pct90 and pct99 are the 90th_{and 99}th_percentiles,

respectively.

Each set of images used to perform the base-calling consists of four microscopy images, one for each base. Since eight statistical measures are extracted from each image, this means that there is a total of thirty-two statistical measures for every blob, that are used as attributes in the machine learning algorithms.

(30)

23

As described in chapter 2.3.1, one of the most fundamental working principles of second-generation sequencing is parallelisation, which means that the DNA is fragmented into short reads that are sequenced simultaneously. That means that there will be different pieces of DNA in every blob, and that the base-calling software should determine the correct base of each blob in the images. However, for the purpose of this project, samples with copies of the same, known DNA sequence in every blob are used. This means that the data to be used in the machine learning consists of a number of attributes for each blob, as well as its correct base.

3.2 Working process

A general machine learning problem can be approached by breaking it down into a number of smaller, separate parts. These main steps are:

1. Look at the big picture and frame the problem 2. Get the data

3. Discover and visualise the data to gain insights 4. Prepare the data for machine learning algorithms 5. Select a model and train it

6. Fine-tune your model 7. Present your solution

8. Launch, monitor, and maintain your system

This method of dividing an extensive problem into more manageable parts is applicable in most machine learning problems, and it is the method I have worked after when tackling the machine learning part of this thesis project.22

Framing the problem

(31)

24

The aim of this project is to use machine-learning algorithms to perform base-calling. This means that the task is to classify blobs into one of the four categories: A (adenine), C (cytosine), G (guanine), or T (thymine); hence the problem can be defined as a multiclass classification problem with four categories. As described in chapter 3.1, above, the input data consists of both attributes and labels, i.e. the correct bases for each blob; which means that the problem also falls under the category of supervised learning. The

conclusion is therefore that the task constitutes a multiclass classification problem to be approached by supervised machine-learning algorithms.

Visualising and preparing data for machine learning algorithms

Once you have a general understanding of your problem. the next step is to take a closer look at the data and analyse it. First it is important to note if the attributes and labels are numerical or categorical. Many machine learning algorithms require numerical data, so if some or all the features are categorical, they will probably need to be converted into corresponding numeric attributes. It is also of importance to check if there are missing values in the data, and if so to handle these datapoints in a suitable way. Another

important factor to consider is whether the machine learning algorithm require the data to be scaled, or if the results might improve using appropriately scaled data, even if this is not required. For example, if the attribute values are of different ranges, it is necessary to normalise the data to a common scale before inserting it into learning algorithms.

Once the data is appropriately investigated and read into a suitably organised data structure, the data set needs to be randomly divided into two parts: a train and a test set. In the early stages of a machine learning project, only the train set is used, both to train algorithms and for a first evaluation. Usually, a larger part of the data, typically around eighty percent, is chosen as training data, while the remaining data constitutes the test set.

In this project, I have worked with Python as the programming language to implement the machine learning algorithms. Additionally, to organise and store the data I have used NumPy, which is a Python library developed to handle and operate on large, multi-dimensional matrices. As described in the previous section, chapter 3.1, eight features for ever base are extracted from the microscopy images. These are stored in NumPy files which I read into NumPy arrays in Python, one image at a time, together with the corresponding known base. The arrays are then concatenated column-wise so that the final data structure is an array with the dimensions: number of blobs times thirty-three (the thirty-two attributes plus the label), as illustrated in table 3.1, below.

A: FG max … C: FG max … G: FG max … T: FG max … Label

(32)

25

Since the labels are categorical and numerical labels are required in the algorithms I have worked with, I have simply converted each base letter into a specific number, so that, for example, adenine is represented by the number two instead of the letter A.

Select, Train, and Fine-Tune Your Models

Once the data is appropriately investigated, read into a suitably organised data structure, and divided into a train and a test set, time has come to select a model to work with, and to train it. After successfully training the algorithm and making a first set of predications, the results can usually be improved somewhat by fine-tuning the model. This is done by altering the model’s input parameters, and sometimes also by adding parameters.

Since the aim of the machine learning part of this thesis project is to investigate the possibility of using machine learning to perform base-calling, and to assess its potential in real, commercial sequencing; I have chosen to work with a few different models, and then compare their base-calling performance. I have mainly focussed on four machine learning algorithms: the Stochastic Gradient Descent (SGD), a Support Vector Machine (SVM) with standard parameters, a Support Vector Machine with polynomial kernel, and

Random Forest. To implement the algorithms I have used Scikit-learn25_{which is a Python}

library especially made for machine learning purposes. (For theoretical background on the machine learning algorithms, see ref. 22 and 24, as well as the suggested readings I-V.)

I have chosen the above-mentioned models, firstly because they are all suitable for classification problems, and secondly because the algorithms are based on fundamentally different ideas. The Stochastic Gradient Descent is a binary classifier, whereas the Support Vector Machine and Random Forest are multi-class classifiers. Furthermore, Random Forest is an ensemble method, meaning that is uses multiple learning algorithms to obtain a better performance than any of the algorithms do alone. I also chose the Support Vector Machine with polynomial kernel specifically because this is the model used in the

commercially available machine learning-based base-callers for the widely used Illumina sequencing platform (see chapter 2.3.2.3).

(33)

26

Present your results and maintain your system

Finally, when you are satisfied with your algorithms and results, all that remains is to present them in a suitable way. For the purpose of this project I have chosen to present the accuracy scores from the base-calling with all four algorithms and for both datasets, as well as the training times. These are presented in chapter 4.

For a machine learning project where a complete and fully functional system is the end product the last step is then of course to launch and maintain that system. However, this lies outside the boundaries for this thesis project.

3.3 Data visualisation

Datasets in machine learning are usually both large and multi-dimensional, i.e. containing a high number of features. It is therefore almost impossible to get at good understanding of your data by simply looking at it. Instead different statistical methods and algorithms can be used to describe and visualise the data in ways that makes it easier to observe patterns and deviating datapoints etc.

To investigate and visualise the data used in this project I have used boxplots and principal component analysis. Firstly, boxplots of the raw, unscaled data show the range of values in the data; and secondly, boxplots of the data scaled to zero mean and unit variance are used to investigate the occurrence of deviating values. Principal component analysis is then applied to visualise patterns and clusters for the different bases. Plots of the above

(34)

27

4. Results

In this chapter, graphs and tables of the achieved results are displayed. First, section 4.1, provides graphs for visualising and analysing the data. Secondly, in section 4.2, tables of the results of the base-calling with the chosen algorithms are presented. Next, in section 4.3, the algorithm with four binary SGD classifiers is further explained, and finally, in section 4.4, graphs of the feature importance are shown.

4.1 Visualisation of data

In this section, graphs displaying the results of the data analyses with the two selected visualisation methods, boxplots, and principal component analysis graphs, are shown.

4.1.1 Boxplots

Figures 4.1 to 4.4 show boxplots of both datasets. Each dataset is visualised using two different plots. The first is of the raw, unscaled data and aims to show the range of values in the data. In the second the data is scaled to zero mean and unit variance, to better highlight deviating values.

The numbers on the attribute axis corresponds to the thirty-two attributes used in the machine learning algorithms. They are in order:

FG max, FG mean, FG pct90, FG pct99, BG max, BG mean, BG median and BG mode,

for all the four bases in order A, C, G, and T. FG and BG denotes foreground and

background values, and pct90 and 99 stands for the 90th_{and 99}th_{percentile, respectively.}

The vertical axis corresponds to light intensity values.

(35)

28 Figure 4.1: Box plot of dataset 1 with the raw, unscaled data

(36)

29 Figure 4.3: Box plot of dataset 2 with the raw, unscaled data

(37)

30

4.1.2 Principal component analysis (PCA)

Figures 4.5 to 4.8, below, shows graphs of the data reduced to two dimensions using principal component analysis. The aim of this analysis is to visualise the data to highlight patterns and clusters for the different bases. Two graphs for each dataset are plotted to clearly illustrate how the datapoints from different bases overlap.

Figure 4.5: PCA of dataset 1 with C and T plotted atop

(38)

31 Figure 4.7: PCA of dataset 2 with A and G plotted atop

(39)

32

4.2 Result from base-calling with different machine learning

algorithms

Below, three sets of two tables are presented, one for each of the two datasets. The first two tables present the results of the base-calling when the training and test data comes from the same dataset, and the remaining four tables present the same results but when training has been performed on one dataset, and the test data is taken from another. The results presented are the total percentage of correctly identified bases in the test data, as well as the percentage of correctly identified bases for each of the four bases A, C, G, and T. The time required to train each algorithm is also measured.

Table 4.1: Training and test data from dataset 1

(40)

33

Table 4.2: Training and test data from dataset 2

Correctly called bases (%) Correctly called A (%) Correctly called C (%) Correctly called G (%) Correctly called T (%) Time (seconds) SVM (default) 94.6 97.3 79.7 96.3 98.6 16 SVM poly kernel 96.0 98.6 87.1 96.1 98.9 >1 h 4 SGD Classifiers 90.3 98.5 93.3 79.2 95.8 2.6 Random Forest 95.8 98.7 85.3 95.9 99.1 4.0 Total A C G T No of blobs in training data 27 614 6 861 4 149 9 721 6 883 No of blobs in test data 6 904 1 732 1 072 2 375 1 725

Table 4.3: Training data from dataset 1, and test data from dataset 2, raw data (unscaled)

Correctly called bases (%) Correctly called A (%) Correctly called C (%) Correctly called G (%) Correctly called T (%) SVM (default) 37.9 0.1 50.3 15.6 99.6 Random Forest 31.1 0.08 18.2 9.7 99.8

Table 4.4: Training data from dataset 1, and test data from dataset 2, scaled data

(41)

34

Table 4.5: Training data from dataset 2, and test data from dataset 1, raw data (unscaled)

Table 4.6: Training data from dataset 2, and test data from dataset 1, scaled data

4.3 The SGD classifier algorithm

The SGD classifier algorithm is constructed from four SGD classifiers, one for each base. The results of all classifiers are then placed beside each other in a prediction matrix as described in chapter 3.2. The idea then that in the best-case scenario the base A would be classified as A for the A classifier and as not A for the remaining classifiers, and so on for all bases. However, running the algorithm some blobs are not classified as any of the bases, and some blobs are classified as two bases. To deal with this and classify each blob as one single base, values from the confusion matrix are used to select the most probable one.

(42)

35

after adjustment. It can then be seen that all above mentioned blobs have been given one single classifications, and that all blobs except number 61 are correctly classified.

Figure 4.9: Prediction matrix after initial classification

(43)

36 Figure 4.11: Feature importance with data from dataset 1

This SGD classification algorithm does not only yield predictions but can also distinguish between more or less confidently classified blobs. Blobs with one original classification are deemed to be confident classifications, while originally unclassified or doubly classified blobs are regarded as less certain. Below, is a table displaying numbers of the percentage of confidently classified blobs, and correctly called confident bases, unclassified bases, and doubly classified bases.

Dataset 1 Dataset 2 Confidently classified blobs

(%)

96.8 82.4

Correctly called confident bases (%)

99.4 97.8

Correctly called unclassified bases (%)

56.1 42.9

Correctly called multiply classified bases (%)

73.5 56.6

4.4 Feature importance

This section displays graphs of the feature importance of the attributes for the Random Forest algorithm, where the thirty-two attributes are sorted from most to least important.

(44)

37 Figure ?: Feature Importance with data from dataset 12

(45)

38

5. Discussion

This chapter provides a discussion of the results that have been presented, and the conclusions that can be drawn from them. It also relates back to the purpose and goals of the project and the research questions posed in the introductory chapter are evaluated. Finally, it also provides a look ahead, and suggestions of how the project could be continued and developed are discussed.

5.1 Evaluation of part 1: The literature review of DNA sequencing

The first part of the project is strictly theoretical since it consists of a literature review of DNA sequencing. This part is presented in chapter 2 of this report, and the purpose of it was for me to learn about DNA sequencing and give an overall understanding of all the steps involved in next-generation sequencing techniques. At the end of the project I can now say that it has done so and that this understanding of the whole process was very valuable for me to better understand the data I was working with and how it had been generated, considering that I worked with the data analysis which is the final step of the sequencing process. Aside from this it was also very interesting to learn more about the historical perspective, including the discovery of the DNA molecule, the realisation that DNA was actually the carrier of genetic material, and the subsequent development of increasingly better sequencing techniques.

5.2 Evaluation of part 2: The machine learning and base-calling

I will now proceed to discuss and evaluate the second part of the project which included the application of machine learning algorithms to perform base-calling. The aim of this part was first to investigate whether or not it was possible to perform base-calling with machine learning using the available data. Secondly, given that it was possible, the aim was to further investigate and assess the potential of machine learning in commercial next-generation sequencing platforms. This was investigated by performing base-calling with four different machine learning algorithms and evaluating the accuracy of the base-calls, as well as the training times.

Training and testing on the same, or different data

(46)

39

first the results when training and testing on the same dataset, these numbers evidently show that it is possible to use machine learning to perform base-calling, and more so the accuracy scores of up to almost 99 % clearly indicate a great potential to perform accurate base-calling using machine learning. However, the drastic decrease of accuracy when training and testing on different datasets show that the data is not stable between runs, but that conditions affecting the image acquisition might change to a degree that makes the base-calling unreliable. This is problematic since the idea is to first train the algorithm using known data and then to be able to do several runs to sequence unknown data. Considering, however, that the training times of a few seconds generally are short in comparison with the sequencing times, an idea to overcome this problem is to always sequence a known reference DNA segment before the target DNA, and to use the reference as training data.

However, when making these conclusions it is also important to note that I have only had access to two datasets and that more data is required to draw more general and better supported conclusions. The same evidently applies for all other conclusions and suggestions discussed later on in this chapter.

Comparison of algorithms

Another question to evaluate is whether any of the machine learning algorithms is better than the others. Looking at tables 4.1 and 4.2, we see that there is no big difference between the algorithms but that they all perform quite similarly. The support vector machine with polynomial kernel performs marginally better in both cases, but this algorithm does, however, suffer from the problem of becoming very slow with increasing amount of data. Therefore, considering both accuracy and time I would say that the random forest algorithm might be the best choice. However, considering training and testing on different data, see result tables 4.3 to 4.6, the support vector machine generally performs better than the random forest. However, all the results are unfortunately so bad that I don’t consider them relevant to evaluate the algorithms. An idea to improve the performance was to feed the algorithms scaled data, since it can be seen in the boxplots, figures 4.1 and 4.3, that the range of values is much larger for the first dataset than for the second. However, this did not solve the problem but merely improved the results

somewhat. Unfortunately, it was not nearly enough for the base-calling to be as accurate as is required.

What is also worth to notice is that the SGD algorithm yield more information in terms of certain and less certain classifications, than the other algorithms. This might be

(47)

40

that the precision is high for quality data but lower if the data quality decreases. The question of the best algorithm therefore depends on the application, and an algorithm like the four SGD classifiers might be desirable in some cases, whereas another of the

algorithms might be the best choice in another case.

Relating results to boxplots

It is also interesting to compare the results of the two datasets when training and testing on the same data, and to look at them together with the boxplots. The boxplots where the data is scaled to zero mean and unit variance are especially interesting to study since they highlight deviating values in the data as described in chapter 3.3. Better results are

achieved with the first dataset than the second (see tables 4.1 and 4.2) and when we look at the boxplots (figures 4.2 and 4.4) we can see that the diagram for the first dataset looks quite even over all bases, while for the second dataset the background variables for base C have a much larger spread that the other bases. Then going back to study the result table we see that for the second dataset the results for bases A, G and T are similar to those of dataset one, while the results for base C are much lower and are responsible for lowering the average results for dataset two. The lower results of dataset two can thus be explained from the data quality of the C base. A way to ensure a high base-calling accuracy would consequently be to apply some sort of data quality control. Another thing worth noting in regard to this is that the sequencing equipment is currently being developed and

improved. With this in mind I would therefore like to argue that it is reasonable to believe that accuracy scores close to those of dataset one can be achieved in future sequencing experiments.

The amount of data

Application of machine learning techniques to perform base-calling in next-generation DNA sequencing

Application of machine learning

techniques to perform base-calling

in next-generation DNA sequencing

ELIN TEGFALK

Application of machine learning

techniques to perform base-calling in

next-generation DNA sequencing

Elin Tegfalk

Abstract

Sammanfattning

Acknowledgements

Table of contents

1. Introduction

1.1 The thesis project

1.2 Introduction to DNA sequencing

1.3 Purpose and goals

1.4 Overview of the report

2. Background

2.1 The DNA molecule

2.1.1 The structure of DNA

2.1.2 Cell function and genes

2.1.3 RNA

2.2 DNA and RNA sequencing

2.2.1 The history of sequencing and development of first-generation

sequencing

2.2.2 Human genome sequencing

2.3 Next-generation sequencing

2.3.1 General introduction to next-generation sequencing

2.3.2 Second-generation sequencing

2.3.3 Third-generation sequencing

2.3.4 RNA sequencing

2.4 Confocal fluorescence microscopy

2.5 Introduction to machine learning

2.5.1 General structure of machine learning data

2.5.2 Types of machine learning

2.5.3 List of machine learning methods

2.5.4 Performance measures of binary classifiers

2.5.5 Principal component analysis

3. Materials and method

3.1 The microscopy data

3.2 Working process

3.3 Data visualisation

4. Results

4.1 Visualisation of data

4.1.1 Boxplots

4.1.2 Principal component analysis (PCA)

4.2 Result from base-calling with different machine learning

algorithms

4.3 The SGD classifier algorithm

4.4 Feature importance

5. Discussion

5.1 Evaluation of part 1: The literature review of DNA sequencing

5.2 Evaluation of part 2: The machine learning and base-calling