Clustering metagenome contigs using coverage with CONCOCT

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2017,

Clustering metagenome

contigs using coverage with CONCOCT

BRYNJAR SMÁRI BJARNASON

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

coverage with CONCOCT

BRYNJAR SMÁRI BJARNASON

Master’s program in Computer Science Date: May 31, 2017

Email: bsbj@kth.se

Supervisor: Anders Andersson Examiner: Jens Lagegren

Swedish title: Klustring av metagenom-kontiger baserat på abundans-profiler med CONCOCT

School of Computer Science and Communication

(3)

i

Abstract

Metagenomics allows studying genetic potentials of microorganisms without prior cultivation. Since metagenome assembly results in fragmented genomes, a key challenge is to cluster the genome fragments (contigs) into more or less complete genomes.

The goal of this project was to investigate how well CONCOCT bins assembled contigs into taxonomically relevant clusters using the abundance profiles of the contigs over multiple samples. This was done by studying the effects of different parameter settings for CON- COCT on the clustering results when clustering metagenome contigs from in silico model communities generated by mixing data from isolate genomes. These parameters control how the model that CONCOCT trains is tuned and then how the model fits contigs to their cluster.

Each parameter was tested in isolation while others were kept at their default values.

For each of the data set used, the number of clusters was kept constant at the known number of species and strains in their respective data set.

The resulting configuration was to use a tied covariance model, using principal components explaining 90% of the variance, and filtering out contigs shorter than 3000 bp. It also suggested that all available samples should be used for the abundance profiles.

Using these parameters for CONCOCT, it was executed to have it estimate the number of clusters automatically. This gave poor results which lead to the conclusion that the process for selecting the number of clusters that was implemented in CONCOCT, “Bayesian Information Criterion”, was not good enough. That led to the testing of another similar mathematical model, “Dirichlet Process Gaussian Mixture Model”, that uses a different algorithm to estimate number of clusters. This new model gave much better results and CONCOCT has adapted a similar model in later versions.

(4)

Sammanfattning

Metagenomik möjliggör analys av arvsmassor i mikrobiella floror utan att först behöva odla mikroorgansimerna. Metoden innebär att man läser korta DNA-snuttar som sedan pusslas ihop till längre genomfragment (kontiger). Genom att gruppera kontiger som härstammar från samma organism kan man sedan återskapa mer eller mindre fullständiga genom, men detta är en svår bioinformatisk utmaning.

Målsättningen med det här projektet var att utvärdera precisionen med vilken mjukva- ran CONCOCT, som vi nyligen utvecklat, grupperar kontiger som härstammar från samma organism baserat på information om kontigernas sekvenskomposition och abundansprofil över olika prover. Vi testade hur olika parametrar påverkade klustringen av kontiger i arti- ficiella metagenomdataset av olika komplexitet som vi skapade in silico genom att blanda data från tidigare sekvenserade genom. Parametrarna som testades rörde indata såväl som den statistiska modell som CONCOCT använder för att utföra klustringen. Parametrarna varierades en i taget medan de andra parametrarna hölls konstanta. Antalet kluster hölls också konstant och motsvarade antalet olika organismer i flororna. Bäst resultat erhölls då vi använde en låst kovariansmodell och använde principalkomponenter som förklarade 90%

av variansen, samt filtrerade bort kontiger som var kortare än 3000 baspar. Vi fick också bäst resultat då vi använde alla tillgängliga prover.

Därefter använde vi dessa parameterinställningar och lät CONCOCT själv bestämma lämpligt antal kluster i dataseten med “Bayesian Information Criterion” - metoden som då var implementerad i CONCOCT. Detta gav otillfredsställande resultat med i regel för få och för stora kluster. Därför testade vi en alternativ metod, “Dirichlet Process Gaussian Mixture Model”, för att uppskatta antal kluster. Denna metod gav avsevärt bättre resultat och i senare versioner av CONCOCT har en liknande metod implementerats.

(5)

Introduction

We know that the blueprint of every known living organism is its genetic code. DNA sequencing has given us a tool to read these blueprints and computer programs have been developed based on biological models and theory to help us derive meaning from what we have read so far. It is however not enough to read and understand the genetic code of an organism to get the full picture of its life and potentials. Every living being thrives in its environment, which it shares with many other organisms big and small. We know that organisms interact with each other and the environment, some compete for resources while others work together for their mutual benefit. They react to the current state of the environment and the surrounding community of organisms to give them the best possibility of surviving and reproducing, to pass on their genetic material to future generations.

Mostly hidden to us is the world of micro-organisms; microbes. These microbes are usually so small that a microscope is required to see them. They live in every environment of the biosphere, soil, water, air, earth’s crust and in symbiotic relationships with other organisms, to name a few. Studying these microbes can be difficult since most of them are extremely hard to isolate for cultivation, and many of them do not cultivate in a laboratory setting. A method that circumvents the problem of obtaining pure cultures of microbes, and which is currently quickly gaining in popularity, is metagenomics. Using metagenomics we can study the DNA material that is gathered from environmental samples.

Metagenomics, using today’s high throughput sequencing (HTS) technology in combination with assembly programs, allows us to reconstruct parts of the genomes from microbes found in such environmental samples. These parts are called contigs and each contig represents a sequence from a microbial genome. Based on the complexity of the community, which is affected by the number of species and their relative abundance in a sample, these contigs can represent quite different things. In a simple sample with few species in unequal abundances it can be possible to assemble the contigs into almost whole genomes, while complex samples with multiple species that have similar abundance can result in shorter contigs and therefore more fragmented genomes. Other complications when creating contigs arise from the facts that different microbes have the same or very similar DNA sequences in parts of their genomes, and one microbe can have the a DNA sequence copied within its own genome. These intra- and inter genomic repeats can make the task of creating pure contigs, that are created from one microbe, extremely hard.

At this stage we can ask our selves “How many microbes are in the samples?” and “To which microbe does this particular contig belong?”. Answering these questions is hard. The approach often used for automatically grouping contigs is a computational technique called

1

(8)

clustering, or binning. These clustering techniques use similarities of contigs DNA composition to attempt to cluster together contigs from the same organism. Another source of information that has not been used as much as the composition is the abundance profiles, or coverage, of contigs over multiple samples from the environment. Most of the organisms found in a sample should also be found in subsequent samples, and therefore its contigs should as well. This information gives us variability of the abundance of a contig over different samples. An example might be that contigs from an organism that has a growth optimum at a certain temperature should be more abundant in a sample taken at that temperature than in samples taken at lower or higher temperatures.

This project will evaluate how valuable information source the coverage, from multiple samples of the same environment, is for clustering contigs. This will be done by using the program CONCOCT (Clustering cONtigs with COverage and ComposiTion), which was developed based on work done in this project in collaboration with another master project, which looked at composition as the information source. CONCOCT allows the usage of both coverage and composition information to be used, but in this project only the coverage part will be analysed.

(9)

Chapter 2

Theory

In this chapter we give a brief history of metagenomics and explain what problem this project is addressing. We look at concepts of the composition of a contig and the coverage of a contig over multiple samples.

We also look at the theory of the machine learning techniques, the mathematical models and evaluations used.

2.1 Metagenomics

Since the 1970’s it has been possible to read the genetic code of organisms by DNA se-

quencing using the Sanger sequencing method[1]. With the Sanger method the whole genomes of multiple organisms have been sequenced, and the first one to be fully sequenced was bac- teriophage ϕX174 in 1977. This opened up a whole new way to study micro-organisms

Early work in microbial studies focused on microbes that could be isolated from an environmental sample and grown into a culture of a single organism. Soon other experiments started suggesting that only a fraction of microbes that are found in the environment were being successfully cultured. Using microscopic counts of microbes in a sample suggested several orders of magnitude more microbes were present in the sample than were actually cultured. This was later called the “Great Plate Count Anomaly”[2]. We now know that most microbes have not or can not be successfully cultured in a laboratory setting. This is due to many factors, such as anaerobic microbes being exposed to air, symbiotic relation- ship between species of microbes and many other. To start estimating the true number of species found in these samples, still without being able to sequence whole microbes, re- searchers started looking at a known gene that is found in all organisms, called the small subunit ribosomal RNA (rRNA) gene (16S rRNA in prokaryotes and 18S rRNA in eukary- otes). It is known to be evolutionary conserved and rarely laterally transferred and serves as a reliable taxonomic marker. By using bacterial artificial chromosome (BAC) to amplify DNA from samples and polymerase chain reaction (PCR) to select 16S sequences from the BACs without isolating and culturing microbes it was shown that there were vast numbers of microbes that had not previously been detected because they could not be cultivated[3, 4]. These methods provided ways to sequence those parts of the community that were already known, such as the 16S rRNA, without culturing. This opened up possibilities for detecting genes and predict functions found in these communities.

This resulted in the birth of metagenomics[5] which has been defined as “the application of modern genomics techniques to the study of communities of microbial organisms

3

(10)

directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species”[6]. Since only a fraction of microbes can be cultured, new approaches were needed. The technique that opened up the world of metagenomics is called shotgun sequencing, or environmental shotgun sequencing (ESS) in the context of metagenomics[7].

Figure 2.1: Whole genome shotgun sequencing and assembly.

The process of ESS is to first gather samples from environment, such as soil, sea, human gut etc that are then filtered to recover microbes of certain size from the sample. These microbes are then prepared for DNA extraction and their DNA is then sheared into many short fragments. These fragments are sequenced into reads[8], and then assembled back into consensus sequences called contig[9], see Figure 2.1¹.

A nice mental model of shotgun sequencing and assembly in metagenomics can be as follows: Let’s say that we have a sample which contains microbes from two different species.

If we think of the DNA of a single microbe as a book, then the microbes from the same species have almost identical books, for example first and second edition where some spelling errors were fixed in the second edition, while comparing the books between species shows more difference, for example Oxford English dictionary and New Oxford American Dictio- nary. After collecting all the books, they are shredded into pieces of 50-200 words at random and mixed together. From the mix, we then start comparing two fragments at the time and see if they have overlapping text. From the overlaps we can then start rebuild segments of the books, with the goal of creating as long continuous texts as possible. We call these continuous segments contigs.

The goal of this project is to automatically group contigs together, such that contigs from microbes of the same species should be in the same group while contigs from other species should not be in that group, that is, every species should have its own group and thereby it should be possible to recreate that species whole genome. By doing this and

1Drawn using https://www.draw.io/, stored on https://drive.google.com/a/binnisb.com/file/d/

0B7Ytf_dJxL6-NGhrelgzY2RMamM/view?usp=sharing

(11)

CHAPTER 2. THEORY 5

AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT

0 0 2 1 0 0 0 0 2 0 0 0 0 0 1 0

Table 2.1: k-mer composition vector of sequence AGAGAT G with k = 2.

knowing what microbes are in these samples and in what quantities, it is possible to com- pare for example healthy patient to a diseased patient to figure out if and how microbes affect our health.

2.1.1 Composition

Each contig that is generated by the ESS and assembly has a sequence composition profile, called k-mer vector. A k-mer vector is generated by counting how often each word of length k appears in the contig, for example if k = 2 and the contig is AGAGAT G then the k-mer vector becomes as in Table 2.1

Using this information we can estimate relatedness of two contigs since it has been shown that k-mer profiles from organisms of one species are more similar to each other than to k- mer profiles of other species[10]. The composition is only used in this project for running the full CONCOCT using the default parameters for composition, without any exploration of the best parameters for composition.

2.1.2 Coverage

Figure 2.2: Coverage of different microbial species across samples.

In metagenomics, multiple samples of the metagenome from the same or similar environment is gathered for analysis, for example sea samples taken from the same location every month for a year. After performing ESS on each sample, all the samples are coassembled into contigs, that is reads from all samples are assembled together since all sample should contain roughly the same organisms and therefore they should contain reads from the same genomes. By doing this, contigs are created over all the samples which allows reads from different samples to be mapped to the same contigs. By counting how many reads from

(12)

each sample map to each conting, a coverage, or abundance profile over the samples can be generated per contig Figure 2.2². CONCOCT uses coverage to cluster the contigs and in this project we explore how different parameter settings affect the clustering.

2.2 Machine Learning

Machine learning is one of the corner stones of today’s technical world. From effective web search, speech- and handwriting recognition, self-driving cars to application in -omics fields and science in general, it has proven an extremely important tool. You probably use a machine learning system (learner) every day, maybe without even realizing it. Spotify, Net- flix and Amazon recommend products that you might like by applying machine learning on your historical behavior, and face detection on smart phones helps you take better pictures.

A general description of a learner is that it should learn from a collection of input data, that will be called training samples, and then apply what it has learnt to predict some prop- erty of unseen samples of the same type. This principle of generalizing from experience is the core of machine learning. A definition of the machine learning process put forth by Tom M. Mithcell in his book, Machine Learning[11, p. 2], captures this well:

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

A simple example of how this definition could be used is to learn to recognize the species of an animal.

• Task T: Recognize animals’ species

• Performance measure P: Percent of individual animals correctly classified to its species

• Training experience E: Collection of animals which have already been classified as certain species.

After learning this task, a learner can then apply its experience by generalizing to new animals and predict what species they belong to.

The training samples, collection of animals, was a bit vague. To be useful to a learner the information about the animals needs to be standardized to some quantitative measure, for example; number of legs, number of eyes, has exoskeleton etc. These are called features of the animals. Each sample in the input data is then said to have features that the learner uses to learn. It is often handy to think of the input data as a matrix representing these features. If we look at a simple case for the animals, the training samples and the true species classification can be something like Table 2.2. The training sample animal1 has the value 4 for feature #legs and so on. The species is not a feature, but the true clas- sification of the samples. A learner that learns these features and receives a new animal, animal4 = [4, 2, T rue], should predict that it is a Turtle.

To be able to recognize more animals, more features should be added. It can happen that the input data has colossal number of features, referred to as high dimensional input data. Such data can be hard to learn. The main problems with so high dimensional data

2Image acquired from author, Michael Imelfort, published in talk at ISME 2015

(13)

CHAPTER 2. THEORY 7

#legs #eyes exoskeleton species

animal1 4 2 False Zebra

animal2 6 5 False Bee

animal3 4 2 True Turtle

Table 2.2: Features of animals and their label as species.

is the lack of information each feature might have, and the time it takes some learners to learn depends on the number of features. One way to tackle these problems is to perform dimensionality reduction on the input data to transform the data into a lower dimensional space that still retains as much information as possible from the data.

Learners are divided into a few categories based on the type of input it can work with or the desired output of the system. Those that are used in this work are:

• Classification (Supervised learning): The input training samples have a known desired output. After learning, the system can predict the output for unseen input samples

• Clustering (Unsupervised learning): The input samples have unknown desired output. Usually these systems do not predict based on new samples, rather they attempt to discover structure in the data.

In the animal example above, there is a known relation between the features of the training samples and the classification to a specific species. We want to predict what species future animal samples belong to, which means that we can apply supervised learning on that problem.

CONCOCT uses unsupervised clustering to bin contigs together, using Gaussian Mix- ture Models for the contigs. It performs dimensionality reduction on the input variables and uses supervised learning mechanism for estimating the quality of the clustering.

2.2.1 Supervised learning

Supervised learning is usually divided into two categories, classification and regression. In both cases the training data is a collection of examples which are input data points paired with their correct output values. After learning the training data, the learner can generalize to assign an output to new previously unseen data points.

In classification this means assigning a class to the new data point as we can see in Fig- ure 2.3. In this case, after learning the training data, a new point in the green area would be classified to have the same class as the green points.

In regression the task can be thought of as estimating a function that can best describe the observed data. As in Figure 2.4 the mapping from the input data (x-axis) to the out- put (y-axis) was a function on the form f(x) = sin(2πx) + noise. If we now give us that the true function is unknown, we can try a cubic regression, f(x) = ax³+ bx² + cx + d, to estimate a function that could generate the data. As we can see in (B) the cubic function gets close to the true function. Then future input (x value) can be mapped to an output by applying the cubic function to it.

(14)

Figure 2.3: (A) Before learning (B) Background colour indicates how a new data point would be classified (coloured).

Figure 2.4: Given the data points in (A) we fit a polynomial of degree 3 (cubic polynomial) to the data (B) which gives us an estimate of the true underlying function (sine with noise).

2.2.2 Unsupervised

The difference between supervised- and unsupervised learning is labeling of data. So far the training data has had the correct output values, such as the correct species or correct color. In unsupervised learning these values are not known for the training data. The features are known but not the correct output values. Unsupervised learning therefore tries to find structure in unlabeled data.

One common unsupervised task is to perform clustering, that is to group together data points so the ones in the same group (cluster) are more similar to each other than to data points in other clusters in some way. A common procedure for this is to apply expectation- maximization (EM) to cluster data points, which are described by statistical models. In Figure 2.5we can see a simple EM algorithm cluster the data points into three clusters without knowledge of the “true” labels of the data.

(15)

CHAPTER 2. THEORY 9

Figure 2.5: (A) Data points, (B) Three clusters, (C) Coloured as in the supervised classification.

EM

The expectation-maximization method is used for finding maximum likelihood solutions for parameters in statistical models that depend on latent variables. It iterates between the expectation-step and the maximization-step, where E step uses the current estimates of parameters for the model to create a function for the expectation of the log likelihood while the M step computes the parameters that maximize the expected log likelihood function from the E step. The distribution of the latent variables of the model is then determined by these parameter estimates. The general EM Algorithm is as follows[12]:

Given a joint distribution p(X, Z|θ) over observed variables X and latent variables Z, governed by parameters θ, the goal is to maximize the likelihood function p(X|θ) with re- spect to θ.

1. Choose an initial setting for the parameter θ^old. 2. E step Evaluate p(Z|X, θ^old).

3. M step Evaluate θ^new given by

θ^new= argmax

θ

Q(θ, θ^old) where

Q(θ, θ^old) =^X

Z

p(Z|X, θ^old) ln p(X, Z|θ)

4. Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let

θ^old← θ^new and return to setp 2.

Gaussian Mixture Model with EM

One of the statistical models that the EM algorithm can cluster on is the Gaussian mixture model (GMM). Mixture models (MM) are used to find subsets in observed data without any direct identification to which subset individual observation belongs. However using MM

(16)

with EM actually attributes observations to the postulated subsets, and by doing so creates clusters of observations.

A Gaussian mixture distribution can be written as a linear superposition of Gaussians in the form

p(x) =

K

X

k=1

π_kN(x|µk, Σ_k).

One drawback of this method is that it uses all suggested clusters while learning, which results in the need to run it multiple times with different number of clusters to get some estimate of the true number of clusters, if they are not known before hand. Here that is done with the bayesian information criterion (BIC), which is a model selection criteria. It finds the model with the highest probability of the likelihood function, while punishing the use of more free parameters in the model.

BIC = −2 ∗ ln L + k ∗ ln L

It is used to select the best performing GMM when estimating the number of clusters.

For GMM, it is possible to use different types of covariance models Σ, for example full, tied, diagonal and spherical Figure 2.6. These types dictate the geometric features of the clusters, that is their orientation, volume and shape. Spherical restricts the geometry the most (fewest free parameters) while full allows for the most flexible geometry[13]³.

Figure 2.6: Shapes of different covariance types.

Dirichlet Process Gaussian Mixture Model (DPGMM)

The DPGMM is a method that uses variational inference algorithm for estimation of the Gaussian mixture [15]. It uses a Stick-breaking process [14] to estimate the number of components in the model. Using the DPGMM we get around running GMM multiple times with different number of components to find the optimal components using the BIC score.

3Covariance types, generated and saved from http://scikit-learn.org/stable/auto_examples/

mixture/plot_gmm_covariances.html

(17)

CHAPTER 2. THEORY 11

2.2.3 Dimensionality reduction

Figure 2.7: This shows the two eigenvectors of the covariance matrix.

When looking at data we see two things, the samples and the features each sample has.

When applying machine learning to data it is good to have many samples, but it is also extremely important to have enough good features to be able to learn the properties of the data we are interested in. This is however not trivial. One could assume that the more features you have the easier it is to learn. This has proven not to be the case. It is common to hit a point where new features do not add new information that helps learn from the data but rather add noise that hinders learning, the data becomes sparse and sometimes this also impacts the training time of the learners. This is known as the curse of dimensionality[16].

To battle this curse, a common procedure used in machine learning is principal component analysis (PCA). PCA maps data linearly from the high dimensional space of all features to a lower dimensional space where the variance of the data is maximised. This is done by calculating the eigenvectors[12] of the original data and then selecting the vec- tors representing the largest eigenvalues Figure 2.7 ⁴. They are then used to transform the data in lower dimension while retaining as much of the variance as possible.

The result of doing this is that the new data has fewer features which usually means faster learner.

2.2.4 Metrics for evaluating clustering results Presision and Recall

Measure clustering when labeled data is available can be done with precision and recall.

Given K known classes to be clustered, then precision describes how pure a cluster is (high

4“GaussianScatterPCA” by Ben FrantzDale (talk) (Transferred by ILCyborg) - PNG version of gigan- tic SVGOwn work (Original caption: “I created this work entirely by myself. (Originally uploaded on en.wikipedia) -”). Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Com- mons - https://commons.wikimedia.org/wiki/File:GaussianScatterPCA.png#mediaviewer/File:

GaussianScatterPCA.png

(18)

precision) or is the cluster mixed with samples from multiple classes (low precision).

P = Tp

T_p+ Fp

Recall on the other hand describes how completely a class is contained within a cluster (high recall) or is the class distributed over multiple clusters (low recall).

R= Tp

Tp+ Fn

Here P is precision, R is recall, Tp is true positives, Fp is false positives and Fnis false neg- atives.

Adjusted Rand Index

The rand index[17, 18] is used for comparing the similarity between two clusterings. It looks at all pairs of samples and figures out if they are in the same cluster or not in the true clustering and then compares that to the predicted clustering. That way it can calculate how many pairs are clustered in the same way between the clusterings. The RI is defined as follows:

Let C be the true class assignment and let K be the clustering, then we can define a and b such as:

• a, the number of pairs of elements that are in the same set in C and in the same set in K

• b, the number of pairs of elements that are in different sets in C and in different sets in K

The unadjusted Rand index can then be defined as:

RI = a+ b

n 2

where ⁿ₂ is the total number of possible pairs in the dataset (without ordering).

The problem with Rand index is that random assignment of labels will usually not result in zero Rand index but rather some higher positive value. That is where the adjusted Rand index comes to play. By discounting the expected Rand index E[RI] of random la- belling the adjusted Rand index can be defined as:

ARI = RI − E[RI]

max(RI) − E[RI]

(19)

Chapter 3

Methods

In this chapter we look at the datasets used for evaluating the coverage performance of CONCOCT and a description of the algorithm behind CONCOCT. The most important packages used for testing different parameters of CONCOCT and analysing the results are discussed.

3.1 Datasets

In this project, two synthetic mock metagenome dataset were used. These datasets were generated for evaluation of CONCOCT, the software built from joint effort of this master project and a fellow students master project. Here they will be used to evaluate how well CONCOCT is able to cluster, using the coverage of contigs. A species mock was generated for evaluating its ability to resolve complex communities down to species level, while the strain mock was generated for evaluating its ability to resolve communities down to strain level[19].

3.2 Software

3.2.1 Python

Python¹ is a dynamic programming language that has become very popular for doing scientific computing. The adaptation of Python has been driven by the development of modules that extend functionality of the language such as interactive development environment[20], scientific calculations[21], bioinformatics[22], machine learning[13], plotting[23] and data analysis² to name a few of value to many scientists. The main reason these modules have been successful is that they rely on a module that has become the core of most scientific work done in python, NumPy[21]. NumPy implements fast array manipulation and basic linear algebra operations for Python. Other modules can then use NumPy for their own purposes.

This project utilises Python version 2.7 with NumPy, Pandas, Scikit-Learn and BioPy- thon for their various capabilities.

1http://www.python.org/about/

2http://pandas.pydata.org/

13

(20)

3.2.2 Snakemake, GitHub & BitBucket

To make the work reproducible, GitHub³ was used for CONCOCT and BitBucket⁴ for the thesis work. A build system called Snakemake[24] was also used for the thesis work. The combination of these tools made all the difference when problems were detected in the file system where the results were generated for the thesis and all the results were lost. By get- ting the latest version from BitBucket and running the build pipeline in Snakemake, all results, figure and data were recreated without problems. This shows how important it is for researches to make analysis reproducible.

The analysis for the thesis work is publicly available here: https://bitbucket.org/

binnisb/thesisdata/src 3.2.3 CONCOCT

CONCOCT is a program that joins contig coverage across multiple samples and contig k- mer composition, to automatically cluster contigs into taxonomically relevant clusters. The general algorithm for CONCOCT is as follows for N contigs, K kmers, M samples, T con- tig length threshold and C clusters:

1. Fasta file converted to N vectors, where each vector is a k-mer profile for that contig (N × K matrix)

2. Coverage file converted to N vectors, where each vector is the abundance profile for the corresponding contig measured in reads per sample that map to the contig (N ×M matrix).

3. k-mer- and abundance profiles joined and normalised, results in N × (K + M) matrix 4. PCA performed to reduce dimensions from K +M to D where D is selected to explain

x% of the variance of the data

5. Rows with contigs shorter than the threshold T are filtered out from N leaving F con- tigs, resulting in training data of F × D

6. Fit the F × D matrix to a Mixture of C Gaussians using EM algorithm 7. Predict all N rows to their clusters.

This is repeated for different number of clusters C and the one with best results (lowest BIC score) is selected as the optimal clustering. To do the DPGMM, the only step needed to change in the above was step 6 to:

6. Fit the F × D matrix to a Mixture of Gaussians, using Dirichlet process with Varia- tional inference to infer the number of clusters.

To ignore the k-mer profiles, CONCOCT can set the dimension K to zero so the joined matrix will be only N × M.

CONCOCT takes two mandatory input files, a fasta file with the contigs, and a tab sep- arated file containing calculated coverage table of these contigs in each sample. The fasta file is generated by the assembly and the coverage file is generated by mapping the reads

3https://github.com/

4https://bitbucket.org/

(21)

CHAPTER 3. METHODS 15

to the contigs using the script map-bowtie2-markduplicates.sh, that utilises Bowtie2[25], SAMtools[26] and BEDTools[27]. Using the script gen_input_table.py we then generate the coverage table. These scripts are provided with CONCOCT. The procedure to create the coverage file from the assembly is explained in an example provided on how to run the whole CONCOCT pipeline⁵.

In this project we will examine how well CONCOCT can cluster based only on the coverage information by having it ignore the sequence composition. The analysis done here uses version 0.1 which can be found on github⁶. The development of CONCOCT has con- tinued and is currently at version 0.4. Since version 0.2 it changed to a different mathematical model for fitting the data and estimating number of clusters.

All the results were generated using UPPMAX resources for high performance computing⁷

5https://github.com/BinPro/CONCOCT/blob/0.1/doc/complete_example.md#

map-the-reads-onto-the-contigs

6https://github.com/BinPro/CONCOCT/tree/0.1

7http://www.uppmax.uu.se/

(22)

Results

In this chapter we look at the results from changing various parameters of CONCOCT to estimate their effect on clustering on the coverage. We look at the performance of using the most promising combination of values for the parameters. Finally, alternative methods for finding optimal number of clusters were tested.

4.1 Setup

Four variables were altered to experiment with their effect on the quality of the clustering by CONCOCT when using coverage data only. The variables altered:

Covariance model This variable controls the variance of the clusters, that is what shape the Gaussian distributions can take. Values tested: full, tied, diag, spherical Explained variance / dimensions after PCA This variable controls how much percentage of the variance should be explained by the PCA. The algorithm then se- lects the components of the PCA accordingly so at least that level of variance is ex- plained. Values tested: 70, 71, . . . , 100

Contig length threshold This variable controls what the minimum length of the contigs used for clustering should be. It filters them out before training the model.

Values tested: 1, 500, 1000, . . . , 10000

Number of samples: This variable was tested to see effect of using different number of samples. Values tested: 2, 4, . . . , all

While testing a variable, the others were kept at their default values as CONCOCT sets them. These are the parameters changed and their default values:

–coverage_percentage_pca Default set to 90%.

–covariance_type Default set to full.

–length_threshold Default set to 1000

The other parameters that were fixed to something other than their default values are:

16

(23)

CHAPTER 4. RESULTS 17

Figure 4.1: Precision, Recall and ARI for the different Covariance types.

–clustersSet to the true number of organisms found in either dataset, 20 for the strain and 101 for the species. Only in the final experiment was CONCOCT asked to figure out the number of clusters automatically

–kmer_length Set to 2 in all runs to make generating the Kmer table as fast as possible since it is not used when only working with coverage

–force_seedSet unique per execution to allow re-execution to generate same results –split_pca This boolean parameter is set to allow the composition to be ignored –composition_percentage_pca Set to 0 to ignore the composition data

Since the EM algorithm is only guaranteed to converge to a local maximum, depending on the initial values for the cluster, the random restart approach was applied in the hope to find the global maximum or at least pretty good local maximum.

All of these experiments used a fixed number of clusters (the true number of clusters).

By using a combination of these parameters that gave good results individually, an experiment was run to test CONCOCT’s ability to figure out the cluster number using the BIC score. Using the BIC score to estimate number of clusters turned out to perform very badly.

Since Scikit-Learn offers other GMM classifiers, it was simple to test out a classifier called Dirichlet Process Gaussian Mixture Model (DPGMM) as an almost drop in replacement of the GMM. With DPGMM the algorithm is given an upper bound of the number of clusters to use and a concentration parameter α and it will estimate the number of clusters using the Dirichlet Process.

4.2 Results per parameter

4.2.1 Covariance

Both datasets in Figure 4.1 show that the ARI gets better with more complex covariance matrices (spherical < diag < tied) until the full matrix which shows worse ARI. This sug-

(24)

Figure 4.2: Execution in minutes for the different Covariance types.

gests that the full matrix has too many free parameters and not enough data to fit them correctly.

The number of PCA components needed to fulfill the 90% explained variance criteria is the same for all covariance type, that is 11 components for strain and 36 components for species.

It is interesting to see in Figure 4.2 how these covariance matrices affect the execution time. The main difference in the execution time (1 minute vs 10 minutes) can be explained with the dimensions of the data (11 components and 9411 contigs for the strain, 36 components and 37627 contigs for the species). There is not much to say about the time difference for the strain dataset since the total time is so short, but there is obviously a large difference when fitting the full matrix for the species dataset compared to the other types.

4.2.2 Principal Component Analysis

CONCOCT allows us to set the minimum variance percentage that should be explained, which then decides how many PCA components are need to keep. In Figure 4.3 it can be seen, that the results show increase in the metrics when more of the variance is explained.

Figure 4.4shows how many principal components are required to explain percentage of variance. By definition, explaining more variance requires as many or more components.

The execution time was quite stable, but increased towards the more variance explained using more components.

4.2.3 Threshold on contig length

Figure 4.5shows the effects of removing contigs below a certain length threshold before fitting the GMMs. After fitting, all contigs are predicted to the clusters. The results improve when removing short contigs, until contigs of length 3000 are removed. After that the ARI starts decreasing.

Figure 4.6shows that the number of components needed to explain 90% of the variance is similar for different thresholds, 11 for strain, and between 34 and 37 for species.

(25)

Figure 4.3: Precision, Recall and ARI for different levels of PCA.

Figure 4.4: Number of PCA components needed to explain the given percentage of variance.

(26)

Figure 4.5: Precision, Recall and ARI when filtering out contigs under threshold.

Figure 4.6: Filtering out contigs can effect number of components after PCA.

(27)

Figure 4.7: Execution in minutes for the different thresholds.

Figure 4.8: The number of contigs that passed the threshold, used for fitting.

In Figure 4.7 both datasets show a general decrease in the execution time when the threshold is increased.

Figure 4.8 shows how many contigs are left to fit after removing those that are shorter or equal in length to the threshold. It ranges from 11471 down to 5538 before dropping to 630 for strain, and from 41029 down to 31654 before dropping to 3626 for species. The big drop at the end happens because contigs of length 20 kb or longer were cut into 10 kb pieces until the last piece was greater than 10 kb and less than 20 kb. Therefore all contigs are less than 20 kb in length. So an originally long contig will create multiple 10 kb pieces and one piece between 10 kb and 20 kb.

4.3 Number of samples

Figure 4.9shows that generally, having more samples gives better ARI.

(28)

Figure 4.9: Precision, Recall and ARI when using different number of samples.

Figure 4.10: Number of components after PCA for different number of samples.

(29)

Coverage Cov + Comp Dataset Model Type

Strain DPGMM 0.01 39 36

0.1 39 37

1 38 38

10 39 39

100 38 38

GMM Fixed 20 20

BIC 17 11

Species DPGMM 0.01 148 139

0.1 151 138

1 145 138

10 146 140

100 145 146

GMM Fixed 101 101

BIC 71 71

Table 4.1: Number of clusters from CONCOCT using coverage and using coverage + composition.

As the number of samples increases, more components are needed to explain 90% of the variance as can be see in Figure 4.10. They range from 3 to 11 for straint and 2 to 36 for species.

4.4 CONCOCT with selected parameters

4.4.1 Coverage only

Using the results from the previous experiments, parameters for CONCOCT were selected:

Covariance model: Tied model Explained variance: 90%

Contig length filtering: 3000 Number of samples: All samples

Note that the Explained variance is set at 90% instead of the 100% that gave the best results in the experiments above. That is because running CONCOCT with all components (100% explained variance) generate very poor results. An experiment using 90% variance resulted in much better results, which are shown here.

Since CONCOCT did not produce good results when estimating clusters using GMM and BIC, the DPGMM was tested as well. For DPGMM, the same parameters were used as for the GMM, except an upper-bound was given for number of clusters and the α parameter (concentration parameter) was tested for 0.01, 0.1, 1, 10 and 100.

(30)

Figure 4.11: Precision, Recall and ARI for GMM and DPGMM, using only coverage.

As can be seen from Table 4.1, the true number of clusters is 101 and 20 for the species and strain datasets respectively, while when CONCOCT is set to estimate them automatically, it arrives at 71 and 17 clusters respectively. This underestimation of number of clusters represents an underestimation of the true number of organisms found in the datasets.

The DPGMM overestimates the number of clusters, representing split genomes.

Figure 4.11shows that CONCOCT performs well (ARI 0.9916 and 0.8958 for species and strain respectively) when using fixed number of clusters (101 and 20). Having CON- COCT estimate the number of clusters (71 and 17) results in worse ARI (0.7429 and 0.7982) and precision but better recall. The metrics can be seen in Table 4.2.

The number of PCA components used are 11 and 37 for Fixed and Estimated respectively.

The DPGMM uses the same parameters for CONCOCT as above. The DPGMM performs much better than using GMM with BIC scores to estimate the number of clusters, resulting in ARI around 0.95 and 0.88 for species and strain respectively.

4.4.2 Coverage and composition

To test how CONCOCT does when adding the composition it was run with the default values for composition, kmer length 4, 90% PCA variance explained, while still using the same parameters as in last experiment, tied model, 90% explained variance, threshold of 3000 bases and using all samples.

In Table 4.1 the BIC estimated number of clusters were 71 and 11 for species and strain respectively, while DPGMM resulted in around 140 and 37 respectively.

Figure 4.12shows similar results as for the coverage only. However comparing the BIC score between the experiments shows that using composition improved the ARI score for species. The metrics are shown in Table 4.2.

(31)

Coverage Cov + Comp Dataset Model Type Metric

Strain DPGMM 0.01 Adjusted Rand Index 0.899546 0.875807

Precision 0.943792 0.945219

Recall 0.882644 0.866396

0.1 Adjusted Rand Index 0.885944 0.891173

Precision 0.949720 0.941157

Recall 0.866945 0.864859

1 Adjusted Rand Index 0.866746 0.894588

Precision 0.947634 0.946646

Recall 0.872434 0.857723

Precision 0.949610 0.942035

Recall 0.865957 0.855747

Precision 0.944451 0.941706

Recall 0.869470 0.859370

GMM Fixed Adjusted Rand Index 0.895806 0.882690

Precision 0.921616 0.915249

Recall 0.924251 0.924361

BIC Adjusted Rand Index 0.798199 0.531455

Precision 0.889340 0.720496

Recall 0.956856 0.996377

Species DPGMM 0.01 Adjusted Rand Index 0.945262 0.971051

Precision 0.980167 0.995182

Recall 0.955968 0.968081

0.1 Adjusted Rand Index 0.941944 0.980584

Precision 0.983149 0.995128

Recall 0.948302 0.973326

Precision 0.987248 0.967788

Recall 0.968028 0.955170

Precision 0.973725 0.979608

Recall 0.962837 0.959110

Precision 0.964620 0.979608

Recall 0.961373 0.938452

GMM Fixed Adjusted Rand Index 0.991637 0.992065

Precision 0.995820 0.996033

Recall 0.995820 0.996033

BIC Adjusted Rand Index 0.742939 0.794312

Precision 0.803509 0.810909

Recall 0.994649 0.994090

Table 4.2: Metrics from CONCOCT using coverage and using coverage + composition.

(32)

Figure 4.12: Precision, Recall and ARI for GMM and DPGMM, using coverage and composition.

(33)

Chapter 5

Discussion

In this chapter we reflect on the parameters that were varied and their effect.

5.1 CONCOCT parameters

5.1.1 Covariance

Varying the covariance parameter suggest that either using the full or tied covariance matrix results in better ARI, precision and recall than using the simpler matrices. Here it needs to be taken into account that the full covariance matrix has many more free parameters to fit than the tied matrix, which results in both longer execution time and the possibility of overfitting if there is not enough data available. Therefore the tied model is probably a good choice for the covariance type, though when enough data is available and execution time is not an issue then the full model should be considered.

5.1.2 Principal Components Analysis

Varying the PCA parameter shows that explaining more variance results in better ARI, precision and recall. However, keeping more components results in more free parameters for fitting which can lead to longer execution and overfitting if not enough data is available.

Using 100% explained variance was tested. The results were worse than expected and the theory was that the combination of using all PCA components and the Tied model for covariance had this effect. Since the PCA experiment showed that using lower percentage of explained variance would still give good metrics, and since using the Full model actually increased the running time quite drastically, it was decided to use 90% explained variance.

This looks like a good balance between accuracy, reduction of free parameters and execution time.

5.1.3 Threshold on contig length

By filtering out short contings before doing PCA and clustering gives better ARI, precision and recall than using all available contigs even when only using the coverage data. This could be due to short contigs give more noise to the data since they recruit fewer reads.

The performance improves until the threshold reaches contigs about 2000 to 3000 bp in length. Then contigs with more informative coverave profiles are being removed.

27

(34)

5.2 Number of samples

It is evident that multiple samples give better coverage and give better ARI, precision and recall than using few samples. It is therefore preferred to use as many samples as are available, though the gain in performance after 40 samples is maybe not worth the effort of collecting, preparing, sequencing and analysing those samples. This should though be examined further with different dataset as it might vary quite a lot between environments due to their complexity being vastly different.

5.3 CONCOCT with selected parameters

5.3.1 Coverage only

With the parameters selected for the Final experiment, few things come to light. When having CONCOCT estimate number of clusters using the BIC score it arrives at too few clusters. The high recall indicates that the clusters contain mostly complete genomes and the low precision suggests that clusters might contain multiple genomes. This indicates that similar genomes might be merged into a single cluster. Using the DPGMM to estimate number of clusters works much better than using the BIC score for the GMM.

5.3.2 Coverage and composition

Here the same parameters were used with the addition of using the composition data with default parameters. The results are unexpected, that is it does not improve either the cluster number estimation or the ARI and other metrics. The likely explanation is that having so many samples out weights the information found in the k-mer composition. The hope is that by exploring the parameters for the composition, a better results could be ac- complished. What can also be seen is that the DPGMM out performs the GMM with BIC score quite drastically when estimating the number of clusters. This suggests that using DPGMM, or another method to estimate number of clusters, should be used instead of GMM with BIC.

(35)

Chapter 6

Conclusion

This project looked at the ability of CONCOCT to cluster metagenomic contigs using the coverage of the contigs. The effects of different parameters and variation within them on the clustering results has been examined. A combination of parameters was selected, with the aim of performing as well as possible while reducing the complexity of the model for better execution performance.

By using these parameters, CONCOCT performs fairly well when the actual number of clusters is set before hand. CONCOCT underestimates the number of cluster when it has to estimate the number. When CONCOCT uses fewer clusters than the true number, some clusters represent more than one organism (low precision) but the clusters contain almost all contigs from the organisms that are found in that cluster (high recall).

Including the composition of the contigs, using the default values for CONCOCT, did not perform better than only using the coverage. The likely explanation is that when using all samples for the coverage (64 and 96 for species and strain respectively) no additional information is obtained by adding composition information. However, if there are fewer samples, or they have different abundances of organisms in the samples, that could make the composition improve the results. Another point to look at are the default parameters for the composition, for example using longer k-mers could make the composition more informative. The fact that coverage alone works so well makes it worth considering leaving out composition data when many samples are available since certain genomic regions, such as recently laterally transferred genes, or ribosomal and transfer RNA genes, have deviating composition and may be wrongly clustered.

Switching models for estimating number of clusters, from the BIC score to using the Dirichlet Process, improved performance. The DPGMM overestimates the number of clusters, but still gives very good ARI. This suggests that CONCOCT should use some other method of estimating the number of clusters than the BIC score, which is actually what happened in later versions (v0.2) of CONCOCT. From v0.2 a Variational Bayesian approach is used for estimating number of clusters. The version of CONCOCT with Variational Bayesian algorithm was published in Nature Methods[19].

Given these datasets, it can be said that CONCOCT can cluster well using coverage alone. It should however be tested on real datasets, for instance using single copy house- keeping genes for evaluating clusters.

29

(36)

[1] F. Sanger, G. M. Air, B. G. Barrell, N. L. Brown, A. R. Coulson, J. C. Fiddes, C. A.

Hutchison, P. M. Slocombe, and M. Smith. Nucleotide sequence of bacteriophage φX174 DNA. Nature, 265(5596):687–695, feb 1977. ISSN 0028-0836. doi: 10.1038/

265687a0. URL http://www.nature.com/doifinder/10.1038/265687a0.

[2] J T Staley and A Konopka. Measurement of in Situ Activities of Nonphotosynthetic Microorganisms in Aquatic and Terrestrial Habitats. Annual Review of Microbiology, 39(1):321–346, oct 1985. ISSN 0066-4227. doi: 10.1146/annurev.mi.39.100185.001541.

URL http://www.annualreviews.org/doi/10.1146/annurev.mi.39.100185.001541.

[3] T M Schmidt, E F DeLong, and N R Pace. Analysis of a marine picoplankton commu- nity by 16S rRNA gene cloning and sequencing. Journal of bacteriology, 173(14):4371–

8, jul 1991. ISSN 0021-9193. URL http://www.ncbi.nlm.nih.gov/pubmed/2066334.

[4] P Hugenholtz, B M Goebel, and N R Pace. Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. Journal of bacteriology, 180(18):

4765–74, sep 1998. ISSN 0021-9193. URL http://www.ncbi.nlm.nih.gov/pubmed/

9733676.

[5] Jo Handelsman, Michelle R. Rondon, Sean F. Brady, Jon Clardy, and Robert M. Good- man. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5(10):R245–R249, oct 1998.

ISSN 10745521. doi: 10.1016/S1074-5521(98)90108-9. URL http://www.cell.com/

chemistry-biology/abstract/S1074-5521(98)90108-9.

[6] Kevin Chen and Lior Pachter. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS computational biology, 1(2):106–12, jul 2005. ISSN 1553-734X. doi: 10.1371/journal.pcbi.0010024. URL http://dx.plos.org/10.1371/

journal.pcbi.0010024.

[7] J Craig Venter, Karin Remington, John F Heidelberg, Aaron L Halpern, Doug Rusch, Jonathan A Eisen, Dongying Wu, Ian Paulsen, Karen E Nelson, William Nelson, Der- rick E Fouts, Samuel Levy, Anthony H Knap, Michael W Lomas, Ken Nealson, Owen White, Jeremy Peterson, Jeff Hoffman, Rachel Parsons, Holly Baden-Tillson, Cynthia Pfannkoch, Yu-Hui Rogers, and Hamilton O Smith. Environmental genome shotgun sequencing of the Sargasso Sea. Science (New York, N.Y.), 304(5667):66–74, apr 2004.

ISSN 1095-9203. doi: 10.1126/science.1093857. URL http://www.sciencemag.org/

cgi/doi/10.1126/science.1093857.

30

(37)

BIBLIOGRAPHY 31

[8] Heidi Chial. DNA sequencing technologies key to the Human Genome Project. Na- ture Education, 1(1):219, 2008. URL http://www.nature.com/scitable/topicpage/

DNA-Sequencing-Technologies-Key-to-the-Human-828?auTags=.

[9] Jason R Miller, Sergey Koren, and Granger Sutton. Assembly algorithms for next- generation sequencing data. Genomics, 95(6):315–27, jun 2010. ISSN 1089-8646.

doi: 10.1016/j.ygeno.2010.03.001. URL http://www.pubmedcentral.nih.gov/

articlerender.fcgi?artid=PMC2874646.

[10] Hanno Teeling, Anke Meyerdierks, Margarete Bauer, Rudolf Amann, and Frank Oliver Glockner. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology, 6(9):938–947, sep 2004. ISSN 1462-2912.

doi: 10.1111/j.1462-2920.2004.00624.x. URL http://doi.wiley.com/10.1111/j.

1462-2920.2004.00624.x.

[11] Tom M Mitchell. Machine Learning. Machine Learning, 1:432, 1997. URL http:

//www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/0070428077.

[12] Christopher M Bishop. Pattern Recognition and Machine Learning. Journal of Elec- tronic Imaging, 16(4):049901, jan 2007. ISSN 1017-9909. doi: 10.1117/1.2819119. URL http://electronicimaging.spiedigitallibrary.org/article.aspx?doi=10.1117/

1.2819119.

[13] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine Learning in Python.

Journal of Machine Learning Research, 12:2825–2830, jan 2012. ISSN 15324435. URL http://arxiv.org/abs/1201.0490.

[14] J. Sethuraman. A constructive definition of Dirichlet priors, 1994. URL http://www3.

stat.sinica.edu.tw/statistica/j4n2/j4n27/..{%}5Cj4n216{%}5Cj4n216.htm. [15] David M. Blei and Michael I. Jordan. Variational inference for Dirichlet process

mixtures. Bayesian Analysis, 1(1):121–143, mar 2006. ISSN 1936-0975. doi:

10.1214/06-BA104. URL http://projecteuclid.org/euclid.ba/1340371077.

[16] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When Is

“Nearest Neighbor” Meaningful? In Database Theory-ICDT 99, pages 217–235.

1999. ISBN 978-3-540-65452-0. doi: 10.1007/3-540-49257-7_15. URL http:

//link.springer.com/10.1007/3-540-49257-7{_}15.

[17] William M Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846, dec 1971. ISSN 01621459. doi:

10.2307/2284239. URL http://www.jstor.org/stable/2284239?origin=crossref.

[18] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, dec 1985. ISSN 0176-4268. doi: 10.1007/BF01908075. URL http://

link.springer.com/10.1007/BF01908075.

(38)

[19] Johannes Alneberg, Brynjar Smári Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua Quick, Umer Z Ijaz, Leo Lahti, Nicholas J Loman, Anders F Andersson, and Christo- pher Quince. Binning metagenomic contigs by coverage and composition. Nature Meth- ods, 11(11):1144–1146, sep 2014. ISSN 1548-7091. doi: 10.1038/nmeth.3103. URL http://www.nature.com/doifinder/10.1038/nmeth.3103.

[20] Fernando Perez and Brian E. Granger. IPython: A System for Interactive Scientific Computing. Computing in Science & Engineering, 9(3):21–29, 2007. ISSN 1521-9615.

doi: 10.1109/MCSE.2007.53. URL http://ieeexplore.ieee.org/lpdocs/epic03/

wrapper.htm?arnumber=4160251.

[21] Travis E. Oliphant. Python for Scientific Computing. Computing in Science & En- gineering, 9(3):10–20, 2007. ISSN 1521-9615. doi: 10.1109/MCSE.2007.58. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4160250. [22] Peter J A Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, An-

drew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J L de Hoon. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423, jun 2009.

ISSN 1367-4803. doi: 10.1093/bioinformatics/btp163. URL http://bioinformatics.

oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btp163.

[23] John D. Hunter. Matplotlib: A 2D Graphics Environment. Computing in Science &

Engineering, 9(3):90–95, 2007. ISSN 1521-9615. doi: 10.1109/MCSE.2007.55. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4160265. [24] J. Koster and Sven Rahmann. Snakemake–a scalable bioinformatics workflow en-

gine. Bioinformatics, 28(19):2520–2522, oct 2012. ISSN 1367-4803. doi: 10.1093/

bioinformatics/bts480. URL http://bioinformatics.oxfordjournals.org/cgi/doi/

10.1093/bioinformatics/bts480.

[25] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2.

Nature Methods, 9(4):357–359, mar 2012. ISSN 1548-7091. doi: 10.1038/nmeth.1923.

URL http://www.nature.com/doifinder/10.1038/nmeth.1923.

[26] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078–2079, aug 2009. ISSN 1367-4803. doi:

10.1093/bioinformatics/btp352. URL http://bioinformatics.oxfordjournals.org/

cgi/doi/10.1093/bioinformatics/btp352.

[27] Aaron R Quinlan and Ira M Hall. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841–842, mar 2010. ISSN 1367-4803. doi:

10.1093/bioinformatics/btq033. URL http://bioinformatics.oxfordjournals.org/

cgi/doi/10.1093/bioinformatics/btq033.

(39)

www.kth.se

Clustering metagenome contigs using coverage with CONCOCT

Clustering metagenome

contigs using coverage with CONCOCT

BRYNJAR SMÁRI BJARNASON

coverage with CONCOCT

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

Chapter 2

Theory

2.1 Metagenomics

2.2 Machine Learning

Chapter 3

Methods

3.1 Datasets

3.2 Software

Results

4.1 Setup

4.2 Results per parameter

4.3 Number of samples

4.4 CONCOCT with selected parameters

Chapter 5

Discussion

5.1 CONCOCT parameters

5.2 Number of samples

5.3 CONCOCT with selected parameters

Chapter 6

Conclusion