Metagenomic Classification using Machine Learning: Applied to SARS-CoV-2 and Viruses

(1)

Master thesis, 30 credits

Master of Science in Industrial Engineering and Management, Industrial Statistics, 300 credits

Metagenomic Classification using Machine Learning

Applied to SARS-CoV-2 and Viruses

Authors:

Eric Tjärnström and Nicolai Granholm

(2)

METAGENOMIC CLASSIFICATION USING MACHINE LEARNING -APPLIED TO SARS-COV-2 AND VIRUSES

Department of Mathematics and Mathematical Statistics Umeå University

90187 Umeå, Sweden Supervisors:

Patrik Rydén, Umeå University

Andreas Sjödin, Swedish Defence Research Agency Examiner:

Konrad Abramowicz, Umeå University

(3)

Abstract

The use of machine learning within the field of metagenomic classification is becoming more relevant since the increasing sequencing speed demands faster and more accurate classification algorithms. This thesis explores the possible application of machine learning methods, used individually and in an ensemble solution, for binary classification of short DNA sequences. The models Convolutional Neural Network, Recurrent Neural Network, Support Vector Machine, Random Forest, Gradient Boosting and K-Nearest Neighbour are trained to distinguish viruses and SARS-CoV-2 from other organisms. The models are evaluated on generated data, as well as real samples.

The outcome of this thesis show that machine learning methods have satisfying results when classifying short DNA sequences, in terms of both accuracy and speed. The best overall accuracies are obtained using ensemble solutions consisting of several machine learning models.

Sammanfattning

Användningen av maskininlärning inom metagenomisk klassificering är ett område som blir mer och mer relevant på grund av att den ökade sekvenseringshastigheten ställer större krav på snabbare och precisare klassificeringsalgoritmer. Detta examensarbete utforskar möjligheten av att använda maskininlärningmodeller, både individuellt och kombinerat, för binär klassificering av av korta DNA sekvenser. Modellerna Convolutional Neural Network, Recurrent Neural Network, Support Vector Machine, Random Forest, Gradient Boosting and K-Nearest Neighbour är tränade till att kunna skilja på virus och SARS-CoV-2 från andra organismer. Modellerna är utvärderade på genererad, samt verklig data.

Slutsatsen av detta examensarbete visar att maskininlärningsmetoder har ett tillfredsstäl- lande resultat vid klassificeringen av korta DNA sekvenser, både i träffsäkerhet och tidsom- fång. De bästa träffsäkerheterna kommer från kombinerade lösningar bestående av flera maskininlärningsmodeller.

(4)

Acknowledgements

We would like to express our appreciation to FOI and the people at the biology unit for making this thesis possible by providing us with the project and all necessary resources needed to succeed. We are grateful for being able to conduct our thesis on site, despite the ongoing global pandemic. Our deepest gratitude goes out to our supervisor Dr. Andreas Sjödin at FOI, who from start to finish has provided us with insights and guidance. This thesis would not have been where it is today without your valuable support.

We would like to thank our supervisor Assoc. Prof. Patrik Rydén at Umeå University. Your combined knowledge in both bioinformatics and mathematical statistics gave us a valuable link between the two fields.

Finally we would like to add a special thanks to Hannah Wallin for spending hours reading our thesis and providing us with feedback.

Yours sincerely, Eric & Nicolai

(5)

1 Introduction

1.1 Background

Living organisms are constructed by genes, which are made from the molecule called de- oxyribonucleic acid, more commonly referred to as DNA. DNA is material that is inherent from an organism’s parent and is unique for each species. There even is a small variation within the species. The DNA can for example be used to determine different species or families of organisms, the so called biological taxonomy[6].

The structure of DNA can be represented as a phosphate backbone, with pairs of bases that links the DNA together. The four bases that make up the DNA chain are adenine (A), cytosine (C), thymine (T) and guanine (G)[35]. The bases that forms pairs are A with T, C with G and vice versa. These compositions of bases in the DNA chain then works as a blueprint for the organisms structure and can be used to identify the specific organism. The DNA structure is visualized in Figure 1.

Figure 1: Visualization of DNA-structure¹

A sequence of a DNA molecule is the order of the base-pairs as they are expressed in the DNA chain. Determining that order of bases is called DNA sequencing. Since only A can connect with T, and, C only with G, the sequence does not need to contain both sides of the DNA chain. Giving information about one side is just as sufficient, since the other side easily can be extracted from that information[35].

Provided a database of complete DNA sequences for organisms, forward referred to as genomes, and tools for sequencing, it is possible to determine what a sample of unknown or- ganisms contains. This is done by matching short DNA sequences from the unknown sample, forward refereed to as reads, with a database of genomes. The study of genome sequences from an environment is called Metagenomics[30]. The classification within metagenomics is based on evaluating if the specific order of bases in a read can be found within a genome in the reference database. There exist well established algorithms that are developed for that specific task. One of these algorithms is BLAST[4], Basic Local Alignment Search Tool.

Another well used method is Kraken 2[30].

One problem within metagenomic classification is that reads are short subsequences from

1Source: www.commons.wikimedia.org

(8)

larger genomes, this makes it hard to classify the origin of the read. Another problem is the lack of characterised genomes in the reference database. Samples often contain organisms that do not have a genome that exists in the database.

The sequencing methods today are becoming more and more cost efficient over time, which results in rapid increase in amount of read sequences to be extracted from organisms. When more data can be extracted, there is an increasing demand for computational power to classify the origin from where the sequence belongs. The general trade-off to be done while classifying is accuracy versus computational time. While BLAST is the most precise method, Kraken 2 is much faster[22].

1.2 Company

The Swedish Defence Research Agency, FOI, is a government agency reporting to the Swedish Ministry of Defence and is one of Europe’s leading research institutes within security and defence. Most of the research conducted by FOI is done for the Swedish military defence, but it can also be conducted for other parties, like private companies and foreign clients[12].

FOI have four research divisions, located in Linköping, Kista, Grindsjön and Umeå. The research facility in Umeå, where this thesis is conducted, is specialized in CBRN²-security and protection.

The biology related work under the CBRN-deparment in Umeå is mainly focused around the analysis of biological samples, suspected of containing dangerous substances. The conducted research can be summarised into three categories: operative support, technology development and microbial risks[13]. The people at FOI provide support and insights regarding the thesis project.

1.3 Aim

The aim of the thesis project is to examine which machine learning methods have the most satisfying classification results when classifying short DNA sequences. This work will serve as proof of concept to show the potential application of machine learning methods within metagenomic classification. The practical solution will yield two classifiers, with aim to classify different levels in taxonomy. One classifier will be used to distinguish if a sequence read belongs to the virus kingdom or not. Another classifier created, on the strain level within the taxonomy, will be used to distinguish if a sequence read origins from the coronavirus SARS-CoV-2.

1.4 Related Work

There are several papers that explore the area of metagenomic classification using machine learning methods.

The use of Deep Learning to classify Coronaviruses given their DNA and RNA sequences is examined in [9]. The experiments were compiled using data from the Novel Coronavirus Resource. The data consisted of full genome sequences of length between 1260 to 31029 base pairs. The obtained result shows an average classification accuracy of 97.75% using a Convolutional Neural Network. This shows that the CNN architecture could be useful in DNA classification of Coronaviruses, including SARS-CoV-2, using complete genome sequences.

2Chemical (C), Biological (B), Radiological (R) and Nuclear (N)

(9)

The possibilities of using a Recurrent Neural Network to classify metagenomic sequences is discussed in [19]. This work shows that a Recurrent Neural Network built with long short term memory provides a computational time that is approximately 36 times faster compared to the traditional method BLAST. Even with faster computational time, the classification result is almost identical to BLAST.

The paper [28] explores the area of different input features in Deep Learning to distinguish between kingdom level taxonomy classes. The conclusion of the paper is that Deep Learning have prosperous prospects to outperform established methods like Kraken 2.

The use of a Convolutional Neural Network and a Deep Belief Network in metagenomic classification are evaluated in [1]. When selecting feature, K-mers are used and the models yield a maximum classification accuracy of 91,3%.

The existing papers within the subject shows that machine learning methods have good overall potential when it comes to metagenomic classification. However, none of the examined articles covers the specific topic this thesis examines.

(10)

2 Data

2.1 Data generation

The data used consists of sequence reads generated from DNA genomes, where each read is a short sequence that have a length of 150 bases. When sequencing machines sequence DNA, the error that occurs in the data follows a typical structure. The beginning of the sequenced read is often accurate with little error, but further down the read the error gets worse and worse. The error that occurs is often switched or falsely inserted bases.

When reads are simulated using the Python based InSilicoSeq, noise is added in a way that represents the error profile from Illumina NovaSeq sequencing machines[16]. This error profile is very similar to the one that FOI would get when sequencing DNA in their lab.

2.2 Data structure

When simulating reads using InSilicoSeq, the origin of all sequences is known. The overall structure of the data follows as the example in Table 1. Each row represents one simulated read, which is 150 bases long and have a corresponding label, which is the origin from where the read is generated. The size of the datasets can vary, but the length of each read is set to be strictly 150 bases.

N











Sequenced DNA read Orgin

ATGCTATCTTACTACTGTCAGACTGC ... G Class 1 TTATCCGCATGACTACCCTCAGCTCA ... T Class 2 CGCCTTGCATACGATCAGCCATACGA ... C Class 1 CTGGGACCTTAAGGTTACCTACCTAG ... T Class 1 GACCGGTCCGCATCGTAAGCATCTAT ... T Class 1 ACTAGGGTGTCATGCACCGTAGTACT ... G Class 2 ACATTCAAAATCACTTGCTTTTGCGA ... A Class 2 TCAGCCGCAGGCATAGTCATTATAGC ... G Class 1 GTACAAGCATGATATACCGGCGACCT ... A Class 2

... ...

ACTGTCTACGTCATATATTTTTTTCT ... T Class 2

Table 1: Overall structure of data. N correspond to number of reads and every read have 150 bases of the categorical variables A,T,C or G. Each read have a known class label presented in the right column, from where the sequence origins.

2.3 Datasets

The data which is used for both implementations are the same size and is split the same way. The total number of observations (reads) is 2 million for each implementation. The data is split into two main parts, one for tuning the hyperparameters of the models, and one used to train and evaluate the final models. A visualization of how the data is split can be seen in Figure 2.

The further on called Tuningset, consisting of 800 thousand reads, is split into 85% train- ingdata and 15% testdata randomly when tuning the hyperparameters.

(11)

The data used in the creation of the final models consists of 1.2 million reads. This data have a fixed split between the further on called Trainingset and Testset with 800 thousand and 400 thousand reads in each set respectively.

Figure 2: Visualization of how the data is split into training- and test-sets.

2.3.1 Virus, Eukaryotic, Bacteria

The data presented in this section is used to create a classifier that can distinguish virus genomes from bacterial and eukaryotic genomes. The virus and bacteria data are randomly gathered at 2020-02-11 from NCBI using InSilicoSeqs[16] and the eukaryotic data is down- loaded from NCBI[27] 2020-03-17.

From bacterial and eukaryotic genomes, the generated data consists of 500 thousand simulated reads from each domain. From viruses, 1 million reads are generated. Hence, the generated dataset consists of 2 million sequential reads over three kingdoms, where the reads from viruses are doubled to receive equal proportions in the dataset, given the aim of the classifier. The split of the generated data follows as described in Section 2.3.

Virus genomes can have a length up to hundreds of thousands of base pairs and bacteria genome have a length up to approximately 10 million base pairs[18]. Eukaryotic genomes can have a length up to multiple billions of bases. The eukaryotic genome used is a human genome, which have approximately 3 billion base pairs[8]. Given the different length in structure of genomes, the data generation uses various numbers of genomes from the different kingdoms to take the varying length into consideration. The bacteria reads are generated using 500 genomes, eukaryotic reads are generated using only one genome and virus reads are generated using 5 thousand genomes.

2.3.2 SARS-CoV-2

The data presented in this section is used to create a classifier that can separate the SARS- CoV-2 virus from viruses, bacteria and eukaryotic. Reads from the SARS-CoV-2 virus will be labeled as one class while other virus, bacteria and eukaryotic reads will be compiled in another class. The proportion of reads will be 50% SARS-CoV-2 and 50% virus, bacteria and eukaryotic.

The SARS-CoV-2 data are gathered from NCBI at 2020-04-06, where a total of 1 million SARS-CoV-2 reads are generated. The Non-SARS-CoV-2 class consists of virus, bacteria and eukaryotic reads, one third from each kingdom. The eukaryotic kingdom have 1 additional read to make it an even 1 million. The split of data follows as described in Section 2.3.

(12)

The SARS-CoV-2 genome have an approximate length of 30 thousand bases and the data is sampled from 319 genomes of the virus.

2.3.3 Real data

For the validation of the SARS-CoV-2 classifiers, the final models are tested on three sets of, non generated, real data samples. All sequence reads in these sets are 150 base pairs long.

The first dataset, called setF, is the first confirmed SARS-CoV-2 case in Sweden. The sample is SARS-CoV-2 positive and contains 320896 reads.

The second dataset, called setM, is a human feces sample. The sample is SARS-CoV-2 negative and contains 997061 reads.

The third dataset, called setU, is sewage water from Uppsala. The sample is SARS-CoV-2 negative and contains 4363463 reads.

(13)

3 Method

Underlying complementary theory for the work can be found in appendix A.

3.1 Initial model selection

When selecting the initial machine learning models to evaluate, a set of six models are chosen.

The selected models are Convolutional Neural Network, Recurrent Neural Network, Support Vector Machine, Random Forest, Gradient Booting and K-Nearest Neighbour. More models could be selected, but since these models are well tested and fundamentally different from each other, it is a good representation of different machine learning techniques.

3.2 Tools setup

All of the model training and prediction is remotely done on a server provided by FOI. The most important specifications on the server are as follow:

CPU Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz GPU GeForce RTX 2080 8GB

RAM 256GB

OS Ubuntu SMP 18.04.4 LTS (Bionic Beaver)

Python has been the programming language used in this thesis. Python is installed using Conda, which is an environment management system that can handle packages and corresponding dependencies. All the calculations is run through MobaXterm on the server presented above. The raw data comes in the file format FASTA and when smaller feature datasets are created and stored, the file formats CSV and NPY are used. NPY is used for larger datasets since it is demands much less memory and time when being stored and loaded. The Numpy library is used make the data processing efficient.

The neural networks are built in the Keras library. Keras is built on the Tensorflow library and operates with Tensorflow GPU. The neural networks can therefore use both GPU and CPU, which is an advantage in terms of processing speed. The models created within Keras is saved in a file format called HDF5.

In the process of creating the K-Nearest Neighbour, Support Vector Machine, Random Forest, Gradient Boosting and Principal Component Analysis, the Scikit-Learn library is used. Unfortunately Scikit-Learn does its calculations on CPU and does not have support for GPU. The prediction results from the models are stored with Pandas, which is a Python data frame library built on Numpy, and are then exported to a CSV file. The trained models are stored in a Joblib format. It is worth noting that Scikit-Learn offers the possibility to parallel run KNN and RF on several cores. To not use up all the CPU capacity on the server, they are limited to use a maximum of ten cores each.

Graph tools in Microsoft Excel are used to visualize the data.

3.2.1 Sequencing Equipment

The Illumina sequencing machine that FOI in Umeå has in its lab, is a MiSeq Series[20].

It has a run time of approximately 27.5 hours for a maximum of 25 million reads, with a length of 150 base pairs. If two sequencing runs are done sequentially, classification of the

(14)

reads from the first run should be completed before the second sequencing run is completed.

The maximum prediction time of 25 million reads when using this sequencing machine would therefore optimally be under around 27.5 hours. This corresponds to around 250 reads/second. Since the Testset consists of 400 thousand reads, the prediction time from our models could be multiplied by 62.5 to simulate the time it would take to predict 25 million reads, under the assumption that time scales linearly with the number of reads.

A top of the line benchtop sequencing machine today is a Illumina NextSeq 550 Series[20].

It can sequence 400 million reads under approximately 30 hours. The classification time of 400 million reads should be no more than around 30 hours. This correspond to around 3700 reads/second. The prediction time from the Testset can be multiplied with a factor of 1000 to simulate the time it would take to predict 400 million reads, under the assumption that it scales linearly.

3.3 Feature extraction

Since the machine learning models are built differently, the prerequisite to process data will vary between models. The data in its raw format when generated comes as read sequences.

As presented in background, Section 1.1, information within the reads are combinations of the categorical variables A,T,C & G.

Different approaches are used when transforming the reads to a more manageable format.

One approach, while still preserving the order of the nucleotide sequences, is to simply translate each letter to an integer. A,T,C & G are converted to 0,1,2 & 3. This features is used in the neural networks since they can process complex data with non-linear structure.

Another well established method of metagenomic classification is to transform the information kept in the reads to so called k-mers. A k-mer is a subsequence from a DNA sequence.

A nucleotide sequence of length L will have L− k + 1 k-mers of length k. The sequence ATTACG will therefore have 4 k-mers of length 3 (ATT,TTA,TAC,ACG) and 3 k-mers of length 4 (ATTA,TTAC,TACG). The possible number of unique k-mers for nucleotide sequences are 4^k, since they have four possible characters. The drawback of this method is that the order of the nucleotide sequence is lost. However, this is still a very robust way of representing the data since a lot of the characteristics of a genome is determined by the frequency of specific patterns. For this experiment, a representation of k-mers ranging from k = 1 to k = 6 is used. This range could be expanded to cover more information on the structure of the reads, but there is a trade off in the time it takes to process and generate k-mers and the extra information gained. Since the number of k-mers grows with 4^k, a k from 1 to 6 results in P6

k=14^k = 5460 k-mers. Adding one k results in P7

k=14^k = 21844 k-mers. To reduce the number of features to generate, the number of k-mers will be limited to a maximum length of 6.

The different k-mers found in a read are represented in an array of size 1x5460, where each element corresponds to a specific k-mer. When a k-mer are found within the read, the element corresponding to that k-mer will increase by one. The translation of a read to a k-mers could look like this:

A T C G AA … GGGGGT GGGGGC GGGGGG

45 42 30 33 12 … 0 0 1 .

When creating k-mers, the smaller k-mers are represented within the larger k-mers. For example, if the 6-mer AGAATC exists in a read, then the 5-mer AGAAT exists in the read as well. Some of the features is therefore highly correlated. In order to reduce the correlation

(15)

and further reduce the dimensionality of the problem, a principal component analysis is conducted on the k-mer transformed data to extract the principal components which explains the most variance. All k-mer transformed reads are represented as coordinates in a feature space, where each dimension in the feature space corresponds to a specific principal component. Each principal component is a different linear combination of the k-mer distribution.

For the Virus/Non-Virus problem, the PCA is applied on the Tuningset for Virus/Non- Virus to extract the most significant features.

For the SARS-CoV-2 problem, the PCA is applied on the Tuningset for SARS-CoV-2/Non- SARS-CoV-2 to extract the most significant features.

This format of the data works well with RF, KNN, SVM, and GB, since it can be represented as points in a continuous feature space.

To summarize the data transformation, the first translation of the sequenced read is translating ATCG as 0123 correspondingly. This is used as input for the neural networks CNN and RNN. The second method is fitting principal components using the distribution of k- mers. This type of transformation is compatible with the machine learning models RF, KNN, SVM and GB. An example of the two read transformations is presented in Table 2.

Table 2: Different ways the sequence read is transformed before being fed into the machine learning models as input.

Sequence Format Input to model

Sequenced read in raw format ATCGATCAGCTACGTTC ... None ATCG transformed to 0123 01230120321023112 ... CNN, RNN

K-mer and PC transformation [ -13.7, 11.8, 4.2, -0.9, -4.6, ... ] RF, KNN, SVM, GB

3.4 Hyperparameter tuning

The general approach to tune hyperparameters in this thesis is to create a sparse grid of possible combinations, to establish where an optimum could be found. The sparse grid is then scaled down to hopefully include the hyperparameter settings that cover the potential optimum.

Tuning in general is sensitive to amount of hyperparameters to be tuned, since the combination of hyperparameters increases exponentially. One additional hyperparameter to tune can be devastating in regards to tuning time. This makes it impractical to fine tune every hyperparameter in every model. Hence, the hyperparameters believed to have the largest impact on the model are tuned, while the other hyperparameters are set to fixed values.

3.5 Classification of Virus/Non-Virus

In this section the binary classification models, suggested in Section 3.1, will be presented, tuned and evaluated, both individually and in an ensemble solution for the Virus/Non-Virus data.

As presented in Section 3.2, the neural networks have an advantage using GPU to make their calculations, while the other machine learning methods are using CPU. The processes that run on GPU and CPU will be faster than the ones that only run on CPU. The neural networks can therefore use the full Tuningset when tuning the hyperparameters, while the other models are limited to sampling smaller sets from the Tuningset due to the increased

(16)

training time. When tuning the models, the data used will randomly be divided into 85%

training data and 15% validation data.

3.5.1 Convolutional Neural Network

A Convolutional Neural Network can be constructed in many different ways. In general, if more layers are added, more hyperparameters needs to be tuned. More layers could also imply a slower prediction time. Taking previous work regarding Convolutional Neural Net- works into account[1][9], combinations of convolutional and maxpooling layers have shown to generate high prediction results. The CNN architecture in this work is therefore constructed in a similar way. The architecture used is the same for all CNN:s created and is constructed by the following layers:

• 1-dimensional Convolutional Layer with ReLU activation

• Maxpooling with pool size 2

• 1-dimensional Convolutional Layer with ReLU activation

• Maxpooling with pool size 2

• Flatten

• Dropout with droprate 0.5

• Fully Connected Layer with softmax activation.

When training the Convolutional Neural Networks, some of the hyperparameters are set to a fixed value to reduce tuning time of the models. One additional model of CNN is constructed with an embedding layer to examine if there is a performance gain by converting the input before it is fed to the first convolutional layer. The tuned hyperparameters are embedding layer size (for the model with an embedding layer), the number of filters and the kernel sizes, in both convolutional layers. The activation function in the convolutional layers is rectified linear unit, ReLU. The flatten layer is used to reduce dimension so that the fully connected layer can output the right dimension out of the network. This dimension reduction is due to the use of multiple convolutional layers. The fully connected layer operates with a softmax activation function. When training the network, dropout of 0.5 is applied to reduce overfitting. The optimizer used when training the network is stochastic gradient descent.

To illustrate the CNN, a simplified flowchart is presented in Figure 3. The process of tuning hyperparameters on the Tuningset can be seen in Appendix B.1. All CNN:s created relies on Keras default settings for all hyperparameters not mentioned[21].

Figure 3: Simplified flowchart of the Convolutional Neural Network structure used.

(17)

After the tuning is completed and the optimal hyperparameter combination is found, two models are trained over 100 epochs on the Trainingset, meaning that the model iterates the training data 100 times. One of the models is trained with embedding and one without.

A comparison of their performance on the Testset is done to see if embedding should be included or not. The model with best performance is considered the final CNN and is used in the ensemble solution.

3.5.2 Recurrent Neural Network

The structure of a Recurrent Neural Network can vary depending on its layers. Previous research [19] have shown that the use of Long Short Term Memory cells yields satisfying prediction results within similar classification problems. The architecture used in this thesis is the same for all RNN:s, constructed using two one-layer Long Short-Term Memory cells.

The RNN architecture used is constructed by the following layers:

• LSTM with tanh and sigmoid activation

• Fully Connected Layer with softmax activation.

Some of the hyperparameters are set to a fixed value to reduce tuning time of the RNN models. Two RNN models are constructed, one with embedding layer and one without. The hyperparameters to tune for RNN are embedding layer size (for the model with embedding layer) and LSTM output space for both LSTM layers. The fully connected layer operates with a softmax activation function. The optimizer used is stochastic gradient descent.

To illustrate the RNN with two LSTM layers, a simplified flowchart of the model architecture is presented in Figure 4. The process of tuning the hyperparameters can be seen in Appendix B.2. The RNN:s created relies on Keras default settings for all hyperparameters not mentioned[21].

Figure 4: Simplified flowchart of the Recurrent Neural Network structure used.

After the tuning is completed and the optimal hyperparameter combination is found, two models are trained over 100 epochs on the Trainingset, one with embedding and one without.

A comparison of their performance on the Testset is done to see if embedding should be included or not. The model with best performance is considered the final RNN and is used in the ensemble solution.

(18)

3.5.3 Support Vector Machine

Using three sample sets of 80 thousand reads generated from the Tuningset, a linear and a radial basis function kernel Support Vector Machine is compared in terms of accuracy. The values of the cost parameter used in the comparison are 0.1, 1, 10, 100.

The kernel found to be the most fitting for the data is used in the tuning of the Sup- port Vector Machine model. The hyperparameter tuned is the cost parameter, C. The rest of the hyperparameters are set to the default settings in the Scikit-learn module for SVM[31]. The tuning is done on sample datasets generated from the Tuningset. The tuning grid and the sizes of the sample datasets can be seen in Table 17 located in Appendix B.3.

When the optimal hyperparameter is found, a Support Vector Machine model is fit on the Trainingset and evaluated on the Testset. The final Scikit-learn model, with all the default settings, is presented in Appendix B.3.

3.5.4 Random Forest

When tuning the Random Forest model, sample datasets generated from the Tuningset are used. The hyperparameters tuned are the number of trees, T, and the maximum depth of these trees, D. The rest of the hyperparameters are set to the default settings in the Scikit- learn module for RF[32]. The tuning grid and the sizes of the sample datasets, can be seen in Table 18 located in Appendix B.4.

When the optimal hyperparameters are found, a Random Forest model is fit on the Train- ingset and evaluated on the Testset. The final Scikit-learn model, with all the default settings, is presented in Appendix B.4.

3.5.5 K-Nearest Neighbour

The only hyperparameter that is tuned for K-Nearest Neighbour is the K number of neigh- bours. The rest of the hyperparameters are set to the default settings in the Scikit-learn module for KNN[34]. The tuning is done on sample datasets generated from the Tuningset.

The tuning grid and the sizes of the sample datasets, can be seen in Table 19 located in Appendix B.5.

When the optimal hyperparameter is found, a K-Nearest Neighbour is fit on the Trainingset and evaluated on the Testset. The final Scikit-learn model, with all the default settings, is presented in Appendix B.5.

3.5.6 Gradient Boosting

The hyperparameters that is tuned for the Gradient Boosting model is the number of estimators, E, and the learning rate, L. The rest of the hyperparameters are set to the default settings in the Scikit-learn module for GB[33]. The tuning is done on sample datasets generated from the Tuningset. The tuning grid and the sizes of the sample datasets, can be seen in Table 20 located in Appendix B.6.

When the optimal hyperparameter is found, a Gradient Boosting model is fit on the Train- ingset and evaluated on the Testset. The final Scikit-learn model, with all the default settings, is presented in Appendix B.6.

(19)

3.5.7 Model Comparison and Model Ensemble

When comparing and evaluating the different models against each other, all the models is used to predict the Testset. The models are evaluated individually and in ensemble models. The individual metrics evaluated are prediction time, accuracy and recall. Two similar ensemble methods is evaluated to decide which ensemble solution that is best. Since the approach for both ensemble methods is very similar, the difference in prediction time will be negligible. The ensemble methods are therefore only evaluated based on their accuracy.

Both ensemble methods uses the already trained machine learning models predictions and letting them vote to which class each read belongs to. The difference between the two ensemble methods is how their voting is conducted.

One ensemble method use each individual models class recalls to vote. Since the class recalls from the individual models are used in the prediction of the Testset, the class recalls must be obtained from another dataset. By letting the individual models predict the Tuningset, class recalls for the models can be obtained. When each individual model are predicting the Testset, an array containing the prediction results are acquired. This vector contains values of 1 or 0 depending on if the model predicted the read as virus or non-virus.

This is done by replacing 1 with the value of the model recall for viruses and 0 with the negative of the model recall for non-viruses. Different models will have different amount of say in an ensemble voting scenario, depending on their class recalls. For example, if Model 1 has a recall of 0.7 for viruses and 0.8 for non-viruses, its predicted vector are replaced in the following way:

[1 0 1 1 0 0 0 ...]→ [0.7 −0.8 0.7 0.7 −0.8 −0.8 −0.8 ...],

where 1 is replaced by the virus recall and 0 by the negative of the non-virus recall. Introduc- ing Model 2 to the example above, with the vector [0.75 0.75−0.85 0.75 0.75 −0.85 0.75 ...].

The element vice sum of Model 1 and Model 2 is [1.45−0.05 −0.15 1.45 −0.05 −1.65 −0.05 ...].

The summarized elements equal to or greater than zero are classified as a virus and elements less than zero are classified as non-virus. This results in the ensemble model with prediction vector [1 0 0 1 0 0 0 ...]. This solution can be extended to include all six models.

The other ensemble method to evaluate is very similar to the first one. Instead of using class recalls, accuracy is used.

The prediction time of the ensemble models are of importance. Some of the processes should be possible to run parallel. The process of creating k-mer features and translating the nucleotides to digits could run parallel. The process of creating principal components of the k-mers must however be done after the k-mer are created. All of the model predictions can run parallel, but the neural networks must wait for the translation into digits to finish and the other models must wait for the k-mer and principal component extraction to be done. The ensemble model can only be created when all the models that should be included are finished predicted the reads. In other words, the model that takes the longest time to predict the reads dictates the time for the ensemble model. The time for classification is compared with the sequencing machines speed described in Section 3.2.1, to make sure that no bottlenecks is caused to slow down the overall process of sequencing and classification.

To get an overview of how alike individual models predict, each model is compared with every other model. By taking the accuracy score between two models, the percentage of alike predictions can be extracted.

(20)

3.6 Classification SARS-CoV-2/Non-SARS-CoV-2

The section of models to include in the SARS-CoV-2 classifier is based on the models which where found to have a feasible prediction time and accuracy in the Virus/Non-Virus classification in Section 3.5.7

The neural networks tune their hyperparameters on the full Tuningset, while the other models sample smaller sets from the Tuningset. When tuning the models, the datasets are randomly divided into 85% training data and 15% validation data.

The Convolutional Neural Network construction is done in a similar approach as in Section 3.5.1. The difference is that only one CNN is constructed, not two. If the CNN have an embedding layer or not is determined by the result of the Virus/Non-Virus classification.

The process of tuning hyperparameters is presented in Appendix C.1. The CNN created for SARS-CoV-2 relies on Keras default settings for all hyperparameters not mentioned[21].

The Recurrent Neural Network construction is done in a similar approach as in Section 3.5.2. The only difference is that only one RNN is constructed, not two. If the RNN have an embedding layer or not is determined by the result of the Virus/Non-Virus classification.

The process of tuning hyperparameters is presented in Appendix C.2. The RNN created for SARS-CoV-2 relies on Keras default settings for all hyperparameters not mentioned[21]

Due to the very slow prediction time, found from the Virus/Non-Virus classification, this model is excluded in the process of SARS-CoV-2/non-SARS-CoV-2 classification.

The procedure is same as in Section 3.5.4. The tuning grid and sizes of the sample datasets can be seen in Table 24 located in Appendix C.3. The final settings of the Scikit-learn model can be seen in Appendix C.3.

The procedure is same as in Section 3.5.5.The tuning grid and sizes of the sample datasets can be seen in Table 25 located in Appendix C.4. The final settings of the Scikit-learn model can be seen in Appendix C.4.

The procedure is same as in Section 3.5.6. The tuning grid and sizes of the sample datasets can be seen in Table 26 located in Appendix C.5. The final settings of the Scikit-learn model can be seen in Appendix C.5.

(21)

3.6.7 Model Comparison and Model Ensemble

The model comparison and model ensemble is done in the same manner as in Section 3.5.7.

3.6.8 Real Data Application

To test if samples are containing SARS-CoV-2, a statistical test is done.

For a given sample containing n reads, there is X reads predicted positive for SARS-CoV-2.

X then both contains reads that are true positives and false positives, X = XT P + XF P.

The probability of obtaining a false positive is PF P and the probability of obtaining a true positive is PT P. With these probabilities we know that the true positives follows the distribution

X_{T P} ∼ B(nQ, PT P) and false positives follows the distribution

XF P ∼ B(n(1 − Q), PF P),

where Q represent the true fraction of positive SARS-CoV-2 reads in the sample.

The statistical test is constructed to test the assumption that the proportion of true SARS- CoV-2 reads, Q, are 0. Hence the null hypothesis can be seen as

H0: Q = 0,

which gives corresponding hypothesis that the sample contains SARS-CoV-2 reads H₁: Q > 0.

Given the null hypothesis of Q = 0, that no true positives XT P exists, then X follows the binomial distribution of positive reads as

X∼ B(n, PF P).

Since the sample sizes are very large, normal approximation of binomial distribution is made where

X∼ B(n, PF P), is approximately equal to and can be estimated by

X ∼ N(nPF P, nPF P(1− PF P)).

The standard deviation is

σ(X) =p

nPF P(1− PF P), and the expected value is

µ(X) = nPF P.

The probability of obtaining a false positive PF P is estimated by predicting a known negative sample, in our case the Non-SARS-CoV-2 part of the Tuningset. Using this estimated probability, a test is constructed to validate the hypotheses. Given x observed positive reads, one reject the null hypotheses at a 1% significance level if

P (X≥ x|X ∼ B(n, PF P))≈ P (Z ≥ x− nPF P

pnP_{F P}(1− PF P))≤ 0, 01.

(22)

4 Results

4.1 Principal Component Analysis

The principal component analysis on the Virus/Non-Virus Tuningset, which can be seen in Figure 5, shows that there is an elbow point around 79% of the variance explained. This corresponds to the first 50 principal components. To hold down the number of dimensions, the first 50 principal components is used.

The variance explained from the Tuningset for SARS-CoV-2 versus non-SARS-CoV-2 does not really have an obvious elbow point, as seen in Figure 5. To hold down the number of dimensions, the number of principal components used are set to 60, corresponding to around 80% of variance explained.

Figure 5: The cumulative sum of the variance explained by the principal components for the SARS-CoV-2/Non-SARS-CoV-2 data and the Virus/Non-Virus data. The graph shows the principal components with all 5460 different k-mers. The smaller graph with grey background is a zoomed in version of the small transparent area to the left.

4.2 Virus/Non-Virus

In this section, the optimal hyperparameters and performance for the different models is presented. For more in depth tuning of hyperparameters, see Appendix B.

(23)

The CNN without embedding layer have resulting hyperparameters of a first convolutional layer with 290 number of filters and a kernel size of 7. The second convolutional layer with 225 number of filters and a kernel size of 35. Training the CNN without embedding on the Trainingset over 100 epochs took 5 hours and 30 minutes. The model yields an accuracy of 85,252%, virus recall of 85,187% and non-virus recall of 85,321%. This is with a speed of 9901 reads predicted per second.

Resulting hyperparameters for the CNN with embedding layer is an embedding layer size of 100, a first convolutional layer with 350 number of filters and a kernel size of 30. The second convolutional layer have 175 number of filters and a kernel size of 45. Training the CNN with embedding on the Trainingset over 100 epochs took 10 hours and 8 minutes. The model including embedding yield an accuracy of 87,118%, virus recall of 86,764% and non-virus recall of 87,472%. Prediction speed of this model is 8187 reads predicted per second.

A comparison of how accuracy changes over an amount of epochs is shown in Figure 6.

The CNN with embedding converges to approximately 87% accuracy and the CNN without converges to approximately 85% accuracy. Out of the two CNN models fitted, better performance is given by the model including embedding layer, since the overall accuracy and recall have increased with a minimal change in prediction time. The CNN with embedding do also reach higher accuracy over less number of epochs, which reduces the training time for the model. The CNN with embedding is the model considered best and is used in the ensemble solution.

Figure 6: Accuracy development over 100 epochs when training Convolutional Neural Networks on the Virus/Non-Virus Trainingset.

After tuning the hyperparameters for the RNN network without embedding layer, optimal performance is received by having a first LSTM layer with 5 memory cells and a second LSTM layer with 110 memory cells. Training the RNN on the Trainingset over 100 epochs took 17 hours and 58 minutes. This yields an accuracy of 80,360%, virus recall of 80,109%

and non-virus recall of 80,613%. Prediction speed is 2188 reads per second.

(24)

Resulting hyperparameters for the RNN network with embedding have an embedding layer size of 450, a first LSTM layer with 27 memory cells and a second LSTM layer with 50 memory cells. Training the RNN with embedding on the Trainingset over 100 epochs took 23 hours and 57 minutes. This gives an accuracy of 80,599%, virus recall of 78,860% and non-virus recall of 82,338%. Prediction is done at a speed of 2532 reads per second.

A comparison of the models is done on how accuracy changes over epochs is presented in Figure 7. Both RNN models converge around 80% accuracy after 100 epochs. The RNN with embedding is considered better because of the slightly higher accuracy, it also reach higher accuracy over less number of epochs, which can be useful when training new models.

Difference in training time is not big enough to exclude the embedding. The RNN with embedding layer is considered best and is used in the ensemble solution.

Figure 7: Accuracy development over 100 epochs when training Recurrent Neural Networks on the Virus/Non-Virus Trainingset.

The comparison between a Support Vector Machine using a linear kernel and a Radial Basis Function kernel resulted in Figure 8. The accuracy is much lower for the linear kernel for all cost parameters. The data does not seem to be linearly separable in a satisfying way. The RBF kernel is therefore used in the final SVM model.

From the tuning result, the optimal hyperparameter for C is 60. To train a SVM model with the settings presented in Appendix B.3 on the Trainingset took approximately 330 hours.

The accuracy measures from predicting the Testset are presented in Table 3.

(25)

Figure 8: Comparison of accuracy from a Support Vector Machine when using a linear kernel and a RBF kernel for different settings of the cost parameter C.

From the tuning results, the optimal hyperparameter settings for the Random Forest are 2000 trees with a maximum depth of 50. To train a RF model with the settings presented in Appendix B.4 took approximately 45 minutes. The accuracy measures from predicting the Testset is presented in Table 3.

From the tuning results, the optimal hyperparameter settings for the Gradient Boosting model is a learning rate of 0.2 with 12000 estimators. To train a GB model with the settings presented in Appendix B.4 took approximately 95 hours. The accuracy measures from predicting the Testset is presented in Table 3.

From the tuning results, the optimal hyperparameter setting for the KNN is 1 neighbour.

Since KNN does not need to be trained, it took approximately 12 seconds to fit the model on the Trainingset, with the settings presented in Appendix B.5. The accuracy measures from predicting the Testset is presented in Table 3. It is worth noting that even though 10 cores are allocated for the KNN, it sometimes uses only 5 core in our experiment. This slows down the prediction time.

4.2.7 Model comparison

The individual model performances on the Virus/Non-Virus data are presented in Table 3.

All models have a fairly high accuracy, with a mean of 84,03% correct predictions, but the predicted reads per second have higher variance among the models.

(26)

Table 3: Evaluation metrics from the tuned models. The predicted reads/second are exclusive the time for feature extraction.

Model Accuracy Recall

Virus

Recall Non- Virus

Predicted reads/sec-

ond Convolutional Neural Network 0.87118 0.86764 0.87472 8187

Recurrent Neural Network 0.80599 0.78860 0.82338 2532 Support Vector Machine 0.84872 0.86268 0.83476 34

Random Forest 0.85672 0.85161 0.86184 7895

Gradient Boosting 0.81803 0.82601 0.81006 3491 K-Nearest Neighbour 0.84616 0.90549 0.78684 195

Comparing models similarities in prediction on the Testset, seen in Table 4, the models that are least similar in prediction is KNN and RNN, which are 77,4% equal in predictions.

Highest similarity is between RF and GB, which predicts equally on 88,3% of the data.

Table 4: Similarity between model predictions when classifying the Virus/Non-Virus Testset.

CNN RNN SVM RF KNN GB

CNN 1 0.82881 0.84542 0.86716 0.81635 0.83267 RNN 0.82881 1 0.84549 0.84912 0.77405 0.85698 SVM 0.84542 0.84549 1 0.87955 0.82079 0.87830 RF 0.86716 0.84912 0.87955 1 0.83121 0.88326 KNN 0.81635 0.77405 0.82079 0.83121 1 0.79539 GB 0.83267 0.85698 0.87830 0.88326 0.79539 1

The ensemble tests all possible combinations of models. One ensemble is created using recall and one using accuracy, described in Section 3.5.7. Of these two methods used as ensemble, the best performance is observed when using accuracy in model voting. Out of the 64 possible model combinations tested, it is only one combination(RNN, RF, KNN, GB) where accuracy is better in the recall ensemble. The difference in accuracy for this combination is as low as 0.034%. The result from all ensemble combinations is presented in Appendix D.

Transforming the Testset for the neural networks took 17 seconds. Creating k-mers and principal components took 141 seconds, respectively 173 seconds.

The highest accuracy is achieved when assembling CNN, RNN, SVM and KNN. This combination yield an accuracy of 89,02%. The time to predict the Testset of 400 thousand reads for this combination is as high as 3 hours and 17 minutes, which can be translated to 33 reads per second. A prediction time this high is received since SVM is the slowest of all presented models when it comes to predicting, which also can be seen Figure 3.

The computational time can not exceed the sequencing machines run time presented in Section 3.2.1. The machine FOI have today, a MiSeq, gives time limitation of 1584 seconds to classify 400 thousand reads. If NextSeq 550 is used, it gives time limitation of 108 second to classify 400 thousand reads. This limitation is presented as a green vertical line in Figure 9.

Since ensemble combinations containing SVM and KNN exceeds the time limit given by the MiSeq equipment, all ensemble models with these included are not presented in the comparison between accuracy and time in Figure 9. However, when focusing on the green line at 108 seconds, only CNN by itself have a fast enough prediction time to meet the requirements when sequencing with the NextSeq 550.

(27)

Figure 9: Prediction accuracy versus computational time for ensemble combinations on the Virus/Non-Virus data. The green vertical line at time 108 correspond to limitation when using NextSeq 550 sequencing equipment.

Given the limitations using MiSeq, combining CNN, RNN, RF and GB gives the best performance when classifying the Virus/Non-Virus Testset. This ensemble combination have the best overall performance, with an accuracy of 87,325% correct classifications, a virus recall of 87,542% and a non-virus recall of 87,108%. Prediction speed is 900 predicted reads per second.

If the NextSeq 550 is used instead, then the only acceptable model with given limitations, CNN, has an accuracy of 87,118%, virus recall of 86,764% and non-virus recall of 87,472%.

This result is obtained with a prediction speed of 6024 reads per second.

4.3 SARS-CoV-2/Non-SARS-CoV-2

The result for the models constructed for SARS-CoV-2 data is presented in this section. For more in-depth description of tuning, see appendix C.

Given the result from using CNN’s on Virus/Non-Virus data, the CNN used for SARS-CoV- 2 classification is constructed using an embedding layer. The optimal hyperparameters after tuning is an embedding layer of size 150, a first convolutional layer that have 300 number of filters, a second convolutional layer that have 150 number of filters and kernel sizes of 30.

The accuracy measures from predicting the Testset are presented in Table 5.

(28)

Equally as in the CNN case, the RNN created with embedding on the Virus/Non-Virus data outperformed the model without. Hence, the RNN created to classify SARS-CoV-2 is created using an embedding layer. The optimal setting of hyperparameters is an embedding layer size of 500, first memory cell with a size of 40 and second memory cell with a size of 65. The accuracy measures from predicting the Testset are presented in Table 5.

From the tuning results, the optimal hyperparameter settings for the Random Forest is 1000 trees with a maximum depth of 80. To train a RF model with the settings presented in Appendix C.3 took approximately 30 minutes. The accuracy measures from predicting the Testset are presented in Table 5.

From the tuning results, the optimal hyperparameter settings for the Gradient Boosting model is a learning rate of 0.35 with 4000 estimators. To train a GB model with the settings presented in Appendix C took approximately 28 hours. The accuracy measures from predicting the Testset are presented in Table 5.

From the tuning results, the optimal hyperparameter setting for the KNN is 2 neighbour.

Since KNN does not need to be trained, it took approximately 14 seconds to fit the model on the Trainingset, with the settings presented in Appendix C. The accuracy measures from predicting the Testset are presented in Table 5. It is worth noting that even though 10 cores are allocated for the KNN, it sometimes uses only 5 core in our experiment. This slows down the prediction time.

4.3.6 Model comparison

In general all models have great performance when classifying reads as SARS-CoV-2 or Non- SARS-CoV-2. One can see in Table 5 that none of the trained models have an accuracy under 99,9% when classifying. The lowest accuracy is obtained from the RNN, with an accuracy of 99,909%. On the other side of the accuracy span, the best accuracy is obtained from the CNN with an accuracy of 99,995%. Another thing to note is that KNN have a recall of 100% for SARS-CoV-2. The most time consuming model when predicting is KNN.

When comparing model similarities when predicting the Testset, seen in Table 6, it is visible that all models have a high similarity. This comes naturally since their overall individual accuracy is very high. The most similar models are CNN and GB with an similarity in prediction of 99,938%. The models least similar in prediction, are RNN and KNN with a similarity of 99,830%.

(29)

Table 5: Evaluation metrics from the tuned models. The predicted reads/second are exclusive the time for feature extraction.

Model Accuracy Recall

SARS- CoV-2

Recall Non- SARS- CoV-2

Predicted reads/sec-

ond Convolutional Neural Network 0.99995 0.99999 0.99992 8423

Recurrent Neural Network 0.99909 0.99908 0.99911 2544

Random Forest 0.99937 0.99914 0.99961 30641

Gradient Boosting 0.99941 0.99972 0.99911 13345

K-Nearest Neighbour 0.99918 1 0.99836 325

Table 6: Similarity between model predictions when classifying the SARS-CoV-2/Non-SARS-CoV- 2 Testset.

CNN RNN RF KNN GB

CNN 1 0.99905 0.99934 0.99914 0.99938 RNN 0.99905 1 0.99857 0.99830 0.99857 RF 0.99934 0.99857 1 0.99858 0.99906 KNN 0.99914 0.99830 0.99858 1 0.99866 GB 0.99938 0.99857 0.99906 0.99866 1

When testing all ensemble combination with both accuracy and recall as constants in voting, which can be seen in Appendix E, the ensemble method using accuracy as constants performs better. There is only one combination, RNN-RF-KNN-GB, that gets an increase in accuracy using recall and the increase is 0.0035%.

Transforming the Testset for the neural networks took 17 seconds. Creating k-mers and principal components took 141 seconds, respectively 174 seconds.

The highest accuracy of all ensemble models is the combination of CNN-RNN-RF-KNN. This specific combination yields an accuracy of 99,999%. The prediction time for the Testset is 33,5 minutes, which corresponds to 199 read per second. This is due to the fact that the combination includes KNN as one of the models, which is the slowest predictor of all trained models on the SARS-CoV-2 data.

If FOI uses its MiSeq sequencing equipment, the time limitation for 400 thousand reads is 1584 seconds. If the NextSeq 550 is used, the time limitation is 108 seconds for 400 thousand reads. All models with a prediction time under 1584 seconds are presented in Figure 10.

The limitations from using NextSeq 550 is presented as a green vertical line at time 108 seconds.

The best model to be used at FOI today, using the MiSeq sequencing equipment, is the ensemble solution combining CNN, RNN, RF, and GB. This model have an accuracy of 98,998%, SARS-CoV-2 recall of 99,997% and Non-SARS-CoV-2 recall of 99,999%. Predic- tion speed was 1125 reads per second.

(30)

Figure 10: Prediction accuracy versus computational time for ensemble combinations on the SARS-CoV-2 data. The green vertical line at time 108 correspond to limitation when using NextSeq 550 sequencing equipment.

If the NextSeq 550 sequencing equipment is used, the only model fast enough, is CNN. The CNN have an accuracy 99,995%, recall SARS-CoV-2 of 99,999% and recall Non-SARS-CoV-2 of 99,992%. Prediction speed for this CNN is 6144 reads per second.

4.3.7 Real data

Applying these best models to predict the real data. The ensemble combination with CNN, RNN, RF, and GB used on the known negative part of the Tuningset results in 19 false positives reads. This means that 19 out of 400 thousand reads are classified positive when they are in fact negative. This can be translated to a probability of 4,75e-5 to wrongly classify a negative read as positive.

Given the number of SARS-CoV-2 reads found in each sample using the ensemble classifier, which can be seen in Table 7, the only sample set where the null hypothesis can be rejected is setF. The rejection of the null hypothesis means that a sample can be said to be SARS-CoV- 2 positive. The statistical tests constructed strongly indicates if each sample is SARS-CoV-2 positive or negative.

Table 7: Result from predicting setF, setM, and setU using the ensemble solution. Positive and negative reads are presented together with a statistical test and prediction for every dataset.

Dataset Predicted Positive(x)

Predicted Negative

Z score Statistical Test P(Z≥ ^x^−µ_σ )

SARS-CoV-2 Prediction (Correct label)

setF 5596 315300 1429.46837 0 Positive (Positive)

setM 9 997052 -5.57424 1 Negative (Negative)

setU 95 4363368 -7.79812 1 Negative (Negative)

(31)

When testing the CNN on the Tuningset, 36 out of 400 thousand reads is misclassified as positive. This gives a probability of 0,00009 to wrongly classify a negative read as positive.

Given the number of SARS-CoV-2 reads found in each sample using the ensemble classifier, which can be seen in Table 8, the only sample set where the null hypothesis can be rejected is setF. The rejection of the null hypothesis means that a sample can be said to be SARS-CoV-2 positive. The statistical tests constructed strongly indicates if each sample are SARS-CoV-2 positive or negative.

Table 8: Result from predicting setF, setM, and setU using CNN. Positive and negative reads are presented together with a statistical test and prediction for every dataset.

Dataset Predicted Positive(x)

Predicted Negative

Z score Statistical Test P(Z≥ ^x^−µ_σ )

SARS-CoV-2 Prediction (Correct label)

setF 5611 315285 1038.76014 0 Positive (Positive)

setM 35 997061 -5.77838 1 Negative (Negative)

setU 345 4363118 -2.40772 0.99197 Negative (Negative)

(32)

5 Discussion

5.1 Taxonomic Level

When comparing the result in Section 3.5.7 and 4.3.6, it is clear that there is a significant difference in terms of accuracy between the two classification problems. Classify- ing Virus/Non-Virus provides a significantly lower accuracy in comparison to SARS-CoV- 2/Non-SARS-CoV-2. This could possibly reflect in the difference in taxonomic level of the two problems. The Virus/Non-Virus data contain reads from the taxonomy level kingdom, which is defined at a high taxonomic level. This results in two very diverse groups of reads.

There is a possibility that some reads are indistinguishable, since both groups might share that specific read sequence of nucleotides.

SARS-CoV-2 on the other hand, is a strain of a virus species and is defined at a low level in the taxonomic tree. Given its relatively short and specific genome, it is possible to sample reads that together cover the entire genome. Models that are trained with such well sampled data, will be better at recognising the specific pattern of SARS-CoV-2. Searching for a specific genome in a diverse groups of genomes is easier to distinguish than to separate two diverse groups of genomes. Due to the difference in taxonomy, an accuracy of the kind we see in the SARS-CoV-2/Non-SARS-CoV-2 problem might not be possible to see in the Virus/Non-Virus problem.

5.2 Model Performance and Time Complexity

The best classifier for the Virus/Non-Virus problem in regards of accuracy, which can be seen in Appendix D, is the ensemble solutions containing models CNN-SVM-KNN, CNN- SVM-KNN-GB and CNN-RNN-SVM-KNN. All of these ensemble combinations got the same accuracy on the Testset, but as it can be noted, these combinations all contain the models CNN, SVM and KNN. The conclusion from this is that the non existent influence in the voting from RNN and GB can be explained by their lower accuracy. The ensemble combination of CNN-SVM-KNN could be seen as the best ensemble combination since it has lower complexity. However both SVM and KNN must be excluded due to the time limitation.

The same goes for the SARS-CoV-2/Non-SARS-CoV-2, where the most accurate ensemble combination is CNN-RNN-RF-KNN. The KNN computational time exceeds the limitation, which can be seen in Appendix E.

The feature extraction of k-mers and the transformation of the data to principal components are also processes that take up a lot of time when predicting. Looking at Table 5, RF and GB are the fastest models to predict reads, when the creation of features are excluded from the time measurement. When the overall time is considered, see Figure 10, the models that include RF and GB are the most time consuming within the predefined time limit.

The problems presented above comes down to computational resources and allocation of computational power. Having even more powerful tool setup then the one presented in Section 3.2 would increase the speed of classification. Another significant increase in speed could be achieved by using all the CPU power by parallel predicting several batches simulta- neously. By doing this, the ensemble combinations that were excluded, would be performing better and therefore possibly fall within the limitation of the MiSeq sequencing machine. It is also possible that some combinations could even be under the threshold for the NextSeq 550 sequencer.

Regarding specific model performance, after training the RNN:s on the Trainingset for Virus/Non-Virus, one can see in Figure 7 that even at epoch 100 the accuracy is still slowly

(33)

increasing. The RNN:s might therefore benefit from training over more epochs and possibly get better performance.

If a new ensemble classifier would be developed, and developing time is of importance, the models to be included must be considered with care. The training of a SVM is time consuming, especially with large sets of data. The SVM training process for the 1 million Virus/Non-Virus reads took approximately 330 hours, which is almost two full weeks.

Depending on the hyperparameter settings, training a GB model can as well be time consuming. When increasing the learning rate and decreasing the number of estimators, the training time drops significantly, which can be seen when comparing the two created classifiers in Section 3.5.6 and 4.3.4. Increasing the learning rate with 0.15 and decreasing the number of estimators with 8 thousand resulted in a drop in training time by 67 hours.

If two or more models would have a high similarity, it could affect the ensemble accuracy since the highly correlated models would then vote for the same class and therefore become biased in the voting compared with other models included. The similarity of predictions in both implementations can be seen in Tables 4 and 6. Some models have a higher similarity, but looking at the models individual accuracy, the similarity scores are not concerning. It shows that the models use different approaches when separating the two classes.

Probabilities for each read to belong to a specific class could be extracted from the models.

A possible improvement for our ensemble methodology would be to consider constructing the ensemble with these class probabilities extracted from the models. This would hopefully yield in ensemble models with increased robustness.

5.2.1 Real data

When doing a statistical test based on the predictions from the ensemble solution, there is no doubt in which dataset that are SARS-CoV-2 positive and negative. The level of confidence in the statistical test is asymptotically close to one for the negative datasets and asymptotically close to zero for the positive dataset. Looking at our classification results, seen in Table 7, a strong indication is shown that our ensemble model is reliable when it comes to classifying SARS-CoV-2.

When using the CNN on the same data, it gets almost as satisfying results as the ensemble solution, except for setU, which only has a confidence level of 0.9923. This indicates that CNN is not as good as the ensemble solution. This is however to be expected since CNN had lower accuracy on the Testset and predicted more false positives on the Tuningset.

In the process of determining the false positive rate of a sample, we used the Non-SARS- CoV-2 part of the Tuningset. This gave us a false positive rate of that sample, which we then used in the statistical test for the real data. However, there are some possible flaws here to be noted. Since we have used the Tuningset to find the optimal hyperparameters for our models, the models are somewhat fitted to that data through the tuning, even though it is not used to train the final model. Also, the proportion of virus, bacteria and eukaryotic reads in the Tuningset is one third each. When handling real data, sampled from an actual human being, we do not know what the fraction of these groups could be. We would therefore propose that this method would be more accurate if real negative samples from humans where used to determine the average proportion of false positives. Using samples from humans infected with virus diseases other then SARS-CoV-2 would arguably be even better since they probably have a higher amount of viruses in their sample and viruses in general are more similar to SARS-CoV-2 since they fall under the same kingdom within the taxonomy.

When our classifier for SARS-CoV-2 was trained, some viruses similar to SARS-CoV-2 were

Metagenomic Classification using Machine Learning: Applied to SARS-CoV-2 and Viruses