• No results found

PREDICTING INTERSPECIES TRANSMISSION AND PANDEMIC RISKS OF CORONAVIRUSES Master Degree Project in bioinformatics 30 ECTS Autumn term 2020 Nigus Fikrie Telele Supervisor: Zelmina Lubovac Examiner: Björn Olsson

N/A
N/A
Protected

Academic year: 2021

Share "PREDICTING INTERSPECIES TRANSMISSION AND PANDEMIC RISKS OF CORONAVIRUSES Master Degree Project in bioinformatics 30 ECTS Autumn term 2020 Nigus Fikrie Telele Supervisor: Zelmina Lubovac Examiner: Björn Olsson"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

PREDICTING INTERSPECIES

TRANSMISSION AND PANDEMIC RISKS OF CORONAVIRUSES

Master Degree Project in bioinformatics 30 ECTS

Autumn term 2020 Nigus Fikrie Telele

(2)

2

Contents

Abbreviations ... 3

1. Introduction ... 4

1.1. Background... 4

1.1.1. Brief overview of coronaviruses ... 4

1.1.2. Feature extraction and machine learning algorithms ... 6

1.1.2.1. Amino Acid Composition ... 7

1.1.2.2. Pseudo Amino Acid Composition ... 8

1.1.2.3. Random Forest ... 9

1.1.2.4. Näive Bayes (NB) ... 9

1.2. Research hypothesis ... 10

1.3. Research aim ... 10

1.4. Research objectives ... 10

2. Materials and methods... 11

2.1. Dataset construction ... 11

2.2. Feature Extraction ... 11

2.3. Feature Selection ... 14

2.4. Machine learning classification models ... 14

2.5. Evaluating the performance of classification models ... 16

2.6. Interspecies transmission and pandemic risk predictions ... 16

3. Results ... 18

3.1. Feature Extraction ... 18

3.1.1. Amino Acid Composition ... 19

3.1.2. Pseudo Amino Acid Compositions ... 19

3.2. Machine learning classification ... 20

3.2.1. Random Forest... 20

3.2.1.1. Random Forest Model for features extracted from Amino Acid Composition (AAC) ... 21

3.2.1.2. Random Forest Model for features extracted from Dipeptide Composition (DC) ... 22

3.2.1.3. Random Forest Model for features extracted from Tripeptide Composition (TC) ... 22

3.2.1.4. Random Forest Model for features extracted from Pseudo Amino Acid Composition (PseAAC) .. 23

3.2.1.5. Random Forest model for features extracted from Amphiphilic Pseudo Amino Acid Composition (APseAAC) ... 24

3.2.2. Näive Bayes Classification ... 25

3.3. Interspecies Transmission and Pandemic Risk Predictions ... 26

3.3.1. Distance Matrix ... 27

3.3.2. Multidimensional Scaling (MDS) ... 30

3.3.3. Phylogenetic tree ... 33

4. Discussion and Conclusions ... 34

4.1. Discussion ... 34

4.2. Conclusion ... 39

5. Bioinformatics pipeline for interspecies transmission and pandemic risk predictions ... 40

6. Scientific contributions... 42

7. Future studies ... 42

8. Ethical aspects and impact on society ... 43

9. Acknowledgement ... 43

(3)

3 Abbreviations

AAC Amino Acid Composition

APseAAC Amphipathic pseudo amino acid composition

CoV Coronavirus

DC Dipeptide Composition

HCoV Human coronavirus

K-NN K-nearest neighbor

MERS Middle East Respiratory Syndrome

MSA Multidimensional scaling

mmds matric multidimensional scaling

NB Näive Bayes

PseAAC Pseudo Amino Acid Composition

RF Random Forest

SARS Severe acute respiratory syndrome

SVM Support vector machines

TC Tripeptide Composition

(4)

4

1.

Introduction 1.1. Background

Most pandemics caused by viruses such as human immunodeficiency virus (HIV), pandemic influenza, and severe acute respiratory syndrome (SARS) originated from animals (Pike et al., 2010). Zoonotic viruses can jump from animals to humans and could establish human-to-human transmission with wide geographical coverage leading to potential pandemics. As a result, predicting cross-species transmission and pandemic risks of viruses with the help of bioinformatics tools become important global research questions (Chan et al., 2013, Eng et al., 2014, Petropoulos and Makridakis, 2020).

1.1.1. Brief overview of coronaviruses

Coronaviruses (CoVs) belong to the family of Coronaviridae, sub-family of Coronavirinae (Corman et al., 2018, Cui et al., 2019). The sub-family Coronavirinae further classified into four genera: alpha-, beta-, gamma- and delta-coronavirus (α-, β-, γ- and δ- CoV) (Corman et al., 2018). The viral genome is composed of single-stranded positive sense RNA. CoVs infect birds (γ- and δ- CoVs) and various mammalian species including human beings (primary α- and β- CoVs) (Chan et al., 2013, Petropoulos and Makridakis, 2020). Human coronavirus (HCoV) infections have been documented since 1960s (Corman et al., 2018, Hamre and Procknow, 1966, McIntosh et al., 1967, Petropoulos and Makridakis, 2020).

So far, seven human coronavirus infections have been described from the α-CoV and β-CoV genera including HCoV-NL63, HCoV-229E, HCoV-OC43, HCoV-HKU1, MERS-CoV, SARS-CoV, and the on-going SARS-CoV-2 (Corman et al., 2018, Cui et al., 2019). The first four human coronaviruses caused only mild upper respiratory tract infections in immunocompetent hosts, whereas the remaining three caused severe respiratory syndrome in human hosts during the last two decades with considerably large mortality rates (Decaro and Lorusso, 2020). Human coronaviruses circulate worldwide, but the predominant species may vary by region or year (Su et al., 2016).

(5)

5

were reported in 2004 and 2005, respectively (van der Hoek et al., 2004, Woo et al., 2005). During the last two decades, two more epidemic human coronaviruses emerged including severe acute respiratory syndrome coronavirus (SARS-CoV) in 2003 and the Middle East respiratory syndrome coronavirus (MERS-CoV) in 2012 (Drosten et al., 2003, Zaki et al., 2012).

Both SARS-CoV and MERS-CoV caused outbreaks with high case fatality rates in different parts of the world. The SARS-CoV outbreak affected about 8 000 individuals, which caused pneumonia with fatality rate of approximately 10 percent (Cheng et al., 2007, Corman et al., 2018, Drosten et al., 2003, Yang et al., 2015). Although SARS-CoV or the closely related CoVs in bats possibly could cause infections among the human population, there were no reports of human SARS-CoV cases since 2004 (Corman et al., 2018, Ge et al., 2012, Ge et al., 2013). The MERS-CoV was first reported in a fatal human case of pneumonia in Saudi Arabia in 2012 (Zaki et al., 2012). Subsequent studies suggest that human beings mostly acquired the virus as zoonotic infection from dromedary camels in the Middle East (Chu et al., 2015, Chu et al., 2014, Saqib et al., 2017, Corman et al., 2018). Since the virus was discovered, over 2 000 human MERS-CoV infections with case fatality rates of approximately 35% (Corman et al., 2018). In addition, a stepwise human-to-human transmission and spread of MERS-CoV infections were reported in hospitalized situation (Oboho et al., 2015, Corman et al., 2018, Kim et al., 2017). This shows that MERS-CoV has unknown epidemic and pandemic potential threats among the human population.

Although the on-going SARS-CoV-2 has successfully established a human-to-human transmission, the virus infection has been believed to start from animals in Wuhan city, China in December 2019 (Riou and Althaus, 2020). Since the beginning of the 21st century, the world has been seriously challenged by successive human coronaviruses with increasingly high fatality rates, especially from the on-going SARS-CoV-2 (Corman et al., 2018, Qiang et al., 2020). As of 23 December 2020, over 75 million cumulative confirmed cases and nearly 1.7 million deaths have been reported globally (WHO, 2020). Although SARS-CoV-2 is believed to have started in China, the pandemic severely affected the Americas and Europe accounting for over 85% of new cases and 86% of new deaths globally.

(6)

6

zoonotic transmission of the virus with viral genomic reservoir in wild animals; its ability to directly cross the species barrier and infect the human host; rapid virus evolution and emergence of human-to-human transmission in a completely naïve human population; the novelty of virus antigens for the human host; and wide range of the target cells in the human host (Eng et al., 2014, Kasibhatla et al., 2020, Phan, 2020, Qiang et al., 2020).

The host tropism is primarily determined by the viral spike (S) entry protein, which plays a vital role in the binding of cell receptors and membrane fusion whereby the host range is known to be determined (Chan et al., 2013, Eng et al., 2014, Li, 2012). Studying the key roles of the S protein certainly contributes in advancing our knowledge on the cross-species transmission and pathogenesis of CoVs towards developing early warning and intervention strategies (Hulswit et al., 2016).

Importantly, the CoV genome has been a focus of intensive study. Currently, a huge number of sequences of the CoV have been made available in public databases. As of 12 December 2020, over 350 000 entries of CoV protein sequences including spike protein sequences were made available in the NCBI virus database (NCBI, 2020). This gives a unique opportunity to gain insight on the evolution of the CoV, for instance, through identifying shared structural topologies in the S protein domains and studying their evolutionary pattern. This could further enable to predict the cross-species and pandemic risks of CoV infections using bioinformatic tools.

1.1.2. Feature extraction and machine learning algorithms

Machine learning bioinformatics methods have been introduced to predict protein structures and functions. In the last few years, these methods have been used in predicting pathogenic microorganisms and their clinical features (Li et al., 2020a, Li et al., 2020b). For instance, recent researches applied machine learning tools for predicting influenza (Eng et al., 2014, Eng et al., 2017, Li et al., 2020a, Qiang and Kou, 2019) as well as to predict the evolutionary dynamics of coronaviruses (Qiang et al., 2020). Similarly, these bioinformatic methods could aid in systematically analyzing the increasingly available S protein sequence data to predict the risk of potential CoV pandemics in a completely naïve human population.

(7)

7

various applications including prediction of protein structural and functional classes, protein-protein interactions, protein-protein–ligand interactions, subcellular locations, and enzyme substrates (Chen et al., 2018).

During the past few decades, an increasing number of web based and standalone software packages have been developed to calculate a variety of structural and physicochemical features from protein sequences including PROFEAT, BioBayesNet, SPiCE, PseAAC, PseAAC-Builder, propy, protr/ProtrWeb, Rcpi and PseKRAAC (Chen et al., 2018, Xiao et al., 2015, Shen and Chou, 2008, Xiao, 2019). Moreover, feature selection and ranking analysis is an equally crucial step in machine learning algorithm studies of protein structures and functions (Chen et al., 2018). In this study, we employed five protein feature extraction algorithms that can represent samples of a protein in two different forms: a discrete form and a sequential form (Chuo, 2005, Kuric, 2010). The five feature extraction algorithm include amino acid composition (AAC), dipeptide composition (DC), tripeptide composition (TC), pseudo amino acid composition (PseAAC), and amphipathic pseudo amino acid composition (APseAAC). The AAC, DC, and TC represent a protein sample in the discrete form, whereas PseAAC and APseAAC represent in the sequential form.

1.1.2.1. Amino Acid Composition

With the discrete form, a protein is represented by a set of discrete numbers or a multiple dimension vector (Kuric, 2010). For example, the amino acid composition is a typical discrete form that has been widely used in predicting protein structural class. The amino acid composition (AAC) does not depend on the order of the amino acid residues and is commonly used to measure the occurrence frequency of the 20 amino acids within a given sequence fragment. It relies on the frequency of an amino acid or a set of amino acids in the sequence that may capture some information that could help in predicting the subject of interest (Chou and Cai, 2005). AAC includes uni-amino acids, dipeptides, tripeptides, or any setting of amino acids in any order. Dipeptide amino acid composition and tripeptide amino acid composition may capture more local information than the uni-amino acid composition.

(8)

8

an amino acid and each element represents the relative frequency of an amino acid (Chuo, 2005, Bhasin and Raghava, 2004). It is the simplest and commonly used amino acid composition form (Chou and Cai, 2005, Chuo, 2005). Although many methods for prediction of various protein attributes were based on the AAC model, all the sequence order effects would be lost by using the AAC discrete model, and hence the prediction quality thus obtained might be limited (Chou, 2009).

With the dipeptide amino acid composition (DC), a feature value represents the frequency of a dipeptide, which is the number of occurrence of an amino acid in two adjacent positions in a protein sequence (Cao et al., 2013). The number of possible dipeptides is 400, which is the number of feature elements. The dipeptide is adding a new meaning to the amino acid composition as the frequency of any contiguous two amino acids may capture some local order information (Bhasin and Raghava, 2004). Tripeptide amino acid composition (TC) is an extension of the frequency of the adjacent amino acids implemented with the DC adding more local order value as the frequency of the three possible amino acids is considered. It is an efficient building block for protein structure prediction and the number of feature elements in TC is 8 000, which is substantially large (Anishetty et al., 2002). However, most machine learning methods do not favor high dimensional data because of high computational demand, which may lead to poor convergence or no convergence state (Cao et al., 2013).

1.1.2.2. Pseudo Amino Acid Composition

Unlike the discrete form, the sequential form is designed to reflect a protein by a series of amino acids according to the order of their positions in the protein chain. Therefore, the sequential form can naturally reflect all the information about the sequence order and length of a protein (Chuo, 2005). Pseudo amino acid composition (PseAAC) is the hybrid features that combines both discrete and sequential features (Shen and Chou, 2008). Compared to the AAC, PseAAC can incorporate much more information of a protein sequence that could remarkably enhance the power of using a discrete model to predict various attributes of a protein (Chou and Cai, 2005, Chuo, 2005). The essence of PseAAC includes the main feature of AAC, but it also includes information beyond amino acid composition. For the PseAAC there are some other elements in addition to the 20 components represented in the AAC.

(9)

9

properties mainly with the hydrophobicity and hydrophilicity of the amino acids (Chuo, 2005). The arrangement of the hydrophobicity and hydrophilicity along a protein chain play an important role in protein folding, catalytic mechanism and protein interaction with other molecules and environment (Limongelli et al., 2015). The APseAAC considers the hydrophobicity, which is often a major contributor of binding affinity between a protein and its ligand. However, the PseAAC and APseAAC may take more computational time than the conventional AAC method.

In addition to the feature extraction methods, this study employed a Random Forest (RF) features screening algorithm followed by construction of machine learning classification models using RF and Naïve Bayes (NB) algorithms (Xiao, 2019). Finally, with the selected features as input, interspecies and pandemic risks of the coronavirus infections were predicted using distance matrix and multidimensional scaling methods through exploring the evolutionary dynamics of the S protein.

1.1.2.3. Random Forest

The RF model classifier has been widely used to model biological data and successfully employed for protein function prediction (Kandaswamy et al., 2010). The model generates a set of decision trees (DTs), which are grown by a subset of features. The RF model repeatedly performs the computing process before making a final prediction for each sample. The DTs trained on different bootstrap samples from a training data. The model elects the classification based on majority vote (over all the trees in the forest). The final prediction is the mean of each prediction with bootstrapping algorithm. Each DT in the RF classifies the new object, and votes for that class.

1.1.2.4. Näive Bayes (NB)

(10)

10

1.2. Research hypothesis

This research hypothesized that there was no significant difference between human coronavirus and non-human coronavirus spike protein features in terms of interspecies transmission and pandemic risks.

1.3. Research aim

The aims of this research project were to screen features of S protein sequences using machine learning models and predict interspecies transmission and pandemic risks of coronavirus infections.

1.4. Research objectives

i) To perform protein feature extraction from coronavirus spike protein sequences using five descriptors including amino acid composition (AAC), dipeptide composition (DC), tripeptide composition (TC), pseudo amino acid composition (PseAAC), and amphiphilic pseudo amino acid composition (ApseAAC).

ii) To perform protein feature selection using a Random Forest (RF) predictive model. iii) To perform classification on the protein features using RF and Näive Bayes (NB)

classifier machine learning algorithms.

iv) To compare performances of the protein feature extraction methods and evaluate their accuracies.

v) To compare performances of the machine learning classifier algorithms and evaluate their accuracies.

vi) To predict interspecies transmission and pandemic risks of coronaviruses using machine learning algorithms.

(11)

11

2.

Materials and methods

The research project performed the following stepwise analyses: 2.1. Dataset construction

Spike protein sequences of coronaviruses were collected from the NCBI Virus database. As of 19 October 2020, there were about 360 000entries of protein sequences from the coronaviruses in the database (NCBI, 2020). We collected complete spike protein sequences of the coronaviruses from the NCBI Virus database, where there were 12 084 coronavirus spike protein sequences (2 586 complete – and 9 498 partial – sequences).

For this project, the 2 586 complete spike protein sequences were collected in FASTA format. Of the available 2 586 coronavirus complete spike protein sequences, 481 were of human – origin and 1 917 of non-human – origin. However, for the remaining 188 sequences the hosts were missing and hence, we excluded them from downstream analyses. Although many spike protein sequences could be downloaded from the NCBI Virus database, there was no any spike protein sequence for the ongoing SARS-CoV-2 virus in this database at that time point. Therefore, we downloaded a complete spike protein sequence for this virus from the China National Center for Bioinformation database (CNCB-NGDC, 2020) and included in the dataset we constructed. The downloaded spike protein sequences were saved into two separate FASTA formatted files (i.e., one for the human coronavirus and another for non-human coronavirus spike protein sequences).

Once target spike protein sequences were collected, incomplete and duplicate sequences were removed to minimize potential biases in the machine learning training process. The sequences were aligned using MEGA X version 10.2.2 (https://www.megasoftware.net/history). The protein sequences also were stratified in relation to host, where the samples were collected from. The coronavirus strains isolated from human samples were considered as positive samples, whereas non-human isolated samples considered as negative samples.

2.2. Feature Extraction

(12)

12

(ACDEFGHIKLMNPQRSTVWY) and 5 ambiguous amino acids (BJUXZ) assignments were converted into numerical vectors for input to machine learning algorithms (Kotyk, 1999). A protr R package was installed and used for reading the spike protein sequences. This package was also used for generating numerical representation schemes of the protein sequences (Xiao et al., 2015). Using the readFASTA() function in the protr package, the two FASTA formatted files were separately uploaded and read in the R environment. The number of sequences uploaded from the two files was checked using the length() function in the protr package. In order to ensure the spike protein sequences only contain the 20 standard amino acid types (which are mostly required for the descriptor computation), the protcheck() function in the protr package was employed, which removed the non-standard sequences.

Once the amino acid type sanity checks were done, a total of five descriptors including three from amino acid composition [i.e., amino acid composition (AAC), dipeptide composition (DC), tripeptide composition)], and two from pseudo amino acid composition [i.e., pseudo amino acid composition (PseAAC), and amphiphilic pseudo amino acid composition (APseAAC)] were computed for those spike protein sequences containing only the 20 standard amino acids. The following functions in the protr package were used to calculate the five descriptors extractAAC(), extractDC(), extractTC(), extractPAAC(), and extractAPAAC(), respectively (Xiao et al., 2015). Also class labels were made for classification modeling using the as.factor() function. As described (section 1.1.2), these feature encoding algorithms were employed for capturing key information of the spike proteins in preparation for subsequent feature selection and machine learning model classification (Shen and Chou, 2008).

AAC gives a 20-dimensional descriptor, which is defined as:

f(r) = Nr r = 1, 2, . . . , 20 N

where Nr is the number of the amino acid type r. N is the length of the sequence. With extractAAC() function, we can easily obtain the length 20 of descriptor.

DC gives a 400-dimensional descriptor, which is defined as:

(13)

13

where, Nrs is the number of dipeptide represented by amino acid type r and type s. With function extractDC(), we can easily obtain the length 400 of descriptor.

TC gives a 8 000-dimensional descriptor, defined as:

f(r, s, t) = Nrst r, s, t = 1, 2, . . . , 20 N−2

where Nrst is the number of tripeptide represented by amino acid type r, type s, and type t. With function extractTC(), we can easily obtain the length 8 000 of descriptor.

For the PseAAC, extractPAAC() functions is used to calculate the descriptor (dim: 20 + lambda, default is 50).

extractPAAC(x, props = c("Hydrophobicity", "Hydrophilicity", "SideChainMass"), lambda = 30, w = 0.05, customprops = NULL)

x is a character vector, as the input protein sequence.

props: A character vector, specifying the properties used. 3 properties are used by default, as listed below:

'Hydrophobicity': Hydrophobicity value of the 20 amino acids 'Hydrophilicity': Hydrophilicity value of the 20 amino acids 'SideChainMass': Side-chain mass of the 20 amino acids lambda: lambda parameter for the PseAAC descriptors, default is 30. w: The weighting factor, default is 0.05.

customprops: A n x 21 named data frame contains n customized property. Each row con-tains one property. The column order for different amino acid types is

'Ac-cNo', 'A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', and the

col-umns should also be exactly named like this. The AccNo column contains the properties' names. Then users should explicitly specify these properties with these names in the argu-ment props. The default value for customprops is NULL.

For the APseAAC, extractAPAAC() function is used to calculate the descriptor (dim: 20 + (n * lambda), n is the number of properties selected, default is 80).

(14)

14

extractAPAAC(x, props = c("Hydrophobicity", "Hydrophilicity"), lambda = 30, w = 0.05, customprops = NULL)

props: A character vector, specifying the properties used. The 2 properties (instead of 3) including “Hydrophobicity” and “Hydropilicity” are used by default.

2.3. Feature Selection

The spike protein features generated by the feature extraction methods may be irrelevant to the prediction of protein types, which can result in over-fitting, information redundancy and dimension disaster. For selecting high discriminative spike protein features and for reducing computational complexities, feature selection procedures are essential in protein function predictions based on machine learning methods (Ebina et al., 2011). Random Forest (RF) is commonly used for feature selection of prediction problems and can rank the importance of the features in a large scale to discriminate the different categories (Svetnik et al., 2003). Hence, the RF method was employed for feature selection before machine learning classification model development.

2.4. Machine learning classification models

(15)

15

Figure 1. Schematic workflow of the machine learning prediction models. Once the protein sequences were collected from the NCBI Virus database and China National Center for Bioinformation database, S protein feature representations were obtained from the feature extraction encoding algorithms. A RF feature selection method was employed followed by two classifier models for training and testing the dataset for predicting interspecies transmission and pandemic risks of coronaviruses (AAC: amino acid composition; DC: dipeptide composition; TC: tripeptide composition; PseAAC: pseudo-amino-acid composition; APseAAC: amphipathic pseudo-amino-acid composition; RF: random forest; NB: Naïve Bayes; ROC: receiver operating characteristic; AUC: area under the curve; ACC: accuracy; Sn: sensitivity; Sp: specificity, and MCC: Matthew’s correlation coefficient).

(16)

16

feature selection. Once the spike protein feature vectors were generated by the five amino acid feature extraction methods (AAC, DC, TC, PseAAC, and APseAAC), RF model implementation was employed for feature selection and machine learning classification. To make results from the RF model reproducible, the seed of the random number generator in R was specified. The datasets were split into a 75% training set and a 25% test set. Using the randomForest package, the RF classification model was trained on the training set with 10-fold cross validation.

For comparison with the Random Forest (RF) classifier, Näive Bayes (NB) classification models were constructed for the spike protein features extracted from three of the five descriptors including AAC, PseAAC, and APseAAC. Alike in the RF classifier model, the datasets were split into a 75% training set and a 25% test set. Using the klaR package, the NB classification model was trained on the training set with 10-fold cross validation.

2.5. Evaluating the performance of classification models

Once the machine learning classifier models were trained on positive and negative samples randomly selected from the spike protein dataset, the remaining 25% samples were reserved as a test dataset for assessing the performances of the classifiers. With the model trained on the training set, prediction was done on the test set and a receiver operating characteristic (ROC) curve was plotted using a pROC package. The area under the ROC curve (AUC) was used to reveal the optimal parameters in the classifier models. The AUC was calculated in R (Sing et al., 2005). Confusion matrices of correctly and wrongly classified spike protein sequences were constructed and the four commonly used metrics for model performance evaluation including sensitivity, specificity, accuracy and Matthew’s correlation coefficient were computed.

2.6. Interspecies transmission and pandemic risk predictions

(17)

17

Once the spike protein sequences of the human coronaviruses and non-human coronaviruses were imported and uploaded in the R environment, a multiple alignment of the sequences was performed with the multiple sequence alignment (msa) package. The readFASTA() function in the protr package was used to import the spike protein sequences stored in FASTA formatted files and uploaded in the R environment followed by multiple sequence alignment analysis using the msa package, and converting the multiple alignment object in an object of class align as defined by the bios2mds package (Bodenhofer et al., 2015). The primary purpose of the consensus sequence was to obtain an impression of the conservation of the sequences at the individual positions or columns of the multiple alignment. The result from the multiple sequence alignment was used for later processing in relation to the distance matrix analyses. The result from the multiple alignments were converted from similarities into dissimilarities for later processing with the seqinr package (for distance matrix computation) (Bodenhofer et al., 2015). Distance matrices were computed for the spike protein sequences using the dist.alignment() function from the seqinr package in reference to the spike protein of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) with accession number QIC53213.1. Using the computed distance matrices evolutionary distances were predicted for the spike protein sequences of human coronaviruses as well as for the all (human and non-human) coronavirus spike protein sequences.

(18)

18

3.

Results

A total of 2 587 complete coronavirus spike protein sequences (482 of human origin and 1 917 of non-human – origin) were analyzed for this study. Using five amino acid feature extraction methods or descriptors discrete features of the spike protein sequences were extracted followed by Random Forest and Näive Bayes machine learning classification models and evaluating their performances. In addition, distance matrices were computationally generated across the sequences for exploring evolutionary dynamics of the coronaviruses and visualized with the multidimensional scaling (MDS) method in R. Evolutionary distances of human coronaviruses and non-human coronaviruses were analyzed for prediction of their interspecies transmission and pandemic risks. Finally, a bioinformatics pipeline has been developed and presented with an attempt to build an initial pipeline for surveillance of interspecies transmission and pandemic risk prediction of human coronaviruses.

Once the amino acid type sanity checks were done with the protr R package, 15 sequences with non-standard amino acids were excluded from the file containing human coronavirus spike protein sequences and remained with 467 sequences containing only the 20 standard amino acids. Similarly, 83 sequences with the non-standard amino acids were excluded from the other file containing non-human coronavirus sequences and remained with 1 834 sequences with only the 20 standard amino acids.

3.1. Feature Extraction

(19)

19 3.1.1. Amino Acid Composition

The amount or number of attributes (dimensions) generated by the amino acid composition (ACC) descriptor was 20 and the number of entries was 2 301 (Figure 2).

Figure 2. A discrete model by amino acid composition descriptor reflecting the spike protein sequence key features.

The number of attributes generated from the dipeptide composition (DC) feature extraction method was 400, which showed the calculated percentage of all possible dipeptides found in the sequence. The number of attributes generated by the tripeptide composition (TC) descriptor was 8 000.

3.1.2. Pseudo Amino Acid Compositions

(20)

20 3.2. Machine learning classification

Two machine learning classification models – Random Forest (RF) and Näive Bayes (NB) were employed. Before the machine learning classification models were constructed, protein feature selections were conducted using the commonly used feature selection algorithm – Random Forest.

3.2.1. Random Forest

The Random Forest classification model was trained on the training set and the 10-fold cross validation results showed that well performance was achieved by all five feature extraction methods. All the five feature extraction methods including AAC, DC, TC, PseAAC, and APseAAC had been shown to have optimal feature with the best performances, where the RF predictive models achieved overall accuracy (ACC) of over 94% and Matthews correlation coefficient (MCC) of over 0.80. Comparative analyses of the five amino acid feature descriptors presented below (Table 1).

Table 1. Comparison of the results for the five amino acid feature extraction methods using 10-fold cross validation.

Method Sensitivity Specificity Accuracy MCC AUC OOB No. variables

AAC 0.8844 0.9528 0.9410 0.8041 0.964 5.90 4

DC 0.8186 0.9793 0.9447 0.8318 0.968 5.53 20

TC 0.7995 0.9832 0.9420 0.8284 0.960 5.80 89

PseAAC 0.9000 0.9561 0.9464 0.8223 0.967 5.36 7

APseAAC 0.8944 0.9560 0.9453 0.8189 0.969 5.47 8

AAC: amino acid composition, DC: dipeptide composition, TC: tripeptide composition, PseAAC: pseudo amino acid composition, APseAAC: amphiphilic pseudo amino acid

composition, MCC: Matthews correlation coefficient, AUC: area under the receiver operating characteristic curve, No. variables: number of variables tried at each site, OOB: out-of-bag estimate of error rate.

(21)

21 3.2.1.1. Random Forest Model for features extracted from Amino Acid

Composition (AAC)

A confusion matrix was generated for the Random Forest (RF) classification model developed using the amino acid composition (AAC) feature. For evaluation of the performance of the classification model, the four commonly used metrics including sensitivity (SN), specificity (SP), the overall accuracy (ACC) and Matthews correlation coefficient (MCC) were calculated from the confusion matrix and the respective values were 0.8844, 0.9528, 0.9410, and 0.8041 (Table 1).

In addition, the area under the receiver operating characteristic (ROC) curve (AUC) was calculated in R using plot.roc() function in an pROC package. As shown below in the ROC curve the AUC value for the spike protein features extracted by the AAC algorithm was 0.964 (Figure 3).

(22)

22 3.2.1.2. Random Forest Model for features extracted from Dipeptide

Composition (DC)

For the Random Forest classification model constructed on features extracted from the dipeptide composition (DC), a confusion matrix was generated. The SN, SP, ACC, and MCC values computed from the confusion matrix were 0.8186, 0.9793, 0.9447, and 0.8318, respectively (Table 1). As shown in the ROC curve, the AUC value for the spike protein features extracted by the DC descriptor was 0.968 (Figure 4).

Figure 4. A receiver operating characteristic (ROC) curve for the model trained on the training set from the dipeptide composition (DC) descriptor.

3.2.1.3. Random Forest Model for features extracted from Tripeptide Composition (TC)

(23)

23

Figure 5. A receiver operating characteristic (ROC) curve for the model trained on the training set from the tripeptide composition (DC) descriptor.

3.2.1.4. Random Forest Model for features extracted from Pseudo Amino Acid Composition (PseAAC)

(24)

24

Figure 6. A receiver operating characteristic (ROC) curve for the model trained on the training set from the pseudo amino acid composition (PseAAC) descriptor.

3.2.1.5. Random Forest model for features extracted from Amphiphilic Pseudo Amino Acid Composition (APseAAC)

(25)

25

Figure 7. A receiver operating characteristic (ROC) curve for the model trained on the training set from the amphiphilic pseudo amino acid composition (APseAAC) descriptor.

3.2.2. Näive Bayes Classification

In addition to the Random Forest, a Näive Bayes (NB) classification models were successfully developed for the spike protein features extracted from three of the five descriptors including AAC, PseAAC, and APseAAC. Although the NB classification model performed well for AAC, PseAAC, and APseAAC, the model was found to be less stable for the spike protein features extracted from the other two descriptors DC and TC. For the DC descriptor, the accuracy metric values were missing. With the TC descriptor, the sensitivity was 0.00 and the specificity was 1.00. The 10-fold cross validation results for the NB on coronavirus spike protein features from the AAC, PseAAC, and APseAAC descriptors as well as their performance evaluation briefly presented below.

(26)

26

correlation coefficient (MCC) were calculated from the confusion matrix and the respective values were 0.9500, 0.8090, 0.8414, and 0.6631. In addition, the area under ROC curve (AUC) was calculated. The AUC value for the spike protein features extracted by the AAC algorithm was 0.918.

A confusion matrix was also computed for the Näive Bayes classification models constructed on features extracted from the Pseudo Amino Acid composition (PSeAAC). The SN, SP, ACC, and MCC values calculated from the confusion matrix were 0.9700, 0.8000, 0.8391, and 0.6685, respectively. From the ROC curve, the AUC value for the spike protein features extracted by the PseAAC descriptor was shown to be 0.916.

Once more, a confusion matrix was generated for the Näive Bayes classification model developed with the features extracted from the amphipathic pseudo amino acid composition (APseAAC) descriptor. The four matrices SN, SP, ACC, and MCC were computed from the confusion matrix, and their respective values were 0.9500, 0.8090, 0.8414, and 0.6631. The AUC value obtained from the ROC curve for the spike protein features extracted by the APseAAC descriptor was 0.918.

3.3. Interspecies Transmission and Pandemic Risk Predictions

Multiple alignment of the coronavirus spike protein sequences was performed using in the R environment. The result obtained from the multiple sequence alignment (Figure 8) were used for later processing in relation to the distance matrix analyses.

(27)

27

b)

Figure 8. CLUSTAL multiple alignment of coronavirus spike protein sequences. a) for human – origin coronaviruses. b) for human origin and non-human origin coronaviruses.

3.3.1. Distance Matrix

Distance matrices were computed for the spike protein sequences with reference to the spike protein of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) with accession number QIC53213.1 and their evolutionary distances were predicted. Partial results of the distance matrices of the spike protein sequences shown below for human coronaviruses (Table 2) and for all coronaviruses (Table 3).

Although it could not be shown in Table 2 and Table 3, majority of the spike protein sequences of both human coronaviruses and non-human coronaviruses found to have similar ranges of distance matrices between 0.8 and 0.9 to the reference sequence SARS-CoV-2. Whereas, only a small proportion of the coronavirus spike protein sequences were found to have distance matrices below 0.5.

(28)

28

Table 2. Partial results of the distance matrix for human coronavirus spike protein sequences. The row highlighted in yellow is the reference sequence (SARS-CoV-2) for the distance matrix computation. DM: distance matrix.

Coronavirus spike glycoprotein DM

YP_009825051.1 |spike glycoprotein [SARS coronavirus Tor2] 0.4747838

AAP41037.1 |spike glycoprotein [SARS coronavirus Tor2] 0.4747838

AYV99747.1 |spike [SARS coronavirus Urbani] 0.4739412

AYV99775.1 |spike [SARS coronavirus Urbani] 0.4739412

AYV99803.1 |spike [SARS coronavirus Urbani] 0.4739412

AYV99761.1 |spike [SARS coronavirus Urbani] 0.4747838

AYV99789.1 |spike [SARS coronavirus Urbani] 0.4747838

AYV99817.1 |spike [SARS coronavirus Urbani] 0.4747838

QIC53213.1 |spike glycoprotein [SARS coronavirus 2] 0.0000000

ALW82720.1 |S protein [MERS-related coronavirus] 0.8491852

ALW82731.1 |S protein [MERS-related coronavirus] 0.8491852

ALW82709.1 |S protein [MERS-related coronavirus] 0.8491852

ALW82680.1 |S protein [MERS-related coronavirus] 0.8491852

AXG21665.1 |spike glycoprotein [MERS-related coronavirus] 0.8491852

ALW82691.1 |S protein [MERS-related coronavirus] 0.8496543

ALW82742.1 |S protein [MERS-related coronavirus] 0.8491852

ALB08246.1 |S protein [MERS-related coronavirus] 0.8491852

ALB08257.1 |S protein [MERS-related coronavirus] 0.8491852

QGW51898.1 |spike glycoprotein [MERS-related coronavirus] 0.8491852

ALX27243.1 |S protein [MERS-related coronavirus] 0.8487159

YP_009047204.1 |spike protein [MERS-related coronavirus] 0.8491852

YP_007188579.1 |spike protein [Betacoronavirus England 1] 0.8491852

ALW82753.1 |S protein [MERS-related coronavirus] 0.8491852

ALX27232.1 |S protein [MERS-related coronavirus] 0.8491852

YP_173238.1 |spike glycoprotein [Human coronavirus HKU1] 0.8472938

AAT98580.1 |spike glycoprotein [Human coronavirus HKU1] 0.8472938

AQN78680.1 |spike glycoprotein [Human coronavirus OC43] 0.8454266

AQN78688.1 |spike glycoprotein [Human coronavirus OC43] 0.8459031

AOL02453.1 |spike glycoprotein [Human coronavirus OC43] 0.8457675

AWW13575.1 |spike protein [Human coronavirus OC43] 0.8468551

AWW13566.1 |spike protein [Human coronavirus OC43] 0.8473307

QJY77954.1 |spike protein [Human coronavirus 229E] 0.8742874

QJY77962.1 |spike protein [Human coronavirus 229E] 0.8742874

QJY77978.1 |spike protein [Human coronavirus 229E] 0.8744123

QJY77970.1 |spike protein [Human coronavirus 229E] 0.8738434

QJY77946.1 |spike protein [Human coronavirus 229E] 0.8726944

APF29062.1 |spike protein [Human coronavirus NL63] 0.8901982

APF29063.1 |spike protein [Human coronavirus NL63] 0.8901982

APF29071.1 |spike protein [Human coronavirus NL63] 0.8906588

(29)

29

Table 3. Partial results of the distance matrix for all (human and non-human) human coronavirus spike protein sequences. The row highlighted in yellow is the reference sequence (SARS-CoV-2) for the distance matrix computation and the row highlighted in red is the coronavirus, which had the smallest distance with the reference sequence.

Coronavirus spike glycoprotein DM

AEK31974.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8889597

AEK31976.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8889597

AEK31977.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8889597 AEK31978.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8889597 AEK31979.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8889597 AEK31980.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8889597 AEK31981.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8889597 AEK31982.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8889597 AEK31983.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8889597 AEK31984.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8908708 AEK31975.1 |spike glycoprotein, partial [Turkey coronavirus] 0.8895972

ACV87265.1 |spike [Turkey coronavirus] 0.8888086

ACV87276.1 |spike [Turkey coronavirus] 0.8873779

ACV87254.1 |spike [Turkey coronavirus] 0.8897791

ALQ43515.1 |spike protein [European turkey coronavirus 080385d] 0.8867169

QDA76255.1 |S protein [Pheasant coronavirus] 0.8839893

QDA76264.1 |S protein [Pheasant coronavirus] 0.8839893

QIC53213.1 |spike glycoprotein [Severe acute respiratory syndrome coronavirus 2] 0.0000000

QHR63300.2 |spike glycoprotein [Bat coronavirus RaTG13] 0.1511709

QIA48632.1 |spike protein [Pangolin coronavirus] 0.2694670

QIQ54048.1 |spike protein [Pangolin coronavirus] 0.2723803

QIA48623.1 |spike protein [Pangolin coronavirus] 0.2696799

QIA48614.1 |spike protein [Pangolin coronavirus] 0.2709275

QIA48641.1 |spike protein [Pangolin coronavirus] 0.2709275

BCG66627.1 |spike protein [Severe acute respiratory syndrome-related coronavirus] 0.4604593

AAP41037.1 |spike glycoprotein [SARS coronavirus Tor2] 0.4847804

ABF65836.1 |S protein [Severe acute respiratory syndrome-related coronavirus] 0.4847804 AAV49730.1 |spike glycoprotein [SARS coronavirus B039] 0.4847804

AAV91631.1 |spike glycoprotein [SARS coronavirus A022] 0.4864265

ATO98205.1 |spike protein [Bat SARS-like coronavirus] 0.4722516

ALK02457.1 |spike protein [SARS-like coronavirus WIV16] 0.4722516

ATO98157.1 |spike protein [Bat SARS-like coronavirus] 0.4730972

ATO98218.1 |spike protein [Bat SARS-like coronavirus] 0.4747838

ATO98231.1 |spike protein [Bat SARS-like coronavirus] 0.4756249

QDF43825.1 |spike glycoprotein [Coronavirus BtRs-BetaCoV/YN2018B] 0.4747838

AGZ48818.1 |spike protein [Bat SARS-like coronavirus Rs3367] 0.4739412

(30)

30 3.3.2. Multidimensional Scaling (MDS)

Patterns of the coronavirus spike protein sequences were clustered and visualized using the multidimensional scaling (MDS) method based on principal component analysis of distance matrix. For example, the resulting scree plot of eigenvalues generated by the scree.plot() function from the bios2mds package was as shown in Figure 9.

(31)

31

Again the mmds.2D.plot() function from the bios2mds package was employed on the human coronavirus spike protein sequences and generated a scatter plot of the MDS coordinates as shown below (Figure 10).

(32)

32

Once more the mmds.2D.plot() function from the bios2mds package was employed and generated a projection of the coronaviruses from coronavirus spike protein sequences of all coronaviruses on the space of human coronavirus spike protein sequences as shown below on the scatter plot of the MDS coordinates (Figure 11).

(33)

33 3.3.3. Phylogenetic tree

Further, phylogenetic trees were constructed with the neighbor joining algorithm using the njs() function from the ape package in R for the computed distance matrices among the coronavirus spike protein sequences. However, results only from 40 randomly selected spike protein sequences of the human coronaviruses including SARS-CoV-2 shown in Figure 12 for the sake of visualization.

(34)

34 4. Discussion and Conclusions

4.1. Discussion

During the last few decades the world has been immeasurably challenged by successive coronaviruses with increasingly high fatality rates in the human population, especially from the on-going SARS-CoV-2 pandemic (Qiang et al., 2020). With advancements of high throughput technologies in the molecular biology field an increasingly high number of virus sequences including coronaviruses were made available in the public database. However, only a few studies applied bioinformatics tools such as machine learning algorithms for predicting interspecies transmission and pandemic risks of viral diseases. Therefore, in this study, we analyzed 2 587 complete spike protein sequences using bioinformatic tools for predicting the interspecies transmission and pandemic risks of the coronaviruses in the human population. The number of sequences analyzed in this study was comparable to a recently published study conducted on a total of 2 666 spike protein sequences of coronaviruses (Qiang et al., 2020). The number of human coronaviruses and non-human coronavirus spike protein sequences were also comparable. However, compared to this study we used only complete spike protein sequences. Further, the number of spike protein sequences used in our study was larger when compared to two other more recently published studies aimed to predict the animal hosts of coronaviruses from compositional biases of spike protein (Brierley and Fowler, 2020) and to predict host specificity of coronaviruses based on spike protein sequences (Kuzmin et al., 2020).

Compared to the abovementioned recently published studies on coronavirus spike protein sequences (Brierley and Fowler, 2020, Qiang et al., 2020, Kuzmin et al., 2020), our study employed a method of checking the amino acid type sanity. Hence, we excluded the non-standard amino acids and proceed only with sequences containing the 20 non-standard amino acids in the downstream analyses.

(35)

35

The feature vectors generated with all of the five descriptors employed in this study were easily combined with the Random Forest machine learning algorithm for developing computational predictors, which was in line with the study done by Qiang et al. (Qiang et al., 2020). Furthermore, our study was also found to be able to combine the extracted features with an additional machine learning algorithm – Näive Bayes (NB).

As revealed above in the comparative performance of the descriptors all the five feature extraction methods have sufficient discriminatory power to predict the spike protein function of the coronaviruses. The feature extractions transformed and extracted numeric values from textual sequences, which were envisioned to be informative to facilitate knowledge learning from the data using machine learning tools (Karlin and Altschul, 1990).

Although a previous study conducted to predict anti-angiogenic activity peptides in silico using a generalized linear model and feature selection methods showed best performance result from AAC as compared to DC and TC (Blanco et al., 2018), our study obtained relatively similar performances across all descriptors. The study done by Blanco et al. (2018) found that the DC and TC had achieved lower performances, whereas our study showed both descriptors achieved higher performances with the Random Forest algorithm. In addition, this study stated that the PseAAC descriptors have been shown to achieve less stable performances than AAC, which was not in line with to our findings.

The 10-fold cross validation results of the two machine learning models – Näive Bayes (NB) and Random Forest (RF) employed for classification of the spike protein features extracted by the three descriptors (i.e., AAC, PseAAC, and APseAAC) apparently revealed that performances of the RF classification models were much better as compared to the NB. Our findings were in line with an earlier study conducted on scoring amino acid mutations for predicting pandemic risks of avian influenza, where the performance of the NB model was slightly poorer and less stable than RF classifier (Qiang and Kou, 2019). In addition, the observed errors in the NB models trained with DC and TC could be due to their high-dimensional representations that might cause over-fitting or dimension disaster problems (Anishetty et al., 2002, Cao et al., 2013).

(36)

36

classifier model had good performance (Kandaswamy et al., 2010, Svetnik et al., 2003). For instance, the study done by Qiang and Kou (Qiang and Kou, 2019) showed that the AUC medians of the RF classifier model was approximately 1.0, whereas it was 0.5 for the K-Nearest Neighbor (KNN) classifier model and slightly lower for the NB model. Similarly, the study on the coronavirus spike protein also showed comparable good performances, though the study did not use machine learning classifier models other than the RF model (Qiang et al., 2020). The Random Forest algorithms also found to work well for the spike protein feature selection process in this study, which was consistent with a previous study on anti-angiogenic activity peptides in silico using a generalized linear model and feature selection (Blanco et al., 2018). The protein feature selection results in this study also get supported by earlier published scientific literatures, where the RF algorithm commonly used for feature selections of prediction problems and can rank the importance of the features in a large scale to discriminate the different categories (Ebina et al., 2011, Svetnik et al., 2003).

In the performance evaluation of the feature selection process, although our study found that the Random Forest machine learning had comparable results from the five amino acid feature extraction methods we employed, Qiang et al. (Qiang et al., 2020) concluded that the performance of the RF was superior for the spike protein sequence features extracted by G-GAP algorithm than for the other two employed in the study (i.e., the AAC and PseAAC). Nevertheless, we had not tested features extracted by a G-GAP method, which partly constrained the comparison of our findings with this study.

Nevertheless, considering the large number of sequences in our dataset, the RF classifier model performed optimally for the machine learning model classification of the spike protein features extracted by the five descriptors we employed. Our study also compared the performances of the RF model with the NB machine learning classifier model on the same dataset and clearly reveled that it fits well for big data as accompanied by published literatures (Liao Y et al., 2013, Victor M. Herrera et al., 2019).

(37)

37

only a few spike protein sequences from the non-human coronaviruses (Bat and Pangolin coronaviruses) had shorter distances with the SARS-CoV-2. One reason for the observed misclassification of sequences could be due to the fact that most of the human coronavirus infection were originated from these animal coronaviruses, particularly the Bat associated strains (Corman et al., 2018, Cui et al., 2019).

Results of the distance matrix showed that a large proportion of the spike protein sequences of both human coronaviruses and non-human coronaviruses had similar ranges of distance matrices between 0.8 and 0.9 to the reference sequence SARS-CoV-2, while only a small proportion had distance matrices below 0.5. This suggested that only a small proportion of the coronaviruses of both human origin and non-human origin would efficiently infect and cause infection in the human population with high probabilities of human-to-human transmission potential. Of all the spike protein sequences analyzed in this study, the Bat coronavirus RaTG13 had the smallest distance matrix to the reference SARS-CoV-2 sequence followed by the Pangolin coronavirus. These findings revealed that the on-going SARS-COV-2 was evolutionary close to Bat coronaviruses and Pangolin coronaviruses even than the coronavirus spike protein sequences originated from the human host.

Therefore, the distance matrix results of this study suggested that Bat-origin and Pangolin-origin coronaviruses have high probability pandemic risk among the human population if efficient adaptations to human cells with human-to-human transmission meet. Our findings were supported by a previously published review study on hosts and sources of endemic human coronaviruses, where there is a strong evidences suggesting the origin of four endemic coronaviruses existed with bats and rodents (Corman et al., 2018).

In addition, the epidemic lineage of the 2002/2003 SARS-CoV was believed to have been originated from bats though human beings got the infection from wild game CoVs such as civet cats (Drexler et al., 2010, Drosten et al., 2003, Ge et al., 2013, Yang et al., 2015). Other published articles also stated that while its human emergence was believed to have been facilitated through intermediate hosts in wet markets of southern China, the SARS-CoV outbreak arisen from bats of the genus Rhinolophus (Pike et al., 2010).

(38)

38

humans acquired the CoV infection. However, genetically the bat-associated MERS-CoV was assessed to be far away than the bat-associated MERS-CoVs related to SARS-MERS-CoV (Corman et al., 2018).

Overall, these studies supported our findings that the spike protein sequence of the Bat coronavirus had the smallest distance matrix to the on-going SARS-CoV-2 compared to other coronaviruses. This could suggest that spike protein sequences of bat-associated CoVs would evolutionarily be the most related ones to the on-going SARS-CoV-2 pandemic. Therefore, bat-associated CoVs still have unknown risks of causing epidemic and pandemic coronavirus diseases in a completely naïve human population would they find appropriate intermediate hosts and establish human-to-human transmissions.

The metric multidimensional scaling analyses plots (Figure 9 – 11) represented the distance matrix and visualized with a sequence space. For instance, Figure 10 demonstrated a scree plot with a histogram showing the eigenvalues of each component that determined the optimal number of components used to describe the coronavirus spike protein sequence data in the context of metric MDS. It is also a diagnostic tool for checking whether the principal component analysis (PCA) works well on our data or not. As the principal components were created in order of the amount of variation they covered, our data had three component numbers where the first captured the most variation, the next captured the second most variations, and the third captured the least variations in the coronavirus spike protein sequences analyzed in this study. Figure 10 presented a scatter plot of human origin coronaviruses corresponding to the first two components of the MDS analysis of spike protein sequences alignment. The scatter plot plausibly presented six spike protein sequence groups/linages from the human origin coronaviruses that could potentially reveal the evolutionary drift. The projection presented on Figure 11 also demonstrated the active data space represented by the black dots/balls obtained by the MDS analyses of the human origin coronavirus spike protein sequences and the supplementary data represented by pink color by MDS analyses of all coronavirus spike protein sequences. The multidimensional scaling analyses presented in this figure also revealed six clusters, which enabled to group the coronavirus spike protein sequences showing their evolutionary pathways.

(39)

39

4.2. Conclusion

The Random Forest algorithm revealed that all the five feature extraction descriptors including AAC, DC, TC, PseAAC, and APseAAC had good performances and achieved over 94% overall accuracy. Compared to the Näive Bayes classifier model, the Random Forest model had better performances in classifying the coronavirus spike protein features. However, only the Random Forest and Näive Bayes classifier models were employed in this study and future benchmark studies with additional machine learning classification models might be valuable to evaluate their performances in classifying the virus protein sequences.

The distance matrices and multidimensional scaling analyses of the spike protein sequences in this study found that Bat coronaviruses and Pangolin coronaviruses had short evolutionary distances with SARS-CoV-2. This revealed that potently pandemic risk of the human coronaviruses could still arise from Bat-associated coronavirus strains. The metric multidimensional scaling analyses also demonstrated that the coronavirus spike protein sequences clustered to several groups.

In this study, we successfully employed several R packages mainly protr, msa and bios2mds with the attempt to build an initial pipeline, where we synchronized their flexible frameworks in performing the different functions including importing large number of spike protein sequences of the coronaviruses, multiple sequence alignments, computing distance matrices, and visualization the evolutionary dynamics of the coronaviruses.

(40)

40

5. Bioinformatics pipeline for interspecies transmission and pandemic risk

predictions

Since the beginning of the 21st century, the world has been seriously challenged by successive coronavirus outbreaks with increasingly high fatality rates, especially from the SARS-CoV-2 (also known as COVID-19). This study demonstrates a possible bioinformatics approach that could serve as a starting point for predicting interspecies transmission and pandemic risks of coronaviruses for early warning.

Here is a step-by-step description of the proposed approach: 1. It is design based on R programming.

2. FASTA formatted complete sequences of the spike proteins are the input sequences. 3. The readFASTA function in the protr R package used to load the FASTA sequences. 4. The feature extraction methods that could be used include amino acid composition,

dipeptide composition, tripeptide composition, pseudo amino acid composition, or amphiphilic pseudo amino acid composition.

5. randomForest package in R is used for Random Forest classification model. 6. Using the plot.roc() function in pROC package in R the area under the receiver

operating characteristic (ROC) curve (AUC) could be calculated

7. The readAAStringSet() function from multiple sequence alignment (msa) package in R should be used to read the FASTA formatted sequence. Then result should be

converted from similarities into dissimilarities for later processing using msaConvert().

8. The seqinr package in R used for distance matrix computation, where distance matrices computed for protein sequences using the dist.alignment() function.

9. The mat.dif() function from the bios2mds package used to compute the distance matrix of pairwise distances between sequences on the basis of sequence difference and used for the MDS analysis.

(41)

41

11. mmds.project() function from the bios2mds package performs projection of supplementary elements onto the active space.

12. scree.plot() function from the bios2mds package displays a bar plot of the eigenvalues obtained by MDS.

13. mmds.2D.plot() function from bios2mds draws a scatter plot of the active elements on a 2D space, if present, of supplementary elements, after a metric MDS analysis. 14. For phylogenetic tree construction a plot() function from the ape package in R should

(42)

42

6. Scientific contributions

This research employed bioinformatic tools including protein feature extraction descriptors, machine learning models and distance matrix and multidimensional scaling studies of the spike protein sequences of the coronaviruses. The results from this research project could potentially help in analyzing the coronavirus spike protein patterns in relation to the infection risks of the viruses to humans from their hosts. It also enabled to make predictions of the interspecies transmissions and potential pandemic risks of coronaviruses. This potentially provides useful information for developing early warning and intervention strategies against potential coronavirus pandemics in a completely naïve human population.

In addition, such machine learning bioinformatic study could possibly be extended to assess antigenicity of spike proteins, to screen the protein features for potential candidate vaccines, as well as for development of potential anti-coronavirus drugs acting as virus entry inhibitors. Further this bioinformatic study could be improved for future treatment efficacy assessments and for monitoring emergences of drug resistance mutations for pathogenic microbial infections.

7. Future studies

Zoonotic virus outbreaks including coronavirus pandemic have seriously challenged the world witnessing the need for fast, but also robust methods for understanding of animal viruses. Despite the considerably huge efforts of the public health sectors, several important scientific research questions remain to be answered in relation to the ongoing SARS-CoV-2. This study tried to predict the interspecies transmission and pandemic risks of coronaviruses and proposed a possible bioinformatics pipeline for similar studies on viral infections.

(43)

43

In this study, we employed only five of the feature extraction descriptors and two of the existing machine learning classification models. Hence, future studies with additional feature extraction methods and machine learning classifier models might be valuable. In addition, this study analyzed only the spike protein sequences, a more complete information might be achieved from analysis of the entire viral genome the coronaviruses.

8. Ethical aspects and impact on society

Only publicly available protein sequences were used. Therefore, prior ethical approval was not required for this study. However, all sources of the data analyzed in this study were clearly cited.

The coronavirus has challenged the world with successive outbreaks during the last two decades, especially the emergence of the on-going SARS-CoV-2 pandemic has imposed a global socioeconomic impact affecting the health, economy, and social relations. Results from this research maybe useful in providing information on cross-species transmission of the coronaviruses that could help the public health authorities to design effective prevention and early warning strategies against potential pandemics in a completely naïve human population.

9. Acknowledgement

(44)

44

10. References:

ANISHETTY, S., PENNATHUR, G. & ANISHETTY, R. 2002. Tripeptide analysis of protein structures. BMC

Struct Biol, 2, 9.

BENGIO, Y., COURVILLE, A. & VINCENT, P. 2013. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell, 35, 1798-828.

BHASIN, M. & RAGHAVA, G. P. 2004. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem, 279, 23262-6.

BLANCO, J. L., PORTO-PAZOS, A. B., PAZOS, A. & FERNANDEZ-LOZANO, C. 2018. Prediction of high anti-angiogenic activity peptides in silico using a generalized linear model and feature selection.

Sci Rep, 8, 15688.

BODENHOFER, U., BONATESTA, E., HOREJS-KAINRATH, C. & HOCHREITER, S. 2015. msa: an R package for multiple sequence alignment. Bioinformatics, 31, 3997-9.

BRIERLEY, L. & FOWLER, A. 2020. Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.

CAO, D. S., XU, Q. S. & LIANG, Y. Z. 2013. propy: a tool to generate various modes of Chou's PseAAC.

Bioinformatics, 29, 960-2.

CHAN, J. F., TO, K. K., TSE, H., JIN, D. Y. & YUEN, K. Y. 2013. Interspecies transmission and emergence of novel viruses: lessons from bats and birds. Trends Microbiol, 21, 544-55.

CHEN, Z., ZHAO, P., LI, F., LEIER, A., MARQUEZ-LAGO, T. T., WANG, Y., WEBB, G. I., SMITH, A. I., DALY, R. J., CHOU, K. C. & SONG, J. 2018. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics, 34, 2499-2502. CHENG, V. C., LAU, S. K., WOO, P. C. & YUEN, K. Y. 2007. Severe acute respiratory syndrome

coronavirus as an agent of emerging and reemerging infection. Clin Microbiol Rev, 20, 660-94. CHOU, K.-C. 2009. Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics

and System Biology. Current Proteomics. Current Proteomics, Volume 6

CHOU, K. C. & CAI, Y. D. 2005. Prediction of membrane protein types by incorporating amphipathic effects. J Chem Inf Model, 45, 407-13.

CHU, D. K., OLADIPO, J. O., PERERA, R. A., KURANGA, S. A., CHAN, S. M., POON, L. L. & PEIRIS, M. 2015. Middle East respiratory syndrome coronavirus (MERS-CoV) in dromedary camels in Nigeria, 2015. Euro Surveill, 20.

CHU, D. K., POON, L. L., GOMAA, M. M., SHEHATA, M. M., PERERA, R. A., ABU ZEID, D., EL RIFAY, A. S., SIU, L. Y., GUAN, Y., WEBBY, R. J., ALI, M. A., PEIRIS, M. & KAYALI, G. 2014. MERS coronaviruses in dromedary camels, Egypt. Emerg Infect Dis, 20, 1049-53.

CHUO, K.-C. 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, Volume 21, Issue 1, 1 January 2005, Pages 10–19,

https://doi.org/10.1093/bioinformatics/bth466. Bioinformatics, Volume 21, 10. CNCB-NGDC 2020. China National Center for Bioinformation 2019 Novel Coronavirus Resource

(2019nCoVR). Available at: https://bigd.big.ac.cn/ncov/?lang=en, accessed 19 November 2020.

CORMAN, V. M., MUTH, D., NIEMEYER, D. & DROSTEN, C. 2018. Hosts and Sources of Endemic Human Coronaviruses. Adv Virus Res, 100, 163-188.

CUI, J., LI, F. & SHI, Z. L. 2019. Origin and evolution of pathogenic coronaviruses. Nat Rev Microbiol, 17, 181-192.

DECARO, N. & LORUSSO, A. 2020. Novel human coronavirus (SARS-CoV-2): A lesson from animal coronaviruses. Vet Microbiol, 244, 108693.

References

Related documents

Regarding the results of the assembly line using production flow control, even after the improvement of the hypothesis approach developed in Scenario 2.1, the

By introducing electron withdrawing groups to the dienophile, the LUMO will be stabilized and lower in energy.. This will decrease the gap to the HOMO, consequently facilitating

The prediction algorithm uses three types of data: (i) target protein sequences among which complexes are to be predicted, (ii) structures of protein complexes to be used as

An index patient with an idiopathic autoinflammatory diseased was sequenced for over ~1900 immunological genes, and their regulatory elements, in a Targeted Sequence Capture

It is using hydrophobicity and charge bias in the pre-processing and a hidden Markov model as classifier (which hydrophobicity scale and learning algorithm used are not men- tioned

It is using hydrophobicity and charge bias in the pre-processing and a hidden Markov model as classifier (which hydrophobicity scale and learning algorithm used are not men- tioned

Saccharomyces cerevisiae protein sequence regions that lack homologs in other species were termed orphan domains. Orphans were defined at four different levels with

Regarding to the effectiveness of KPIs, the “12 characteristics of Effective KPIs” defined by Wayne W.Eckerson[3] was chosen, because I believe that the 12