Machine learning methods for seasonal allergic rhinitis studies

(1)

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2021 | LIU-IDA/STAT-A--21/006--SE

Machine learning methods for

seasonal allergic rhinitis studies

Zijie Feng

Supervisor : Oleg Sysoev Examiner : Jose M. Peña

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Seasonal allergic rhinitis (SAR) is a disease caused by allergens from both environmen-tal and genetic factors. Some researchers have studied the SAR based on traditional genetic methodologies. As technology develops, a new technique called single-cell RNA sequenc-ing (scRNA-seq) is developed, which can generate high-dimension data. We apply two machine learning (ML) algorithms, random forest (RF) and partial least squares discrimi-nant analysis (PLS-DA), for cell source classification and gene selection based on the SAR scRNA-seq time-series data from three allergic patients and four healthy controls denoised by single-cell variational inference (scVI). We additionally propose a new fitting method consisting of bootstrap and cubic smoothing splines to fit the averaged gene expressions per cell from different populations. To sum up, we find that both RF and PLS-DA could provide high classification accuracy, and RF is more preferable, considering its stable per-formance and strong gene-selection ability. Based on our analysis, there are 10 genes hav-ing discriminatory power to classify cells of allergic patients and healthy controls at any timepoints. Although there is no literature founded to show the direct connections be-tween such 10 genes and SAR, the potential associations are indirectly confirmed by some studies. It shows a possibility that we can alarm allergic patients before a disease outbreak based on their genetic information. Meanwhile, our experiment results indicate that ML algorithms may discover something between genes and SAR compared with traditional techniques, which needs to be analyzed in genetics in the future.

key words: Machine learning, Seasonal allergic rhinitis, Random forest, Partial least squares discriminant analysis, Bootstrap, Cubic smoothing splines, Single-cell RNA se-quencing.

(4)

Acknowledgments

First of all, I would like to show my deepest gratitude to Oleg Sysoev, my supervisor. Thank you for your guidance and encouragement. Your recognition for my work is a great motivation for my future study.

Thanks to my examiner, Jose M. Peña, for concise feedback on my thesis project. All your feedback is important to make my thesis more academic and constructive. I would also like to thank Aashana Nijhawan, my opponent, to do peer review.

Furthermore, thanks to everyone in the CPMed research group. Throughout my thesis process, you introduced me to an interesting and necessary field in statistical application and left me a precious experience in my master’s study.

Finally, I would like to thank Ting Xie, my girlfriend, for daily caring and her support in my thesis writing. Additionally, thanks to my friends Jiawei Wu, Andreas C. Charitos and Phillip Hölscher.

(5)

Glossary

MDA Mean decrease in accuracy.6 MDG Mean decrease in Gini.6 ML Machine learning.2,6

MT-genes mitochondrial genes.15 OOB Out-of-bag.10,13

PBMCs Peripheral blood mononuclear cells.1,3

PLS-DA Partial least squares discriminant analysis.2,6,13 RF Random forest.2,6,13

SAR Seasonal allergic rhinitis.1

scRNA-seq Single-cell RNA sequencing.1 scVI single-cell variational inference.2,5,6,19 VIP Variable importance in projection.6

(6)

List of Figures

2.2 Example matrix . . . 4

2.3 Data reorganization process . . . 5

3.1 Decision tree . . . 9

4.1 Genes’ heatmap with top20 ranked by the importance summation at 7 timepoints based on reorganized data . . . 14

4.2 Mitochondrial membrane protects content during cell lysis . . . 15

4.3 Extra quality control . . . 16

4.4 Genes’ heatmap with top20 ranked by the importance summation at 6 timepoints based on reorganized data . . . 17

4.5 Genes’ heatmap example 1 from figure 4.3. . . 18

4.6 Genes’ heatmap example 2 from figure 4.3. . . 18

4.7 Genes’ heatmap sorted by the importance summation at 6 timepoints based on denoised data . . . 20

4.8 Importance comparison between figures 4.4 and 4.7 . . . 21

4.9 Histograms of variable importances computed from section 4.3 . . . 22

4.10 Histograms of variable importances computed from section 4.4 . . . 22

(9)

List of Tables

2.1 Sampling records . . . 4

4.1 Genes’ count in section 4.1 . . . 13

4.2 Error rates in section 4.1 . . . 14

4.3 Genes’ count in section 4.4 . . . 19

4.4 Error rates in section 4.4 . . . 19

(10)

1 Introduction

1.1 Background

1.1.1 Seasonal allergic rhinitis

Allergic diseases are complex syndromes, which occurs when the immune system overre-acts to allergens in the environment. Allergic rhinitis is a common allergic disease affecting 40% of the population in Europe (D’Amato et al., 2007), and around 10-25% of the world-wide population (Wang,2005). However, the actual prevalence might be underestimated. The symptoms of allergic rhinitis are similar to the common cold. So it is hard for patients to recognize rhinitis themselves, and thereby seeing the doctors. Allergic rhinitis can separate into three different kinds, including perennial, seasonal, and occupational. In this thesis, we only focus on seasonal allergic rhinitis (SAR). SAR is mainly triggered by seasonal allergens, such as pollens (Bousquet et al.,2001). It is general that SAR resulted from environmental factors, but can be a hereditary disease (Akdis and Akdis,2007). For instance, research about twins and families proves the genetic predisposition in SAR (Yokouchi et al.,2002).

In medicine, it is useful to know the diagnostic biomarkers or factors for early disease prevention before disease symptom. Hence, it is possible to predict the earliest timepoint when SAR outbreak based on genetic knowledge. Some important SAR factors have been found in previous genetic studies. For example, Benson et al. (2002) identify that gene VEGF-A may be an important mediator in SVEGF-AR based on microarray technology. VEGF-Additionally, Wang et al. (2010) find 209 differentially-expressed genes in accordance with the microarray data from allergen-challenged peripheral blood mononuclear cells (PBMCs).

As technology develops, some new approaches are developed and popular in the past few years. Single-cell RNA sequencing (scRNA-seq) is a new technique that generates high-resolution data allowing to do the genome-wide analysis of mRNAs (a molecule contain-ing the genetic sequences and ready to produce proteins) in cells (Mongia et al.,2018). The scRNA-seq can measure gene expressions at a single cell, thus providing more information about individual cells. By contrast, the microarray is the technique used in the previous two SAR studies (Benson et al.,2002; Wang et al.,2010). It can only be used for bulk data and may whereby resulting in the loss of genetic information. Due to the high dimensionality of

(11)

1.2. Research aim

scRNA-seq data, it overcomes the limitations of traditional experiments yield (Torroja and Sanchez-Cabo,2019).

1.1.2 Examples of machine learning applications in genetics

Machine learning (ML) algorithms are increasing popular in bioinformatics since of their great performance to treat high-dimensional genetic data. Pérez-Enciso and Tenenhaus (2003) find that partial least squares discriminant analysis (PLS-DA) shows a great performance for microarray data from breast cancer. After that, Hsueh et al. (2013) notice that the random for-est (RF) method has discriminative power for different gene sets and the variable importance of RF can measure genes’ importance concurrently.

Except for traditional ML algorithms, the neural network is also widely used. For exam-ple, Lopez et al. (2018) develop single-cell variational inference (scVI), a multipurpose tool for both scRNA-seq data processing and analysis via stochastic optimization and neural net-work. Besides, Torroja and Sanchez-Cabo (2019) quantify the scRNA-Seq data about colorec-tal and breast cancer and show high prediction accuracy for cell livability via neural network.

1.2 Research aim

As mentioned in section1.1.1, some SAR-associated genes have been found based on mi-croarray technique and bioinformatics analysis. The technique of scRNA-seq and the method of ML show their potential for better predicting SAR, which is still not applied in the existing studies. Thus, this thesis aims to apply ML algorithms for SAR time-series scRNA-seq data and answer the following questions:

a. Which ML algorithm is suitable for classification based on scRNA-seq data and can measure the gene importance for classification?

b. Among the genes which might be the biomarker of SAR, how such genes’ expressions change over time?

c. What are the earliest timepoints (in hours) such genes’ expressions differ between dif-ferent classes?

d. Do these genes have any biological functions associated with SAR?

We hope this study can fill the disease-associated gene selection gap based on scRNA-seq data and applying by ML algorithms, and provide basics for further research focusing on SAR disease.

(12)

2 Data

Fig. 2.1.The data collection process

Figure2.1explains the process of sample sequencing and data collection. The scRNA-seq data is collected from blood samples by the Centre for Personalised Medicine (CPMed) group at Linköping University.

The immune cells reacting with SAR belong to the peripheral blood mononuclear cells (PBMCs), which are the subject of sequencing. The PBMCs consist of lymphocytes and mono-cytes, and they are two different kinds of cells. Furthermore, this thesis’s samples were ex-tracted from three allergic persons (patients) and four non-allergic persons (healthy controls). All samples were sequenced once before they were challenged by an allergen (Birch pollen). After that, the samples were left to stand after the allergen challenge, and researchers se-quenced the same samples several times to collect scRNA-seq time-series data.

Table2.1presents the sampling records. The cell numbers are not positively relative to the sample numbers at different timepoints. Such a phenomenon results from the randomness of the scRNA-seq technique in belief. When sequencing a specific sample, around 2000 cells were loaded into the sequencer’s experiment wells based on the volume and concentration of such a sample. There should be one cell in one well, but it is possible that some wells failed to catch cells or caught more cells. The sequencer read the information from wells randomly.

(13)

Only the cells with enough read times would be shown in the output data. Consequently, the randomness of scRNA-seq causes the difference in cell numbers in different samples.

0h 12h day1 day2 day3 day4 day5 day6 day7

Patient 1 C A A A A A A Patient 2 C A A A A A A Patient 3 C A A A A A A Healthy control 2 C A A A A Healthy control 5 A A Healthy control 7 C A A A Healthy control 8 C A A A

Total number of cells 5115 1259 2957 3552 2271 2884 2074 Total number of genes 25713 25713 25713 25713 25713 25713 25713

Table 2.1: This table is the sequencing records. Each row represents a person, and each column represents a specific timepoint. The timepoints consisting of the initial timepoint (0h), 12 hour after challenge (12h), first day after challenge (day1), second day after challenge (day2), third day after challenge (day3), fifth day after challenge (day5) and seventh day after challenge (day7). Except for the last two lines, in the table are the indicate controls. C means the data without challenge, A means the data with challenge. The blank area indicates either invalid sequencing or sequencing absence. The last two lines show the total number of cells and genes in the scRNA-seq data.

The scRNA-seq data was processed into digital gene expression matrices following drop-seq core computational protocol (Nemesh,2016). The gene alignment and identification was referred from human genome overview GRCh38 (April 2017, Ensembl). To be more detailed, the scRNA-seq data is a matrix for a sample at a specific timepoint. In the rows are different genes, and in the columns are different cells. Each cell is represented by a combination of the person’s title, timepoint, indicate control, and cell identifier. The cell identifier has no meaning. It is unique in each cell, but the identifier could be duplicated in different cells. The value in the matrix represents the gene expression level of such a gene in a specific cell counted by singe-cell RNA sequencer.

Fig. 2.2. An example matrix from patient 3 at timepoint 0h. It exemplifies the expression levels of six different genes in two different cells.

(14)

2.1. Data reorganization

2.1 Data reorganization

Fig. 2.3.Assume each matrix is sequenced from one person at a specific timepoint (e.g., day1). One matrix represents an allergic person and another represents a non-allergic person. The indexes of genes are the same in both matrices, but * distinguishes the indexes of cells. After concatenation and transposition, the cells are on the row. Meanwhile, the genes and target are on the columns.

A special feature of scRNA-seq data is that the sequenced cells are different from person to person. Meanwhile, the sequenced cells from one person could also be different during different timepoints. Thus, it is hard to apply traditional time-series algorithm. By contrast, a gene can express in any cells, but its expressions are different in different cells. Also, its expression in one cell are changeable as time grows.

According to the cells’ independence in scRNA-seq data, figure2.3indicates how we reor-ganize the data. We extract the matrix from each person’s sample and concatenate the matri-ces by the genes’ intersection and cells’ union at each timepoint. After that, we add a target column to all the cells, representing whether such cells are from allergic patients or healthy controls. The data we finally utilize is the one after transposition. Therefore, the different cells are treated as observations and genes as variables. Before modeling, the final matrix should be separated into training and testing sets. To balance different sets, we randomly select 70% cells from allergic and non-allergic data for the training set, and treat all remaining 30% cells as the testing set.

2.2 Data denoising

Another special feature of scRNA-seq is the sparsity of the data, which means there are abundant zeros in scRNA-seq data. This unpredictable noise arises for both biological and technical reasons. For example, a gene is not expressed for some cells, or a gene is expressed but not detected by the sequencer. The sparsity of scRNA-seq data increases the cost of mod-eling and analysis difficulty (Vallejos et al.,2017).

To reduce the data sparsity of scRNA-seq, Lopez et al. (2018) propose a method called single-cell variational inference (scVI) to denoise scRNA-seq data. It completes and normal-izes the scRNA-seq data based on the neural network automatically. In this thesis, the data would be denoised by scVI.

(15)

3 Methodology

In this thesis, two widely-used ML algorithms in bioinformatics will be applied to classify whether the cells are from allergic or non-allergic persons based on the data from section2.1. The first is partial least squares discriminant analysis (PLS-DA), and the second is random forest (RF). Since the gene expression is changeable over time, the correlation of cells between persons is difficult to measure. For the convenience of modeling, we assume that all the cells are statistically independent. After modeling, the algorithm with better performance will be chosen, and its result will be analyzed. Next, we model once again based on the denoised data by scVIfor comparison. We will select the combination of algorithms and data that generates the best result. The genes shown in such combination’s result are more reasonable to be SAR-associated.

The gene importance for classification is measured based on variable importance. Notwith-standing, there is noMLcriterion for disease-associated gene selection at present. For exam-ple, Pérez-Enciso and Tenenhaus (2003) choose disease-related genes due to variable impor-tance in projection (VIP) by PLS-DA with cutoff. Hsueh et al. (2013) confirm the reliability of the mean decrease in Gini (MDG) of RF. Moreover, Shaik and Ramakrishna (2014) choose significant genes depending both on mean decrease in accuracy (MDA) by RF and VIP by PLS-DA based on their corresponding cutoffs. Furthermore, all studies above are not based on scRNA-seq data, which cannot guide us to some extent.

To avoid this problem, we only pay attention to the genes with non-zero variable impor-tances and appearing at all timepoints. For a ML algorithm, we will train a model at each specific timepoint, and thereby obtaining seven independent time-based models. The gene with non-zero importance in a particular model indicates that such a gene contributes to the classification of whether the cells are from patients or controls at the corresponding time-point. Hence, a gene is possible to be a biomarker of SAR, if it has non-zero importances in

(16)

3.1. Partial least squares discriminant analysis

After the gene selection, our second step is to fit such selected genes’ expressions in a partic-ular cell from different populations. Thus, we propose a method consisting of bootstrap and cubic smoothing splines to fit the averaged expression of cells from allergic and non-allergic persons, respectively. Then we can analyze how the genes’ expressions varying over time and find the earliest timepoints when such genes express differently between populations.

3.1 Partial least squares discriminant analysis

Partial least squares (PLS) or partial least squares regression is a statistical approach for prediction when data has multi-collinearity and the number of variables is larger than of ob-servations, such as scRNA-seq data. It is also a method combining dimension reduction and regression analysis. Although PLS is not designed for classification initially, it has demon-strated good results in classification problem (Rohart et al.,2017).

3.1.1 Partial least squares

Assume the data contains a variable matrix X and a response vector y, while there are J variables and K observations in X, and y is K-dimensional as well. The aim of PLS is to compute a component matrix T (K ˆ M) satisfying

X=TP+E, (3.1)

y=Tq+f, (3.2)

where E and f are error terms. Items P and q are corresponding loadings of X and y to the component matrix T, respectively. To find the best number M of components, the PLS weight vector is firstly computed as

w=X1y, (3.3)

and a corresponding component is

t= Xw

ař w2. (3.4)

Secondly, the loading from y to the current component space is

q= y

1 t

ř t2, (3.5)

and the residuals of y are

y˚=y ´ tq. (3.6)

Meanwhile, the loading from X to the current component space is

p= t

1 X

ř t2, (3.7)

and the residuals of X are

X˚₌_{X ´ tp.} _(3.8)

If more components are considered, the step will return to equation3.3, but X and y would be substituted by their residuals X˚_{and y}˚_{, respectively. The number of components M are}

extracted until

Q2Ym=1 ´PRESSm

RSSm´1

ă0, (3.9)

where PRESSm means the predicted residual sum of squares of model including the

cur-rent component m estimated by cross-validation with y-permutation testing (Thévenot et al., 2015). Additionally, RSSm´1means the residual sum of squares of model with the previous

(17)

3.2. Random Forest

components. It is noticed that each w is a J-dimensional vector. A J ˆ M weight matrix W will be created if we consider all components. Also, the component matrix T is the collection of all components t, and t and p are two K-dimensional vectors. A prediction process can be written as

ˆy=X(W(PW)´1q) +f. (3.10)

3.1.2 Partial least squares discriminant analysis

Partial least squares discriminant analysis (PLS-DA) is a special case of PLS when the re-sponse vector y is a sequence of categorical variables, which can be assigned as +1 and -1 if there are just two discrete classes. For PLS-DA, if a specific element ˆy from ˆy is a positive value it belongs to a class, otherwise to another class if ˆy is negative (Brereton and Lloyd, 2014).

3.1.3 Variable importance in projection

The variable importance of PLS can be identified using variable importance in projection (VIP) score, which is a weighted sum of squares of partial least squares loadings (Shaik and Ramakrishna,2014). Its formula for a specific variable j follows

V IPj=

?

M ˆ |wj| where j=1, 2, ..., J, (3.11)

where wjis a M-dimensional vector from weight matrix W (Thévenot et al.,2015).

3.2 Random Forest

Random forest (RF) is an ML algorithm that combines decision tree and bagging. It has good performance to the data with high-dimension, without dimension reduction. Since the decision tree is a model easy to be over-fitting, it can benefit greatly from bagging.

(18)

3.2. Random Forest

3.2.1 Decision tree

Fig. 3.1.A decision tree for classification.

A decision tree is a model that can be visualized as a tree. It consists of one root node, many branch nodes and many leaf nodes. Each leaf represents an outcome, and each branch represents an attribute test in decision progress. The path from the root to a leaf corresponds to a series of decisions.

In the training process, the root contains all the samples, and such samples will be sepa-rated into child nodes according to various tests until they are assigned to leaves. The accu-racy of the decision tree can be improved if the tree is deep enough. However, the decision tree procedure also results in high-variance and low-bias (over-fitting) if the tree depth is large (Zhou, Wang, et al.,2006).

3.2.2 Bagging

Bagging is an extensive method of bootstrap (see section3.3.1) that assembles weak clas-sifiers into strong clasclas-sifiers. It is an idea to reduce variance based on the average of many noisy but approximately unbias models. For a specific data Z and bagging iteration B, it contains two steps for each iteration b ă B:

1. Extract samples randomly from Z with replacement, as new training set Zb,

2. Train a weak classifier fbbased on Zb.

For classification, the outcome is the majority of all weak classifiers’ results. For regression, the outcome is the mean of all weak classifiers’ results.

(19)

3.3. Averaged gene expression fitting

When the iteration number B is large, some observations are duplicated in each bagged set Zb, and some others are absent on the contrary. The possibility of a certain observation not

extracted by infinite sampling is lim BÑ8(1 ´ 1 B) B₌ 1 e «0.368, (3.12)

which means that around 36.8% observations are not selected in Zb. Such un-extracted

obser-vations ZzZbis called out-of-bag (OOB) data, and the overall OOB error rate is approximately

the same as cross-validation error rate (Hastie et al.,2009). Hence, such overall OOB error is usually used for hyper-parameter selection.

3.2.3 Random forest classifier

RF is a forest made of such de-correlated decision trees and outcomes the average of all trees’ results. For M observations with J variables, the modeling process and prediction of RF can be concluded as:

1. Draw a bagged set Zbwith M observations.

2. Grow a tree Tbbased on Zbby recursively repeating the following steps:

a. Select j ă J variables at random, b. Pick the best variable as split-point, c. Split the node into two child nodes.

The outcome of RF is the class in the majority of these trees’ results. Moreover, the inventors suggest that the maximal j is?J for classification (Breiman,2001).

3.2.4 Mean decrease in accuracy

In this thesis, mean decrease in accuracy (MDA) is selected as the variable importance of RF. MDA is also called permutation importance and is computed during the RF modeling. For each iteration b ă B, the calculation steps are:

1. Shuffle the observed values in variable j of OOB data and get a permuted OOB data, 2. Compute the permuted OOB accuracy a˚

b,

3. Compute the difference d˚

b =ab´a˚b.

For each variable j, the corresponding MDA is the average of a sequence of differences

MDAn = 1 B B ÿ b=1 d˚ b. (3.13)

If the MDA of a variable is small, it means that permuting the values of such variable would not affect the prediction accuracy. On the contrary, the larger the MDA is, the more important for prediction accuracy such variable has.

(20)

Besides, cubic smoothing splines is such an algorithm that balances the goodness of piece-wise polynomial regression and fitting by adding a smoothing penalty. It is utilized to fit gene expression during bootstrapping since our training data is not time-continuous.

3.3.1 Bootstrap

For a given data set Z= tx1, x2, x3, ..., xnu, bootstrap is a method to estimate the range of

a quantity ˆθ of such data without knowing the data distribution. If we extract another n-size samples from Z with replacement, it equals to sample from a discrete distribution like

P(X=x) =

" ₁

n x P(x1, x2, ..., xn),

0 otherwise. (3.14)

(Zhou, Xie, et al.,2008). The extract data set Zbis called bootstrap samples. After sampling

B times, we will get a sequence of data sets tZ1, Z2, ..., ZBu. An empirical quantity ˆθbis

com-puted for each Zb, and the variance of ˆθ can be estimated by the sample standard variance of

sequence t ˆθbuaccording to ˆσ= g f f e 1 B ´ 1 B ÿ b=1 (ˆθ_b´E(ˆθ)). (3.15)

3.3.2 Cubic regression splines

Cubic regression splines is also called cubic spline. Differ from normal cubic polynomial regression (three order polynomial regression), It firstly divides the data range into K+1 sub-ranges of equal length depending on K knots and assumes a piecewise cubic polynomial f =β(k)₀ +β(k)₁ xi+β(k)₂ x2_i +β(k)₃ x3_i k=1, 2, ..., K+1, (3.16) which can be fitted by the least-squares in piecewise sub-range particularly. Before regression, three constraints should be followed between each pair of adjacent polynomials:

1. The fitted line should be continuous at knots.

2. The first-order derivatives of polynomials from both sides of knots should be the same. 3. The second-order derivatives of polynomials from both sides of knots should be the

same as well.

3.3.3 Cubic smoothing splines

Cubic smoothing splines is a regularized cubic regression splines function. It adds a penalty to all the second-order derivate of piecewise polynomial and implies to minimize the residual sum of squares

RSS(f , λ) = N ÿ i=1 [yi´f(xi)]2+λ ż [f2(x)]2dx, (3.17) where λ is the penalty parameter and can be estimated using leave-one-out cross-validation (James et al.,2013). The piecewise polynomial will be a straight line if λ = 8. It is shrunk toward the linear fitting even there are many parameters (Hastie et al.,2009).

(21)

3.3.4 Fitting method combining bootstrap and cubic smoothing splines

Detailed fitting progress in this thesis is demonstrated in algorithm1. The earliest time-points that selected genes are differently expressed in cells between allergic and non-allergic persons would be the timepoints when the confidence intervals of their averaged expressions are not intersected any more.

Algorithm 1:Gene expression fitting by bootstrap and cubic smoothing splines Data:Data tAtucontains all cells from allergic persons where t means sequenced

timepoint, genes tGu, and bootstrap iteration number B. for G in tGu do

for t in 0h, ..., day7 do for b in 1,2,...,B do

1. Select an allergic person α randomly, 2. Extract all cells A˚

t of α from At,

3. Bootstrap sample a collection of cells A(b)_t from A˚ t,

4. Extract the expression levels of G from A(b)_t , 5. Calculate the averaged expression level mtof G.

end end

a. Train cubic smoothing splines using a list of averaged expressions tmtuas target and known timpoints as variables,

b. Fit the gaped timepoints from 0h to day7.

At present, there are B values computed by different cubic smoothing splines for every precise timepoints. The confidence intervals can be computed relying on 2.5% and 97.5% percentiles of these values.

end

Except for data tAtu, for cells from healthy controls (tHtu), the fitting process is the same

as above, except that the person α selected in step 1 is a non-allergic person instead.

It is noticed that we randomly select a person before bootstrap sampling. The reason is that the sequenced person populations are different at different timepoints, and we want to fit the genetic expression of a specific person. Depending on table2.1, six persons’ data are sequenced successfully at timepoint 0h, but only four at timepoint day7.

(22)

4 Experiment results and

discussion

In this thesis, we applyPLS-DAbased on ropls R-package (Thévenot et al.,2015) andRF based on ranger R-package (Wright and Ziegler,2015). At last, the bootstrap function is im-plemented from scratch, but the cubic smoothing spline function we use is built on splines R-package. All hyper-parameters are default, except that the number of decision trees is se-lected from 150 to 1200 with increment 150 for RF, and the best RF has the minimalOOBerror. The number of bootstrap replicates is B=1000 for the fitting of averaged gene expression.

4.1 Results based on reorganized data

Algorithm

Genes’ number Timepoint

0h 12h day1 day2 day3 day5 day7

RF 9503 7108 9279 9412 3524 11313 3164

PLS-DA 21929 17340 21060 21555 20629 22288 22129

Table 4.1: The above line represents the numbers of genes selected by RF at seven time-points, and the below represents the ones of PLS-DA. All models are trained relying on the reorganized data in section2.1, and only the genes with non-zero variable importances will be considered.

After modeling, table4.1shows the number of selected genes at each timepoint. The genes with zero importances would be ignored. It is obvious that RF select fewer genes than PLS-DA at all timepoints.

(23)

4.1. Results based on reorganized data

Algorithm

Error rate Timepoint

0h 12h day1 day2 day3 day5 day7 PLS-DA, train 0.000 0.001 0.000 0.000 0.000 0.170 0.000 PLS-DA, test 0.005 0.024 0.045 0.024 0.088 0.248 0.042 RF, train 0.000 0.000 0.000 0.000 0.000 0.000 0.000 RF, test 0.008 0.011 0.071 0.018 0.108 0.059 0.119

Table 4.2: This table shows the error rates at different timepoints of RF and PLS-DA, re-spectively. Both training data and testing data are from section2.1. Noted that all values are rounded to three decimal places, and the very low classification error rate 0.000 is not absolute zero.

Table4.2reveals that both ML algorithms of PLS-DA and RF are slightly over-fitting, but they can provide low classification error rates. Although there are four timepoints that PLS-DA’s test error rates are lower than RF’s, the test error rate of PLS-DA dramatically increases at timepoint day5, with 24.8% misclassification rate. It seems that the classification stability of PLS-DA is worse than RF.

(24)

4.2. Extra quality control

Considering the intersection, RF and PLS-DA select 1140 and 14888 genes appearing at all timepoints, respectively. While these genes are the ones we are looking for, because they have non-zero classification importances every timepoints and are possible to be SAR biomarkers between allergic and non-allergic persons.

Since RF’s gene selection ability and classification stability are better than PLS-DA, we use the former result for the further analysis. We rank all RF-selected genes based on the summa-tion of variable importances (MDA from RF) of all seven timepoints, and draw a figure4.1 with the top 20 genes. Since all seven models are independent, the importances of one gene might not be the same important if its MDAs are the same in two different timepoints. It is necessary to consider the proportion of a particular gene’s importances in the longitudinal (column) direction before discussing the importance change of such a gene at different time points.

It is worth noting that there are 23 mitochondrial genes (MT-genes) among 1140 genes selected by RF based on reorganized data. According to figure4.1, MT.RNR2 (a kind of MT-genes) contributes more to distinguishing the allergic persons’ cells than the other genes, especially at timepoint 12h. It means that the cell quality might be a significant index for distinguishing patients’ cells, which is abnormal in bioinformatics. SAR is not a disease that can lead to cell death, and the cell quality should be similar between different populations. Therefore, an extra quality control is needed to check the quality of reorganized data.

4.2 Extra quality control

The proportion ofMT-genes, such as MT.RNR2, MTRNR2L1, MTRNR2L12 and MT.RNR1 in figure4.1, in the cells is a vital quality indication of the cells (Luecken and Theis,2019). A mitochondrion is an organelle in a cell. Suppose a cell is lysed (dead) and its correspond-ing cell membrane is disrupted. In that case, some gene expression content will leak out of the cell, while the mitochondrial expressed content will not. It is because the mitochondrial membrane protects such mitochondrial expressed content. Thus, the higher proportion of MT-genes’ expression in a cell is, the lower quality of such a cell has.

(25)

4.2. Extra quality control

Fig. 4.3.There is a collection of timepoint-based scatter plots. Each point represents a cell. By calculating the reorganized data mentioned in section2.1(i.e., getting the row summation of all genes for a cell), the x-axis is the gene counts per cell. The y-axis is MT-genes’ proportion per cell (the row summation of MT-genes divided by the gene counts for a cell). In the sub-plot pair of each timepoint, the left represents the cells from healthy controls, and the right represents these from allergic patients.

In section2.1, we have removed the genes only appearing either in allergic or non-allergic persons at each timepoint. So the proportion of MT-genes per cell may slightly increase, and the gene counts per cell may decrease on the contrary. As shown in figure4.3, result shows that the minimal gene count per cell is 399. Only several cells have around 25% of MT-genes, and there is no cell with MT-genes proportion larger than 30%.

(26)

4.3. Results based on reorganized data without timepoint 12h

treat samples in the same ways (Vallejos et al., 2017). This is so called the high variability between technical replicates. According to the time and technical limitation, it is difficult to review whether it is because of some technical mistakes or a real biological variety. Since it is the only timepoint with unbalanced data, we decide to remove the data at timepoint 12h and its corresponding result after discussing with the whole research group. Referring to the data at timepoint 12h, further discussion could be done in the future studies. After removing timepoint 12h, a new heatmap will be drawn only based on the remaining six results.

4.3 Results based on reorganized data without timepoint 12h

Fig. 4.4.A heatmap based on on the reorganized data from six timepoints. It is drawn in the same way as figure4.1.

Since the models are independent based on timepoints, the results from the other six time-points will not change if the timepoint 12h is skipped. Algorithm RF selects 1204 genes in-tersected at six different timepoints. Meanwhile, Algorithm PLS-DA picks 15040 inin-tersected

(27)

4.3. Results based on reorganized data without timepoint 12h

genes. Because the performance of RF is still better than that of PLS-DA, we decide to analyze the result of RF.

There are still 23 MT-genes among such 1204 RF-selected genes. Figure4.4is drawn in the same way as figure4.1, besides ignoring timepoint 12h. In figure4.4, MT.RNR2 has a continuous-contribution for distinguishment between the cells from patients and controls at the beginning timepoints. Its pattern pecks at timepoint day2 but climbs down at the latter timepoints. The genes such as AL591846.1, MTRNR2L12, RPS14P8 and RPL13AP5 have sim-ilar patterns of variable inportances with MT.RNR2. In contrast, genes such as MTRNR2L1, AC093484.4, GIMAP7 and RPS27A have little importances in the early timepoints, but their contributions are greater for later timepoints.

Fig. 4.5.Such 5 genes have large variable importances in the early timepoints, but less in the later timepoints.

Fig. 4.6. Such 4 genes have small variable impotances in the earply timepoints, but they are more important in the end.

Except the genes with continuous-contribution, there are some genes that only have sig-nificant classification importance at one timepoint. For example, the gene S100A8 has huge importance at timepoint 0h compared with other timepoints and locates at top 20, although we have already ranked the genes in the figure4.4.

(28)

4.4. Results based on denoised data without timepoint 12h

other genes in general. We hypothesize that MT-genes are very selected as high-important genes due to some artifacts during sequencing, such as the technical noise specified in section 2.2. Therefore, we decide to usescVIto denoise the reorganized data and remodeling via RF and PLS-DA. The scVI depends on scvi-tools Python-package with default parameters (Lopez et al.,2018).

4.4 Results based on denoised data without timepoint 12h

Algorithm

Genes’ number Timepoint

0h day1 day2 day3 day5 day7 RF 3468 5885 4558 3337 8408 3642 PLS-DA 22662 21838 22402 21825 23097 24438

Table 4.3: This table reveals the number of selected genes based on denoised data via two ML algorithms.

Algorithm

Error rate

Timepoint

0h

day1

day2

day3

day5

day7

PLS-DA, train

0.000

0.002

0.000

0.003

0.001 PLS-DA, test

0.005

0.021

0.007

0.022

0.020

0.014 RF, train

0.003

0.014

0.003

0.021

0.016

0.010 RF, test

0.002

0.005

0.002

0.013

0.007

0.006 RF with 10 genes only, test

0.016

0.020

0.008

0.028

0.021

0.010

Table 4.4:The values in the table are error rates of RF and PLS-DA at six different timepoints. All models are trained relying on the denoised data.

Comparing table4.1and table4.3, the gene-selection ability of RF is extremely strong, ben-efiting from scVI denoising. The numbers of genes that RF chooses decrease by half approxi-mately. Oppositely, the numbers of genes selected by PLS-DA climb up slightly. Additionally, there are 10 and 17836 genes appearing all timepoints chosen by RF and PLS-DA, respectively. Table4.4is similar with table4.2, and it illustrates that the RF’s and PLS-DA’s error rates based on denoised data. Compared with table4.2, it is obvious that the classification accuracy of both algorithms improve greatly after scVI denoising. Although the classification stability and accuracy of PLS-DA increases greatly comparing with those based on reorganized data, it is still over-fitting. Also, the test error rates of PLS-DA are worse than RF’s at any timepoints in table4.4.

Therefore, it is still preferable to analyze the result of RF, due to its better performance to classify allergic persons’ cells. We create six extra RFs based on such 10 RF-selected genes, and their misclassification rates are low as well. The last row of table4.4demonstrates that the high association between such ten genes selected by RF and SAR. This is interesting because it means that the discriminative power of such 10 genes is close to the power of more than 3000 genes in table4.3.

According to tables4.2and4.4, there is another interesting phenomenon. The classifica-tion error rates at timepoint 0h are always equal and lower than at other timepoints when

(29)

considering all the genes, no matter which ML algorithms and data we use. One probable ex-planation is that more genes participate in the disease response after the allergen challenge, and then the scRNA-seq data becomes noisier. Regardless of the reasons, such a high ac-curacy for distinguishing the cells of allergic patients illustrates that the genetic expressions differ between allergic and non-allergic persons even before they are affected by allergens (pollen).

Fig. 4.7. A heatmap depends on the denoised data from six timepoints. It contains all 10 genes selected by RF, and ranks the genes using the same way as figure4.1.

The maximal MDA is only around 0.004 in figure 4.7, but 0.014 in figure4.4. In figure 4.7, MT.RNR2 and MT.RNR1 disappear, but MTRNR2L1 and MTRNR2L12 are still selected. Except for these two MT-genes, there are eight other genes, while seven of them are also selected in 1204 genes from reorganized data. The gene AL136133.1 is the only additional one found based on denoised data.

(30)

Fig. 4.8.Nine genes selected in both section4.3(left) and section4.4(right).

Nine tenth of genes in section4.4are included by the RF-selected result in section4.3, but the MDAs’ patterns of these nine genes are quite different in accordance with figure4.8. For example, GIMAP7 contributes greatly only at timepoint day3 in section4.4, but all timepoints except for day1 in section 4.3. On the other hand, although the contributions’ patterns of two MT-genes in section4.4are similar with in section4.3, they are not as important as in section4.3. For example, MTRNR2L1 has large variable importances based on the analysis of reorganized data except for the data at timepoint 0h, but only has obvious importances at timepoints day3 and day5 based on the denoised data.

These changes exist because more genes may have a greater discriminative power between patients and controls than when the data is not denoised. According to figures4.9and4.10, the maximal frequencies of MDAs from reorganized data (around ten thousand) are higher than from denoised data (around one thousand), and the histograms of MDAs from reor-ganized data are much narrower than those from denoised data. It means that most genes in section4.3have small variable importances (nearly zero), while the importances of most genes increase when the data is denoised. it is possible that more genes share the contribution to distinguishing patients’ cells after data denoising by scVI. Such a phenomenon may cause the continuous contribution of a gene in figure4.7being illegible.

Consequently, the performance of RF increases after denoising, and the data denoised by scVI could produce a more balanced variable importances compared with the reorganized data. Besides, the result obtained from denoised data are also highly coincident with from reorganized data. Hereafter, we continue the analyze the result of RF based on denoised data.

(31)

Fig. 4.9.Histograms of MDAs at each timepoints based on section4.3. The x-axis represents the different MDAs, while the y-axis represents the corresponding frequencies of such MDAs. Additionally, the y-axis is transposed to log10 scale, and bins=100 in geom_histogram function.

(32)

4.5. Fitted averaged gene expressions per cell

4.5 Fitted averaged gene expressions per cell

Fig. 4.11. Each subplot denotes a RF-selected gene from section4.4. The y-axis is the fitted gene expression levels, and the x-axis is the continuous timpoints. In each subplot, the red dotted lines are from allergic persons and the blue solid lines from healthy controls. The middle line of the same kind of lines denotes the averaged expression levels of a certain gene per cell for seven days, and the others indicate the upper and lower 95% quantile confidence intervals, respectively.

4.5.1 Gene-to-gene discussion

By fitting the result in section4.4, figure4.11shows that the averaged expressions of some genes are different between allergic and non-allergic persons even from the beginning (no overlapping between confidence intervals), including AL591846.1, MTRNR2L12, PFN1P1, RPL15P3 and TMSB10P1. Those might be the early response mechanisms in different popu-lations.

Subsequently, according to figure4.11GIMAP7 and HNRNPH1 are the only two genes that do not express differently between allergic and non-allergic persons before the allergen chal-lenge, but show different expressions over time. When such genes are differently expressed, the earliest timepoints are 3h and 24h for GIMAP7 and HNRNPH1.

Last but not least, such genes as AL136133.1, MTRNR2L1, and RPL27A are hard to define when they express differently since the confidence intervals of their gene expressions from patients and controls cross in a certain time. Also, it is worth noting that AL136133.1 is the only gene that seems not to express differently between allergic and non-allergic persons for seven days since their huge overlapping area. However, it does not mean that AL136133.1 has less contribution than other genes according to figure4.7. For example, its importance is larger than TMSB10P1 at day5, while TMSB10P1 expressions differ between patients and con-trols clearly at the same timepoint. One reason could be that both AL136133.1 and TMSB10P1 are useful genes to classify patients’ cells, but they are not significantly powerful.

(33)

4.5.2 Overall discussion

Considering the genes other than GIMAP7, RPL27A and AL136133.1, the confidence inter-vals of all genes’ expressions have been significantly expanded in patients’ cells than those of controls’ during seven days after the allergen challenge. It reflects that the expressions of these seven genes from allergic persons increase rampantly but unequally.

By contrast, the confidence interval of GIMAP7 from controls is expanded just after the allergen challenge, compared with that from patients. Later it climbs steadily until 1, which means that such a gene can almost be detected in all cells from controls.

At last, the expressions of the remaining two genes (AL136133.1 and RPL27A) are ex-pended in all populations.

Additionally, it can be noticed that AL591846.1 and RPL15P3 are differently expressed for the entire seven days after challenging by allergen, but AL591846.1 seems to have more im-portances than RPL15P3 at most timepoints based on figure4.8.

4.5.3 Discussion with biological functions

There is no reference found that refers to the direct connection between our genes and SAR. Gene is a vital segment containing genetic information for producing RNA. One possible hy-pothesis is that the inflammation (a symptom, not a disease) resulting from SAR needs more RNA production, and such a situation increases the variable importances of the correspond-ing genes.

On the other hand, some researchers have indirectly confirmed their possible SAR connec-tions. Table4.5shows some existing studies discussing about the genes we found or their associated diseases with SAR. For example, PFN1P1 is related to CD4-CTL, which are im-portant cells of immune system in human body. Furthermore, AL591846.1 (or RF00019) is a component of the Ro60 ribonucleoprotein particle, a target of autoimmune antibodies in pa-tients with systemic lupus erythematosus. The potential relative risk between allergic rhinitis and systemic lupus erythematosus is confirmed by meta-analysis (Wongtrakul et al.,2020).

(34)

Gene Associated

disease(s) Studies possible related to SAR

PFN1P1 - The products of such gene may also play important roles in the development or function of CD4-CTLs (Patil et al.,2018).

RPL15P3 -

-TMSB10P1 -

-MTRNR2L12 Hirschsprung Disease 1

Allergic colitis is a disease causing the similar symptoms as Hirschsprung disease (Bloom et al.,1999).

AL591846.1 (or RF00019)

Systemic Lupus Erythematosus

(SLE)

"A significantly increased 1.4 fold risk of SLE among patients with allergic rhinitis was observed in this systematic review and meta-analysis." (Wongtrakul et al.,

2020)

RPL27A

Rectum Adenoma (or colorectal

cancer)

"The current meta-analysis provides further evidence that allergic conditions were inversely associated with colorectal cancer risk and mortality." (Ma et al.,2017)

AL136133.1

(or SWT1) Kidney cancer

"Patient with allergic rhinitis or asthma had an elevated risk of developing cancer of kidney and urinary bladder based on standard incidence ratio." (Hwang et al.,2012)

GIMAP7

Early infantile epileptic encephalopathy 4

"GIMAP family members play a role in T-helper cell differentiation. GIMAP7 is also a regulator of lymphocyte survival and homeostasis." (Nee’Betal et al.,2019)

HNRNPH1 Leukemia and Hereditary Lymphedema -MTRNR2L1 Dystonia 23

-Table 4.5: This table shows the possible connections between the genes found in this thesis and some existing studies related to SAR. The first, second and third columns in the table are the genes, their associated disease(s) and the existing studies, respectively.

(35)

5 Conclusion

Our results demonstrate that RF is a more suitable algorithm to classify whether the cells are from allergic or non-allergic persons based on scRNA-seq data compared with PLS-DA. PLS-DA is an algorithm having high classification accuracy as well, but it might be over-fitting and unstable. By contrast, RF can not only provide better prediction accuracy for distinguishing cell source at any timepoints, but also has relatively stronger ability of gene selection. Moreover, the performance of RF can be greatly improved by denoising the data via scVI. It confirms that denoising can remove the technical noises in scRNA-seq. The high clas-sification accuracy clarifies that there might be an obvious difference in genetic code between allergic and non-allergic persons, no matter how much they are affected by the allergen. It shows us the possibility of preventing allergic patients from a disease outbreak by predicting in advance, which could benefit a large population.

Secondly, we find 10 genes having high classification power, which might be SAR biomark-ers. Applying our proposed fitting methods, the variations of selected genes’ expressions can be clearly visualized. Based on our results, for non-allergic persons, the gene expressions of most genes we found change negligibly in the period of seven days after the allergen chal-lenge. On the contrary, the expressions of such genes from allergic persons have a great variation, explicated by wider confidence intervals in figure4.11.

After that, we find that some of 10 selected genes may reveal the early response mecha-nisms in different populations. To be more specific, AL591846.1 and RPL15P3 even differently express between populations at all timepoints. Besides, we detect the earliest differently-expressed timepoints of GIMAP7 and HNRNPH1 are 3h and 24h. Nevertheless, it is difficult to describe when are the earliest timepoints the other six genes differently expressed. A de-tailed characterization can be seen in section4.5.1.

(36)

Moreover, we have not found any literature discussing the direct relation between our selected genes and SAR. However, as mentioned in section4.5.3, there are some existing researches revealing the potential relation between the genes we find and SAR (Patil et al., 2018; Bloom et al.,1999; Wongtrakul et al.,2020; Ma et al.,2017; Hwang et al.,2012; Nee’Betal et al., 2019). It is unexpected that such 10 genes found in this study could support such low misclassification rates in table4.4. Based on our results, ML shows its potential to help researchers discover something that traditional analysis has not found, which needs further bioinformatics and experimental analysis.

(37)

6 Study limitation and future work

First of all, we assumed that all cells in scRNA-seq data are statistically independent for convenience. However, the similarity between the same kind of cells from different persons’ samples is likely to exist. Such a condition can result in prediction bias. Thus, it is required to apply some models considering data correlation for comparison in further study. Meanwhile, removing the bias caused by the intrinsic difference between allergic and non-allergic people while predicting the disease status could also be an interesting direction in the future.

Secondly, the sample sizes of allergic and non-allergic persons participating in scRNA-seq varies at different timepoints. Although we have modified our methods in section3.3.4, this data defect might have some latent negative effects on our studies, such as unbalanced confi-dence interval and over-fitting in the latter timepoints. So the verification of our methodology based on other scRNA-seq data in future research is very important.

Additionally, we find the classification accuracy of allergic or non-allergic persons applied by PLS-DA or RF before the allergen challenge is higher than after the challenge. Although we put forward a hypothesis, it still needs biological verification.

Furthermore, the scRNA-seq data at timepoint 12h is abnormal according to figure4.3, which needs to be reviewed and further analyzed. If the abnormality in such data only stems from biological difference rather than technique results, it is may of value to do a further discussion about the results at timepoint 12h.

Last but not least, for the 10 genes we find, their fitted gene expressions show unexpected complexity. Based on our methodology, such 10 genes do have discriminative power, but we cannot rank them based on their different importances at different timepoints. Besides, we have not found the specific explanations of the connections between the 10 genes found in this

(38)

Bibliography

[1] M. Akdis and C. A. Akdis. “Mechanisms of allergen-specific immunotherapy”. In: Jour-nal of Allergy and Clinical Immunology 119.4 (2007), pp. 780–789.

[2] M. Benson et al. “Increased expression of Vascular Endothelial Growth Factor-A in sea-sonal allergic rhinitis”. In: Cytokine 20.6 (2002), pp. 268–273.

[3] D. Bloom et al. “Allergic colitis: a mimic of Hirschsprung disease”. In: Pediatric Radiol-ogy 29.1 (1999), pp. 37–41.

[4] J. Bousquet et al. “Allergic Rhinitis and Its Impact on Asthma”. In: Journal of Allergy and Clinical Immunology 108.5, Supplement (2001), S147–S334.

[5] L. Breiman. “Random forests”. In: Machine learning 45.1 (2001), pp. 5–32.

[6] R. G. Brereton and G. R. Lloyd. “Partial least squares discriminant analysis: taking the magic away”. In: Journal of Chemometrics 28.4 (2014), pp. 213–225.

[7] G. D’Amato et al. “Allergenic pollen and pollen allergy in Europe”. In: Allergy 62.9 (2007), pp. 976–990.

[8] T. Hastie et al. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.

[9] H.-M. Hsueh et al. “Random forests-based differential analysis of gene sets for gene expression data”. In: Gene 518.1 (2013), pp. 179–186.

[10] C.-Y. Hwang et al. “Cancer risk in patients with allergic rhinitis, asthma and atopic der-matitis: a nationwide cohort study in Taiwan”. In: International Journal of Cancer 130.5 (2012), pp. 1160–1167.

[11] G. James et al. An introduction to statistical learning. Vol. 112. Springer, 2013.

[12] R. Lopez et al. “Deep generative modeling for single-cell transcriptomics”. In: Nature Methods 15.12 (2018), pp. 1053–1058.

[13] M. D. Luecken and F. J. Theis. “Current best practices in single-cell RNA-seq analysis: a tutorial”. In: Molecular Systems Biology 15.6 (2019), e8746.

[14] W. Ma et al. “Association between allergic conditions and colorectal cancer risk/mor-tality: a meta-analysis of prospective studies”. In: Scientific Reports 7.1 (2017), pp. 1–7. [15] A. Mongia et al. “deepMc: deep Matrix Completion for imputation of single cell

(39)

Bibliography

[16] S. G. Nee’Betal et al. “Histological Chorioamnionitis Induces Differential Gene Expres-sion in Human Cord Blood Mononuclear Leukocytes from Term Neonates”. In: Scien-tific reports 9.1 (2019), pp. 1–12.

[17] J. Nemesh. Drop-seq core computational protocol. http://mccarrolllab.com/wp-content/uploads/2016/03/DropseqAlignmentCookbookv1. 2Jan2016.pdf. 2016. [18] V. S. Patil et al. “Precursors of human CD4+ cytotoxic T lymphocytes identified by

single-cell transcriptome analysis”. In: Science immunology 3.19 (2018).

[19] F. Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[20] M. Pérez-Enciso and M. Tenenhaus. “Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach”. In: Human ge-netics 112.5-6 (2003), pp. 581–592.

[21] F. Rohart et al. “mixOmics: An R package for ’omics feature selection and multiple data integration”. In: PLoS Computational Biology 13.11 (2017), e1005752.

[22] R. Shaik and W. Ramakrishna. “Machine learning approaches distinguish multiple stress conditions using stress-responsive genes and identify candidate genes for broad resistance in rice”. In: Plant Physiology 164.1 (2014), pp. 481–495.

[23] E. A. Thévenot et al. “Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index, and Gender by Implementing a Comprehensive Work-flow for Univariate and OPLS Statistical Analyses”. In: Journal of Proteome Research 14.8 (2015), pp. 3322–3335.

[24] C. Torroja and F. Sanchez-Cabo. “DigitalDLSorter: Deep-Learning on scRNA-Seq to de-convolute gene expression data”. In: Frontiers in Genetics 10 (2019), p. 978.

[25] C. A. Vallejos et al. “Normalizing single-cell RNA sequencing data: challenges and op-portunities”. In: Nature Methods 14.6 (2017), p. 565.

[26] H. Wang et al. “Allergen challenge of peripheral blood mononuclear cells from patients with seasonal allergic rhinitis increases IL-17RB, which regulates basophil apoptosis and degranulation”. In: Clinical & Experimental Allergy 40.8 (2010), pp. 1194–1202. [27] D.-Y. Wang. “Risk factors of allergic rhinitis: genetic or environmental?” In: Therapeutics

and Clinical Risk Management 1.2 (2005), p. 115.

[28] W. Wongtrakul et al. “Allergic rhinitis and risk of systemic lupus erythematosus: A sys-tematic review and meta-analysis”. In: International Journal of Rheumatic Diseases 23.11 (2020), pp. 1460–1467.

[29] M. N. Wright and A. Ziegler. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R”. In: arXiv (2015), arXiv–1508.

[30] Y. Yokouchi et al. “A genome-wide linkage analysis of orchard grass-sensitive child-hood seasonal allergic rhinitis in Japanese families”. In: Genes & Immunity 3.1 (2002), pp. 9–13.

[31] S. Zhou, S. Xie, et al. Probability theory and mathematical statistics. Beijing: Higher Educa-tion Press, 2008.

Machine learning methods for seasonal allergic rhinitis studies

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2021 | LIU-IDA/STAT-A--21/006--SE

Machine learning methods for

seasonal allergic rhinitis studies

Zijie Feng

Upphovsrätt

Copyright

Acknowledgments

Glossary

Contents

List of Figures

List of Tables

1

Introduction

1.1

Background

1.1.1

Seasonal allergic rhinitis

1.1.2

Examples of machine learning applications in genetics

1.2

Research aim

2

Data

2.1

Data reorganization

2.2

Data denoising

3

Methodology

3.1

Partial least squares discriminant analysis

3.1.1

Partial least squares

3.1.2

Partial least squares discriminant analysis

3.1.3

Variable importance in projection

3.2

Random Forest

3.2.1

Decision tree

3.2.2

Bagging

3.2.3

Random forest classifier

3.2.4

Mean decrease in accuracy

3.3.1

Bootstrap

3.3.2

Cubic regression splines

3.3.3

Cubic smoothing splines

3.3.4

Fitting method combining bootstrap and cubic smoothing splines

4

Experiment results and

discussion

4.1

Results based on reorganized data

4.2

Extra quality control

4.3

Results based on reorganized data without timepoint 12h

4.4

Results based on denoised data without timepoint 12h

Algorithm

Error rate

Timepoint

0h

day1

day2

day3

day5

day7

PLS-DA, train

0.000