Machine Learning Classification of Response to Internet-based Cognitive-Behavioural Therapy using Genome-Wide Association Study Data

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 60 CREDITS

,

STOCKHOLM SWEDEN 2020

Machine Learning Classification of

Response to Internet-based

Cognitive-Behavioural Therapy

using Genome-Wide Association

Study Data

KTH Thesis Report

REN XIN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Authors

Ren Xin <xinr@kth.se>

Electrical Engineering and Computer Science KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

Examiner

Henrik Boström Stockholm, Sweden

KTH Royal Institute of Technology

Supervisor

Magnus Boman Stockholm, Sweden

(3)

Abstract

(4)

Sammanfattning

(5)

Acknowledgements

I would like to thank my supervisor Magnus Boman for giving me the opportunity to conduct my thesis with this interesting project. I have learnt a lot about bioinformation and machine learning from this project. I would also like to thank Magnus Boman, Fehmi Ben Abdesslem and Henrik Boström for all the guidances and suggestions throughout the work. Thanks to Christian Rück and his team at KI for providing the dataset and answering my questions regarding the dataset.

Support from my manager Martin Söderlund and my colleagues at Ericsson is gratefully acknowledged, since it is crucial for me to finish my study at KTH along with a full time job. I would also like to take this opportunity to express my gratitude to all the teachers and classmates I have met during the whole time I have been studying in KTH. Without their help, I would not be able to acquire the knowledge needed for this project.

(6)

Acronyms

ALT Alternative Allele

AUC Area Under the Curve

CNN Convolutional Neural Network

CPU Central Processing Unit

DNA Deoxyribonucleic Acid

FN False Negatives

FP False Positives

FPR False Positive Rate

GAN Generative Adversarial Network

GWAS Genome-Wide Association Studies

ICBT Internet-based Cognitive-Behavioural Therapy

KI Karolinska Institutet

KTH Kungliga Tekniska Högskolan

LD Linkage Disequilibrium

MADRS Montgomery-Åsberg Depression Rating Scale MAF Minor Allele Frequency

ML Machie Learning

MLP Multilayer Perceptron

(7)

PRC Precision Recall Curve

REF Reference Allele

RNN Recurrent Neural Network

ROC Receiver Operating Characteristic

SNP Single Nucleotide Polymorphism

TN True Negatives

TNR True Negative Rate

(8)

Chapter 1 Introduction

A learning machine is defined as an autonomous self-regulating open reasoning machine that actively learns in a decentralized manner, over multiple domains [5]. It has been employed in a project, led by Karolinska Institutet (KI) with Kungliga Tekniska Högskolan (KTH) involved and conducted in the Internet Psychiatry Clinic at Psykiatri Sydväst in Stockholm, to predict future patient behavior and the clinical outcome of treatment [4]. The learning machine uses a number of machine learning methods. Each of them helps identify and amplify signals of bias, but due to the different nature of data points, none of them is good enough for analyzing every weak signal. The learning machine fuses and unifies all the signal analyses and adapts the procedure over time with new data added.

(11)

CHAPTER 1. INTRODUCTION

1.1 Background

Since 2002, GWAS have been performed to analyze the genetic variation to identify genotypes that would cause certain phenotypes, including observable traits and diseases, in both humans and plants. One form of genetic variations is called Single Nucleotide Polymorphism (SNP). As of September 2018, over 5687 GWAS have examined over 71673 variant-trait associations from 3567 publications [7]. The challenges in GWAS include the huge number of genotypes to test, the weak association between truly associated genotypes and the phenotypes, and the possible interactions between the genotypes [28].

A number of statistical methodologies that have been used in GWAS. The simplest one is single marker regression, which looks at every genotype-phenotype combination, one genotype at a time. Using a linear regression statistical test gives how much of the phenotype is associated with that specific genotype. Because of the relatively weak signals between single genotype and phenotype, and ignoring the interactions between genotypes, this method has limited power. Thus, a number of approaches have been taken to consider the correlative structure of the genetic data, which generated multiple-marker methods, where multiple genotypes are tested at the same time. For example, [9] reviewed and discussed three methods of achieving this: meta-analysis, which combines results from multiple single-SNP GWAS; lasso penalized ordinary linear regression and logistic regression, which searches for epistasis within a single GWAS study in order to identify stronger results that are revealed when genes interact; pathway analysis of GWAS results, which prioritizes genes and pathways within a biological context.

(12)

phenotype prediction problems.

Recently, deep learning models have been widely employed in the GWAS to predict phenotypes according to genotypes and a varity of models have been used, for example, Liu, Yang, et al. used deep convolutional neural network to predict phenotypes of soybean with GWAS data [29], multilayer feedforward network was used to predict age-related macular degeneration risk in [53] and classify polygenic obesity in [32], and stacked autoencoders are used to identify higher-order SNP interactions in [15]. According to [37], convolutional neural networks seem more promising than multilayer perceptrons in genomic prediction.

Different approaches have been taken to handle the problem of large feature size and small sample size in deep learning models. Logistic regression has been used in feature selection to select possibly associated SNPs [30]. The authors of [41] used auxiliary networks to predict the parameter for the first layer to reduce the number of free parameters in the model. In [54], the researchers divided each chromosome into regions, each region is provided to a CNN network as input for classification, only the top eight regions that yielded the highest prediction accuracy from each chromosome are given to the main model as input.

1.2 Problem

(13)

1.3 Purpose

The purpose of this thesis project is to explore the possibility of employing deep learning models to find the relationship between patients’ GWAS data and treatment responses. That being said, this thesis project aims to answer the following research question:

With GWAS data, can deep learning networks predict ICBT outcome better than non-discriminating classifier?

The work involves researching feature selection, defining a deep learning model, and analyzing the performance of the model.

1.4 Goal

The goal of this thesis project is to deliver a deep learning model that can predict treatment outcomes with GWAS data, to the learning machine mentioned previously, in order to further assist therapists in their daily work. As applying deep learning on GWAS data is still in its early days, the secondary goal is to motivate more research in this field.

1.5 Benefits, Ethics and Sustainability

This thesis project contributes to the completeness of the learning machine. It is beneficial to geneticists by identifying the relationship between SNPs and treatment outcome, therapists by allowing them to adjust treatments for each individual according to their genotypes, patients by allowing them to get better treatments and clinics by allowing better allocation of healthcare resources.

The model and results will be made public. Hopefully, this thesis project would encourage more research on predicting phenotypes from genotypes with deep learning models and eventually contribute to the healthcare of humankind.

(14)

held by KI, and with the digital policy of the research, as provided by professor Magnus Boman.

1.6 Methodology

The project will be conducted with a data-driven approach. It consists of a literature study and experimental research. The literature study aims to provide theoretical background of the project and methods to be applied in the project. During the experimental period, the data will be processed so that it can be loaded to the memory efficiently and fit in the memory. A CNN model will be defined and trained with hyper-parameters tuned with cross-validation. Feature selection will be applied to select possibly associated SNPs in order to reduce the dimension of the input to the CNN. The model will be evaluated and its performance will be compared with traditional statistics or machine learning models that have been widely used for GWAS data, for example, logistic regression or random forest. The conclusion of the thesis project will be based on the performance of the CNN model.

1.7 Outline

(15)

Chapter 2 Theoretical Background

This chapter briefly explains the theoretical background needed to understand the dataset, analysis and results. First, SNP and GWAS are described. Then, several SNP selection methods are discussed. Last but not the least, machine learning and deep learning are explained, followed by the performance metrics.

2.1 SNP and GWAS

A genome is the genetic material of an organism, which is organised in chromosome pairs. The chromosomes are the carriers of genes, which are functional sequences of DNA. DNA is composed of two polynucleotide chains, which are built from simpler monomeric units called nucleotides. Each nucleotide is composed of one nitrogen-containing nucleobase (either cytosine [C], guanine [G], adenine [A] or thymine [T]), a monosaccharide sugar called deoxyribose, and a phosphate group. Two nucleobases at the same location of two strands of DNA form a base pair. There are not two people that have identical DNA, unless they are monozygotic twins. There are different forms of genetic variations, and the most common form is called SNP, as shown in Figure 2.1.1, which is single base-pair changes in the DNA sequence that occur with high frequency (typically more than 1%) [8].

(16)

CHAPTER 2. THEORETICAL BACKGROUND

Two SNPs can be located close to each other on the chromosome. Alleles at SNPs close together on the same chromosome have a trend of occurring together more often than what is expected by chance, in this case, the SNPs are in linkage disequilibrium [45]. When SNPs are independent, they are referred to as in linkage equilibrium.

GWAS studies SNPs. According to U.S. National Institutes of Health [33] [34], a GWAS is a method used to find specific genetic variations associated with particular diseases in genetics research. The method involves scanning the genomes from many different people, including case group (people with the disease or trait) and control group (people without the disease or trait), and looking for genetic markers that occurs more frequently in cases than in controls. These genetic markers once identified can be helpful in understanding how genes contribute to the disease and developing better prevention and treatment strategies.

A T C G A T C G 1 2 SNP

Figure 2.1.1: The upper and lower DNA molecules are different at a single base-pair location. SNP model by David Eccles (gringer) / CC BY (https://creativecommons.org/licenses/by/4.0)

2.2 SNP Selection

(17)

that are usually used by geneticists in the pre-processing of GWAS data. MAF considers individual SNPs. It measures how often the second most common allele occurs in a given population. SNPs that have MAF lower than a threshold can be removed for practical reasons, for example, the threshold is set to 0.05 in [11]. LD considers SNP pairs as formally defined in [45]. Assume A and B are a pair of alleles of two SNPs, we denote PAand PBas the frequencies of those alleles, and PABas the frequency of A and

B showing up together, then we have the LD between alleles A and B as:

DAB = PAB− PAPB. (2.1)

Since virtually all SNPs are diallelic, which means having only two alleles, the LD between the two SNPs D can be characterized as DAB. When D ̸= 0, the SNPs are in

linkage disequilibrium, otherwise, they are in linkage equilibrium. It is also possible to define LD of more than two SNPs. [6] suggests that causal variants (SNPs), each with small effect sizes, may be fairly uniformly distributed, hinting that we can reduce the number of SNPs in the data by selecting one representative SNP from each LD region (a number of SNPs that are in LD). This method is called LD clumping, and can be conducted with PLINK1_.

P-value considers an individual SNP and its phenotype. It measures the probability of obtaining data samples at least as extreme as the actual data, assuming the null hypothesis that the SNP is not associated with the phenotype is correct. The exact value can be calculated via Fisher’s exact test [16]. The lower the p-value, the more likely the null hypothesis is false, and the more likely the SNP is associated with the phenotype. The most commonly accepted p-value threshold is 5 ∗ 10−8 [52]. SNPs that have p-value lower than that are called genome-wide significant to the phenotype. Recently, this value has been suggested to be updated for low-frequency variants [13]. [6] and [51] suggest that in order to capture more phenotypic variance, a higher p-value threshold should be used. SNPs that are not genome-wide significant to a phenotype can be considered for removal in the prediction of that phenotype.

There are also other methods that can be used, for example, Shannon entropy and conditional entropy which are widely used in information theory. In data science, Shannon entropy [44] is usually the first choice to calculate the variance of each feature

(18)

(SNP) independently. It is defined by:

H =−

M

∑

i=1

Pilog2(Pi) (2.2)

where Pi is the fraction of values equal to i observed for that feature, and M is the

number of distinct observed values for that feature. We can calculate the Shannon entropy for all the SNPs, rank them and focus on the SNPs with the most entropy. Shannon entropy is computed independently from phenotype targets, while conditional entropy [44] considers a target when computing an entropy. It quantifies the amount of information needed to describe the outcome of a random variable Y (the target) given the value of another random variable X (a SNP). It can also be defined as the uncertainty of encountering Y when we know X. The conditional entropy for a given SNP is defined as:

H(Y|X) = − ∑

x∈X ,y∈Y

p(x, y)logp(x, y)

p(x) (2.3)

where X is the set of observed values for X, and Y would be the set of values for Y. When H(Y|X) = 0, the value of Y is completely determined by the value of X, and when H(Y|X) = H(Y ), X and Y are independent random variables. We can calculate the conditional entropy for all the SNPs with the target if the patient is a treatment responder, rank them, and focus on the SNPs with the least conditional entropy. UMAP [31] is an exciting dimension reduction technique that can be used for visualisation. It is dominating the genomics-related information processing nowadays. Unlike the aforementioned methods, which reduce data dimension while keeping the original features, UMAP projects the data from high dimensions to lower dimensions with the new dimensions not necessarily from the original dimensions. UMAP itself is a machine learning algorithm that can perform both unsupervised and supervised training on training data to reduce the data dimension and cluster the data. The trained UMAP model can be used to reduce the dimension of test data and label them. Additionally, the low dimension data produced by UMAP model can be used as input to other machine learning algorithms. All of these can be achieved by using the Python package umap-learn2_.

(19)

2.3 Machine Learning

The two most commonly used machine learning models in GWAS are logistic regression and random forest. In this project, we consider logistic regression as it has been widely applied to classification problems in data science and also used for phenotype prediction.

2.3.1 Logistic Regression

Logistic regression is a generalized linear model used for binomial classification problems. Let x1, x2, ... ,xnbe the predictor variables (features), and y be the binary

response variable (label). We denote p = p(y = 1), which is the expected value of y. The log-odds of the event that y = 1 is assumed to have a linear relationship with the predictor variables.

log p

1− p = β + W

T_{X = β + w}

1x1+ w2x2+ ... + +wnxn (2.4)

β, w1, w2, ... ,wn are model parameters with β called intercept and the rest called

weights. By inverse transformation we have the logistic function for the expected values p:

p = exp(β + W

T_X)

1 + exp(β + WT_X) (2.5)

These expected values will be between 0 and 1 and can be treated as a probability. We denote ˆyas these probabilities, then we have

p(y = 1|X; W ; β) = ˆy p(y = 0|X; W ; β) = 1 − ˆy

(2.6)

This can be written more compactly as

p(y|X; W ; β) = ˆyy(1− ˆy)1−y (2.7)

(20)

likelihood of the parameters as,

L(W ) = p(y|X; W ; β) =∏m_i=1p(y(i)|x(i); W ; β) =∏m_i=1(ˆy(i))y(i)(1− ˆy(i))1−y(i) (2.8)

And negative log likelihood as

− log(L(W )) = −∑m i=1(y

(i)_log(ˆ_y(i)_{) + (1}_{− y}(i)₎_log(1_{− ˆy}(i)₎₎ _(2.9)

where i = 1, 2, ..., m and m is the sample size. We would like the parameters maximizing likelihood and minimizing negative log likelihood. So we can define the loss function J(W ) = − log(L(W )). The model parameters are usually randomly initialized at the beginning of the training, and updated during the training with gradient descent algorithm to minimize J(W ).

One way to reduce the risk of overfitting in logistic regression is regularization, which is achieved by constraining the weights of the model.

Q(W ) = J (W ) + λR(W ) (2.10)

where R(W ) is a function of W . The most commonly used regularization are Lasso where R(W ) = ∑n_i=1|wi|, Ridge where R(W ) =

∑n i=1wi

2 _{and ElasticNet which is}

a hybrid of Lasso and Ridge with R(W ) = α∑n_i=1|wi| + (1 − α)

∑n i=1wi

2_{. λ and α}

are called hyperparameters that are defined before the training. With regularization,

Q(W )instead of J(W ) is minimized during the training.

2.4 Deep Learning

(21)

we describe MLP first.

2.4.1 MLP

An MLP is a fully connected network that consists of one input layer, at least one hidden layers, and one final output layer (see Figure 2.4.1). The input layer contains a set of neurons representing the input features. Each neuron in the hidden layer and the final output layer transforms the outputs from the previous layer with a weighted linear summation, followed by a non-linear activation function which produces the neuron’s output. Subsequent layers receive the outputs from the previous layers, and the final output layer transforms them into output of the network. Learning happens in the MLP by updating each weight (wi) and bias (b) after processing every piece of

data, and the goal is to minimize the loss function (also called error) [3]. The learning process is called back-propagation [42]. In brief, it is Gradient Descent with an efficient technique to compute the gradients automatically: in one forward and one backward pass through the network, the back-propagation algorithm is capable of computing the gradient of the network’s error respecting each model parameter. That is, it is able to find out how each weight (wi) and bias (b) should be updated to reduce the error. After

it gets the gradients, it simply performs a regular Gradient Descent step, and the whole procedure is repeated until the network converges to the solution [18].

When the target is a real value, the output of the network would be a vector of numbers, while when the target is a class as in a classification problem, the output of the network would be an array containing probabilities for each class. MLPs are powerful in solving classification and regression problems, but they are not the best option when data is spatial or temporal. CNN is one of the neural networks that are proposed to address these issues [37].

2.4.2 CNN

(22)

Figure 2.4.1: An MLP with a set of SNPs as input and four hidden layers; A basic “neuron” with n inputs explaining what one neuron does is applying weighted linear combination (xi, wi, and biases b) followed by nonlinear transformation. / CC BY [37]

building block of a CNN. Neurons in the first convolutional layer are not connected to every single feature in the input, but only to features in their receptive fields (see Figure 2.4.2). Correspondingly, every neuron in the subsequent convolutional layer is connected only to neurons located within a small rectangle in the previous layer. This architecture lets CNN focus on small low-level features in the first hidden layer, and assemble them into larger higher-level features in the next hidden layer, and so on [18].

In every convolutional layer, a kernel (also called filter) is applied to the input with predefined width and strides. Each kernel kind of equals to a neuron in a layer of MLP. The convolutional layer output is produced by an activation function applied after each kernel. Pooling layer is usually used to reduce spatial dimensions of convolutional layer output, which merges values of successive positions by taking their maximum, minimum or average [37]. Figure 2.4.3 shows a simple representation of an one-dimension (1D) convolutional kernel and a complete representation of an 1D CNN.

(23)

Figure 2.4.2: CNN layers with rectangular local receptive fields / CC BY [18]

features, which the model can then use for classification. The learning process of the CNN is also conducted with back-propagation [42].

2.5 Performance Evaluation

To evaluate the performance of the classification models, performance measures are needed. This section describes the measures and methods used to evaluate the models. According to [18], there are several metrics for a binomial classifier, such as confusion matrix, precision and recall, and AUC.

2.5.1 Confusion Matrix

(24)

Figure 2.4.3: (a) A simple representation of an one-dimension (1D) convolutional kernel. (b) A complete representation of an 1D CNN with a matrix of SNPs as input, convolutional layers followed by pooling layers and a standard MLP to produce the network output. / CC BY [37]

positives). A perfect classifier, that classifies every instance correctly, would only have true positives and true negatives, which implies that its confusion matrix would only have nonzero values on the top left to bottom right diagonal.

Predicted class 0 Predicted class 1 Actual class 0 100 (True Negatives) 400 (False Positives)

Actual class 1 200 (False Negatives) 300 (True Positives) Table 2.5.1: Example for confusion matrix with two classes 0 and 1.

2.5.2 Precision and Recall

The confusion matrix gives lots of information, but in some cases, a more concise metric is preferred. What is interesting to look at is the accuracy of the positive predictions, which is called the precision of the classifier (Equation 2.11).

precision = T P

(25)

where TP represents the number of true positives, while FP represents the number of false positives. A simple way to get perfect precision as 100% is to predict one single instance as positive and make sure the prediction is correct. However, this classifier would not be very useful, since it would ignore all but one positive instance. Therefore, precision is commonly used with another metric called recall, which is also named sensitivity or the true positive rate (TPR): the ratio of positive instances that are classified as positive by the classifier (Equation 2.12).

recall = T P

T P + F N (2.12)

where FN represents the number of false negatives.

In order to compare two classifiers in a simple way, precision and recall are often combined into a single metric called the F1 score, which is the harmonic mean of

precision and recall (Equation 2.13). While all values are treated equally by the regular mean, much more weight is given to low values by the harmonic mean. Thus, a classifier needs both high precision and recall to get a high F1 score.

F 1 = ₁ 2

precision +

1

recall

(2.13)

There is a trade-off between precision and recall such that increasing precision reduces recall, and vice versa. It can be plotted with a precision-recall curve. The area under the curve (PRC AUC) can be measured to compare two classifiers. The baseline of the precision-recall curve is determined by the ratio of positives (P) and negatives (N) in the data as y = P

P +N, so PRC AUC of random classifiers is P

P +N [43].

2.5.3 The ROC Curve

The receiver operating characteristic (ROC) curve plots the true positive rate (TPR, another name for recall) against the false positive rate (FPR). The FPR is the ratio of negative instances wrongly classified as positive. It is equal to 1-TNR (true negative rate, also called specificity), which is the ratio of negative instances correctly classified as negative. So the ROC curve plots sensitivity (also another name for recall) against 1-specificity.

(26)

Figure 2.5.1: An example of ROC curve, which plots the false positive rate against the true positive rate for all possible thresholds; the red dot features a chosen ratio at 43.68% recall / CC BY [18]

dotted line stands for the ROC curve a purely random classifier would have. A good classifier should stay above that line and as far away from it as possible. To compare classifiers, we can measure the area under the curve (AUC). A perfect classifier would have an AUC as 1, while a purely random classifier would have 0.5.

2.5.4 Cross-Validation

(27)

dataset of size n by sampling n instances uniformly with replacement, and the test set contains the instances that are left out.

In [26], the authors reviewed these three methods with the later two compared on a variety of real-world datasets with differing characteristics. They found that stratified k-fold cross-validation is generally preferred and bootstrap has extremely high bias on some problems. They recommend using stratified ten-fold cross-validation.

(28)

Chapter 3 Methodologies and Methods

In this chapter, we first motivate the research method adopted in this project, followed by explaining the data and its pre-processing. Then the feature selection is described. And in the end, we describe the machine learning and deep learning models, their training and evaluation.

3.1 Research Method

To answer our research question, we choose to use the CNN as the deep learning network, and build a classification model based on CNN to predict ICBT outcome from GWAS data. One motivation for the choice of CNN is its ability in extracting and learning high-level features from the input data and success in image classification problems [48]. The other motivation is that the authors in [37] reviewed a list of reports on phenotype prediction with deep learning, and found that CNN performed better than MLP because CNNs were able to exploit interactions among nearby SNPs.

(29)

CHAPTER 3. METHODOLOGIES AND METHODS

3.2 The Data

The genetic data contains 964 patients and 7249461 SNPs for each patient. There are four versions of the data: raw, pre-quality-control, quality-control-pass, and imputed. Each version consists of three files generated by PLINK1_{, a genome association analysis}

tool set. The authors of [14] explored the imputed version, removed SNPs that were related to Insertions and Deletions (indels), and generated a noindel version of the data, containing 6810471 SNPs. The focus of the project is the noindel version, and its three files:

1. .bim file: containing variants for each SNP:

(a) ALT: the alternate allele, which can be A, C, G, or T. (b) REF: the reference allele, which can also be A, C, G, or T.

2. .bed file: containing a matrix in which elements represent the variant for each SNP and each patient. That value can be missing, 0, 1, or 2.

(a) 0: if the SNP variant is the ALT in both chromosomes for that patient (Homozygous)

(b) 1: if the SNP variant is the ALT in one chromosome and the REF in the other for that patient (Heterozygous)

(c) 2: if the SNP variant is the REF in both chromosomes for that patient (Homozygous)

3. .fam file: containing information (for example, sex) of each patient.

Along with the genetic data sets, there is also a file containing phenotype information about patients, such as gender, pre-treatment MADRS score and post-treatment MADRS score.

3.3 Data Pre-processing

In [14], the same data has been successfully pre-processed and loaded into memory of a computer for some statistical analyses. In this project, we follow the same procedures

(30)

to pre-process the data so that it can be loaded into computer memory for machine learning.

Pandas-PLINK module2_{(v1.2.29) is used to read the genetic data set files and convert}

them into Pandas dataframes, which are data structures suitable for machine learning. The Pandas dataframe version of the .bed file is then modified to map NaN to integer 3. (In [14], the .bed file is modified to map 0,1 and 2 to actual alleles, for example, for a SNP with ALT as A and REF as T, 0 is mapped to AA, 1 is mapped to AT and 2 is mapped to TT. Then the actual alleles and NaN are mapped to numbers from 0-16. We found this is not necessary, since for each SNP, there is only four possible values, which can be represented by 0-3. And using 0-3 requires less computer memory than using 0-16 when later on each number is encoded by one-hot encoder. Thus, we decided to re-execute the data pre-processing procedures instead of using the processed data from [14].) The result is saved into the hard drive as 964 CSV files, one for per patient. Each file contains values for all SNPs, and can be represented by one row of Table 3.3.1. To speed up data loading, following optimisations are applied:

1. Converting the CSV files to a binary file format: HDF5.

2. Transposing the data structure, to 964 columns and 6810471 rows, since it is time consuming for the Pandas module to infer the type of each of the 6810471 columns, even though the type is specified.

3. Merging the 964 HDF5 patient files into seven files of 150 patients each, so that they can be loaded into compute memory in parallel by seven processes.

Patient SNP1 SNP2 SNP3 … SNP6810470 SNP6810471 0 1 2 1 … 0 1 1 1 0 2 … 3 0 .. . ... ... ... ... ... ... 962 0 3 1 … 2 1 963 1 2 3 … 0 2

Table 3.3.1: Unified data structure containing SNP values for each patient.

There are eight cores on the machine we are using, which allows parallel loading of the seven files. In less than two seconds, all SNP values for all patients are loaded into computer memory. After that, we merge the genetic data set with phenotype data set. We exclude patients in the genetic data set that are not found in the phenotype data

(31)

set, leaving us with 894 patients. Further, we exclude patients that do not have both pre- and post-treatment MADRS scores, leaving us with 788 patients.

3.4 Feature Selection

After pre-processing, we have a data set with sample size 788 and feature size 6810471. The fact that the feature size is way bigger than the sample size makes the data set not applicable to machine learning algorithms, since it is doomed to have an overfitting problem. Thus, we need to reduce the feature size, i.e. the number of SNPs. To do so, we get all the SNPs associated with trait “Unipolar depression” from GWAS Catalog3_{. There are in total 1547 distinct SNPs, of which 875 exist in our genetic}

dataset. There are two reasons that we choose this path. First, it is interesting to see if the SNPs associated with trait “Unipolar depression” are associated with response to the treatment; Secondly, the dataset we got has been already processed with MAF and LD, and we plan to use P-value, Shannon entropy, conditional entroy and UMAP to further reduce the dimensions if 875 SNPs are still too many.

The label of the data set is defined to 0 and 1, where 1 means the patient is a responder to the treatment, and 0 means the patient is not a responder. A responder is a patient that has 50% symptom reduction from pre- to post-treatment [17], i.e. the MADRS score reduces 50% after treatment. Among the 788 patients, 353 are responders and 435 are not, that is the positive rate is 0.451.

So our final dataset includes 875 features(SNPs) and one label (trait) which is if the patient is a responder to the treatment.

3.5 Our Models

We train two kinds of models with the final dataset, one is logistic regression model and the other one is CNN.

3.5.1 Evaluation

The models are evaluated with PRC AUC. We choose it over ROC AUC, because we have a dataset with class imbalance. According to [12] and [43], ROC curves present

(32)

an optimistic picture of the model on datasets like this; however, precision recall curves can present a more accurate picture of classification performance since they evaluate the fraction of true positives among positive predictions.

3.5.2 Logistic Regression

For the logistic regression, we use LogisticRegressionCV4 _{model provided by}

scikit-learn [36]. We apply stratified k*l-fold cross-validation, where both k and l are 10. This is achieved by dividing the dataset to 10 parts with stratification, each of which is used as test data for a LogisticRegressionCV model trained with other 9 parts, and when training a LogisticRegressionCV model, we set parameter cv to 10, which applied stratified 10-fold cross-validation to the training. The parameter penalty is set to default which means Ridge, parameter scoring is set to a customized function to compute PRC AUC, parameter Cs is set to 10, which would generate 10 values that describe the inverse of regularization strength, and cross validation would select the best value that associated with the model that has highest validation PRC AUC. Before training, we apply one hot encoding to the data, so that each column in the data set no longer stands for a SNP, but only a category of a SNP, and to avoid multicollinearity, the first category of each SNP is removed from the dataset. And this gives us 2227 features in the input to the models.

3.5.3 CNN

Our CNN contains an input layer, four interception-like blocks, two shortcut connections and a output processing block as shown in Figure 3.5.2. The input layer contains one-hot encoded genotypes. The interception-like block as shown in Figure 3.5.1 is inspired by [48]. It contains four parallel Conv1D layers with 10 filters each and different kernel sizes (1, 5, 10 and 20). Its output are the concatenation of these four layers. The purpose of this is to consider the interaction of SNPs at various scales and then aggregate them so that the next stage can abstract features from different scales simultaneously [48]. The shortcut connections are used to avoid the vanishing gradient problem and reduce degradation of training accuracy in deep network when the network is able to start converging [22]. The shortcut connections serves as residual connections to the stacked-interception-like blocks. This technique

4_{https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.}

(33)

has also been used in [29]. Each shortcut connection contains a Conv1D layer with 40 filters and kernel size 1, so that the output of the shortcut connection has the same dimension as the output of stacked-interception-like blocks, which is a prerequisite for the subsequent add operation. The output processing block contains one Conv1D layer followed by a flattened layer to integrate all the extracted features from previous stages, and a dense layer, consisting of a single neuron, to convert the features to the predicted phenotypes. Batch normalization is adopted after each convolution layer and before activation layer, following [24], and to reduce the effect of overfitting, we add dropout layers [46]. The dropout rate in the interception-like block was 0.7 and in the output processing block was 0.4. We adopt ReLU [19] [35] as activation function in the activation layers following Conv1D layers, and sigmoid [35] as activation function in the last dense layer. Ridge regularization is applied on each Conv1D layer and the dense layer to apply a penalty on the layer’s kernel, with regularization factor 0.0001. Padding is enabled in all the Conv1D layers.

Figure 3.5.1: Interception-like block in the CNN Model

The model is implemented and trained with Keras [10], since it is powerful, easy to use and widely adopted in the industry and the research community. During the training, we use stratified k-fold cross-validation where k is 10 and Adam optimizer [25]. To further reduce the effect of overfitting, we stop the training if the PRC AUC on the validation data does not increase in five epochs and restore the model to the previous best.

3.6 Training Environment

(34)

Figure 3.5.2: Proposed CNN Model

(35)

Chapter 4 Results

In order to know if the 875 SNPs we got from GWAS Catalog has stronger prediction power on response to the treatment than other SNPs, we trained both logistic regression models and CNN models with both our final dataset and a dataset that contains random 875 SNPs that are outside of the list of SNPs we got from GWAS Catalog.

4.1 Non-discriminating Classifier

According to [43], the baseline of precision-recall curve is determined by the ratio of positives (P) and negatives (N) as y = P / (P + N) and a random classifier has PRC AUC of P / (P + N). In this project, we refer to the random classifier as non-discriminating classifier, and with regard to our data, it has PRC AUC of 0.451.

4.2 Our Models

(36)

CHAPTER 4. RESULTS

Figure 4.2.1: Mean PRC AUC for logistic regression and CNN models trained with both GWAS Catalog SNPs and random SNPs.

trained with GWAS Catalog SNPs have slightly higher mean PRC AUC than the CNN models trained with the same data. However, these do not necessary imply that our models trained with GWAS Catalog SNPs have better performance than the non-discriminating classifier and the GWAS Catalog SNPs have stronger prediction power on treatment outcome than random SNPs.

(37)

CHAPTER 4. RESULTS

p-values, we can see that regarding the performance of the models, we have logistic regression models trained with GWAS Catalog SNPs better the non-discriminating classifier, with CNN models trained with GWAS Catalog SNPs in between, but more likely closer to the logistic regression models.

Model Set 1 Model Set 2 P-value LR GWAS LR Random 0.003 CNN GWAS CNN Random 0.047 CNN GWAS LR GWAS 0.862 CNN GWAS Non-discriminating 0.056 LR GWAS Non-discriminating 0.006

Table 4.2.1: P-values from t-test on the PRC AUC values of different models.

4.3 Shannon Entropy and Conditional Entropy

We computed Shannon entropy for all the 875 GWAS Catalog SNPs and the trait of response to the treatment. Figure 4.3.1 shows the entropy for the SNPs, and the entropy of the trait is 0.9922. We also computed conditional entropy for the trait with regard to all GWAS Catalog SNPs and random SNPs, and the results are presented in Table 4.3.1. The fact, that the mean conditional entropy of GWAS Catalog SNPs is lower than that of the random SNPs, and two sets of conditional entropy values have p-value of 0.02, implies that the trait is more associated with the GWAS Catalog SNPs than the random SNPs.

Dataset Mean Max Min P-value GWAS Catalog SNPs 0.9896 0.9922 0.9735

0.02 Random SNPs 0.9898 0.9922 0.9770

Table 4.3.1: Conditional Entropy for the trait with regard to all GWAS Catalog SNPs and random SNPs, as well as p-value from Welch’s t-test on two sets of conditional entropy values

4.4 UMAP

(38)

CHAPTER 4. RESULTS

Figure 4.3.1: Shannon Entropy distribution of all the 875 GWAS Catalog SNPs

Figure 4.4.2. The model succeeded to cluster the training data, but failed to project the test data into the correct cluster.

We also trained logistic regression models and CNN models with low dimensional data projected by UMAP models. Before the training of each fold, a UMAP model was trained with the original training data so that it could be used to reduce the dimension of the data (in logistic regression to 2D and in CNN to 20D), then the low dimensional data are used to train and evaluate the logistic regression models and CNN models. The mean score of the performance metrics from logistic regression models with stratified 10*fold cross-validation and CNN models with stratified 10-fold cross-validation are listed in Table 4.4.1. Reducing dimension with UMAP failed to improve the performance of logistic regression models and CNN models.

Mean PRC AUC Logistic Regression Models 0.453

CNN Models 0.455

(39)

CHAPTER 4. RESULTS

Figure 4.4.1: Projection of the training data in 2D space by UMAP model

(40)

Chapter 5 Discussion

In an attempt to see if machine learning could predict patients’ responses to treatment from GWAS data, we have performed classification with logistic regression and CNN. The dataset we have contains GWAS data and clinical data for 788 patients, where the GWAS data contains 6810471 SNPs for each patient, and the clinical data contains pre- and post-treatment MADRS scores, which can be used to compute if a patient is a responder to the treatment as the target. The difference between the sample size and feature size was so huge in our dataset that we had to work on reducing the number of features, otherwise the machine learning models were doomed to overfitting. We selected the SNPs that are associated with trait “Unipolar depression” according to GWAS Catalog, since it is interesting to see if these SNPs are also associated to the treatment response, and by doing this, we were able to reduce the feature size to 875.

(41)

CHAPTER 5. DISCUSSION

the CNN models, for example, reducing the depth of the network, increasing the depth of the network, and toning batch size, dropout rate as well as regularizers, but without success. The reason that CNN underperformed logistic regression is probably the small sample size. According to [38], neuron networks may need over 10 times as data to achieve a stable performance and a small optimism than logistic regression. We also notice that dimension reduction by UMAP fails to improve the prediction performance, this is probably due to that some information is lost when the data is projected to lower dimensions. With regard to our research question, we failed to present a CNN model for better treatment outcome prediction from GWAS data than non-discriminating classifier, we succeeded to present a logistic regression one with the 875 GWAS Catalog SNPs.

However, the results of this project is not decisive. More research needs to be done for the same purpose as this project. The future work can be conducted in seven directions.

1. Collecting data from more patients. A large sample size is crucial in machine learning. With more samples available, the CNN model defined in this project can be revisited to see if it can outperform logistic regression model.

2. Continuing to reduce the number of SNPs. A small feature size can compensate when the sample size is small. One way of further reducing the SNPs is to divide the 875 SNPs into small groups base on their locations on the genes, and train the models with each group of SNPs. This assumes that the further two SNPs are located, the less likely they would interact and have joint prediction power. We can also apply this to our original SNP set to see if the ones that are not associated with trait “Unipolar depression” have any prediction power on treatment outcome. A similar approach to this has been applied in [54]. Another way is to apply aforementioned SNP selection methods that have not been applied yet, such as P-value, Shannon Entropy and Conditional Entropy to reduce the number of SNPs, however, useful SNPs may be removed because of the small sample size.

3. Increasing the integrity of the data. This can be done by, for example, collecting the missing SNPs that are associated with trait “Unipolar depression”, and adding more patient information such as gender and age into the input to the machine learning models.

(42)

CHAPTER 5. DISCUSSION

have got lucky to get a set of random SNPs that is less associated to the treatment outcome than the GWAS Catalog SNPs, and the results might be different with other sets of random SNPs.

5. Continuing to tone parameters of the CNN model.

6. Coefficients of the best logistic regression model can be used to decide the contribution of each SNP, and identify the SNPs that are most associated with the treatment outcome. For CNN model, this can be achieved with saliency map as used in [29]. The results can be compared with results from an Multiple Correspondence Analysis [1].

(43)

Chapter 6 Conclusions

In this project, GWAS data was used with machine learning models to predict future treatment outcome of Internet-based Cognitive-Behavioural Therapy for patients suffering from depression, as part of the learning machine employed in the Internet Psychiatry Clinic at Psykiatri Sydväst in Stockholm. The original data had a very small sample size, however a huge feature size, which would make any machine learning model prone to overfitting. We reduced the number of SNPs by selecting the ones associated with the trait “Unipolar depression” according to the GWAS Catalog. We defined and trained a CNN model with the new data containing only the selected SNPs. For comparison, we also trained a logistic regression model with the new data, and trained both models with a same size data containing random SNPs. The results show that the selected SNPs have stronger prediction power than the random SNPs, the models trained with the selected SNPs have better performance than the non-discriminating classifier, with the logistic regression model performing better than the CNN model.

(44)

CHAPTER 6. CONCLUSIONS

knowledge, this is the first work to predict ICBT treatment outcome with GWAS data and machine learning models and should motivate more research on adopting machine learning models, especially CNN, in genetic prediction.

(45)

References

[1] Abdi, Hervé and Valentin, Dominique. “Multiple correspondence analysis”. In:

Encyclopedia of measurement and statistics 2 (2007), pp. 651–66.

[2] Analyzing Network Data in Biology and Medicine: An Interdisciplinary Textbook for Biological, Medical and Computational Scientists. Cambridge

University Press, 2019. DOI:10.1017/9781108377706.

[3] Bellot, Pau, Campos, Gustavo de los, and Pérez-Enciso, Miguel. “Can deep learning improve genomic prediction of complex human traits?” In: Genetics 210.3 (2018), pp. 809–819.

[4] Boman, Magnus, Abdesslem, Fehmi Ben, Forsell, Erik, Gillblad, Daniel, Görnerup, Olof, Isacsson, Nils, Sahlgren, Magnus, and Kaldo, Viktor. “Learning machines in Internet-delivered psychological treatment”. In: Progress in

Artificial Intelligence (2019), pp. 1–11.

[5] Boman, Magnus, Sahlgren, Magnus, Görnerup, Olof, and Gillblad, Daniel. “Learning Machines”. In: Learning, Inference and Control of Multi-Agent

Systems (2018), pp. 610–613.

[6] Boyle, Evan A, Li, Yang I, and Pritchard, Jonathan K. “An expanded view of complex traits: from polygenic to omnigenic”. In: Cell 169.7 (2017), pp. 1177– 1186.

[7] Buniello, Annalisa, MacArthur, Jacqueline A L, Cerezo, Maria, Harris, Laura W, Hayhurst, James, Malangone, Cinzia, McMahon, Aoife, Morales, Joannella, Mountjoy, Edward, Sollis, Elliot, et al. “The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019”. In: Nucleic acids research 47.D1 (2018), pp. D1005–D1012. [8] Bush, William S and Moore, Jason H. “Genome-wide association studies”. In:

(46)

REFERENCES

[9] Cantor, Rita M, Lange, Kenneth, and Sinsheimer, Janet S. “Prioritizing GWAS results: a review of statistical methods and recommendations for their application”. In: The American Journal of Human Genetics 86.1 (2010), pp. 6– 22.

[10] Chollet, François et al. Keras.https://keras.io. 2015.

[11] Consortium, International HapMap et al. “A haplotype map of the human genome”. In: Nature 437.7063 (2005), p. 1299.

[12] Davis, Jesse and Goadrich, Mark. “The relationship between Precision-Recall and ROC curves”. In: Proceedings of the 23rd international conference on

Machine learning. 2006, pp. 233–240.

[13] Fadista, João, Manning, Alisa K, Florez, Jose C, and Groop, Leif. “The (in) famous GWAS P-value threshold revisited and updated for low-frequency variants”. In: European Journal of Human Genetics 24.8 (2016), pp. 1202– 1205.

[14] Fehmi, Ben Abdesslem, Fredrik, Olsson, and Magnus, Boman. RISE Internal

Report. Tech. rep. 2019.

[15] Fergus, Paul, Montanez, Aday, Abdulaimma, Basma, Lisboa, Paulo, Chalmers, Carl, and Pineles, Beth. “Utilising deep learning and genome wide association

studies for

epistatic-driven preterm birth classification in African-American women”. In:

IEEE/ACM transactions on computational biology and bioinformatics (2018).

[16] Fisher, Ronald A. “On the interpretation of χ 2 from contingency tables, and the calculation of P”. In: Journal of the Royal Statistical Society 85.1 (1922), pp. 87–94.

[17] Forsell, Erik, Isacsson, Nils, Blom, Kerstin, Jernelöv, Susanna, Ben Abdesslem, Fehmi, Lindefors, Nils, Boman, Magnus, and Kaldo, Viktor. “Predicting treatment failure in regular care Internet-Delivered Cognitive Behavior Therapy for depression and anxiety using only weekly symptom measures.” In: Journal

of Consulting and Clinical Psychology (2019).

[18] Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and

TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.

(47)

REFERENCES

[19] Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. “Deep sparse rectifier neural networks”. In: Proceedings of the fourteenth international conference

on artificial intelligence and statistics. 2011, pp. 315–323.

[20] Goldstein, Benjamin A, Hubbard, Alan E, Cutler, Adele, and Barcellos, Lisa F. “An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings”. In: BMC genetics 11.1 (2010), p. 49.

[21] Grinberg, Nastasiya F, Orhobor, Oghenejokpeme I, and King, Ross D. “An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat”. In: Machine Learning (2019), pp. 1–27.

[22] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on

computer vision and pattern recognition. 2016, pp. 770–778.

[23] Huang, Xiuzhen, Zhou, Wei, Bellis, Emily S, Stubblefield, Jonathan, Causey, Jason, Qualls, Jake, and Walker, Karl. “Minor QTLs mining through the combination of GWAS and machine learning feature selection”. In: BioRxiv (2019), p. 712190.

[24] Ioffe, Sergey and Szegedy, Christian. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: arXiv preprint

arXiv:1502.03167 (2015).

[25] Kingma, Diederik P and Ba, Jimmy. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014).

[26] Kohavi, Ron et al. “A study of cross-validation and bootstrap for accuracy estimation and model selection”. In: Ijcai. Vol. 14. 2. Montreal, Canada. 1995, pp. 1137–1145.

[27] LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.

(48)

REFERENCES

[29] Liu, Yang, Wang, Duolin, He, Fei, Wang, Juexin, Joshi, Trupti, and Xu, Dong. “Phenotype prediction and genome-wide association study using deep convolutional neural network of soybean”. In: Frontiers in Genetics 10 (2019), p. 1091.

[30] Maciukiewicz, Malgorzata, Marshe, Victoria S, Hauschild, Anne-Christin, Foster, Jane A, Rotzinger, Susan, Kennedy, James L, Kennedy, Sidney H, Müller, Daniel J, and Geraci, Joseph. “GWAS-based machine learning approach to predict duloxetine response in major depressive disorder”. In: Journal of

psychiatric research 99 (2018), pp. 62–68.

[31] McInnes, Leland, Healy, John, and Melville, James. “Umap: Uniform manifold approximation and projection for dimension reduction”. In: arXiv preprint

arXiv:1802.03426 (2018).

[32] Montaez, Casimiro A Curbelo, Fergus, Paul, Montaez, Almudena Curbelo, Hussain, Abir, Al-Jumeily, Dhiya, and Chalmers, Carl. “Deep learning classification of polygenic obesity using genome wide association study SNPs”. In: 2018 International Joint Conference on Neural Networks (IJCNN). IEEE. 2018, pp. 1–8.

[33] National Cancer Institute: GWAS Definition. https : / / www . cancer . gov / publications / dictionaries / genetics - dictionary / def / gwas. Accessed: 2020-03-30.

[34] National Human Genome Research Institute: GWAS Definition. https : / / www.genome.gov/genetics- glossary/Genome- Wide- Association- Studies. Accessed: 2020-03-30.

[35] Nwankpa, Chigozie, Ijomah, Winifred, Gachagan, Anthony, and Marshall, Stephen. “Activation functions: Comparison of trends in practice and research for deep learning”. In: arXiv preprint arXiv:1811.03378 (2018).

[36] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

(49)

REFERENCES

[38] Ploeg, Tjeerd van der, Austin, Peter C, and Steyerberg, Ewout W. “Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints”. In: BMC medical research methodology 14.1 (2014), p. 137.

[39] Poplin, Ryan, Chang, Pi-Chuan, Alexander, David, Schwartz, Scott, Colthurst, Thomas, Ku, Alexander, Newburger, Dan, Dijamco, Jojo, Nguyen, Nam, Afshar, Pegah T, et al. “A universal SNP and small-indel variant caller using deep neural networks”. In: Nature biotechnology 36.10 (2018), pp. 983–987.

[40] Rasmussen, Kristin. Encyclopedia of measurement and statistics. Vol. 1. Sage, 2007.

[41] Romero, Adriana, Carrier, Pierre Luc, Erraqabi, Akram, Sylvain, Tristan, Auvolat, Alex, Dejoie, Etienne, Legault, Marc-André, Dubé, Marie-Pierre, Hussin, Julie G, and Bengio, Yoshua. “Diet networks: Thin parameters for fat genomics”. In: arXiv preprint arXiv:1611.09340 (2016).

[42] Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning

internal representations by error propagation. Tech. rep. California Univ San

Diego La Jolla Inst for Cognitive Science, 1985.

[43] Saito, Takaya and Rehmsmeier, Marc. “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets”. In: PloS one 10.3 (2015).

[44] Shannon, Claude E. “A mathematical theory of communication”. In: Bell system

technical journal 27.3 (1948), pp. 379–423.

[45] Slatkin, Montgomery. “Linkage disequilibrium—understanding the evolutionary past and mapping the medical future”. In: Nature Reviews Genetics 9.6 (2008), pp. 477–485.

[46] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. “Dropout: a simple way to prevent neural networks from overfitting”. In: The journal of machine learning research 15.1 (2014), pp. 1929–1958.

[47] Stephan, Johannes, Stegle, Oliver, and Beyer, Andreas. “A random forest approach to capture genetic effects in the presence of population structure”. In:

(50)

REFERENCES

[48] Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. “Going deeper with convolutions”. In: Proceedings of the IEEE

conference on computer vision and pattern recognition. 2015, pp. 1–9.

[49] Wei, Zhi, Wang, Wei, Bradfield, Jonathan, Li, Jin, Cardinale, Christopher, Frackelton, Edward, Kim, Cecilia, Mentch, Frank, Van Steen, Kristel, Visscher, Peter M, et al. “Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease”. In: The American Journal of Human Genetics 92.6 (2013), pp. 1008– 1012.

[50] Welch, Bernard L. “The generalization ofstudent’s’ problem when several different population variances are involved”. In: Biometrika 34.1/2 (1947), pp. 28–35.

[51] Wood, Andrew R, Esko, Tonu, Yang, Jian, Vedantam, Sailaja, Pers, Tune H, Gustafsson, Stefan, Chu, Audrey Y, Estrada, Karol, Kutalik, Zoltán, Amin, Najaf, et al. “Defining the role of common variation in the genomic and biological architecture of adult human height”. In: Nature genetics 46.11 (2014), p. 1173. [52] Xu, ChangJiang, Tachmazidou, Ioanna, Walter, Klaudia, Ciampi,

Antonio, Zeggini, Eleftheria, Greenwood, Celia MT, and Consortium, UK10K. “Estimating genome-wide significance for whole-genome sequencing studies”. In: Genetic epidemiology 38.4 (2014), pp. 281–290.

[53] Yan, Qi, Jiang, Yale, Huang, Heng, Swaroop, Anand, Chew, Emily Y, Weeks, Daniel E, Chen, Wei, and Ding, Ying. “GWAS-based Machine Learning for Prediction of Age-Related Macular Degeneration Risk”. In: medRxiv (2019), p. 19006155.

(51)

Appendix A

Precision Recall Curves

We present the precision recall curves of the trained models here.

(52)

APPENDIX A. PRECISION RECALL CURVES

Figure A.0.2: Precision recall curves for CNN models trained with random SNPs.

(53)

APPENDIX A. PRECISION RECALL CURVES

(54)

TRITA-EECS-EX-2020:587

Machine Learning Classification of Response to Internet-based Cognitive-Behavioural Therapy using Genome-Wide Association Study Data

Machine Learning Classification of

Response to Internet-based

Cognitive-Behavioural Therapy

using Genome-Wide Association

Study Data

KTH Thesis Report

REN XIN

Authors

Place for Project

Examiner

Supervisor

Abstract

Sammanfattning

Acknowledgements

Acronyms

Contents

1 Introduction

1

2 Theoretical Background

6

3 Methodologies and Methods

19

4 Results

26

5 Discussion

31

6 Conclusions

34

Chapter 1

Introduction

1.1

Background

1.2

Problem

1.3

Purpose

1.4

Goal

1.5

Benefits, Ethics and Sustainability

1.6

Methodology

1.7

Outline

Chapter 2

Theoretical Background

2.1

SNP and GWAS

2.2

SNP Selection

2.3

Machine Learning

2.3.1

Logistic Regression

2.4

Deep Learning

2.4.1

MLP

2.4.2

CNN

2.5

Performance Evaluation

2.5.1

Confusion Matrix

2.5.2

Precision and Recall

2.5.3

The ROC Curve

2.5.4

Cross-Validation

Chapter 3

Methodologies and Methods

3.1

Research Method

3.2

The Data

3.3

Data Pre-processing

3.4