Machine Learning Based Analysis of DNA Methylation Patterns in Pediatric Acute Leukemia

(1)

UNIVERSITATISACTA UPSALIENSIS

UPPSALA

Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine 1069

Machine Learning Based Analysis of DNA Methylation Patterns in Pediatric Acute Leukemia

CHRISTOFER BÄCKLIN

ISSN 1651-6206 ISBN 978-91-554-9151-2

(2)

Dissertation presented at Uppsala University to be publicly examined in Auditorium minus, Museum Gustavianum, Akademigatan 3, Uppsala, Friday, 13 March 2015 at 14:00 for the degree of Doctor of Philosophy (Faculty of Medicine). The examination will be conducted in English. Faculty examiner: Prof. Markus Ringnér (Lunds universitet).

Abstract

Bäcklin, C. 2015. Machine Learning Based Analysis of DNA Methylation Patterns in Pediatric Acute Leukemia. (Maskininlärningsbaserad analys av DNA-metyleringsmönster i pediatrisk akut lymfatisk leukemi). Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine 1069. 68 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9151-2.

Acute lymphoblastic leukemia (ALL) is the most common pediatric cancer in the Nordic countries. Recent evidence indicate that DNA methylation (DNAm) play a central role in the development and progression of the disease.

DNAm profiles of a collection of ALL patient samples and a panel of non-leukemic reference samples were analyzed using the Infinium 450k methylation assay. State-of-the-art machine learning algorithms were used to search the large amounts of data produced for patterns predictive of future relapses, in vitro drug resistance, and cytogenetic subtypes, aiming at improving our understanding of the disease and ultimately improving treatment.

In paper I, the predictive modeling framework developed to perform the analyses of DNAm dataset was presented. It focused on uncompromising statistical rigor and computational efficiency, while allowing a high level of modeling flexibility and usability. In paper II, the DNAm landscape of ALL was comprehensively characterized, discovering widespread aberrant methylation at diagnosis strongly influenced by cytogenetic subtype. The aberrantly methylated regions were enriched for genes repressed by polycomb group proteins, repressively marked histones in healthy cells, and genes associated with embryonic development. A consistent trend of hypermethylation at relapse was also discovered. In paper III, a tool for DNAm- based subtyping was presented, validated using blinded samples and used to re-classify samples with incomplete phenotypic information. Using RNA-sequencing, previously undetected non- canonical aberrations were found in many re-classified samples. In paper IV, the relationship between DNAm and in vitro drug resistance was investigated and predictive signatures were obtained for seven of the eight therapeutic drugs studied. Interpretation was challenging due to poor correlation between DNAm and gene expression, further complicated by the discovery that random subsets of the array can yield comparable classification accuracy. Paper V presents a novel Bayesian method for multivariate density estimation with variable bandwidths.

Simulations showed comparable performance to the current state-of-the-art methods and an advantage on skewed distributions.

In conclusion, the studies characterize the information contained in the aberrant DNAm patterns of ALL and assess its predictive capabilities for future relapses, in vitro drug sensitivity and subtyping. They also present three publicly available tools for the scientific community to use.

Christofer Bäcklin, Department of Medical Sciences, Cancer Pharmacology and Computational Medicine, Akademiska sjukhuset, Uppsala University, SE-75185 Uppsala, Sweden.

ISBN 978-91-554-9151-2

urn:nbn:se:uu:diva-242544 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-242544)

(3)

The thing that differentiates scientists is purely an artistic ability to discern what is a good idea,

what is a beautiful idea,

what is worth spending time on, and most importantly,

what problems are sufficiently interesting, yet sufficiently difficult to not already be solved, and for which the time to be solved has now come.

— Savas Dimopoulos [70]

To my grandparents.

(4)

(5)

List of papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

†Shared ﬁrst authorship.

I Bäcklin, C.L., and Gustafsson, M.G. Developer Friendly and Computationally Efﬁcient Predictive Modeling without Information Leakage: The emil Package for R . Minor acceptance by J of Statistical Software, 2015.

II Nordlund, J.^†, Bäcklin, C.L.^†, Wahlberg, P., Busche, S., Berglund, E.C., Eloranta, M-L., Flaegstad, T., Forestier, E., Frost, B-M., Harila-Saari, A., Heyman, M.M., Jónsson, Ó.G., Larsson, R., Palle, J., Rönnblom, L., Schmiegelow, K., Sinnett, D., Söderhäll, S., Pastinen, T., Gustafsson, M.G., Lönnerholm, G., and Syvänen, A-C. Genome-wide signatures of differential DNA methylation in pediatric acute lymphoblastic leukemia.

Genome Biology, 14(9):r105, October 2013.

III Nordlund, J., Bäcklin, C.L., Zachariadis, V., Cavelier, L., Dahlberg, J., Öfverholm, I., Barbany, G., Nordgren, A., Övernäs, E.,

Abrahamsson, J., Flaegstad, T., Heyman, M.M., Jónsson, Ó.G.,

Kanerva, J., Larsson, R., Palle, J., Schmiegelow, K., Gustafsson, M.G., Lönnerholm, G., Forestier, E., and Syvänen, A-C. DNA

methylation-based subtype prediction in pediatric acute lymphoblastic leukemia. Clinical Epigenetics, 2015.

IV Bäcklin, C.L., Freyhult, E., Nordlund, J., Forestier, E., Frost, B-M., Harila-Saari, A., Heyman, M.M., Jónsson, Ó.G., Palle, J., Flaegstad, T., Schmiegelow, K., Larsson, R., Syvänen, A-C., Lönnerholm, G.,

Gustafsson, M.G. DNA methylation-based prediction of in vitro drug resistance in primary pediatric acute lymphoblastic leukemia patient samples. Manuscript to be submitted to Pharmacogenetics, 2015.

V Bäcklin, C.L., Andersson, C., and Gustafsson, M.G. A simple and well performing Bayeasian approach to adaptive multivariate kernel density estimation. Manuscript to be submitted to Pattern Recognition, 2015.

Reprints were made with permission from the publishers.

(6)

(7)

Abbreviations

ALL Acute lymphoblastic leukemia, Section 1.1.

BCP-ALL B-cell precursor acute lymphoblastic leukemia, Section 1.1.

CDX+ Cluster of differentiation X positive, e.g. CD3+, Section 3.1.

CpG-site CG dinucleotide, Section 1.4.

DNMT DNA methyltransferase, Section 1.4.

FISH Fluorescent in-situ hybridization, Section 1.2.

FMCA Fluorometric microculture cytotoxicity assay, Section 3.3.

GMM Gaussian mixture model.

NOPHO Nordic Society of Pediatric Hematology and Oncology (http:

//www.nopho.org), Section 1.1.

NSC Nearest shrunken centroids, Section 3.5.2.

PCA Principal component analysis, Section 3.5.1.

R The R language and environment for statistical computing (http://r-project.org), Section 3.5.

RF Random forest, Section 3.5.3.

RT-PCR Reverse transcription polymerase chain reaction, Section 1.2.

T-ALL T-cell acute lymphoblastic leukemia, Section 1.1.

(10)

(11)

1. Introduction

1.1 Acute lymphoblastic leukemia

Acute lymphoblastic leukemia (ALL) is a malignant disease occurring when T-cells or precursor B-cells of the bone marrow deviate from the normal mat- uration process, lose their natural function and rapidly increase in numbers [93]. It is the most common form of pediatric cancer, with an incidence peak between two and ﬁve years of age and roughly 160 cases per year in the Nordic countries, as diagnosed by the Nordic Society of Paediatric Haematology and Oncology (NOPHO) during the years 1981–2007 [95]. Immediate treatment is essential for survival and with contemporary protocols 77% of the patients are cured (5 years of event free survival). The remaining patients either experi- ence relapses after entering remission, do not respond to the treatment or suffer treatment related complications. Although there has been great improvements since the 70’s, when an ALL diagnosis was a sure death sentence, there is still a need to more accurately detect patients with high risk of future complications and elucidate the biological mechanisms underlying the development and progression of the disease.

In the Nordic countries pediatric ALL is treated with intensive combined chemotherapy speciﬁed in three protocols of increasing intensity: standard risk, intermediate risk, and high risk (SR, IR, and HR). There is also a special infant protocol for patients less than a year old and an international EsPhALL protocol Philadelphia positive (Ph+) patients (Section 1.2). Patients are as- signed to one of these protocols at diagnosis based on their age, white blood cell count, and disease subtype, which is deﬁned by the immunophenotype (T-ALL or BCP-ALL) and cytogenetic aberrations recurrent in the population (Section 1.2).

Relapsed ALL is more resistant to chemotherapy than the initial ALL, and is treated with intensified chemotherapy and bone marrow transplantation in severe cases. Both the intensified therapy and transplantation are associated with significant risks of serious complications, such as infections and graft versus host disease [28], and the overall survival is poorer [38].

1.2 Cytogenetic classiﬁcation of ALL subtypes

Leukemic cells typically carry many somatic large-scale genomic rearrange- ments, including chromosomal translocations, aneuploidies, ampliﬁcations,

(12)

Age at diagnosis (years)

Number of cases

0 5 10 15 20

0 20 40 60 80 100

HeH (n=191) t(12;21) (n=165) Undefined (n=113) Nonrecurrent (n=108) TALL (n=101) 11q23/MLL (n=31) t(1;19) (n=23) t(9;22) (n=20) dic(9;20) (n=20) susp HeH (n=12) iAMP21 (n=10)

<45chr (n=5)

>67chr (n=3)

Time (months)

Relapse free fraction

0 50 100 150

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1.1. Overview of the ALL cohort. Left: The number of cases in each cytogenetic subtype. Note that the oldest patient was 18.9 years and that the right tail above 18.9 is an artifact of the smoothing algorithm used to generate the graph. Right: Fraction of relapse free patients (y-axis) as a function of time (x-axis). Each vertical tick marks a censored patient, either because of competing events such as death from unrelated causes or simply because not more time has passed since the diagnosis.

duplications, and deletions [84]. Many of these are not random, but appear recurrently in the population and are associated with either better or worse prognosis. Oncogenic mutations have also been shown to be associated with ALL, both somatic [118] and germline [98], but they are not yet part of routine diagnostics, as is the case the related disease acute myeloid leukemia (AML) [25].

The composition of the ALL cohort studied in this thesis in terms of cytogenetic subtypes is presented in Figure 1.1, together with Kaplan-Meier esti- mates of relapse free survival. The largest subtype of ALL is high hyperdiploid (HeH), characterized by a large number of non-random chromosomal duplications, with a modal chromosome count between 51–67 [88]. Although the exact threshold values for inclusion in the subtype can appear slightly arbi- trary, these patients are surprisingly consistent with respect to in which chromosomes are duplicated and the lack of features characteristic of other subtypes, e.g. translocations. The far less common cases of patients with less than 45 chromosomes are referred to as hypodiploid, while patients with more than 67 chromosomes typically have a very high modal number approaching tetraploid. Patients labelled as “susp HeH” are suspected HeH, but not yet convincingly validated.

The translocation subtypes t(12;21), t(1;19), t(9;22) also known as Ph+, and 11q23/MLL are characterized by aberrant fusions between different chromosomes. The breakpoints often occur within protein coding genes producing, fusion proteins such as ETV6/RUNX1 of t(12;21) formed by of the genes

(13)

ETV6 on chromosome 12 and RUNX1 on chromosome 21. The 11q23/MLL subtypes contain fusions between the MLL gene on chromosome 11 and var- ious other partners. The dicentric translocation dic(9;20) produces a fusion chromosome with both centromeres from chromosome 9 and 20. Although some controversy exists to whether or not it should be considered a subtype of its own, NOPHO finds that dic(9;20) patients share a poor prognosis and often lack defining features of other subtypes [116]. Patients with intrachromosomal amplifications (iAMP21) are characterized by a large number of copy number increases on chromosome 21 [45].

To determine karyotype and subtype membership, chromosome-banding and a number of targeted reverse transcription polymerase chain reaction (RT- PCR) and fluorescent in situ hybridisation (FISH) analyses are performed. RT- PCR is used to detect expressed fusion proteins, by using one primer for each of the two translocation partners. If the fusion protein is expressed they will bind to the same RNA and allow amplification. If not they will bind to different RNAs and not produce any product. FISH can be used to detect non-expressed translocations, using a pair of fluorescently labelled probes designed for the two translocation partners. If there is a translocation the two signals will co- localize when binding to the DNA. If not, they appear as separate signals.

A large number of patients lack subtype membership due to negative or failed tests, labelled “undeﬁned” in Figure 1.1, or only display non-recurrent cytogenetic aberrations, labelled “non-recurrent”.

1.3 Epigenetics of ALL

Epigenetics¹ encompasses all heritable changes to a cell that regulate phenotype, but do not involve any changes in the nucleotide sequence [29], with histone coding and DNA methylation (DNAm) being the most well studied phenomena (Section 1.4). These mechanisms are intimately connected to cell differentiation and development, but also serve ordinary cellular functions like chromatin arrangement, regulation of gene expression, and genomic imprint- ing [4, Chapter 11].

Although these changes are heritable on a cellular level, they are generally not heritable on a population level since humans undergo epigenetic repro- gramming during embryo development [91] that erases and resets the DNAm patterns. However, cases of transgenerational inheritance of epigenetically acquired traits have been observed from parent to offspring, e.g. adaption to starvation [61], pathogens, or temperature [22], which has caused some confusion regarding the meaning of the term epigenetics. Transgenerational inheritance is not discussed in this thesis.

There are several reasons to believe that epigenetics play a central role in ALL: (1) Epigenetic states are carefully regulated and maintained in normal

1The epi- preﬁx is Greek (εpí-) for over, above, outer.

(14)

healthy cells, but the control is disrupted in ALL and many other diseases.

(2) It has long been known that the DNA of cancer cells harbour many mutations and structural changes, some of which are recurrent but the majority are rare and seemingly random. In some cases, recurrent aberrations have been shown to be the initiating event of the disease, e.g. RARα translocations of acute promyelocytic leukemia (APL) [96], but in most cases the initiating events remain unknown despite intensive study. This is partly because it is hard to identify true causal variants among the large number of passenger aberrations, and supposedly because genetics alone cannot explain everything. The general belief is still though, that genetic alterations are the initiating events of the disease but that epigenetic modiﬁcations contribute signiﬁcantly to the development. (3) Cancer cells are highly adaptable to new conditions, such as changes in local environment during tumor growth or exposure to drugs. Ac- quired traits have been observed to be reversible in several studies [71, 112]

which rule out genetics as the underlying mechanism. (4) In solid cancers, studies have shown that cancer cells can dedifferentiate to acquire traits de- sirable for growth and metastases [97]. Since epigenetic regulation is central in cellular differentiation, it is highly likely to also be important for disease progression. (5) Connections between DNAm and disease development and outcome have been observed in earlier studies of other cancers [18, 109, 73], as well in an ALL study of smaller scale by our own lab [82]. Recently, additional studies with similar results has also appeared [103, 79, 114]. (6) Patient survival has been improved using demethylating agents in myelodysplastic syndromes (MDS) [33, 42], a group blood diseases that often develop into AML. The most successful of these drugs is 5-azacytidine, described in more detail below.

1.4 DNA methylation on a molecular level

Methyl groups may be attached to DNA at any cytosine nucleotide directly followed by a guanine nucleotide, so called CpG-sites, by DNA methyltrans- ferases (DNMT, Figure 1.2). The “p” in “CpG” is short for the phosphate separating the C and G nucleosides. A key feature of the CpG-sites is that they are palindromic, meaning that their sequence is identical on both the for- ward and reverse strands, allowing their states to be mirrored by maintenance DNMT and persist through DNA replication.

The mechanisms for removing DNAm have only recently been understood in full detail [64]. It can either be actively removed by iterative oxidation and base excision repair, passing hydroxymethylation (Figure 1.2 right) as an intermediate state, or passively removed if the supply of methylgroups is depleted during DNA replication, the DNA is damaged and replaced, or if the DN- MTs are disabled. This is the mechanism exploited by the demethylating drug 5-azacytidine, a nucleoside analogue that can replace cytosine during replica-

(15)

O NH₂

N HO

O NH₂

O N NH₂

N

A T G G A C G T T A T G T A C C T G C A A T A C

Me

A T G G A C GT T A T G T A C C T G CA A T A C DNMT

mDNMT

Me

Unmeth. Meth. Hydroxymeth.

Figure 1.2. Left: The top DNA strand is de novo methylated by a DNMT. A main- tenance DNMT (mDNMT) then recognizes the mismatch and adds a complimentary methylgroup to the opposite strand. Right: The structure of a cytosine nucleotide without methylation, with methylation, and hydroxymethylation.

tion and then bind covalently to mDNMT when it tries to copy the methylation status from the template strand. At low doses, cells respond by degrading the mDNMT and repairing the DNA, loosing the methylation status in the process.

At high doses 5-azacytidine is cytotoxic, which was in fact the intended use when the drug was ﬁrst investigated in the 1950’s.

There are roughly 28 million CpG-sites in the human genome, of which many are clustered in regions of high CG-content, so called CpG-islands.

These are frequently found close to transcription start sites (TSS) where many regulatory factors bind, providing a natural interpretation of the function of DNAm. However, the methylation status only correlate with the expression of nearby genes for a small fraction of the CpG-sites and there are also plenty of CpG-sites far from any gene, indicating that DNAm is predominantly in- volved in long range interactions, or may be put out of use through chromatin remodeling.

DNAm in the context of chromatin remodeling

To understand the function of DNAm it is necessary to examine the context in which it occurs. DNA is never found ﬂoating around freely and naked in the nucleus, despite traditionally being depicted as a bowl of spaghetti in cartoon ﬁgures. It is wrapped around histone complexes called nucleosomes to form chromatin, which can be packed tightly (heterochromatin) or loosely (euchro- matin) to prevent or allow access by the transcription machinery [4, Chapter 3].

On a higher level, the chromatin is attached to a polymerous matrix (similar to a scaffold meshwork) called nucleoskeleton spanning the interior of the nu- cleus and attached to the nuclear lamina, a thick ﬁbrillous mesh coating the inner nuclear membrane [92, 21, 4]. This structure is highly dynamic and re- sponsible for much of the regulation of gene products. It seems like one may think of the nucleus as a library where some books (genes) are put on shelves (euchromatin) in the main room (nucleoplasm), ready to be browsed by stu-

(16)

dents and researchers (transcription machinery), while other books are put in cabinets (heterochromatin) in the basement archive (matrix and lamina), not immediately accessible but not impossible to retrieve. However, computer simulations of histone polymer chemistry has revealed that chromatin can form several additional states [108], which will likely lead to an extended structural model in the near future.

Although the mechanisms controlling the chromatin structure are far from worked out in detail, much of the information is believed to be encoded in small molecules attached to histone tails and directly onto the DNA itself. Dur- ing replication the structure of the chromatin is completely changed, but the modiﬁcations are preserved, copied and passed on to the daughter cells that restore the organisation.

As for the actual function of epigenetic marks, a variety of histone marks have been discovered, coding for different chromatin structure and activity, e.g. open or closed chromatin or active transcription of a gene. They are believed to form a complex code with much interaction, actively altered and read by enzymes, rather than affecting chromatin structure through physical and chemical properties. DNAm is less dynamic and informative, but more persistent. The variation in methylation can only in part be attributed to know functions, like suppression of gene expression through promoter hypermethylation. A major hurdle for interpretation is that both histone marks and DNAm are both time and tissue speciﬁc, but few studies are designed in such a way that causality can be determined. Nevertheless, since many diseases are associated with increased methylation levels and variability, treatment with demethylating agents have been found effective in myelodysplastic syndrome (MDS), and two such drugs, azacitidine and decitabine, have already been approved by the US food and drug administration.

1.5 Machine learning in molecular biology

Technological advances in the past decades has caused our capacity to generate and store data to explode in many fields. This is particularly noticeable in molecular biology where for example the cost of sequencing a genome was reduced from 3 billion to $30,000 (US) in the ten years following the com- pletion of the first genome [110, 77], and is currently on the order of $1,000 [17]. It has therefore become increasingly important to be able to automati- cally identify meaningful patterns in large volumes of data, especially when the prior knowledge of the systems at hand is limited. The field of machine learning encompasses concepts and methods for accomplishing this.

The techniques of machine learning are divided into two main categories, unsupervised and supervised learning [46]. Unsupervised techniques summa- rize data and create overviews in forms that are easier for humans to understand than raw data (Section 3.5.1). Supervised techniques construct models

(17)

that estimate or “predict” a response of interest based on collected data. More speciﬁcally, a supervised technique sets up a model

y= f (x| ˆθ,D) + ε (1.1)

which can map an observed data point x to a response y using a model family f parametrized by θ (the hat ˆ meaning estimate). The term ε is a residual error. An example of such a model family that can map an x∈ R^p→ y ∈ R is linear regression

ˆ

y= ˆθ^Tx (1.2)

where ˆθ is a vector of coefﬁcients of the same length as x. To estimate θ, one tries to minimize the estimated error ˆq of the predictions on a training set consisting of observations D= {x1,...,xn} with associated known responses Y = {y1,...,yn}. The error can be estimated in different ways, but a classical error function suitable for linear regression is the mean square error (MSE):

ˆ

q= MSE(Y, ˆY) =1 n

∑n

i=1(yi− ˆyi)² (1.3) The ˆθ that minimizes ˆq can be hard to ﬁnd, but for this particular setup there is a closed form expression for it

θ = (Dˆ ^TD)⁻¹D^TY . (1.4) Generally, the estimation method differs from model family to model family.

There is a plethora of model families available with accompanying methods for estimating θ, and new ones can be created either from scratch or by combining existing techniques, allowing models to be customized to ﬁt any particular application. New knowledge about the modelled system can also be gained through examining ˆθ, making machine learning popular in molecular biology where data is abundant and new knowledge is frequently appearing.

1.5.1 Performance evaluation

The true performance q of a model depends both on how well f matches the true structure of the data and how well the true value ofθ was approximated by the corresponding estimate ˆθ. If a training set is sufﬁciently large and well represents the underlying distribution from which it was sampled, ˆθ will be similar to the unknown ideal θ^∗, producing a model that accurately predicts y for new unseen observations that are not part of D. If on the contrary, the training set is too small or biased, ˆθ will differ substantially from θ^∗, produc- ing an overﬁtted model that will not be able to predict y well for new unseen observations. The only way to tell how well an estimate ˆθ performs is through testing the model on an independent test set with corresponding responses that

(18)

was not part of the estimation ofθ. Just as the size of the training set is important for obtaining a well-performing ˆθ, the size of the test set is important for getting a reliable estimate of the performance ˆq. It is not obvious how to divide available data into design and test sets in the most beneﬁcial way, particularly not if it is of a limited amount, but a popular approach is to use resampling methods or cross-validation [46, chapter 7.10]. Cross-validation divides the entire data set into K mutually exclusive subsets called folds, and in turn lets each of them acts as a test set for a model designed on the obser- vations present in the remaining folds. This will produce a collection of K performance estimates { ˆqk}^K_k₌₁ with a mean estimate ˆQ. If K is large, each design set will be large and produce models similar to the one obtained if all data was used for design, thus generating an approximately unbiased ˆQ. How- ever, this comes at a price of high variance of{ ˆqk}^K_k₌₁since the test sets will be small. If K is small, the design sets will be relatively small compared to the total number of examples available for design and test, and result in models with worse performance than for a large K, negatively biasing the mean ˆQ but reducing the variance of{ ˆqk}^K_k₌₁. A sound strategy is therefore to estimate a lower bound on ˆQ with a small K and build a ﬁnal model for future use based on all available data.

In practical applications, model training typically contains a number of steps from data pre-processing to performance evaluation, and it is important to realize that ˆQ will be positively biased if not all steps of the training is per- formed separately for each cross-validation fold. This is extremely important when feature selection techniques are used prior to the estimation of model parameters. A non-obvious example relevant to application in molecular biology is the case of dimensionality reduction of gene expression or DNAm data sets. In such applications the number of observations is typically much smaller than the number of measured features per observation (mRNAs or CpG-sites), making it difﬁcult or impossible to estimateθ for simple model families, such as the linear regression example given in (1.1)–(1.4). A powerful method to solve this problem is to use principal component analysis [46, Pages 62–63, 485–491], non-negative matrix factorization [69], or some other compression method to reduce the number of features before estimating θ. However, if all the data is used for performing the compression before the data is divided into design and test sets, the observations of the test set will inﬂuence the def- inition of the new set of features by which the data will be described, which can substantially aid the training procedure. This is referred to as information leakage.

Ideally, all steps starting from raw data processing and quality control should be incorporated into the cross-validation procedure to completely rid the results of biases from information leakage. In reality this is usually very time- consuming and impractical, especially if data is produced and pre-processed by an external service, and performance estimation typically takes place only

(19)

Easy

Challenging

Futile

Figure 1.3. Three properties of supervised learning problems that inﬂuence the suc- cessfulness of a modeling attempt. Left: Response ambiguity refers to how often similar observations have different responses. Center: The model complexity can lazily be described as how complicated the response surface or decision boundary is.

Right: Data sparsity refers to how far apart the observations are in the space spanned by the features of the dataset. The more features measured the further the apart the observations will be, with larger areas in between them left unseen (unless they reside in a subspace that is).

after the data set has been quality controlled and normalized. Whether or not this confers signiﬁcant biases differs between applications, and it is important to always evaluate the risks of not including each step into the performance estimation.

1.5.2 The interplay of problem formulation, model complexity and difﬁculty to solve

The successfulness of a modeling attempt can be traced to three sources: response ambiguity, model complexity, and data sparsity [48] (Figure 1.3). These can to some extent be controlled by the user, but are often dependent on each other. For example, the response can be made less ambiguous by adding more features, at the expense of increased data sparsity, or a less complex model family may be chosen if a continuous response is divided into discrete classes, at the expense of reduced interpretability.

Concerning model complexity, the best result is obtained when the structure of f allows the model come as close to the true underlying pattern as possible, whileθ is simple enough to be estimated well with the available data. If there is plenty of data a simple f,θ might not be able to make best use of it, and vice versa if f,θ are too complex with respect to the available data a well

(20)

Figure 1.4. Demonstration of a model complexity trade-off. Polynomials of different degrees (black) were ﬁtted to 12 data points generated from an underlying function that was a sum of two sine functions (grey). No polynomial model can exactly match the true pattern, but the higher the degree the closer an approximation can get. Polynomial models of degree 1 (left) and 3 (right) had enough data to obtain sensible estimate their parameters, but were inevitably further from the truth than a well estimated polynomial model of higher degree would have been. A polynomial model of degree 8 (right) theoretically had the ability to come closer, but in this particular case there wasn’t enough data to estimate its parameters properly. The result was an overﬁtted model with poor performance in areas of the space where no training data was available. Out of the models shown here, the polynomial of degree 3 would be the overall winner in terms of mean square error.

performing estimation ofθ cannot not be found, even if it lies perfectly within the scope of f (Figure 1.4).

Three kinds of supervised learning problems appear in this thesis: classifica- tions, regressions and survival analyses. Classification deals with categorical responses (e.g. a patient has a t(12;21) translocation or not) and is the simplest of the three; regression deals with continuous responses (e.g. how resistant a patient is to a drug on a scale from 0 to 100), which are more informative but also require more complex models; and survival analysis deals with a combi- nation of the two, i.e. types and time points of events that occurred (e.g. a patient experienced a relapse after 3 years and 2 months after the initial diagnosis), and is the most complex type of problem. Survival analysis is further complicated by the fact that the response is only certain in cases where an event has already occurred, but event-free cases are not guaranteed to remain so indefinitely, regardless of how long time they have been observed for. They are instead said to be censored at the last observed time point. The classical way to model such a response is to use a regression model with time to event as response, and include the censored information to the extent possible (e.g.

using the Cox proportional hazards model [20]). However, unless the dataset is sufﬁciently large to ﬁt such a complex model we are probably better off simplifying it, exchanging some potential to accurately capture the underlying pattern for robustness to variation in the data. Two common ways to do this is to either divide the observations into discrete groups and test for a difference in outcome with a statistical test (e.g. with Gray’s test, Section 3.5.4) or by

(21)

dichotomizing the response and treat it as a classiﬁcation problem, e.g. an event did or did not occur before a time point t (as was the case of the survival analysis of Paper II).

Another example of non-trivial modeling decisions is how to treat groups that may not be directly comparable, e.g. treatment groups or cytogenetic subtypes. If the groups are analysed together the data is less sparse, but at the possible cost of increased complexity of the resulting prediction model.

How much the complexity increases depends on how similar the groups are, so whether or not it is a good strategy depends on the application. If there are patterns in the data shared across the groups it might be better to analyse them together and adjust for group effects with covariates or stratiﬁcation. If there are no shared patterns it is better to analyse the groups independently, since the reduced sparsity will not fully compensate for the increased complexity of having to solve two different problems simultaneously.

This matter is of great importance when working with hypothesis generating problems, which are common in medicine and biology today. Modeling simpliﬁcations and trade-offs of the type described above were necessary to make in both Paper II and IV of this thesis, since at the time of study design there were good reasons to believe there are meaningful patterns in DNAm proﬁles related to disease development, progression, and drug resistance, but their extent or complexity was not known.

In summary, if you strike gold in the sense that the investigated trait has a simple pattern (e.g. the successful GWASes), it can be found with any nearly method, if not (e.g. all the other GWASes), machine learning offers a toolbox for getting the most out of your data.

(22)

2. Aims

The aim of this thesis was to investigate the prognostic value and biological function of DNA methylation in childhood ALL using state of the art machine learning tools. More speciﬁcally the aims were:

• To broadly characterize the methylome of ALL and to search for DNAm patterns at diagnosis predictive of future relapses (Paper II).

• To improve patient subtyping by augmenting current practices with DNAm based prediction (Paper III).

• To gain knowledge of the epigenetic mechanisms of cellular drug resis- tance (Paper IV).

• To develop easy to use tools for multivariate analyses of DNA methyla- tion data and to make them publicly available (Papers I, III, and V).

(23)

3. Materials and methods

3.1 Samples and phenotype assessment

Bone marrow aspirates and peripheral blood samples were collected by the NOPHO clinical centers, vitally frozen, and biobanked during the years 1996–

2011. Non-lymphocytic cells were removed by ﬁcoll-isopaque centrifugation (Pharmacia). All patients were enrolled in the NOPHO 1992 or NOPHO 2000 treatment protocols [95] and provided written informed consent themselves and/or by their parents or guardians. The studies were approved by the ethics committee of Uppsala University.

Clinical and phenotypic information was recorded or produced by the NOPHO clinical centers. The leukemic cell fraction of the ALL samples was assessed by experienced pathologists using light microscopy. Karyotypes and cytogenetic subtype were analyzed with chromosome-banding, RT-PCR, and FISH (Section 1.2).

A reference panel of non-leukemic samples was used for comparison with the ALL samples in the studies. The panel included healthy bone-marrow samples in an age-matched subset of the patients from whom the diagnostic samples were taken, taken at routine follow-up screening for minimal residual disease within one year from diagnosis. The panel also included T-cells and B-cells isolated from peripheral blood samples of healthy Swedish blood donors using positive selection and MACS cell separation reagents [83]. The T-cells were identified by the presence of the cell-type-specific surface protein cluster of differentiation 3 (CD3+), and the B-cells by CD19+. Finally, one pooled sample of CD34+ hematopoietic stem cells from 5 blood donors was purchased from the company 3H Biomedical (Uppsala, Sweden). This was because of the difficulty to obtain a pure CD34+ sample inhouse due to the small number of CD34+ cells in peripheral blood.

3.2 Methylation assay

The Infinium 450k methylation assay (Illumina) [55] was used to measure methylation levels of 478855 CpG sites by the SNP&SEQ platform of the Molecular Medicine group at Uppsala University. The DNA was bisulfite converted (Figure 3.1), changing unmethylated cytosines into uracils but leaving methylated cytosines unchanged, and then amplified, changing the uracils into thymines in the process. The amplified DNA was then fragmented and applied

(24)

Figure 3.1. Bisulfite consversion. Unmethylated cytosines are converted to uracils when exposed to bisulfite treatment (left), but methylated cytosines are not (right). The figure was originally published in the Nucleic Acids Book [5] and was used here with permission from the author.

to a chip containing beads with attached probes to which the DNA fragments may hybridize. Single base extension was then performed using hapten labelled dideoxynucleotides that were subsequently ﬂuorescently stained and identiﬁed by optical scanning.

There are two types probes on the 450k, type I that was originally developed for the predecessor array, the Inﬁnium 27k array, and newly developed and simpler type II.

Type I probes come in pairs that differ in the last nucleotide. If the complimentary fragment was unmethylated at the interrogated CpG-site only the unmethylated copy of the probe pair can be extended, due to the mismatch at the last base of the methylated copy of the probe pair (Figure 3.2 A left) and vice versa. Thus if the fragment was methylated at the interrogated CpG-site only the methylated copy of the probe pair can be extended (Figure 3.2 A right).

The degree of methylation at the CpG-site is then calculated asβ = _M_+U+100^M , where M and U are the signal intensities detected from the two copies of the probe pair (M = methylated and U = unmethylated). The constant in the de- nominator is a regularisation parameter guarding against division with zero.

Type II probes do not come in pairs, but only a single version that hybridize up to the G nucleotide of the interrogated CpG-site (Figure 3.2 B). Single base extension then incorporates an A nucleotide if the locus was unmethylated and therefore converted to T, or a G nucleotide if the locus was methylated and thereby protected from conversion. The degree of methylation at the CpG- site is then calculated asβ = _R_+G+100^G , where G and R are the red and green intensities.

The raw intensities was processed in Genome Studio (Illumina) using the methylation module [56] and exported as β-values together with associated detection p-values. The p-values were calculated by comparing each CpG- site’s signal intensity to the intensity distribution of a set of negative control positions on the array. The data was then imported to R where CpG-sites with

(25)

Figure 3.2. The chemistry of the two probe types of the Infinium 450k methylation assay. (A) Type I probes use allele-specific hybridization, where the probes come in pairs with identical sequences apart from the last base, which is overlapping the C of a CpG-dinucleotide. If the site was unmethylated the C in the fragment was converted to a T allowing the unmethylated probe to be extended (top left), but not the methylated (bottom left). If the site was methylated the C remains allowing the methylated probe to be extended (bottom right), but not the unmethylated (top right). (B) Type II probes use single base extension, where the probes do not include the C of the CpG-site allowing all probes to be extended, but with different colour depending on the incorporated nucleotide. The figure was originally published by Bibikova et al. (2011) [9] and was used here with permission from the author.

(26)

p-value > 0.01 were filtered out, producing an average fraction of failed CpG- sites of 0.22% across the samples. The data was finally normalized to account for probe-type-specific biases using peak-based normalization [24].

Sites measured with unreliable probes were removed from all analyses.

Such probes include those located on the X and Y chromosomes, since they would introduce sex-related biases; 65 single nucleotide polymorphism (SNP) genotyping probes included on the array for DNA tracing¹; probes hybridizing to multiple locations in the genome; and probes with a common SNP or indel in the ﬁrst 2 bp from the 3’ end of the probe, since they affect probe hybridization and produce noisy results (see the supplementary methods of Paper II for details). 435941 CpG-sites remained for analysis.

3.3 In vitro drug resistance assay

Data from fluorometric microculture cytotoxicity assay (FMCA) [74] was available to assess in vitro drug resistance of primary ALL samples [37]. It is a microplate-based cell viability assay where cells are seeded in wells and exposed to drugs for 72 h, after which the fraction of viable cells is detected by fluorescent staining and optical scanning. More specifically, fluorescein diacetate is added that hydrolyze to intact cell membranes and emits a light signal. Matching wells without drug exposure are used for comparison, and additional wells with medium only are used for background correction.

3.4 Gene expression and RNA quantiﬁcation

Three forms of gene expression data was used in the studies in the thesis: Digi- tal gene expression [86], array-based RNA quantiﬁcation, and RNA- sequencing.

Brieﬂy the digital gene expression method consists of sequencing the 17 bp closest to the polyadenylated tails of mRNAs and annotating them to the human transcriptome (Ensembl version 58) using both the coding and non- coding strands. Tags matching more than one gene were discarded. The data was produced for an earlier study [86] and readily available.

The array-based expression profiling data was produced using the Human Genome U133 Plus 2.0 GeneChip (Affymetrix) [2]. In short it works by first synthesizing double stranded cDNA from an RNA sample that is to be as- sayed. The cDNA is then used as a template for an in vitro transcription re- action producing a large amount of biotin-labelled cRNAs. The cRNAs are applied to a chip containing probes for a predefined set of genes, to which the

1These probes do not map to CpG-sites, but to SNPs. Theirβ-values therefore relate to the genotype of the sample, which can be used for identiﬁcation, and do not measure methylation status.

(27)

cRNAs hybridize. Finally, the hybridized cRNAs were stained with a ﬂuores- cent molecule (streptavidin-phycoerythrin which binds to the biotin) and subsequently detected by measuring the ﬂuorescent light intensity. The data was produced and normalized by Uppsala array platform using the robust multi- array average method [59].

Fusion genes were detected in Paper III using RNA-sequencing, performed by the SNP&SEQ platform of the Molecular Medicine group at Uppsala Uni- versity. The ScriptSeq v1.2 kits were used for library preparation (Epicentre, Madison, WI, USA), and sequencing was performed using either HiSeq2000/2500 or MiSeq instruments (Illumina Inc., San Diego, CA, USA).

Gene fusions were detected using the FusionCatcher software [85].

3.5 Mathematical and computational methods

The R environment for statistical computing [90] was used for performing most of the analyses, thanks to its wide range of available packages, excel- lent graphical facilities, and well written interfaces to computer clusters, other programming languages such as C++, database tools such as MySQL, and pro- grams for statistical analysis such as SPSS. The following subsections describe the theoretical foundations of the methods used together with the relevant R commands.

3.5.1 Unsupervised methods

Principal component analysis (PCA) [46, Pages 62–63, 485–491] implemented in the_prcompfunction in R was used to describe the dominating structures in the DNAm dataset in Paper II. PCA calculates an alternative set of orthogonal basis vectors ordered by how much variance in the dataset they capture, called loading vectors, eigenvectors, or principal directions. The associated score or coordinate values in the transformed space is called principal components.

Interpreted geometrically, PCA rotates the datasets so that the ﬁrst principal direction describes the direction in the dataset with the maximal variance, the second principal direction describes the direction with the maximal remaining variance, and so on. This can reveal interesting structures in the data, particularly if the unprocessed data is of a high dimension that cannot easily be overviewed in unprocessed form.

Hierarchical agglomerative clustering [46, Section 14.3.12] implemented in the_hclustfunction in R was used to used in Papers II, III, and IV to search for groups of similar observations. Complete linkage and Euclidean distances was used throughout to merge the observations into clusters unless otherwise stated. Still, since finer details tend to be wiped out by the bulk information of the profiles, a weighted Euclidean distances was used in Paper IV to detect clusters relevant to in vitro drug resistance. The weights were defined

(28)

as the absolute variable importance scores of the CpG-sites, estimated by the classiﬁcation algorithms. Thus, this resulted in producing a semi-supervised clustering method.

Traditionally, clusterings are plotted with the observations distributed along the x-axis and the branch lengths denoted by the y-axis, as in Figure 1 of Pa- per II. The dendrograms presented in Paper IV are regular hierarchical clusterings produced with hclust, but plotted using polar coordinates with the observations distributed across the angles and branch lengths denoted by the radii. This was found to make better use of the plot area since the perimeter around a circle provides more space for the many leaf nodes compared to a single side of a rectangle, and to also be aesthetically pleasing.

3.5.2 Nearest shrunken centroids

The nearest shrunken centroids classification method (NSC) used in Paper II, III, and IV was introduced for multi class tumor classification based on gene expression data [106]. The general idea of the method is to represent each class k by a prototype observation called centroid, and assign new observations to the class they are spatially closest to. Initially the centroids are defined as the mean feature vector of each class, that are then “shrunken” towards the overall centroid of the whole dataset.

Let xi jbe the values of features i= 1,2,..., p and observations j = 1,2,...,n.

The components of the unshrunken centroid for class C_k, where k= 1,2,...,K, is deﬁned by the mean of its observations

¯

x_ik= ∑

j∈Ck

xi j

n_k (3.1)

where nk is the number of observations in class Ck. Similarly the ith component of the overall centroid is ¯x_i = ∑ⁿj=1x_{i j}/n. The standardized distance between the class centroids and the overall centroid is d_ik= ( ¯xik− ¯xi)/si for feature i, where si is the pooled within-class standard deviation for feature i deﬁned as

s²_i = 1 n− K∑

k ∑

i∈Ck

(xi j− ¯xik)² . (3.2) The components are then shrunk towards ¯xigenerating shrunken centroid components ¯x_ik= ¯xi+ sid_ik , where d_ik is deﬁned as

d_ik = sign(dik)(|dik| − Δ)+ (3.3) whereΔ is a user-deﬁned threshold and (x)₊is deﬁned as

(x)₊=

x if x≥ 0

0 if x< 0 (3.4)

Machine Learning Based Analysis of DNA Methylation Patterns in Pediatric Acute Leukemia

Machine Learning Based Analysis of DNA Methylation Patterns in Pediatric Acute Leukemia

List of papers

Contents

Abbreviations

1. Introduction

2. Aims

3. Materials and methods