• No results found

Jessica Nordlund, Christofer L. Bäcklin, Vasilios Zachariadis, Lucia Cave-lier, Johan Dahlberg, Ingegerd Öfverholm, Gisela Barbany, Ann Nordgren, Elin Övernäs, Jonas Abrahamsson, Trond Flaegstad, Mats M. Heyman, Ólafur G Jónsson, Jukka Kanerva, Rolf Larsson, Josefine Palle, Kjeld Schmiegelow, Mats G. Gustafsson, Gudmar Lönnerholm, Erik Forestier, and Ann-Christine Syvänen.

DNA methylation-based subtype prediction in pediatric acute lymphoblas-tic leukemia. Accepted for publication in Clinical Epigenelymphoblas-tics, December

2014.

CLB’s contributions: Concieved the study together with JN, MGG, GL, EF, and ACS. Analyzed the data together with JN, VZ, and JD.

4.3.1 Motivation and setup

In the clinic, the subtype of an ALL patient is an important factor for select-ing the appropriate treatment. To determine the subtype diagnostic samples are routinely analyzed with chromosome banding, reverse transcriptase PCR (RT-PCR), fluorescence in-situ hybridization (FISH), Southern blot and array-based methods. Since the DNAm dataset presented in Paper II contained very strong patterns related to subtype, this study was launched to assess the value of adding DNAm to the repertoire of subtyping methods.

Erik Forestier, an experienced expert on ALL cytogenetics, manually re-vised all clinical information collected by the regional hospitals at which the original subtyping was performed to produce an updated dataset of 546 sam-ples belonging to eight clinically well-established recurrent subtypes, and 210 samples of limited or unclear subtyping information. A supervised learning procedure based on NSC was used to identify predictive DNAm patterns and to predict the subtype of the unknown samples as well as 39 blinded samples only used for a final validation. The blinded samples were newly diagnosed and had not been previously analyzed in Paper II.

4.3.2 Modeling procedure

Classification problems with three or more classes can be addressed as a sin-gle all-vs.-all problem where all classes are modeled simultaneously, or as a seriers of pairwise one-vs.-one or one-vs.-rest problems. Since the all-vs.-all approach generall-vs.-ally does not all-vs.-allow patients to belong to multiple subtypes (with the notable exception of neural networks), the one-vs.-one results can be cumbersome to aggregate into a single decision, the one-vs.-rest approach was chosen. Very few ALL patients carry cytogenetic aberrations of more than one subtype, but since they vary greatly in frequency and presumably also in de-tection accuracy, it was found easier to interpret separate class-wise decisions

rather than a combined decision. Classifiers were also designed to discrimi-nate ALL samples from normal healthy samples and females from males, to detect possible contaminations and mix-ups.

NSC is by nature good at dealing with noisy high dimensional data but to gain some extra protection against outliers and small sample biases, multiple NSC classifiers were trained on 25 random subsets of the training set and only CpG-sites selected repeatedly were allowed to be part of the final classifier, referred to as the consensus sites (Figure 4.6 A).

4.3.3 Performance estimation

5 replicates of 5-fold cross-validation was used to estimate the performance of the modeling procedure (Figure 4.6 B). To establish the set of consensus sites needed to train one complete classifier for the eight subtypes, 200 NSC models had to be trained on the whole array, resulting in a total of 5000 NSC models trained on the whole array for the performance estimation. Since the size of the consensus sets was then controlled through the parameterτ one should also

add an additional level of cross-validation to account for the parameter selec-tion. It was however found to be computationally infeasible to fit the 125000 NSC models required to do so, and when it was found that the choice ofτ had

a negligible impact on the final performance, a value ofτ = 17 was instead

chosen as a trade-off between consensus set redundancy and convenient size. We were aware that this did not follow our general rule of information leakage free performance estimation, but had good reasons to believe the decision did not introduce any noteworthy biases in the estimated performance.

The classifiers produces achieved sensitivities > 90% and specificities =

99±1% on the easiest subtypes to classify: T-ALL, t(12;21), 11q23/MLL, and

t(1;19). The less common subtypes dic(9;20), t(9;22), and iAMP21 were more difficult to sift out and the associated classifiers achieved sensitivities between 70–90% and specificities = 99 ± 1%. HeH displayed the unique behaviour of a high sensitivity > 90% but a comparatively low specificity of 95 ± 2%.

This could be caused by the fact that it is quite common for patients in other subgroups to also carry extra copies of chromosomes, which blurs the defini-tion of the subtype. iAMP21 samples in particular were often misstaken for HeH by the HeH classifier, but non-iAMP21 were never incorrectly detected by the iAMP21 classifier. Taken together 91% of the samples were correctly assigned to a single correct subtype during the cross-validation, 3.4% were as-signed to multiple subtypes including the correct one, and the remaining 5.6% were either assigned to only incorrect or no subtypes.

Of the 39 blinded validation samples, 36 were correctly classified and the remaining misclassified samples all had atypical results from the original chro-mosomal analyses.

Figure 4.6. (a) Classifier training procedure. Rather than modeling all subtypes simultaneously as a multi-class problem, they were modeled as a series of binary one-vs.-rest problems. In order to find the most coherent CpG-sites among many with highly similar content, a series of nearest shrunken centroid (NSC) classifiers were trained on 25 randomly selected subsets of the training data (performed separately for each subtype). Sites choosen at leastτ times were defined as the subtype-specific consensus sites. All training data was then used to create subtype-specific classifiers using the consensus sites only. See Section 4.3.3 for details on the tuning ofτ. (b) Performance estimation procedure. The entire dataset was randomly divided into 25 pairs of training and test sets (5 replicates of balanced 5-fold cross-validation). One set of subtype-specific classifiers was trained on each training set and its performance was evaluated using its corresponding test set. (c) Final classifier training. The entire dataset was used to train the final classifiers that were used to predict the subtypes of the samples with unknown subtype and the blinded validation samples.

4.3.4 Review of samples with unknown subtype

The 210 samples with unknown subtype comprised samples with no result (n= 18), normal karyotype (n = 87), and non-recurrent karyotype (n = 105) according to the original cytogenetic analyses. Using the established classi-fier 106 of the unknown samples could be assigned to a recurrent subtype (Figure 4.7). The newly classified samples displayed DNAm profiles highly similar to those of the samples with previously established subtype and are therefore referred to as subtype-like. The subtype-like samples also followed what could be expected from the whole ALL population in terms of number of patients (Figure 4.7C), clinical features, and treatment outcome.

Three reasons were identified for why samples with relatively non-ambiguous DNAm profiles had not previously been assigned to any of the recurrent subtypes. Firstly, not all patients were tested for their true subtype at the time of diagnosis, since not all of NOPHO’s clinical centers have the time and ability to run all the required tests on all samples. Only 57% of the subtype-like patients of the translaction subtypes t(12;21), t(1;19), and t(9;22) and the MLL-rearranged subtype 11q23/MLL had originally been tested for their cor-rect subtype using RT-PCR or FISH. Upon running the required RT-PCR anal-ysis for eight randomly selected t(12;21)-like samples the translocation could be identified in five, and RNA sequencing (RNA-seq) of 17 additional samples could identify nine more.

The second reason was that not all patients carried the canonical transloca-tions and fusion genes of their respective subtypes. Among the samples that were RNA sequenced the non-canonical translocations t(20;21) RUNX1/ASXL1, t(7;12) ETV6/CBX3, and t(3;12) ETV6/AK125726 were discovered among the t(12;21)-like samples instead of the expected ETV6/RUNX1. The known but not routinely analyzed translocations t(9;12) PAX5/ETV6 and inv(9p13.2) PAX5/ZCCHC7 were also detected, and the novel translocations t(9;14) PAX5/ESRRB and t(5;15) BRD9/NUTM1. Fur-thermore, together with the readily available karyotyping data the fluorescence intensity data of the 450k was used to detect large scale genomic gains and losses, which were suggestive of, but not always conclusive for, a large num-ber of the subtype-like samples of HeH, iAMP21, dic(9;20), and t(1;19).

Finally, the sensitivity and accuracy of the cytogenetic analyses has im-proved along with the advances in molecular biology of during the past decades. Some of the newly classified samples were taken as early as 1996 and could not be satisfactory analyzed at the time. This illustrates the utility of DNAm for classification of historically obtained biobanked samples, which might be of limited amount. From just a single experiment all known subtypes can be tested and a several other attributes can be investigated, such as patient age [50], sex, tissue of origin, and cell types fractions. Working on DNA rather than RNA, the methylation arrays are also less sensitive to sample degradation compared to gene expression arrays.

Related documents