Classifying patients' response to tumour treatment from PET/CT data: a machine learning approach

(1)

STOCKHOLM SWEDEN 2017,

Classifying patients' response to tumour treatment from PET/CT data: a machine learning approach

GIULIA BUIZZA

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

I would like to thank the Medical Image Processing and Visualization research group at the School of Technology and Health of KTH, Royal Insitute of Technology, where this work has been conducted. The feedback I received helped me in finding the right path during the project work and the stimulating activities that were organized made me learn much more than expected. In particular, I thank my supervisor, Dr. Chunliang Wang, for being always patient, resourceful and positive about the development of the project.

I would also like to show gratitude to professor ¨Orjan Smedby for the opportunity of working in such a stimulating research group and to Dr. Yongjun Chang for his help in designing a feature selection method.

I would like to thank professor Marco Riboldi for his help, together with Chiara Paganelli for her patience, her many suggestions and her support from Milan.

For helping in designing my report and for making my own thoughts clearer for the rest of the world, I would like to thank Alexis, Edvin, Frans and Magnus, together with our guide, Dr. Dmitry Grishenkov.

A huge thank must go to my family who always supported me and discussed constructively about life-changing choices: without them I would not have gone to Sweden and had this amazing time. You will always be home to me.

Finally, my friends should know how much I value their presence in my life. A great mention should go to Francesca and Laura, who tolerated my complaints and made me smile and laugh countless times during these years: I hope we will be the Trio de Janeiro for many years to come.

(4)

(5)

Early assessment of tumour response has lately acquired big interest in the medical field, given the possibility to modify treatments during their delivery. Radiomics aims to quanti- tatively describe images in radiology by automatically extracting a large number of image features. In this context, PET/CT (Positron Emission Tomography/Computed Tomogra- phy) images are of great interest since they encode functional and anatomical information, respectively. In order to assess the patients’ responses from many image features appropriate methods should be applied. Machine learning offers different procedures that can deal with this, possibly high dimensional, problem.

The main objective of this work was to develop a method to classify lung cancer patients as responding or not to chemoradiation treatment, relying on repeated PET/CT images.

Patients were divided in two groups, based on the type of chemoradiation treatment they underwent (sequential or concurrent radiation therapy with respect to chemotherapy), but image features were extracted using the same procedure. Support vector machines performed classification using features from the Radiomics field, mostly describing tumour texture, or from handcrafted features, which described image intensity changes as a function of tumour depth. Classification performance was described by the area under the curve (AUC) of ROC (Receiving Operator Characteristic) curves after leave-one-out cross-validation. For sequential patients, 0.98 was the best AUC obtained, while for concurrent patients 0.93 was the best one. Handcrafted features were comparable to those from Radiomics and from previous studies, as for classification results. Also, features from PET alone and CT alone were found to be suitable for the task, entailing a performance better than random.

Keywords: treatment response, PET/CT, Radiomics, feature extraction, support vector machines

(6)

(7)

Tidig bedömning av tumörers respons p˚a p˚ag˚aende behandling har under senare tid till- dragit sig stort intresse inom det medicinska omr˚adet, p˚a grund av möjligheten att an- passa behandlingen medan den p˚ag˚ar. Radiomik (Radiomics) syftar till att kvantitativt beskriva radiologiska bilder genom att automatiskt extrahera ett stort antal bildsärdrag.

Inom detta omr˚ade är bilder fr˚an PET/CT (kombinerad positronemissionstomografi och datortomografi) av stort intresse eftersom de inneh˚aller b˚ade funktionell och anatomisk information. Lämpliga metoder behöver användas för att bedöma patientens gensvar fr˚an m˚anga bildsärdrag. Maskininlärning erbjuder ett antal olika metoder för att lösa detta problem, som kan vara mycket högdimensionellt.

Huvudsyftet med detta arbete har varit att utveckla en metod för att bedöma huruvida en lungcancerpatient svarar eller inte svarar p˚a en radiokemoterapibehandling genom att använda PET/CT-bilder fr˚an upprepade undersökningstillfällen.

Patienterna delades in i tv˚a grupper baserat p˚a den typ av radiokemoterapi de fick (se- kventiell eller simultan), men bildsärdragen hämtades med samma metod. Support vector machines klassificerade patienterna med hjälp av särdrag fr˚an radiomikomr˚adet, som huvudsakligen beskrev tumörens struktur, eller med anpassade särdrag. De anpassade särdragen beskrev bildintensitetsförändringar som en funktion av tumördjupet. Klassifice- ringens prestation beskrevs av ytan under kurvan (AUC) i en ROC (Receiving Operator Characteristics) kurva efter en lämna-en-ute-korsvalidering. Den bästa AUC som erhölls för sekventiella patienter var 0.98, medan motsvarande siffra för simultana patienter var 0.93. Klassificeringsresultaten fr˚an de anpassade särdragen var jämförbara med de fr˚an radiomik och tidigare studier. Vidare var resultaten fr˚an PET-särdragen och CT-särdragen bättre än slumpmässig, och därmed bedöms de lämpliga för uppgiften.

(8)

machines

(9)

(10)

Acknowledgements . . . i

Abstract . . . iii

Sammanfattning . . . v

1 Introduction 1 2 Materials and Methods 2 2.1 Clinical Data . . . 3

2.2 Image Registration . . . 4

2.3 Image Segmentation . . . 6

2.4 Image Features . . . 7

2.4.1 Image Normalization . . . 7

2.4.2 Radiomics Features . . . 8

2.4.3 Handcrafted Features . . . 9

2.5 Classification . . . 11

2.5.1 Algorithm . . . 11

2.5.2 Evaluation . . . 12

2.6 Features Selection . . . 13

2.6.1 Forward Feature Selection . . . 13

2.6.2 Feature Ranking . . . 13

(11)

3.1 Radiomics and Handcrafted Features . . . 16

3.2 CT and PET Features . . . 20

3.3 Performance Testing . . . 22

3.4 Unbalanced Data . . . 24

4 Discussions 25 4.1 Radiomics and Handcrafted Features . . . 26

4.2 CT and PET Features . . . 27

4.3 Performance Testing . . . 29

4.4 SVM Settings . . . 29

4.5 Limitations . . . 30

4.5.1 Registration . . . 30

4.5.2 Segmentation . . . 30

4.5.3 Handcrafted Features . . . 30

5 Conclusions 32

A State of the Art I

A.1 Clinical Background . . . II A.1.1 Lung Cancer . . . II A.1.2 Prognosis and Therapies . . . II A.2 Machine Learning . . . IV A.2.1 Overview . . . IV A.2.2 Main Issues . . . V A.2.3 Data Quality . . . VI A.3 Data Features . . . VIII

(12)

A.3.2 Feature Selection . . . XII A.4 Supervised Classification . . . XII A.4.1 Learning Algorithms . . . XII A.4.2 Performance Evaluation . . . XVII

B Additional Tables XIX

B.1 AUC Variability . . . XX B.2 Confidence Intervals . . . XXIII

Bibliography 33

(13)

(14)

Introduction

Lung cancer is found to be among the deadliest and the most frequent cancers worldwide [1]. Diagnosis and cure are equally challenging tasks, but recently the possibility of modifying the treatment during its delivery required to improve the assessment of tumours’

response. In particular, early response should be assessed in a couple of weeks so that any treatment change would be more effective [2].

Biopsies are still the most reliable method to evaluate the state of any tissue, but imaging strongly helps in monitoring changes over time and space, given the lower invasiveness [3].

Lately, the field of Radiomics [4] raised great interest in the clinical context. It refers to extraction and quantification of features (same as predictors or descriptors) from radio- logical images and it assumes that macroscopic changes in images are related to physical and biological changes [5]. In this framework, features to be extracted and classification methods are two of the most debated steps [6], even if many more sources of variability are found [7].

Features from PET¹alone [8–11], from CT²alone [12, 13] or from their combination [14–16]

were found to be predictors for survival or treatment response. Prediction power, however, is not the only important characteristic for descriptors. Repeatability, reproducibility, stability to image registration and segmentation are desirable as well [5]. In this context, the number of features that can be tested increases fast and feature selection might be necessary to obtain interpretable results. On the other hand, prediction models exploiting

1Positron Emission Tomography

(15)

several features might be able to integrate different types of information and outperform univariate models [17].

Machine learning offers methods to extract valuable information from big amounts of data, which are now becoming available in the medical field with the creation of public databases. However, it also provides algorithms that can perform reasonably good with fewer data and high dimensions [18].

The objective of this thesis was to develop a method to classify early response of lung cancer patients who underwent two different treatments, based on repeated PET/CT images.

This method included all the necessary steps from raw acquired images to performance evaluation, such as image registration, segmentation, feature extraction and classification.

Additionally, image normalization and feature selection were investigated.

A new set of features was proposed and compared to features from Radiomics literature.

Linear support vector machines as classifiers were able to reach or outperform results from previous studies, being able to use information from both PET and CT. However, the proposed features may need to be further validated and analysed.

(16)

Materials and Methods

In this chapter available data and processing methods are described. Methods are presented following the order in which they were developed.

(17)

2.1 Clinical Data

Repeated ¹⁸F-FDG-PET¹/CT and treatment information were available for 32 NSCLC² patients that underwent two types of treatment. Sequential chemoradiation treatment consisted in receiving three cycles of chemical drugs before radiation. Concurrent chemotherapy, instead, required one cycle of chemotherapy agents before radiation and two cycles during radiotherapy. Detailed information about treatments and ethical approval can be found in [8, 14].

PET/CT scans were acquired with a Biograph 40 scanner (Siemens Medical Solutions) while patients were lying in radiotherapy position. Acquisition and reconstruction pro- tocols are detailed in [14]. The first scan was taken before radiotherapy with planning purposes, for both chemotherapy groups. The second scan was taken between two and three weeks after that radiotherapy had started.

One patient showed bilateral tumoural masses, which were analyzed as separate cancers.

Another patient showed two big masses on the left side and the biggest one was chosen as mass of interest. Moreover, two patients who did not undergo chemotherapy were assigned to the concurrent group. Finally, two patient were excluded because the chosen segmentation strategy was inadequate. The final dataset consisted of 31 cases (Table 2.1.1).

The endpoint of classification was the overall survival (OS) at two years after the end of the treatment, according to data availability.

Table 2.1.1 – Number of patients belonging to concurrent and sequential chemotherapy treatment groups. Class 1 refers to survived patients (OS=1), while class 0 to patients who were

not alive after two years (OS=0).

Treatment OS=1 OS=0 Total

Sequential 8 7 15

Concurrent 11 5 16

All 19 12 31

1PET using 2-deoxy-2-[¹⁸F]-fluoro-D-glucose as radiotracer

(18)

2.2 Image Registration

PET and CT images from the same exam, for each patients, were automatically aligned on the scanner during image acquisition. Alignment was also visually verified (Figure 2.1).

Then, CT images from the second exam were registered to CT images from the first one, for each patient separately. The resulting transformation was applied to PET images from the second exam, so that every set of images was aligned to the first CT.

Registration was manually performed using MeVisLab [19], aiming at better aligning the lesion rather than the overall volume (Figure 2.2). Global rigid registration was chosen in order not to deform anatomical structures and cancel the effects of tumour shrinkage.

After registration, isotropic resampling was applied to avoid artifacts in subsequent processing steps. Spatial resolution was changed from (0.98 x 0.98 x 3.00) mm in CT and (4.07 x 4.07 x 3.00) mm in PET to (0.98 x 0.98 x 0.98) mm using a Lanczos kernel.

Additionally, a fine tuning step was performed to automatically register CT images from the first and the second exam, using a rigid 3D registration that was driven by the mean square intensity difference [20]. A mask was used to constrain the registration algorithm to match primarily the tumour region and disregard far away structures. The mask was given by the enlargement of the combination of the segmentation masks from the first and the second scan.

(19)

Figure 2.1 – Verification of the alignment between the PET and CT scan in sagittal coronal and axial view (clockwise, from the bottom).

Figure 2.2 – Image registration between the CT scan from the first exam (green) and the second one (red). On the left, images are shown in sagittal, coronal and axial view (clockwise, from the

bottom). On the right, the command panel is displayed.

(20)

2.3 Image Segmentation

MiaLite [21] software was used to semi-automatically delineate tumours on CT (Figure 2.3) and PET images. A threshold-based 3D level-set algorithm was applied. The user was required to set lower and upper intensity thresholds, blocking regions and smoothing factor (Table 2.3.1) to constrain the region that would expand from manually placed seeds.

Table 2.3.1 – Parameters used for segmentation procedure.

Parameters CT PET

Lower threshold -100 HU³ 40% of maximum intensity in the tumour Upper threshold +100 HU maximum intensity in the tumour Smooth factor less than 0.5 less than 0.2

Figure 2.3 – Results of the segmentation of a CT scan.

(21)

2.4 Image Features

In this study, a conventional set of image features from Radiomics literature was tested at first. Then a new set of handcrafted features, which describe regional intensity changes from the tumor border to the core, were proposed and tested. All the computations described in the following paragraphs were performed in MATLAB [22].

2.4.1 Image Normalization

In order to make intensity values of both CT and PET images as independent of the acquisition procedure as possible, they were rescaled and normalized.

CT raw values (I_CT) were transformed in HU units (I_HU), using slope (a_DICOM) and intercept (b_DICOM) parameters from DICOM⁴ fields,

IHU “ aDICOM ¨ ICT ` bDICOM. (2.1)

Additionally, for each CT scan, two ROIs (Regions Of Interest) were manually drawn to compute mean intensities of air (I_AIR) and cardiac blood pool (I_HEART), respectively.

They were used to estimate two parameters, a and b (Eq. 2.2), and then applied voxel-wise as done in Eq. 2.1 to normalize CT intensities.

$

’’

&

’’

%

´1000 “ a ¨ IAIR` b 50 “ a ¨ I_HEART ` b

(2.2)

PET raw values were linearly rescaled in PET intensities (I_{P ET}), as in Eq. 2.1. They were then transformed in SUV⁵ units by applying Eq. 2.3:

SUV “ paDICOM ¨ IP ET ` bDICOMq ¨ body weight

decay corrected dose (2.3)

where

decay corrected dose “ injected dose ¨ exp ˆ

ln 2 ¨ ∆t half life time

˙

(2.4)

4Digital Imaging and Communications in Medicine

5Standard Uptake Volume

(22)

where ∆t is the time between the injection and the beginning of the decay process and half life time is the radionuclide half life.

2.4.2 Radiomics Features

Radiomics features from literature [23] were computed from the segmented regions, both from PET and CT images. The same feature f was extracted from images from the first and the second exam; also, the percent response index (RI) (Eq. 2.5) was derived [24] as a separate feature (fRI).

fRI “ fexam1´ fexam2

fexam1 ¨ 100 (2.5)

A toolbox [25] was adapted to be applied to the available data. Both for PET and CT 3 global intensity-based (variance, skewness, kurtosis), 2 geometrical (solidity and eccentricity) and 40 textural features were computed. For CT volume and energy [12]

were added, while for PET SUV-related features (max, peak, mean, AUC of cumulative SUV-volume histogram) were obtained. Textural features included metrics from GLC, GLRL, GLSZ and NGTD matrices⁶ that described relationships among different groups of voxels. GLC matrices described how often gray-level pairs (second order matrices) are present in a pre-defined neighbourhood, across the whole image. Then, energy, contrast, homogeneity, variance, sum average, entropy, dissimilarity and correlation measures were computed as image features. The remaining matrices described textures of higher order.

GLRL ones counted how often iso-intensity lines (runs) of the same length are found along 13 directions in the 3D space. Nonuniformity, variance and emphasis were computed with respect to short or long run lengths and to high or low gray-level values. GLSZ ones related the size of the biggest regions in pre-defined neighbourhood to their intensity, across the whole image. Nonuniformity, variance and emphasis were again provided. Finally, an NGTD matrix accumulated the differences between the gray-value of a voxel and the mean value of all voxels in a local area, for all gray levels. Coarseness, contrast, busyness, complexity and strength were output features.

In total, each patient was described by 290 values.

6Gray-Level Co-occurrence, Gray-Level Run-Length, Gray-Level Size Zone, Neighbourhood Gray-Tone Difference

(23)

2.4.3 Handcrafted Features

A new set of features was designed in order to describe intensity changes at different depths in the tumour.

After intensity normalization, a normalized distance map was obtained. For each patient, segmentation masks from both exams were put together and the 3D euclidean distance from the border of the unified tumour mask was computed voxel-wise (Figure 2.4). These values were then normalized for the maximum distance, to make them comparable for different tumours’ sizes. Finally, distances were binned in ten intervals and the difference of mean intensities was computed for each distance bin and for each image set (Figure 2.5).

Thus, each patient was described by ten intensity differences for CT and ten for PET.

Figure 2.4 – Superimposition of distance bins (red for [0.2-0.3] and green for [0.5-0.6]) to the central slice of the tumour in the first CT (a), the second CT (b), the first PET (c) and the second PET (d). For CT, the displayed area refers to the mask used for computations, but the volume reported in ml refers to single segmentation masks. Patient 3 (# 3) is shown as example.

(24)

Figure 2.5 – In a) mean intesities from each CT as a function of the normalized distance are reported. In b) the differences between the two are shown. In c) and d) the same is applied to PET. The final set of features would be the concatenation of features in b) and d). Features for

patient 3, a survivor from the concurrent group, are shown as example.

(25)

2.5 Classification

2.5.1 Algorithm

Margin maximizing Support Vector Machine (SVM) was chosen as classification algorithm.

It can be described as a solver of a convex constrained quadratic optimization problem using Lagrange multipliers α^t[26]. This means to maximize

L_p “ ´1

2pw^Twq `ÿ

t

α^t, where w “ÿ

t

α^tc^tx^t,

(2.6)

with respect to α^t, under the constraints that ÿ

t

α^tc^t“ 0, and α^tě 0

(2.7)

where c^t is the class of the corresponding x^t, t^th sample, and w the parameters of the model. Classification of unseen data would be performed by testing the output of

c^tpw^Tx^t` w0q ě 1, (2.8)

where w₀ is a constant parameter of the model, previously estimated, and c “ ˘1.

Given the small sample size and the potentially high number of features, SVM had greater generalization abilities compared to other learning methods. Two parameters were varied: kernel type and data standardization. Linear and RBF (Radial Basis Function) kernels were tested. Data standardization is generally suggested, so the impact of this pre-processing step was tested as well by creating distributions with zero mean and uni- tary standard deviation. Also, in case of unbalanced data (when one class is represented by a higher number of samples than the other class), ADASYN rebalancing was performed [27].

(26)

2.5.2 Evaluation

To compare results from different procedures, an appropriate evaluation framework was set. Leave-one-out (LOO) cross-validation was performed to obtain a more generalizable and unbiased evaluation of the classifier performance, with repset to bootstrap [28]. This method is a type of n-fold cross-validation where the number of folds equals the number of samples N . It exploits N ´ 1 samples for training and test the estimated model on the N^th sample, producing a single prediction estimate. The process is repeated N times so that each sample belongs to the test set once. The same procedure applies for n-fold CV, but the patients are divided in n groups and N ´ n are used for training and n for testing at each run. The procedure is repeated n times.

Classification performance was defined by the AUC (Area Under the Curve) of a ROC (Receiving Operator Characteristic) curve. In order to build ROC curves for SVMs’ binary classification output, a sigmoidal score function was fit [29]. The optimal threshold was found as the point of intersection between the ROC curve and the highest iso-performance line tangent to it [30]. For this study the inclination of the line was given by the ratio between number of negative and positive samples. Each AUC value was evaluated to be different from random at 95% confidence level if the pointwise confidence intervals did not include 0.50. Such values will be marked with * in Chapter 3.

(27)

2.6 Features Selection

Given the small sample size with respect to the large number of features and the high correlation among features, most learning algorithms tend to overfit training data and lose accuracy on testing data. Feature selection was therefore applied to improve the overall classification accuracy by limiting the number of features to the fewer most important ones. In this study, two feature selection techniques were tested: forward feature selection and feature ranking.

2.6.1 Forward Feature Selection

The implemented forward features selection strategy (FFS) performed a fine search among candidate features. It iteratively selected features that would have improved the accuracy of a cross-validated SVM. At the first iteration each feature was tested individually, then features were combined only with the ones chosen before and, if accuracy improved, a new subset was created. The process stopped when no improvement was detected. Given the random choice of a feature among those that led to the same improvement, the chosen feature set changed at every run. To limit this variability, the best feature set was chosen by applying FFS several times and then selecting those features that were most frequently chosen.

2.6.2 Feature Ranking

The second method (PCA-ranking) made use of Principal Component Analysis (PCA) to rank features [31], based on their contribution to the first N principal components (PCs).

PCA is used as a ranking method in order not to lose the interpretability of the result, which would happen if PCs were used as new features.

The score for the i^th feature was defined as the sum of the coefficients (loading) of the eigenvectors of the n^th PC, weighted by its eigenvalue (λ), as in Eq. 2.9:

scorei “

N

ÿ

n“1

λn¨ kloading_n,ik (2.9)

Arbitrary thresholds were set to define the number of PCs and the number of ranked

(28)

features that should be considered. In order to generalize better, this procedure was repeated and the most frequent features selected.

(29)

Results

In this chapter results from linear SVM classification under different conditions are shown.

Leave-one-out cross-validation and 10-fold cross-validation were applied in order to test the stability of classification with respect to changes in the training set.

(30)

3.1 Radiomics and Handcrafted Features

Classification results using Radiomics and Handcrafted features (Figure 3.3) were compared. In general, linear SVM performed better when using Handcrafted features than Radiomics ones (Table 3.1.1).

Table 3.1.1 – Mean AUC values for classification on different groups of patients, using Radiomics (Rad) and Handcrafted features as a whole (HandTOT), for CT only (HandCT) and

for PET only (HandPET).

AUC - no feature selection

Group Folds Rad HandTOT HandCT HandPET

Sequential LOO 0.59 0.91* 0.80 0.89*

10 0.61 0.92* 0.84 0.90*

Concurrent LOO 0.64 0.76 0.73 0.84

10 0.63 0.79 0.67 0.84

All LOO 0.54 0.89* 0.61 0.89*

10 0.59 0.86* 0.61 0.89*

When an appropriate feature selection was applied (Table 3.1.2), results improved and Radiomics features could outperform Handcrafted ones as for concurrent patients. Feature selection was implemented at a later stage, as a method to improve classification, mostly because of the many significant correlations found within Radiomics (Figure 3.1) and Handcrafted (Figure 3.2) features.

The best results were obtained when applying FFS. AUC reached 0.71 and 0.98* when using Radiomics and Handcrafted features, for sequential patients. For concurrent patients AUC values reached 0.93 and 0.87* respectively.

(31)

Figure 3.1 – Spearman’s correlation heat maps for Radiomics features for sequential (left) and concurrent (right). In the upper images, significant correlation pairs (95% confidence) are marked

in red. The last column and row of each map show correlations between features and OS.

(32)

Figure 3.2 – Spearman’s correlation heat maps of Handcrafted features from both CT (from 1 to 10) and PET (from 11 to 20), for sequential (left) and concurrent (right) patients. The last

column and row show correlations between features and OS. Significant correlations at 95%

confidence are marked in red.

Table 3.1.2 – Mean AUC values for classification on different groups of patients, using selected Radiomics (Rad) and Handcrafted features as a whole (HandTOT), for CT only (HandCT) and

for PET only (HandPET).

AUC - FFS

Sequential LOO 0.71 0.98* 0.91* 0.98*

10 0.72 0.98* 0.92* 0.97*

Concurrent LOO 0.93 0.87* 0.65 0.80

10 0.86 0.90* 0.66 0.81

All LOO 0.95* 0.93* 0.66 0.91*

10 0.92* 0.92* 0.60 0.90*

AUC - PCA-ranking

Sequential LOO 0.71 0.91* 0.77 0.96*

10 0.73 0.90* 0.80 0.96*

Concurrent LOO 0.60 0.67 0.65 0.93*

10 0.66 0.67 0.69 0.92*

All LOO 0.67 0.86* 0.65 0.90*

(33)

Figure 3.3 – Representation of CT and PET Handcrafted features for sequential ( a) and c) ) and concurrent ( b) and (d) ) patients. Survived patients are shown in blue and non-survived in

red.

(34)

3.2 CT and PET Features

The contribution of CT and PET was investigated by isolating features related to either one of the imaging modality. From Table 3.2.1 and 3.2.2, CT showed to be less powerful than PET. However, it mainly performed better than random and HandCT improved the PET-based prediction for sequential patients, for concurrent ones (Table 3.1.1) or for the whole dataset (FFS in Table 3.1.2). Results for HandCT and HandPET might differ from those in Section 3.1 due to differences in the selected features or in data splitting.

In Appendix B.1 variability of the mean AUC values is reported to account for these discrepancies.

Table 3.2.1 – Mean AUC values for classification on different groups of patients, using CT and PET information separately for both Radiomic (RadCT, RadPET) and Handcrafted (HandCT,

HandPET) features.

AUC - no feature selection

Group Folds RadCT RadPET HandCT HandPET

Sequential LOO 0.73 0.73 0.80 0.89*

10 0.73 0.70 0.82 0.90*

Concurrent LOO 0.76 0.53 0.73 0.84

10 0.71 0.50 0.71 0.84

(35)

Table 3.2.2 – Mean AUC values for classification on different groups of patients, using selected CT and PET information separately for both Radiomic (RadCT, RadPET) and Handcrafted

(HandCT, HandPET) features.

AUC - FFS

Sequential LOO 0.89* 0.86 0.89* 0.98*

10 0.89* 0.86 0.91* 0.98*

Concurrent LOO 0.89 0.69 0.82 0.87*

10 0.77 0.66 0.78 0.86

AUC - PCA-ranking

Sequential LOO 0.82 0.86 0.34 0.96*

10 0.81 0.86 0.36 0.97*

Concurrent LOO 0.60 0.73 0.65 0.93*

10 0.65 0.51 0.69 0.92*

(36)

3.3 Performance Testing

In order to compare the performance of the proposed features with predictors from a previous work [8] a subset of subjects (26 out of 31) was used. In this way, SVM classification was performed on the exact same dataset, described by different features.

Mean effective radiosensitivity ( ¯α_{ef f}) was given alone and together with the negative fraction (Both) as input feature to the SVM. In Table 3.3.1 these classification results are compared to the ones from Radiomics (Rad) and Handcrafted features (HandTOT, HandCT, HandPET). When applying feature selection before linear SVM classification, the newly proposed features outperformed the reported features for sequential patients (0.98* against 0.88) and reached comparable AUC values for concurrent ones (0.94*).

(37)

Chapter3 AUC

Group Folds Both α¯_{ef f} HandTOT HandCT HandPET

no FS FFS PCA no FS FFS PCA no FS FFS PCA

Sequential LOO 0.88 0.86 0.88 0.86 0.93* 0.71 0.81 0.81 0.86 0.98* 0.95*

10 0.85 0.86 0.89 0.88 0.92* 0.73 0.81 0.81 0.85 0.98* 0.94*

Concurrent LOO 0.94* 0.94* 0.84 0.88 0.84 0.66 0.47 0.53 0.88 0.91 0.94*

10 0.92 0.93* 0.84 0.88 0.86 0.64 0.51 0.46 0.88 0.91 0.93*

All LOO 0.91* 0.91* 0.93* 0.97* 0.94* 0.69 0.63 0.82* 0.91* 0.94* 0.94*

10 0.90* 0.91* 0.94* 0.96* 0.95* 0.70 0.67 0.81* 0.92* 0.94* 0.94*

23

(38)

3.4 Unbalanced Data

By applying data rebalancing [27] to the concurrent patients’ group, results did not show agreement across different cases (Table 3.4.1). For example, this strategy improved classification when Rad without any feature selection were used, but not when FFS was applied.

The dataset showed an unbalance ratio of around 2:1, survived to dead.

Table 3.4.1 – Mean AUC values for classification on concurrent patients with and without rebalancing. Mean AUCs are computed over 5 repetitions, where the rebalanced data are each

time newly generated.

No FS FFS PCA-ranking

Group Folds Rad HandTOT Rad HandTOT Rad HandTOT

without ADASYN LOO 0.64 0.76 0.93 0.87* 0.84 0.76

10 0.64 0.81 0.84 0.89* 0.79 0.78

with ADASYN LOO 0.97* 0.75 0.63 0.66 0.73 0.76

10 0.93* 0.79 0.65 0.67 0.72 0.59

(39)

Discussions

In this chapter results from the previous section are analysed, choices of some SVM parameters are explained and the main limitations of this work are highlighted.

(40)

4.1 Radiomics and Handcrafted Features

In the field of Radiomics, features can be provided through two main approaches. Feature discovery consists in selecting the most relevant features from a large pool of them, while a candidate feature approach generates only relevant ones [32].

In this study HandTOT features consistently outperformed Rad ones both when no feature selection was applied (candidate feature, see Table 3.1.1) and when feature selection was introduced (feature discovery, Table 3.1.2). By comparing these cases, PCA-ranking seemed not to hugely improve performance, in general, while FFS improved classification of both sequential (all cases) and concurrent patients (Rad and HandTOT). However, PCA- ranking was more reproducible than FFS, because FFS chooses randomly a feature if two or more produces the same classification accuracy. This can be positive in exploratory studies, where features relevance is under investigation, but not in the clinics. So, other criteria should be used instead of random choice, for example AUC maximization. Also, FFS usefulness in combination with classifiers should be tested, since it involves the optimization of an SVM, same method used for classification. On the other side, PCA-ranking needed few parameters to be defined and their optimization should be further improved.

The main motivation for designing Handcrafted features was that blood supply patterns influence the local oxygenation level and, in turn, the efficacy of radiation therapy. Blood vessels are irregularly distributed in fast growing tissues as tumoural ones and can generate, especially in the central part, necrotic regions where the efficacy of radiation is strongly diminished. PET is able to capture blood supply patterns, as the radiotracer uptake of a metabolically active region depends on the ability of the tracer to reach it. HandPET features detect uptake changes as a function of depth, possibly describing changes in those patterns. As well, textural features from Radiomics might account for local blood supply inhomogeneities. In CT, attenuation coefficients could change due to tissue inflammation or volume modifications but texture could also be of interest (see Section 4.2).

In general, classification results and selected features varied for different datasets (by excluding few patients, for example) and for different tests (features and patient groups).

This may be explained by the high correlations within both Radiomics (Figure 3.1) and

(41)

From these correlation maps and from Figure 3.3, features for sequential and concurrent patients seemed to have different behaviours. However, SVM classification results for the whole dataset (All) was comparable to those for separate groups (Table 3.1.1 and Table 3.1.2). This might relate to the larger patient number rather than strictly clinical reasons, but it could be an option worth considering especially when few patients are available, for example in small facilities. Further investigation is anyway needed.

4.2 CT and PET Features

RECIST¹ guidelines suggested the number of lesions (invasiveness) and the measurement of the longest axis of the tumour as predictors of response to treatment. PERCIST² guidelines suggested that SUV-based metrics would improve anatomical-based prediction, given the higher biological predictive power of PET [33]. However, in previous studies on lung cancer, the prediction power of both CT and PET features has been investigated and found to be better performing than current guidelines [4]. In [13] the AUC of a single-CT heterogeneity feature was 0.60, at a medium spatial scale. In a more recent work, selected single-CT features reached a concordance index of 0.65 on lung cancer patients who underwent either radiation or concurrent chemoradiation therapy. Energy (describing tumour density), shape compactness, grey level nonuniformity and wavelet grey level nonuniformity were the employed descriptors [12]. Customized repeated-PET features accounting for dose information reached a significant AUC value of 0.89 [8], while selected features from Radiomics literature, namely contrast and coarseness, reached 0.82 and 0.80 respectively [10]. Few studies evaluated combined CT and PET features and they were suggested to improve classification [15]. As well, repeated exams showed to outperform predictions based on clinical (age and tumour stage), geometrical (maximum diameter) and SUV (maximum) features from a single time point [24], going from 0.69 to 0.86 in AUC values.

In this work, there was no agreement on whether CT or PET features performed better both for Radiomics and Handcrafted features (Table 3.2.1). RadCT were outperformed by RadPET only in one case (PCA-ranking) and the overall best result was obtained

1Response Evaluation Criteria in Solid Tumours

2Positron Emission Response Criteria in Solid Tumours

(42)

with FFS, both for sequential (0.89*) and concurrent (0.89) patients. On the other hand, HandPET features always gave better results than HandCT ones but, for sequential patients, adding CT to PET features raised the AUC value from 0.89* and 0.80 to 0.91*

Table 3.1.1. This could be understood by noticing that HandCT features showed less overlap between dead and survived patients for sequential than for concurrent patients (Figure 3.3). Also, for sequential patients, HandCT seemed to differentiate more in the peripheral region (at lower distance values) while HandPET in a deeper region (towards the unity distance). This suggested that they carried complementary information about tumour changes, which seemed to agree with the correlation pattern. HandCT features closer to the border (first rows and columns) and HandPET closer to the core (first rows, last columns) were the least correlated pairs (Figure 3.2). This might imply that HandCT better detect changes in size, occurring at the border, while HandPET better describe deep metabolic activity, this information being complementary. AUC values from CT features were better than random (AUC=0.5) in all cases, but when PCA-ranking was applied to sequential patients. So, CT features may be used to improve PET performance or even alone, when PET is not clinically available, since they seemed to add meaningful information.

In general, HandPET showed to have the best performance (0.98* with FFS, sequential) in almost any case but when applying FFS to the concurrent group (0.87* against 0.89 from RadCT). This agreed with results from literature, where PET was found to be useful for classification of early tumour response [34]. Moreover, HandPET features visually showed less overlap than the corresponding HandCT ones (Figure 3.3).

Finally, when evaluating these results, feature stability should be considered. PET is known to depend on many confounding factors related to the acquisition process [35], while attenuation values vary less. Also, the classifier type influences the classification accuracy: SVM was a-priori chosen as the most reasonable classifiers but others might perform better, relying on different features.

(43)

4.3 Performance Testing

From Table 3.3.1 it can be observed that the prediction power of Handcrafted features was comparable to the one from descriptors from literature. In particular, Handcrafted features performed better for sequential (0.98* against 0.88) and similarly for concurrent patients (0.94*), with feature selection. PET features appeared to be quite important (0.98*) to discriminate between responding and not responding groups, as it was previously reported.

4.4 SVM Settings

Linear SVMs were used to obtain results in Chapter 3, but RBF-SVMs were tested and general observations could be made. Only on the complete dataset (All), the number of support vectors for RBF-SVM was less than the number of training samples but the difference was about few samples. However, when applying RBF-SVM to sequential patients described by HandCT features the AUC reached 1.00. So, if more data will be available non-linear SVMs may be considered, since the estimated hyperparameters would be more robust. A larger dataset would provide a more robust estimate of the parameters of RBF kernels. This model selection procedure is not theoretically guaranteed to avoid overfitting, while the resistance towards overfitting of SVMs can be theoretically proved with regularization theory.

Interestingly, linear SVM found a varying number of support vectors, describing how hard the separation problem was. Classification cost was set to be equal for the two classes because of the clinical context, for which the decisions of keeping or modifying a treatment were equally important. Data standardization was applied because in some cases the solver failed, probably because of high sparsity of data. To be consistent within this study and with literature, data standardization was applied.

In general, LOO and 10-fold cross-validation produced coherent results for the same problem: this was considered to be a positive behaviour regarding generalization.

(44)

4.5 Limitations

The first limitation of this work was the sample size, considering the context of machine learning and the high variability of results when using different sample sets. For this reason, lower importance was given to the statistical significance of results. However, some more improvements along the processing chain could be done.

4.5.1 Registration

Qualitative observations of the impact of the registration error on Handcrafted features were carried out by artificially translating one image in each spatial direction. The impact on the AUC was not assessed but registration errors are expected to influence it. This was also the motivation for which automatic registration was added as fine tuning step.

A quantitative and systematic study of the impact of this error is needed. Additionally, registration could be optimized for CT and PET images separately or by using information from both modalities.

4.5.2 Segmentation

Some patients were excluded from analysis because tumour delineation was too difficult to be performed and expert physicians were not available for the task. In particular, the level- set-based approach could not distinguish between clamped lung tissue or inflammation and tumour masses. Further studies are needed to evaluate the impact of segmentation errors and, possibly, the usage of enlarged masks that include also the surroundings of the tumour mass. Planning contours could also be interesting in the clinics, since they might simplify treatment routines.

4.5.3 Handcrafted Features

The clinical meaning of the Handcrafted features has not been assessed yet, but it is expected to relate better to local or regional tumour control probability, rather than to the overall survival at 2 years. Also, Handcrafted descriptors should be tested for repeatability

(45)

The number of distance bins, the computation of the mean intensity and the voxel-to-voxel difference were all reasonable choices that could be further investigated. For example, distance bins could be varied depending on tumour size, but this would add complexity to classification. As well, CT could be smeared out so that changes at coarser scales are evaluated and shape analyses may describe better the separation between classes.

(46)

Conclusions

A framework for classification of patients’ response to treatment has been set. Results are obtained for two patients groups, evaluating two imaging modalities (PET and CT) and two classes of features (Radiomics and Handcrafted) by mean of different methods from machine learning (SVM classifier, feature selection, data rebalancing).

Among different remarks, a new set of features was suggested and results were found to be comparable or outperforming metrics from literature. Tumour descriptors from CT were found to be helpful for patient who underwent radiotherapy and sequential chemotherapy, while PET was confirmed to be highly informative for early treatment assessment in general.

The influence of feature selection algorithms was taken into account by comparing two different methods and no feature selection procedure, while SVM classifier was found to be a powerful tool when high dimensional data are taken into account.

However, the number of cases is quite low in the context of machine learning and results must be considered carefully, as the evaluation of a larger dataset shows. Also, OS at 2 years might not be the best endpoint for classification, while local tumour control could be considered. The newly suggested set of features needs a deeper investigation of its reproducibility and robustness, but the preliminary results are encouraging.

(47)

(48)

State of the Art

(49)

A.1 Clinical Background

A.1.1 Lung Cancer

Lung cancer was the primary cause of tumour-related deaths in the world and the first tumour for incidence in 2012. It affected millions of people (Figure I) and showed a mortality rate of 0.87 [1]. Its incidence follows smoking trends and it is estimated to cause 158 080 deaths in the US in 2016, accounting for 26.5% of all cancer-related deaths [36].

Figure I – Number of cancer-related deaths in the world in 2012 [37].

From a medical point of view, lung cancers can be grouped in two main classes: small cell lung cancer and non-small cell lung cancer (NSCLC). The latter is the most common form, occurring in the 85% of lung cancer cases, and it can be further divided into squamous cell carcinoma, large cell carcinoma, adenocarcinoma and other minor types [38].

A.1.2 Prognosis and Therapies

Based on tumour characteristics, medical doctors should choose the best treatment option among surgery, chemotherapy, radiation therapy, targeted therapies or a combination of them [39]. Radiotherapy and chemotherapy are suggested for inoperable lesions and locally advanced cancers (stage III). These therapies are mostly delivered in fractions given at different time points and new technologies allow for adaptation of the treatment during its delivery. If the patient is responding to the cure, the attention can be focused on reducing normal tissue toxicity, while for non-responding patients the dose can be increased and the tumour control probability enhanced [2].

In this context, two key elements are the assessment of the effectiveness of the treatment

(50)

and the development of decision support systems.

Guidelines, such as RECIST¹, EORTC² and PERCIST³, suggest that many factors could be prognostic, among which patient risk profile, histology [40], genetic mutations, overall health conditions, tumour metabolism and TNM (gross Tumour, lymph Nodes, Metas- tases) tumour staging [41]. Imaging exams provide a non-invasive method to repeatedly sample the lesion in space and time, thus capturing its heterogeneity [42], which is related to worse outcomes. In particular, PET/CT⁴ scans are recommended for accurate cancer staging and early prediction of treatment outcome [33]. Low-dose CT is employed to compute anatomical response metrics, such as volume and shape of the lesion, while PET is commonly used to assess the metabolic response. ¹⁸F-FDG⁵ is the most common radiotracer in PET imaging nowadays as it highlights cells with high glucose consumption, which is often the case of tumoural cells. Since PET is able to detect smaller active nodules than CT because metabolic changes precede anatomical ones, it is suitable to assess early tumour response [11] and to improve survival [43]. However, PET’s spatial resolution is worse than CT’s and PET alone could miss small lesions or lesions with small radiotracer uptake [33]. Moreover, PET quantification is hindered by many confounding factors, such as acquisition mode, radiotracer injection modality, reconstruction algorithm, smoothing filter, image noise and partial volume effect [35]. All these factors should be taken into account when evaluating image features coming from this imaging modality.

Once all the parameters are defined and data gathered, medical doctors must choose among possible treatments or treatment modifications. However, the number of variables is high and a single person may not be able to take in account all of them. This is why, lately, machine learning strategies have been used as decision support tools in the clinical context [44].

1Response Evaluation Criteria in Solid Tumours

2European Organization for Research and Treatment of Cancer

3Positron Emission Response Criteria in Solid Tumours

4Positron Emission Tomography/Computed Tomography

52-deoxy-2-[¹⁸F]-fluoro-D-glucose

(51)

A.2 Machine Learning

A.2.1 Overview

Machine learning is a branch of artificial intelligence, as it produces ‘intelligent’ models able to learn. It is applied to data mining problems, where big amounts of data are efficiently processed to obtain simpler and valuable information [26]. In particular, it addresses the problem of making a model learn from data samples [45], thus reducing or eliminating the need for a-priori knowledge of the relations between input variables and output. This black-box approach is useful as deterministic models are unable to fully describe a complex problem or when the problem under investigation is not fully understood, e.g. when predicting the tumour response to a treatment [17]. At the same time, an appropriate algorithm could include known information about the input data so that all available knowledge is exploited.

A typical problem in this field is classification. Its goal is to provide models (also called learners or classifiers) that can predict the class of a new sample based on the attributes (also called variables or features) describing that sample [18].

The learning procedure includes three phases, namely training, validation and testing, for each of which different subsets of data are used. During the training phase, the parameters of the learning algorithm are defined following different procedures depending on whether the true class (label) of each sample is known or not. Unsupervised learning is used when labels are not known, while supervised learning makes use of a labelled training dataset. For example, a doctor may define few patients (the samples) as cured or not (the label) based on imaging data (the features). Supervised learning would train a model by using both features and labels. Unsupervised learning would make use of features only and could precede the doctor’s choice. Labelled data refer to samples for which the true classification output is known so that the actual error made by the learning algorithm can be computed. The goal of validation is to assess how well different models perform on a validation dataset. Models can be learnt with different parameters, different input features or from different samples, and several strategies can be implemented (see section A.4.2), mostly depending on data availability. Finally, the testing phase makes use of a test set

(52)

If robust and reliable results are needed, it is necessary to account for three important components [46]: quality of available data (section A.2.3), data features (section A.3) and type of learning method (section A.4.1). The overall process is described in Figure III.

A.2.2 Main Issues

Overfitting, high dimensionality and the bias-variance trade-off are the main challenges encountered when dealing with any machine learning task.

Overfitting describes the situation in which the model is too close to training data and does not perform well on unseen data. It occurs especially when the complexity of the model is high or its generalization ability is low, or both. Overfitting is the reason why the resubstitution error (the error computed on the training set) is not a good measure of the accuracy of the classifier.

Data dimensionality is closely related to the complexity of the model: the higher the dimension, i.e. the number of variables describing each sample, the greater complexity may be required to reach good performance. In particular, the “curse of dimensionality” refers to the need to increase the number of samples with an increasing number of descriptors to keep the robustness of the model constant [47]. Feature selection and dimensionality reduction are pre-processing steps that can help reducing the impact of high dimensionality on the final result.

Figure II – Error and model complexity. The error (black line) is given by the contribution of variance (cyan line), squared bias (red line) and an irreducible error, represented by the non-null

asymptotes (figure from [48]).

(53)

Finally, bias and variance are two tunable components that contributes to the final error (Figure II). Bias refers to the average difference between the true and the predicted output, while variance describes the variations of the output across different experiments. Their trade-off describes the impossibility to have both low bias and low variance for a certain classifier. There are strategies to cope with this effect, for example by combining the answer coming from more learners as in ensemble learning [18].

Before approaching any machine learning tasks, the “No Free Lunch” theorem should be kept in mind. This states that a certain model cannot outperform other learners in all the tasks on which it can be tested. Some characteristics of the problem could suggest which classifier could potentially be the best choice, which would be anyway biased by the experience of the person choosing it. Therefore the most widely suggested approach is by trials and errors.

A.2.3 Data Quality

In machine learning, data is a collection of samples that are characterized by certain attributes. Each sample is a vector of p feature values and it can be thought of as a single point in a p-dimensional space, in which a dataset is a cloud of N points.

Relying upon high quality data is important. The performance of a classifier could be biased by a non-representative dataset or it could be flattened by merging data with different level of complexity [49]. Missing or duplicate data, outliers, noise and imbalanced class representation worsen the accuracy of a classifier that would give acceptable results otherwise. Thus, the analysis of the available data and appropriate pre-processing steps can significantly improve the results obtained from a finite dataset and increase their robustness.The most common procedures are data cleaning, data transformation and data reduction.

Data cleaning includes:

• compensation for missing values, through imputation methods [50];

• noise smoothing, through binning or regression methods;

• removal of outliers and inconsistencies.

(54)

Figure III – General procedure for data analysis and classification.

(55)

It is important that any of these procedures is applied cautiously, especially when prior information about data is not available, not introduce or modify existing trends [51].

Data transformation is recommended and sometimes necessary in order not to bias the result towards input features with larger values. Typically, data are standardized or rescaled in a [0 1] range to avoid dependencies on units of measure, following min-max, z-score or decimal scaling strategies [18]. Depending on the classifier type and speed requirements, data may need to be discretized or represented in a counter-intuitive way.

Data reduction includes:

• numerosity reduction, which does not usually apply to the medical field, where data are usually scarce;

• feature selection, which can strongly improve the classification result by, for example, eliminating correlation among variables and irrelevant features.

Finally, it is suggested to create stratified and balanced subsets when building training, validation and testing sets, in order for the finite dataset to more closely represent the population from which it is sampled [28].

A.3 Data Features

Classifiers can handle data from different sources, if properly pre-processed. Therefore mixed data coming from genetics, clinical indexes [52], demographics and imaging [32]

can be exploited. Genetic data can be useful to describe patients’ response but it is not routinely acquired in clinical practice. For example, genomic heterogeneity of tumour tissues was found to be related to higher probability of resistance to treatment [3]: if this indicator was robustly associated to macroscopic heterogeneity in medical images [12], it would be assessed for a greater number of patients. Clinical data, instead, may be used to stratify patients in risk groups.

A.3.1 Image Features: Radiomics

Radiomics is a field of study in which quantitative and automatic image analysis is per-

(56)

and genetic descriptors of the tumour [5], thus being able to predict treatment outcomes and patient overall survival. This hypothesis has been studied according to two different approaches: in some cases predictive genetic factors are directly compared to image features, while in other cases the prediction power of image features themselves is investigated. This latter method is useful for assessing early tumour response, since repeated biopsies in space (to determine heterogeneity) and time (to determine lesion evolution) are not clinically feasible due to their invasiveness [53]. Imaging methods, instead, embody an almost cost-free assessment strategy, since they are able to extract information from exams that are already routinely acquired for staging purposes [16].

The prognostic power of a single feature strongly depends on cancer and treatment type.

Additionally model performances across institutions do not agree. However, some general concepts regarding image processing, feature taxonomy and feature selection can be commonly applied (Figure IV). The following description refers to the framework of¹⁸F-FDG PET/CT scans repeated in time.

Figure IV – Main steps of Radiomics [5].

Image registration and segmentation are preprocessing steps required before computing features from different images. The accuracy of the registration process determines the quality of the integration of information between scans from different modalities or taken at different time points. The transformation type has to be chosen mainly between rigid, which applies only translations and rotations to the whole image, and deformable, which computes a different displacement for each voxel in the image. If the tumour shrinkage between scans is not significant and volume comparison is required, rigid registration may be suitable. Image segmentation, instead, defines the tumour mass: manual contouring by experts is still the safest procedure [54], even if inter- and intra-operator variability should always be reported as a quality indicator. Automatic procedures are currently being investigated, but neither conclusive results nor established routines are available in

(57)

Image features can be divided in three main groups: intensity, shape and texture features.

Intensity features are typically first-order statistical descriptors that include global measures without taking into account spatial relationships among pixels [56]. For PET, images are expressed in SUV⁶ because it normalizes for injected dose, activity of the source and patient size. If volume measurements are available, TLG⁷ can be obtained by multiplying the gross tumour volume (GTV) by the mean SUV. For CT, images are normalized to Hounsfield Units (HU). In both cases, mean value, standard deviation, minimum, maximum and other statistics may be computed from appropriate tumour regions. As well, IVH⁸ may provide statistical descriptors and metrics that relate intensities and volume percentages.

Shape features describe the geometry of the lesion in both PET and CT by its eccentricity, solidity or extent, for example, but they strongly depend on the segmentation performance.

Textural features raised strong interest in the scientific community because heterogeneity is found to be related to adverse tumour biology [34]. For example, blood vessels heterogeneity may indicate the presence of hypoxic regions that suggest treatment resistance and high metastatic potential, thus being predictors of adverse outcome [13]. They can be further divided in: [2]:

• statistical-based, which describe the relationship between one pixel and another one (second-order) or groups of neighbouring pixels (higher-order) [3];

• model-based, which make use of fractal analysis for example;

• transformed-based, which employ wavelets or other transform functions.

As it can be seen from Table I and from [2, 12, 57–60], the number of features that can be computed is huge. If both PET and CT data are available, this number could double or even more, if scans repeated in time are taken. Data would then be described in an extremely high dimensional space.

6Standard Uptake Volume

7Total Lesion Glycolysis

8Intensity Volume Histogram

9Biological Effective Dose

(58)

Table I – Summary of previous works that use image features in lung cancer to predict treatment outcome.

Study Endpoint Image Type Main Features

[12] Survival baseline CT 440 features: intensity (energy), shape (compactness), texture (nonuniformity), wavelet, volume [52] OS baseline CT 66 texture features, 12 clinical

features

[17] Recurrence baseline CT Maximum 2D diameter, BED⁹ metrics

[11] OS baseline PET SUV peak, GTV, IVH-based

metrics

[61] OS baseline PET GTV size, TNM staging

[10] Survival baseline PET SUV, TLG, texture (coarseness, contrast, busyness, complexity) [8] OS repeated PET Effective radiosensitivity [15] Tumour control baseline PET/CT SUV, HU, TLG, IVH, texture

(energy, contrast, entropy, local homogeneity)

[13] Survival baseline PET/CT multiscale uniformity, SUV [14] OS repeated PET/CT change in: SUV, CT and PET

volumes

(59)

A.3.2 Feature Selection

To deal with high dimensionality and to avoid overfitting, dimensionality reduction should be applied, either by choosing from a pool of many features or by extracting few features, in order not to hinder the robustness of the results. The main intent is to describe data through uncorrelated and relevant features. Principal Component Analysis (PCA) is a well-established method [62] that describes data in a new space of uncorrelated variables and selection of the most relevant ones is based on variance maximization.

Disregarding the particular method, features should be chosen based on independence, reproducibility and prominence [5]. Independence refers to the fact that features should not be redundant, i.e. correlated, in order to maximize the classifier generalization ability.

However, independent variables should also be stable both to the selection method itself [6] and to the acquisition process, especially when dealing with PET data which is affected by many confounding factors [35]. In the latter case, studies of test-retest reliability should be performed, together with intra- and inter-operator variability for not fully automatic procedures [59]. Finally, features should be chosen based on their prediction power, which can be assessed with different strategies.

Also simplicity of computations and clinical interpretability could be taken into account depending on the final goal of the study.

A.4 Supervised Classification

A.4.1 Learning Algorithms

A number of machine learning algorithms can perform classification [18, 63, 64] and in the following sections relevant methods will be presented.

A Standard Approach: Regression

Logistic regression [9] is a mathematical modelling approach to a multiple regression problem where the logistic function LR links the probability of a binary output to the linear

(60)

predictor function:

LR “ 1

1 ` expp´w^Txq “ exppw^Txq

1 ` exppw^Txq, (1)

where w are model’s parameters and x are the input variables. It can also be seen as an extension of the Na¨ıve Bayes classifier where the assumption of independence is taken away. The training set is used to drive an optimization algorithm that would find the best weights w. Cox regression (or proportional hazards regression) [16], instead, is used in tumour response prediction when survival analysis is conducted. It models the incidence per unit time, i.e. the rate at which the disease per population at risk would occur.

Decision Trees

Decision trees [65] makes use of divide-and-conquer approaches to recursively partition the training set according to a splitting criterion and appropriate stopping rules. Different algorithms exist but a common underlying process can be identified. The first split, the root of the tree, divides data in branches according to the feature that had been chosen by the splitting metric. For each branch, the splitting procedure is repeated and a testing attribute is assigned to each node. When data in a branch reaches purity with respect to one class, a leaf is created: the branch does not grow anymore and any test sample that reaches that leaf is assigned to its class. If a stopping rule (for example maximum depth of the tree or minimum number of sample in a node) forces the tree to stop growing but purity is not reached, the class of the leaf is assigned by majority voting the classes of its samples (Figure V).

The parameters that need to be defined are only the pruning strategy (pre- or post- pruning), to avoid overfitting, and the splitting criterion. The interpretability of the resulting model depends mostly on the depth of the tree and on the type of splitting that is allowed (binary or multivariate), but it is one of the highest. However, since discrete variables are more easily handled, trees are usually applied to clinical data.

Support Vector Machines

Support vector machines (SVMs) [17, 66] are classifiers with a strong theoretical founda-

(61)

Figure V – Decision tree with three nodes and splitting thresholds for binary classification [45].

SVM, the decision boundary is defined by the weights w as the hyperplane

w^Tx “ 0 (2)

that crosses the space between two classes and maximizes the distance d (i.e. the margin, Eq. (4)) between the hyperplane itself and the closest positive p and negative q samples.

2d “ w^T

kw^Tkpp ´ qq “ w^Tp ´ w^Tq

kw^Tk “ 1 ´ p´1 q

kw^Tk (3)

d “ 1

kw^Tk (4)

Training samples that lie on margins are called support vectors (Figure VI): they encode all the information needed to define the boundary function in the input space and determine the model complexity. Different algorithms can solve this maximization problem. Gradient

Figure VI – Boundary, margins d and support vectors for a linerly separable problem [63].