Deep Learning Based Multi-Label Classification of Radiotherapy Target Volumes for Prostate Cancer

(1)

INOM

EXAMENSARBETE

MEDICINSK TEKNIK,

AVANCERAD NIVÅ, 30 HP

,

STOCKHOLM SVERIGE 2020

Deep Learning Based Multi-Label

Classification of Radiotherapy

Target Volumes for Prostate

Cancer

LINA WELANDER

KTH

(2)

Djupinlärnings baserad fler-etikett klassificering av

målvolymer för prostata cancer inom strålterapi

av Lina Welander

Master in Medical Engineering

Supervisor: Tommy Lofstedt, Eva Onjukka, Elinor Wislander Reviewer: Chunliang Wang

(3)

Abstract

An initiative to standardize the nomenclature in Sweden started in 2016 along with the creation of the local database Medical Information Quality Archive (MIQA) and a national radiotherapy register on Information Network for CAncercare (INCA). A problem of identifying the clinical tumor volume (CTV) structures and prescribed dose arose when the consecutive number, which is added to the CTV-name, was made inconsistently in MIQA and INCA. Deep neural networks (DNN) were promising tools to solve the multi-label classification task of the CTV to enable automatic labeling in the database. Prostate cancer patients that often have more than one type of organ in the same CTV structure were chosen for proof of concept. The DNN used supervised training in a 2D fashion where the radiation therapy (RT) structures along with the CT image were fed, slice by slice, to AlexNet and VGGNet to label the CTV structures in the local database system MIQA and INCA. The study also includes three methods to classify a final label for the CTV structure since the model makes the predictions on each slice. The three methods were maximum method by taking the maximum prediction for each class, minimum method by taking the minimum prediction for each class and occurrence method. The occurrence method chooses the maximum prediction if the network has predicted the class over 0.5 at least two times and the minimum prediction if not. The DNN and volume classification methods performed well where the maximum and occurrence method performed the best and can be used to interpret RT volumes in MIQA and INCA for prostate cancer patients. This novel study gives promising results for the future development of deep neural networks classifying RT structures for more than one type of cancer patient.

Sammanfattning

(4)

Nomenclature

CN N Convolutional Neural Network CT Computed Tomography CT V Clinical Tumour Volume

IN CA Information Network for CAncercare M ICE Medical Interactive Creative Environment M IQA Medical Information Quality Archive M R Magnetic Resonance

OIS Oncological Information System ROC Reciever Operating Characteristics RT Radiation Therapy

(5)

1 Introduction

An initiative in Sweden started in 2016 to standardize the nomenclature for anatomical structures defined during radiation therapy. This facilitated the development of a new national radiotherapy registry placed on a platform called the Information Network for CAncercare (INCA). The IT platform is managed by the Regional Cancer Centers which will receive its information from the local databases called Medical Information Quality Archive (MIQA) with patients from the same area located at each hospital[1].

The annotation of RT structure names is typically performed manually. This produces patient data that are diﬃcult to use for clinical revision, research, and development. In this work a deep neural network (DNN) is proposed, partly to take the human interaction out of the equation, but more importantly to make the radiotherapy (RT) data that are stored in the hospital databases more comprehensible.

The clinical tumor volume (CTV) is a type of anatomical structure that the physician draws in a 3D-rendered image to mark where the highest dose of the RT will be ordained. The CTV can contain three categories; tumor tissue, lymph node or metastases[1]. When there is more than one CTV containing the same category the structures are distinguished with a consecutive number in MIQA. MIQA are databases located at each separate hospital and contain CT-images in DICOM-format[2] and treatment data (both planned and performed) for RT patients. The structure names get extracted and standardized along with diagnoses from the oncology information system MOSAIQ or ARIA[3]. When the data is uploaded from MIQA to INCA the CTV receives a new consecutive number for the same purpose as before, to distinguish CTVs containing the same category. This consecutive number does not cohere with the consecutive number given in MIQA. Thus, the data uploaded to INCA do not contain any indication of the organ/organs irradiated from the CTV structure identified as the target volume for RT.

CTV structures for prostate cancer patients include the following four anatomical structures; prostate, right- and left vesicle and adjuvant lymph nodes.

The medical images and RT structures along with the CTV-name are stored in MIQA and later uploaded to INCA. This is done with non-specific labels due to the inconsistency of adding a consec-utive number to the CTV-name in MIQA and INCA. No previous solution has been attempted for the task at hand meaning that the CTV-names and the actual anatomical structures it refers to is incomprehensible. The alternative to this would be to name the RT structure by its clinical name. RT structures for prostate cancer have the following clinical names; prostate, seminalVesicle_L, seminalVesicle_R, and adjuvant lymph nodes.

A project to create an automated solution that could classify the treatment target volume from the CTV-mask and CT images was initiated to make the information in INCA distinct. The solution will be integrated into MIQA to classify the RT structure, add the label to the information about the CTV and connect it with the right dose level. Prostate cancer, where often multiple CTV’s are created, will be used as proof of concept.

Classification is a wide subfield in machine- and deep learning where text[4], biomedical[5] and image data[6] can be classified with a label depending on its content. In the medical envi-ronment physicians would conceivably get help with interpreting medical images like CT- and MR images for diagnostics. An example of this is the study of Geert Litjens et al on how to use deep learning as a tool to increase the accuracy and eﬃciency of histopathological diagnosis by looking at MR guided biopsy results[7]. Deep learning can lighten the workload as well as contribute with a supporting assessment/labeling which will be one of the underlying goals of this thesis.

This thesis uses multi-label classification to determine if the four diﬀerent anatomical structures are present or not as they can appear in the same structure-/CT volume or slice.

Aim and contributions

(7)

and retrieval of information in the database systems MIQA and INCA.

2 Methods

The approach for multi-label classification that predicts four diﬀerent classes in one label is pre-sented in this section. The model is fed with CT images in a 2D fashion and an anatomical structure-mask (derived from the RT structure set) extending through the 3D image. The DNN will classify each slice independently. The classification contains the probability, given by the DNN, that a particular RT structure consists of one up to four classes. Each class probability, highest or lowest, from the DNN labels for an RT structure, will be stored in one final label. This label will be used to add additional information to MIQA and INCA.

In this project, the ability of a convolutional neural network (CNN) to classify prostate cancer target structures using multi-class labels were studied. This is accompanied by studying the prediction of a whole structure volume from the CNNs 2D probabilities. This section will also describe data management.

2.1 Data

The data used in this project were collected from the Karolinska University Hospital in Stockholm, Skåne University Hospital in Lund and University Hospital of Umeå in Umeå. CT images and RT structure sets in DICOM format[2] of one hundred prostate cancer patients were collected from each hospital. Karolinska- and Skåne University Hospital had a slice thickness of 3mm and University hospital of Umeå 2.5mm. Five patients from Umeå were excluded because the RT structures were interrupted and hard to interpret in the sense of what label it belonged to. Patients with all of the RT structures; prostate, seminal vesicles, and lymph nodes were primarily chosen to get as much training data as possible. The structure volume defined by the physician could include more than one of the classes (prostate/right vesicle/left vesicle/adjuvant lymph node) simultaneously. Patients with structures that comprise more than one class were, therefore, also prioritized to represent the diï¬erentcombinationsthatanRT structurecouldembody.

2.1.1 Pre-processing

(8)

(a) First Prostate (b) Last Prostate

(c) First Right Vesicle (d) Last Right Vesicle

(e) First Left Vesicle (f) Last Left Vesicle

(g) First Adjuvant Lymph Node (h) Last Adjuvant Lymph Node

(9)

(a) First Right- & Left Vesicle (b) Last Right- & Left Vesicle

(c) First Prostate, Right- & Left Vesicle (d) Last Prostate, Right- & Left Vesicle

(e) First Adjuvant Lymph Node, Right- & Left

Vesi-cle (f) Last Adjuvant Lymph Node, Right- & Left Vesi-cle

(g) First Prostate, Adjuvant Lymph Node, Right- &

Left Vesicle (h) Last Prostate, Adjuvant Lymph Node, Right- &Left Vesicle

(10)

The images were imported into Python and concatenated1 _{with their respective mask(s).}

Only one CTV mask was concatenated with the CT image at a time creating two channels. There-fore the same CT image was, in most cases, used for concatenation more than once.

The labels are created from label-sets of strings of the present classes (prostate, right vesicle, left vesicle, and adjuvant lymph node) that are made binary. The labels were represented as follows:

[1, 0, 0, 0] Left Vesicle

[0, 1, 0, 0] Adjuvant Lymph node [0, 0, 1, 0] Right Vesicle

[0, 0, 0, 1] Prostate

To avoid processing unnecessary data and reduce the imbalance between classes only slices where the maximum pixel value was larger than zero in the binary mask were extracted. Therefore the CT image and CTV mask have at least one class present throughout the training- and test data. In a multi-label classification problem where the diﬀerent classes appear in combination, the label could also be in the form of: [1, 0, 1, 0] as seen in Figure 2 (a and b). This is interpreted as both the left- and right vesicle present in the CT image and CTV mask.

The CT images shape and pixel intensities were processed for better use in memory on the GPU-server and for the learning of the DNN. The image size was halved from 512⇥512 to 256⇥256 and the pixel intensity was clipped to be between [-1000, 1000] and then normalized to [-2, 2]. This can be seen as detailed in Table 1.

Before After Type

Reshape 512x512 256x256 nr of pixels Clip [-1000, 1024] [-1000, 1000] range of values on pixels Normalise [-1000, 1000] [-2, 2] range of values on pixels

Table 1: Pre-processing values on the CT-image (the mask is binary and only needed reshaping). The datatype of the images was changed from 64-bit floating point to 32-bits for the same purpose; to reduce the memory usage on the GPU-server.

The pre-processed and concatenated images, together with their respective labels, were in advance split up into four folders for cross-validation during DNN training. The alternative combinations of anatomical structures outlined as a single CTV (only prostate, prostate and both vesicles, etc.) were distributed as equally as possible. This was done without splitting a patient with more than one CTV into diﬀerent cross-validation folds to keep the CT images unseen in the test data. The following amount of slices from each class and structures put in each folder are shown in Table 2.

Folder Left Vesicle Adjuvant Lymph Node Right Vesicle Prostate Nr of Structures

1 660 1667 663 1877 218

2 771 1936 771 1776 240

3 743 1904 743 1827 227

4 791 1952 791 1890 231

Table 2: Amount of slices of each class and structures in each of the four folds.

2.2 Data Augmentation

To create a robust DNN that focuses on the right attributes in the images, data augmentation is used to increase the variation of the appearance within the dataset. This can help the DNN to reduce over-fitting and to artificially create a larger training set since it "creates" more trainable images.

(11)

In this work, images were duplicated and edited with respect to object orientation and size. The augmentation parameters used were; rotation by 8 , zoom- in and out by 20% and a shift in width and height by 5%. These parameters were set and chosen after some trial images were created to check that the images still looked realistic.

2.3 DNN Architectures

Two DNN architectures were evaluated in this work. The AlexNet[6] and VGGNet16[8] are well known architectures, see Figure 3 for an illustration of the used DNN architectures in this thesis.

(a) AlexNet (b) VGGNet

(12)

AlexNet

The model consisted of five convolutional layers with the rectifying linear unit (ReLU) as the activation function and ended with four dense layers. The first two convolutional layers used 5 ⇥ 5 convolutions where the first of the two had a 4 ⇥ 4 stride. The rest of the convolutional layers performed 3 ⇥ 3 convolutions.

The convolutional layers can be parted into blocks where after each convolutional layer a max-pooling layer was added to reduce the spatial size of the feature maps. The block was ended with batch normalization to reduce the covariate shift. This created a target that had a mean close to zero and a standard deviation close to one creating an immobile learning target. After each dense layer, a dropout of 0.4 is added along with batch normalization. The activation function in the last layer used sigmoid to compute the probability for each label individually on each image, binary cross-entropy was used as the loss function and ADAM as optimizer with an initial learning rate of 10 5_{, and the metrics were set to accuracy.}

VGGNet

The model performed 3 ⇥ 3 convolutions in all of the layers where the first two blocks consisted of two convolutional layers with the rectified linear unit as activation followed by a max-pooling layer. The two blocks after that exist of three convolutional layers. Three dense layers with a drop out of 0.3 conclude the model where the last activation function was, once again, the sigmoid activation function. The loss function was binary cross-entropy and the optimization function was ADAM with a learning rate of 10 4_.

CTV Classification Methods

To meet the aim of the thesis one single label had to deliver all the present classes in the CTV struc-ture. The DNN delivered a label for each CTV-slice and an approach to classify the whole structure had to be determined. Three methods were chosen and denoted the maximum-, minimum- and occurrence-method.

The maximum-method identifies the highest probability of each class in any slice and returns them as the final label.

The minimum-method does the opposite to the maximum-method and identifies the lowest prob-ability for each class in any slice and returns them as the final label.

The occurrence-method is an attempt to eliminate wrongly made classifications by the DNN by making sure the prediction has occurred more than one time. It first determines if the probability of a class has exceeded 0.5 in at least two slices. If a class has fulfilled the previous condition the highest probability is added to the final label. If not, the lowest probability is added. The limit of 0.5 was chosen with regards to the DNN predicting that the class is more probable to be present than not. The second limit, to appear at least two times with probabilities over 0.5, was selected with regards to the vesicles that appear in only two slices in various cases.

The three methods will be evaluated to see which one is the most suitable one to classify the complete CTV structure.

CTV Label Performance

The performance and ability to classify the multi-labels can be met in True Positives (TP) when predicting the class as present when it is, False Positive (FP) when predicting the class present when it is not, True Negative (TN) when predicting the class not present when it is not and False Negative (FN) when predicting the class not present when it actually is.

The receiver operating characteristics (ROC) gives a measure of how well the model performs on a binary task by evaluating the true positive rate (TPR);

T P R = T P T P + F N to the True negative rate (TNR);

T N R = T N T N + F P

(13)

predictions will also be evaluated by calculating the accuracy; A = T P + T N T P + F P + T N + F N the precision; P = T P T P + F P and the recall (which is the same as the TPR);

R = T P T P + F N

for further evaluation on the performance of the three structure prediction methods. DNN Training & Evaluation

The DNNs was trained with randomly shuﬄed 2D slices from the 295 patients with said augmen-tation in real-time during training. Both the DNNs used the ADAM-optimizer with the distinction that AlexNet had a learning rate of 10 5_{and VGGNet of 10} 4_{. The loss function used in the DNNs}

was the binary-cross entropy. They were fed with a batch size of 16 slices at a time where AlexNet needed 40 epochs and VGGNet 20 epochs for stabilization. The size of the images was made up of 256 ⇥ 256 pixels with two channels containing the CT- and corresponding mask-slice. The construction of the DNNs are visualized in Figure 3 and described earlier in the method section. Accuracy and loss graphs were used to check for overfitting, instability and overall performance during training. To enable a way to assess the model performance, a four-fold cross-validation setup was used.

To further check the performance of the DNNs, graphs of the Receiver Operating Charac-teristics (ROC)[9] were plotted for each class over each fold using each of the CTV classification methods (maximum/minimum/occurrence). This was accompanied by the individual scores for accuracy, precision, and recall. The scores were presented in tables for, once again, each class over each fold using each of the three CTV classification methods.

3 Results

The results from the performance of the DNN along with various experiments are provided in this section. The model was tested on previously unseen data which brings meaningful results to the proof of concept and the future usability of the DNN.

The results are compared in terms of the two DNNs performance on the three diﬀerent CTV structure classification methods. The training was performed on each class individually over the four cross-validation folds, assessed by ROC along with accuracy, precision and, recall scores.

In Table 3 and 4 we see the accuracy and loss scores during the last epoch of the test batch. The combined score for AlexNets training accuracy reached 97.8625 % ± 0.5240 % and 98.550 %_{± 0.468 % for VGGNet. In Figure 4 and 5 the accuracy performance is visualized during} training for AlexNet and VGGNet. AlexNet needs the 40 epochs to level out and has a steady rise during training. The validation curve rises slightly faster but has an unsteadier path. VGGNet reaches a high accuracy relatively fast during training. The training- and validation curve starts to level out after only five epochs and the accuracy only improves by a little for the remaining time. Table 5 presents the results for the maximum-, minimum- and occurrence method to classify the CTV structure with one final label. The parameters calculated and presented are accuracy, precision, and recall.

The accuracy for classifying the prostate with the maximum- and occurrence method ranged be-tween 99 % to 100 % over the four-folds. Precision gave equal results with the distinction for the adjuvant lymph nodes that ranged between 98 % to 100 %. Recall gave top-scores for the two classes except during fold three for the adjuvant lymph node.

(14)

The scores ranged between 89 % to 97 % for the accuracy and 79 % to 95 % for the precision. Recall showed higher- or equal values for the maximum method that ranged between 95 % to 99 %.

The minimum method received lower scores for accuracy and recall by several percentage points. The precision always received a score of 100 %.

The second DNN, VGGNet, corresponding scores are presented in Table 6. The accuracy score for the left- and right vesicle improved by two to three percentage points. In exchange, the accuracy decreased by less than a percentage point for the adjuvant lymph node and prostate using the maximum and occurrence method. The third parameter, recall, improved between zero to four percentage points over all of the methods and classes. The minimum method also improved re-garding the accuracy and recall parameters for each of the four classes while the precision remained unchanged.

The ROC curve visualizes how well the DNN performs on predicting high when the class is present and low when it is not. AlexNet succeeds well and receives an area under curve (AUC) score that is 100 % for the adjuvant lymph node and prostate using either the maximum- or the occurrence method which can be seen in Figure 6 and 7. The vesicle’s performance with a 98 % AUC score still indicates a well-performing DNN. The minimum method yields lower AUC scores for all of the four classes with pronounced diﬀerences regarding the vesicles seen in Figure 8. The adjuvant lymph node and prostate get AUC scores around 98 % while the vesicles lie around 90 %. The VGGNets ROC performance in Figure 9 and 10 yields the same performance regarding the adjuvant lymph node and prostate as AlexNet for the occurrence- and maximum method. The vesicles AUC score improved by 1 % for the maximum method receiving an AUC score of 99 %. The minimum method once again receives lower scores but with improvement to the AUC score with two percentage points for the vesicles seen in Figure 11.

(a) Fold 1 (b) Fold 2

(c) Fold 3 (d) Fold 4

(15)

AlexNet Accuracy Loss Fold 1 97.99 % 0.080 Fold 2 98.17 % 0.050 Fold 3 98.33 % 0.062 Fold 4 96.96 % 0.102 Combined 97.8625 % ± 0.524 % 0.0735 ± 0.0192

Table 3: Performance for each respective fold in AlexNet after training. VGGNet Accuracy Loss Fold 1 98.23 % 0.0877 Fold 2 98.85 % 0.0476 Fold 3 99.16 % 0.0459 Fold 4 97.96 % 0.1293 Combined 98.55 % ± 0.468 % 0.077 63 ± 0.033 50

Table 4: Performance for each respective fold in the VGGNet after training.

(a) Fold 1 (b) Fold 2

(c) Fold 3 (d) Fold 4

(16)

AlexNet

Fold 1 Left Vesicle Lymph Node Right Vesicle Prostate

Method A P R A P R A P R A P R

Maximum 95.65 92.31 96.55 99.57 98.18 100 96.52 93.40 97.70 99.57 99.12 100 Minimum 86.96 100 65.52 97.83 100 90.74 86.96 100 65.52 87.83 100 75.00 Occurrence 95.65 93.26 95.40 99.57 98.18 100 96.52 94.38 96.55 99.57 99.12 100

Maximum 95.13 90.43 97.70 99.56 98.21 100 96.02 92.39 97.70 99.56 99.11 100 Minimum 84.96 100 60.92 98.23 100 92.73 85.40 100 62.07 86.73 100 72.97 Occurrence 96.46 94.38 96.55 99.56 98.21 100 96.46 94.38 96.55 99.56 99.11 100

Maximum 94.56 89.47 96.59 99.58 100 98.25 94.14 87.76 97.73 99.58 99.20 100 Minimum 86.19 100 62.50 97.91 100 91.23 86.19 100 62.50 85.36 100 71.77 Occurrence 94.56 90.32 95.46 99.16 100 96.49 96.23 92.47 97.73 99.58 99.20 100

Maximum 89.86 79.80 97.53 100 100 100 90.32 80.00 98.77 100 100 100 Minimum 84.79 100 59.26 99.08 100 95.75 83.41 100 55.56 81.11 100 64.35 Occurrence 90.78 81.44 97.53 100 100 100 90.78 81.44 97.53 100 100 100 Table 5: A - Accuracy/P - Precision/R - Recall scores for AlexNet on each class and each volume

prediction method over each of the four folds.

VGGNet

Maximum 97.83 94.57 100 98.70 94.74 100 97.83 94.57 100 99.13 98.25 100 Minimum 87.39 100 66.67 98.70 100 94.44 94.44 100 71.43 88.26 98.85 76.79 Occurrence 97.83 95.56 98.85 99.57 98.18 100 97.83 95.56 98.85 99.13 98.25 100

Maximum 96.90 93.48 98.85 99.12 100 96.36 96.90 93.48 98.85 100 100 100 Minimum 85.84 100 63.22 98.23 100 92.73 90.97 95.83 65.71 86.73 100 72.97 Occurrence 97.35 95.51 97.70 99.12 100 96.36 97.35 95.51 97.70 100 100 100

Maximum 97.07 93.55 98.86 99.58 100 98.25 97.07 93.55 98.86 100 100 100 Minimum 87.03 100 64.77 98.33 100 92.98 92.01 100 66.67 89.12 100 79.03 Occurrence 98.33 96.67 98.86 99.58 100 98.25 98.33 96.67 98.86 100 100 100

Maximum 92.63 84.21 98.77 99.54 97.92 100 92.63 83.51 100 100 100 100 Minimum 84.79 100 59.26 99.08 100 95.75 88.89 100 62.79 82.49 100 66.96 Occurrence 93.55 86.81 97.53 100 100 100 94.01 86.96 98.77 100 100 100 Table 6: A - Accuracy/P - Precision/R - Recall scores for VGGNet on each class and each volume

(17)

(a) Left Vesicle (b) Adjuvant Lymph Node

(c) Right Vesicle (d) Prostate

Figure 6: Receiver Operating Characteristics for AlexNet on each class using the occurrence method.

(18)

Figure 8: Receiver Operating Characteristics for AlexNet on each class using the minimum method.

(19)

Figure 10: Precision Recall Curve for VGGNet on each class using the maximum method.

(20)

4 Discussion

This section will discuss and evaluate the results and performances from both the DNNs: AlexNet and VGGNet, along with the structure classification methods.

4.1 Trained DNNs

The two DNNs, AlexNet and VGGNet, both perform well looking at the test score for accuracy and loss with the slight diﬀerence of a little less than one percentage point in accuracy. The VGGNet reaches a high accuracy score a bit faster than AlexNet and keeps a stable performance throughout the training. Depending on the size of the dataset and its dimensions a deeper DNN could cause overfitting. VGGNet has about eight times more parameters than AlexNet and could potentially have the problem to overfit the training data. The reason why it does not overfit is because of regularisation like dropout and data augmentation. More parameters can handle more complicated futures and can become more flexible to the desired mapping. The result from the VGGNet shows that it handles the data better than AlexNet. A more complex problem could have divided the results more in VGGNets’ favor.

The accuracy plots had one noticeable aspect that the validation curve (orange) was higher than the training curve (blue) seen in Figure 4 and 5. This could be explained by the augmentation done on the training data but not on the validation data. The augmentation was sparse and created images that looked realistic. Meanwhile, the dataset still became larger and images that already were hard to classify were seen again in diﬀerent perspectives lowering the accuracy score. A second reason that could explain why the validation accuracy performed better, in the beginning, was the dropout layers. Dropout is used to prevent overfitting to the training data. It ignores a certain percentage of the weights from the previous layer while training. In the meantime, the validation data is assessed using all the weights. This forced the weights to specialize on the task at hand, when neighboring weights were canceled out, creating a well trained DNN. AlexNet dropped 40% of its weights while VGGNet dropped 30%. Augmentation and dropout could explain why the validation curve performed better than the training curve in the beginning. After some training, only parts of the weights were needed to classify the data as well as all the weights.

Studying Table 2 shows that the left and right vesicle was a bit underrepresented in folder 1. The same applied to the adjuvant lymph node. The three remaining folders had a good balance concerning the four classes. Folder 4 was used as test data in training of cross-validation fold 1 for both of the DNNs, folder 3 for the training of cross-validation fold 2, etc. Table 3 and 4 shows the accuracy performance results from both AlexNet and VGGNet after training. The highest accuracy score was received during cross-validation fold 3 and the lowest during fold 4. Therefore, using folder 2 as test data yielded the highest accuracy score while folder 1 yielded the lowest. With fewer cases in folder 1, structures that were hard to predict correctly made more impact on the result than in the remaining folders.

Slices, where all of the classes were present, were few and only represented in three out of four folders. The reason was that there only were three patients with this type of CTV structure. Therefore, individual slices were hard to classify for the DNN.

Training based on 2D images from 3D structures was chosen in consideration of memory usage on the GPU-server and could have impacted the slightly worse results on the vesicles. The vesicles sometimes only appear in two to three slices. The alternative approach could have been to classify two to three slices at the same time, meaning that the input data would have four or six channels instead of two. Feeding the DNN a thicker slice could enhance the identification of the vesicles by giving the DNN more substance to work on when classifying. The DNN can then learn the diﬀerent transitions from the prostate to adjuvant lymph node or right- and left vesicle, etc. Hopefully, this could achieve better performance for the classification.

4.2 Structure Classification

Method evaluation

(21)

class. Meaning, ideally, that the model should have assigned a very low probability in parts of the structure and high probability in others. Specifically selecting the class with the lowest probability, while the true label is positive, speaks against it and gives the method lower credibility. The performance, using the minimum method, was still high for the prostate and adjuvant lymph node since they had enough slices with only their class present. Another reason could be that their class was present throughout the structure. The vesicles were aﬀected by choosing the lowest prediction due to the few slices they appeared in a structure. The result was a significantly lower performance compared with the other two methods.

Picking the maximum probability for each class in a structure was more intuitive when working with CTV structures that included more than one class. Each class was not consistently present throughout the whole structure. When the model predicted a high probability for one class somewhere in the structure the class should be in the final CTV label. The results showed almost perfect scores for all of the classes and indicated that the maximum method was reliable.

Lastly, we have the occurrence method which was an attempt to see if the structure classi-fication would perform better if imposing a threshold on how many times the model must identify the class to be more likely (prediction above 0.5) to be present. This was an attempt to prevent false classifications in individual slices from aﬀecting the final structure classification. This will be evaluated from the ROC curves for each of the classes for each fold and method.

ROC

One noticeable aspect of the structure classification was that the prostate and adjuvant lymph node had the highest scores for all the three methods. The ROC curves also showed, with their high AUC score, that the model predicted either very high or very low for the two classes. The pre-dictions for the right- and left vesicles were not as accurate as for the previously mentioned classes. The reason could be that the transition from, for example, the prostate was hard to classify.

Comparing the three diﬀerent structure classifications method by method resulted in the occurrence method performing the best. Therefore, it became the most reliable way to classify a CTV structure. The occurrence method succeeded in improving the performance for the vesicles regarding the accuracy and precision. The predicted false positives were reduced producing an increased precision score. The recall score, at times, produced lower for the occurrence method compared to the maximum method. The trade-oﬀ for less false positive predictions became more false negatives. Although, the accuracy score increased, meaning that the true positives and true negatives in all increased. The occurrence method, therefore, succeeded in eliminating single pos-itive predictions for the vesicles when they were not present. The ROC curves showed the same performance for the prostate and adjuvant lymph node. The vesicles’ AUC score was slightly im-proved by the VGGNet since it succeeded to classify them more accurately. The minimum method were the worst to classify the vesicles. The lower AUC score comes from the fact that the vesicles were not present throughout the structure resulting in low predictions added to the final label. AlexNet vs. VGGNet

(22)

Future work

In combination with the mask, the CT-image was included as extra help. However, sometimes the only available image was the structure mask meaning that it would be preferable for future studies to include only that in the data. Quick training of each of the DNNs mentioned in this thesis indicates that the result would be just as good. No extensive investigation was done to confirm this.

The next step for training a DNN would be to include more classes (more structure types) to be able to use the DNN for its intended application, adding needed information about irradiated structures to MIQA. To create more cases where more than one class is present an approach to artificially create combined structure masks from patients with separately drawn structures is suggested. The structures for, in this case, all the prostate classes can be combined in MICE when creating and withdrawing data from the databases.

5 Conclusion

(23)

References

[1] Den nationella referensgruppen. En standardiserad svensk nomenklatur för strålbehandling. Tech. rep. Swedish Radiation Therapy Safety, June 2016.

[2] Maria YY Law and Brent Liu. “DICOM-RT and its utilization in radiation therapy”. In: Radiographics 29.3 (2009), pp. 655–667.

[3] Tufve Nyholm et al. “A national approach for automated collection of standardized and population-based radiation therapy data in Sweden”. In: Radiotherapy and Oncology 119.2 (2016), pp. 344–350.

[4] Yoon Kim. “Convolutional neural networks for sentence classification”. In: arXiv preprint arXiv:1408.5882 (2014).

[5] Yonghong Peng, Zhiqing Wu, and Jianmin Jiang. “A novel feature selection approach for biomedical data classification”. In: Journal of Biomedical Informatics 43.1 (2010), pp. 15–23. [6] Hoo-Chang Shin et al. “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning”. In: IEEE transactions on medical imaging 35.5 (2016), pp. 1285–1298.

[7] Geert Litjens et al. “Deep learning as a tool for increased accuracy and eﬃciency of histopatho-logical diagnosis”. In: Scientific reports 6 (2016), p. 26286.

[8] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).

[9] Elizabeth R DeLong, David M DeLong, and Daniel L Clarke-Pearson. “Comparing the ar-eas under two or more correlated receiver operating characteristic curves: a nonparametric approach.” In: Biometrics 44.3 (1988), pp. 837–845.

[10] NONPI Medical AB. MICE Toolkit user manual. 1st ed. NONPI Medical AB. NORRA GIMONÄSVÄGEN 39 907 38 Umeå, 2017.

[11] Coen Rasch, Roel Steenbakkers, and Marcel van Herk. “Target definition in prostate, head, and neck”. In: Seminars in radiation oncology. Vol. 15. 3. Elsevier. 2005, pp. 136–145. [12] E Onjukka et al. “Twenty Fraction Prostate Radiotherapy with Intra-prostatic Boost:Results

of a Pilot Study”. In: Clinical Oncology 29.1 (2017), pp. 6–14.

[13] Valeria Landoni et al. “Predicting toxicity in radiotherapy for prostate cancer”. In: Physica Medica 32.3 (2016), pp. 521–532.

[14] Berkman Sahiner et al. “Deep learning in medical imaging and radiation therapy”. In: Medical physics 46.1 (2019), e1–e36.

[15] Geert Litjens et al. “A survey on deep learning in medical image analysis”. In: Medical image analysis 42 (2017), pp. 60–88.

[16] Timothy Rozario et al. “Towards automated patient data cleaning using deep learning: A fea-sibility study on the standardization of organ labeling”. In: arXiv preprint arXiv:1801.00096 (2017).

[17] Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib. “Deep learning for medical image processing: Overview, challenges and the future”. In: Classification in BioApps. Springer, 2018, pp. 323–350.

[18] Md Zahangir Alom et al. “The history began from alexnet: A comprehensive survey on deep learning approaches”. In: arXiv preprint arXiv:1803.01164 (2018).

[19] Deepak Pathak et al. “Fully convolutional multi-class multiple instance learning”. In: arXiv preprint arXiv:1412.7144 (2014).

[20] Nima Tajbakhsh et al. “Convolutional neural networks for medical image analysis: Full train-ing or fine tuntrain-ing?” In: IEEE transactions on medical imagtrain-ing 35.5 (2016), pp. 1299–1312. [21] Ashnil Kumar et al. “Subfigure and Multi-Label Classification using a Fine-Tuned

Convolu-tional Neural Network.” In: CLEF (Working Notes). 2016, pp. 318–321.

(24)

[23] Piotr SzymaÃ âski and Tomasz Kajdanowicz. “scikit-multilearn: A Python library for Multi-Label Classification”. In: Journal of Machine Learning Research 20.6 (2019), pp. 1–22. url: http://jmlr.org/papers/v20/17-100.html.

[24] Grigorios Tsoumakas et al. “Mulan: A java library for multi-label learning”. In: Journal of Machine Learning Research 12.Jul (2011), pp. 2411–2414.

[25] Asma Aldrees and Azeddine Chikh. “Comparative evaluation of four multi-label classifica-tion algorithms in classifying learning objects”. In: Computer Applicaclassifica-tions in Engineering Education 24.4 (2016), pp. 651–660.

[26] Gjorgji Madjarov et al. “An extensive experimental comparison of methods for multi-label learning”. In: Pattern recognition 45.9 (2012), pp. 3084–3104.

[27] Jianqing Zhu et al. “Multi-label convolutional neural network based pedestrian attribute classification”. In: Image and Vision Computing 58 (2017), pp. 224–229.

[28] Joseph Redmon and Ali Farhadi. “YOLO9000: better, faster, stronger”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 7263–7271. [29] Jiang Wang et al. “Cnn-rnn: A unified framework for multi-label image classification”. In:

Pro-ceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 2285– 2294.

[30] Caner Mercan et al. “Multi-instance multi-label learning for multi-class classification of whole slide breast histopathology images”. In: IEEE transactions on medical imaging 37.1 (2018), pp. 316–325.

[31] Feng Zhu et al. “Learning spatial regularization with image-level supervisions for multi-label image classification”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 5513–5522.

[32] Kush Bhatia et al. “Sparse local embeddings for extreme multi-label classification”. In: Ad-vances in neural information processing systems. 2015, pp. 730–738.

[33] Mohammad S Sorower. “A literature survey on algorithms for multi-label learning”. In: Ore-gon State University, Corvallis 18 (2010), pp. 1–25.

(25)

A State of the Art

The appendix provides a background to the field of multi-label classification of medical images and deep learning. It also provides a background on modalities that are included in the project either as a tool or for the aim of the project.

This section aims to provide context to the master thesis project at hand.

A.1 MIQA & INCA

Images used for target and normal tissue delineation, typically CT, MR, and PET, are together with structures and dose distributions stored in the treatment planning system (TPS) and onco-logical information system (OIS). The systems could be separate databases depending on vendor solution[3]. To make such data available for clinical review and associating treatment parameters with clinical outcomes, a database for information of interest was implemented. Data are trans-ferred to MIQA from TPS/OIS by DICOM export of images, structures and treatment records. The diagnose classification according to ICD10 and intention to treat is not part of the DICOM RT standard and therefore extracted from the OIS database. Below in Figure 12 we can see a flow chart of the interaction of the user and the data from the TPS and OIS for MIQA and INCA.

Figure 12: An overview of the flow between the databases for MIQA and INCA[3]. A.1.1 MIQA

(26)

A.1.2 INCA

INCA is a national database with diﬀerent platforms for cancer patients treated in Sweden. A national clinical registry platform for RT data is under construction for data from MIQA to be uploaded to[3]. This collection of data allows physicians to compare treatment methods throughout the country. This platform does not support the transfer of image information meaning that the only stored data will be the aggregated data[3].

An example of data that is uploaded to the RT registry on INCA is a roughly sampled dose-volume histogram (DVH)[3].

A.2 MICE

Medical Interactive Creative Environment (MICE) is a graphical programming application provid-ing easy access to advanced image analysis tools, includprovid-ing RT-specific analyses. The program is built using C# programming in general but has some parts that include C++[10]. The filters that are used in MICE come from the insight segmentation and registration toolkit (ITK) which is a software library that sets the gold standard for medical image analysis[10]. Elastics, which also is based on ITK, contributes with the registration functions in MICE.

Included in the program is a visualization module based on the visualization toolkit (VTK) which is an open software for 3D computer graphics. The functionality also allows the user to program their functions using MATLAB or Python.

This program is used in the clinical environment to create masks, evaluate image data, etc. and can be used for medical research[10].

A.3 Radiation Therapy

(27)

Figure 13: Example of how a CT image with delineated RT-structures for tumour and organs at risk can look like[12](Fig.2).

The accuracy of the delineations is also important for the organs at risk (OAR) since all the dose distributions are optimized with prescribed dose to targets with as low dose as possible to the OAR to minimize potential side eﬀects. Some of the OARs for prostate cancer are the bladder, the urethra, any small bowel loop, and the rectum[12].

The possible side eﬀects of radiotherapy for prostate cancer are rectal bleeding, fecal incontinence, erectile dysfunction, hematological toxicity, and bowel toxicity[12, 13].

A.3.1 Medical Imaging in Radiotherapy

The usage of medical imaging in radiation therapy is extensive. The images provide a foundation for treatment planning and patient positioning during treatment for verification of treatment area. The amount of time spent on annotation and treatment planning can reach several hours for the experts[14]. Novel and future applications for deep learning in radiotherapy include assistance in structure delineation, treatment planning, assessing the response to treatment and providing automated adaption of the treatment[14]. Therefore, this will not replace the medical staﬀ but rather act as a clinical decision support system.

(28)

A.3.2 Standardization of labeling

For medical research projects, nearly 80% of the time is consumed by data cleaning[16]. This is due to inconsistent labeling at diﬀerent, and sometimes within the same, hospital(s) during diﬀerent periods.

In radiation therapy, labeling is aﬀected by the annotators’ preferences, the treatment plan-ning system in use and the variations in labeling over time[16]. This can lead to misinterpretation of critical data which can lead to treatment errors and radiotherapy maladministration[16].

A.4 Classification

The classification has a wide range of applications where text, like movie descriptions, can be classified into genres and objects in images can be identified. Even videos and real-time imaging, like surveillance cameras, can be used as a scope for classification tasks. Depending on how complex and how memory-dependent the data to be classified is there are diﬀerent approaches to consider. A.4.1 Deep Learning

Deep Learning is a fast-growing concept that has investments of millions to proceed with developing new and better algorithms. It is expected that it will take some time before the full potential for deep learning has been realized within medical image processing[17]. The largest and most documented obstacle is the lack of annotated data in medical images. The resources that would be needed to achieve the required amount of annotated data are too expensive, tedious and time-consuming[17]. The sharing of patient data is also a problem due to patient confidentiality and data protection legislation. This makes it a slow process to bring the benefits of deep learning into everyday medicine. With a universal learning approach, that can be applied to almost any domain, deep learning is an interesting field of research. This is due to its robustness in learning characteristics of the problem at hand, that the same deep learning approach can be generalized for diﬀerent types of datasets. But also for its scalability, meaning that it can handle large amounts of data[18].

A.4.2 Supervised, semi-supervised and unsupervised There are diﬀerent approaches to take on a deep learning problem.

With supervised learning, the input data have a corresponding output for which the DNN can adapt its parameters to be able to give the most suitable output during tests[18].

In semi-supervised learning, the data is only partially labeled. This approach is often known as reinforcement learning which is when the loss function is not specified and is used in unknown learning environments[18]. The diﬀerences from supervised learning are that you do not have full access to what you are trying to optimize and that each state (each input) depends on previous actions[18].

(29)

Figure 14: A Venn diagram of the three diﬀerent types of learning[18]. A.4.3 Convolutional Neural Network

Convolutional Neural Networks (CNN) is a type of deep learning method which has a close relation to how the brain interprets and recognizes visual input. Features in the image like patterns, edges, etc. are extracted from the image and used for the learning of the network. The output from a layer is also the input for the immediate next layer in the network allowing the network to learn deeper features for each layer. The network is constructed with three diﬀerent types of layers, the convolutional layer, the max-pooling layer, and the classification layer. The convolutional layer uses learnable kernels whose output is filtered through an activation function (linear or non-linear) to form the output feature maps[18].

After the convolutional layer a max-pooling layer, also known as a down-sampling layer, is applied. This layer does exactly what the name implies and downsamples the data. An example would be to apply a 2x2 down-sampling meaning that you would output half of the input dimensions of the image.

The last block of layers is the classification layer which most often consists of fully connected layers where the output feature maps from the previous convolutional layer are flattened into scalar vectors and used as input to the fully connected one. The number of fully connected layers is optional but in the last layer, an activation function like softmax or sigmoid are used[18]. The softmax function will give a probability of all the diﬀerent classes evaluated together. The sigmoid function also gives the probability for the classes but each one separately.

The architecture of diﬀerent networks usually consists of the same build of stacking convolutional layers with down-sampling layers in between ending with fully connected layers where a softmax or sigmoid function is included. Some examples of these kinds of layers are the LeNet, the AlexNet, the VGGNet, the ResNet and the fully convolutional network (FCN).

(30)

The AlexNet consists of five convolutional layers where a down-sampling layer and a nor-malization layer is placed in between the first three layers (2 of each). The classification layer consists of two fully connected layers that end with a softmax function[18]. This network won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 and had a breakthrough in the classification task of visual recognition leading to an increased interest in deep learning[18]. The VGGNets greatest contribution is to show that the depth of the network is critical to accomplish better recognition or classification accuracy[18]. The network varies in-depth but follows the pattern of two convolutional layers followed by a down-sampling layer until the classi-fication layers are reached, consisting of three fully connected layers with a softmax activation at the end.

The ResNet is designed to manage ultra-deep neural networks without suﬀering from the vanishing gradient. The number of layers varies from 34 up to 1202 but the most popular one is the ResNet50 with 49 convolutional layers and one fully-connected layer.

The fully convolutional network classifies spatial prediction problems only using convolu-tional layers. The network does not use the fully connected layers in the end like the previously mentioned architectures. This type of network is usually used in segmentation[19]. There are many networks, like those above, that is pre-trained and can be used through transferred learning via the Keras API.

A.4.4 Transferred Learning

To train a CNN from scratch requires a large set of data to get the wanted results. This is, as mentioned above, hard to do with medical data since the amount of annotated data is small. This leads to a new approach via transferred learning where you use a pre-trained network as initialization for the weights and then fine-tune the parameters so the network applies to the current problem[20]. The fine-tuning consists of swapping the fully connected layer to something that better suits the problem at hand, like the number of classes and output activation. To fine-tune the layers in the CNN it is advised to start from the last layer and then proceed layer by layer[20].

Transferred learning using pre-trained weights from AlexNet with fine-tuning such as back-propagation, dropout and augmentation were used to adapt the 30 class classification task on the subfigure classification training dataset which is described in ImageCLEF 2016 overview papers[21]. The network performed good and can be used for several diﬀerent problems.

A.4.5 Multi-Label Classification

Multi-label classification represents a setup of labels that each represents a different classification task and these tasks are somewhat related. This means that for instance, in this project an image can belong to several labels at the same time[22]. The two general approaches are method adaptation and problem transformation. Either you adapt the algorithm to fit a multi-label class problem or turn your problem into one or several single-label classification problems[23]. This can be executed in several different ways. One of these ways is by using the MULAN library with several of the below-explained modalities to perform multi-label classification. It has its API that works in java[24]. Since more than one delineation volume can appear in the CT-images for this project it calls for a multi-label classification approach. Depending on the number of labels, correlations and how much computation power that is available, different paths of methods can be required/preferred.

Binary Methods like the Binary Relevance (BR) approach looks at the problem from Q binary classification angles. The Q binary classification angles are transformed into a single-label dataset[25]. The diﬀerent binary classifications are then added together and presented[22]. However, this method wrongly assumes independence which is one of the reasons the binary rele-vance method has been criticized[22]. Then we have the pairwise methods, ranking via pair-wise comparison (RPC) and Calibrated Label Ranking (CLR), which pair the labels.

(31)

covers all of the combinations of pairs[25]. At least one of the pairwise labels in the dataset is annotated, but never both[25]. For the evaluation, the labels are ranked according to the sum of votes that are received after the prediction between the pairs of labels[25].

The CLR uses the RPC method but creates a label V which divides the relevant and irrele-vant labels into two. This is also taken into account for the predictions of votes[25]. This method is better in terms of prediction accuracy in comparison to the non-ensemble methods in multi-label learning[26]. But it has its flaws and needs to take into account for the number of the classifiers squared to be able to make a single prediction concluding in a very time-consuming problem for networks with a large number of labels.

Label-Combination Methods is another approach for multi-label classification that con-siders correlations between labels. Label Power-set (LP) concon-siders each distinct label in the multi-label dataset as one, which makes it a single-multi-label classification problem. The most probable class label will be returned, which is the corresponding set of labels[25]. This method takes the depen-dence between labels into account, but with a large number of distinct labels, the complexity of the method can become critical.

Pruned Set (PS) is similar to LP with the distinction that it prunes away the label-sets that do not occur as often as the user input threshold. The removed label-sets are replaced with disjoint label-sets that occur more often than the threshold[25].

Classifier Chains (CC) embraces the BR method by Q binary classifiers. To improve the method it considers the label correlation task. All of the classifiers are linked along a chain where each classifier is binary, meaning that it will classify it by 0/1 depending on the previous link[25]. The Ensemble Methods builds on the three diﬀerent approaches of Label-Combination Methods with the diﬀerence that it breaks the label-sets into m models making k label-sets for each model[25].

Random k-Label Sets (RAkEL) ensembles LP classifiers and breaks the large label-set, as explained previously, into k label-sets for the m models. This leads to higher accuracy in the result. Ensembles of Pruned Sets (EPS) ensemble the PS method leading to less over-fitting and allows for new label-sets during the classification time[25].

Ensembles of Classifier Chains (ECC) obtain the base classifier using the CC method. Each model has a diﬀerent chain of labels and a random version of the dataset. A threshold value singles out the most relevant labels that will form the predicted multi-label set[25].

Probabilistic Methods divides the data into a probability distribution instead of giving them the most probable labels. One way to do this is by the Multi-Label Naive Bayesian (ML-NB) that creates the multi-label problem in Q binary Naive Bayesian algorithms. To specify the probable labels it uses the maximum aposteriori principle (MAP)[25].

Another method is the Multi-Label k Nearest Neighbors (MLkNN) which extends the k-nearest neighbor (kNN) with a Bayesian algorithm. Like in ML-NB it uses the MAP principle to specify relevant labels but can also rank the labels[25].

Multi-Label Decision-Tree (ML-DT) uses the well-known C4.5 algorithm to handle the multi-label data with the difference that the entropy calculation is modified for multi-multi-label problems. Each node consists of one attribute that has the most effect to divide the samples into subsets. In each leaf, multiple labels can be found and this method is defined by its computational efficiency[25]. Back-Propagation Multi-Label Learning (BPMLL) uses a neural network algorithm that is derived from the Back-propagation algorithm and will change the weights in the network, depending on the output in accordance to the input and desired output. The new error function reckons that there are several labels instead of one[25].

A.4.6 Multi-label Front-end networks

(32)

A network called YOLO9000 can classify multiple labels in real-time meaning that it needs to both detect and classify simultaneously. The network can do this jointly since the dataset is computed in a WordTree meaning that the total dataset is divided into multiple datasets that are connected in a suitable fashion[28]. This type of network is what is/will be integrated into self-driving cars.

Wang, Jiang, et al[29] proposed a CNN-RNN network to use the connection between seman-tic redundancy and the co-occurrence dependency which are important for the eﬀectiveness of the multi-label classifier[29]. The neurons in the recurrent network improve the co-occurrence and can predict smaller objects. The network performed well and had the highest accuracy for most of the labels from the MS-COCO datasetwang2016. The network oﬀers the advantage of combining joint image/label embedding and the label co-occurrence model dependency by using CNN and RNN to create the modelwang2016.

Mercan, Caner, et almercan2018 used both the slide-level classification and the region of interest level classification approach to predict lesions in breast tissue mercan2018. The labels were only weakly-labeled for the whole slide images but still performed a 78% accuracy with five diﬀerent classes[30]. If the region of interest was added to the training, an increase of 3% could be detected[30]. The fact that they used whole slide images and weakly-labeled slides makes the result more usable in the clinical environment since it reflects the clinical usage better.

Zhu, Feng, et al[31] suggests a ResNet101 with two subnetworks where one learns attention maps of the image and the other uses spatial regularization of labels on the attention maps[31]. This network can visualize which pixels and parts that results in the predicted label and outperforms state-of-the-art networks in multi-label classification[31].

For challenges with extremely many labels (XML), i.e. hundreds of thousands up to millions of labels, multi-label classification tasks can become troublesome[32]. Bhatia, Kush et al[32] created the Sparse Local Embeddings for Extreme Classification (SLEEC) for this kind of problem which used state-of-the-art embedding based approaches[32]. Embedding aims to reduce the eﬀective number of labels and the natural way of performing embedding is to project the label vectors on to a low dimensional linear subspace in a global fashion[32]. An embedding approach can diﬀer in the technique to compress and decompress the data, but what they all have in common is that they are easy to implement, have a theoretical background, their simplicity and the ability to handle label correlations[32]. The downsides with embedding are that it is slow in training and prediction. The embedding approach for SLEEC is done by combining multiple embedding methods which instead of globally projecting the labels on a low dimensional subspace creates pairwise relationships between labels and their near neighbors only. In the prediction phase, SLEEC uses the kNN classifier and the fact that the nearest neighbor relationship has been preserved during training[32]. To speed up the prediction with kNN the training data is divided into K clusters with separate embeddings. SLEEC has outperformed the leading embedding approaches but also the leading tree method for XML.

A.5 Evaluation

Accuracy is the most common evaluation of performance for classification tasks in general and specifically for single label multi-class learning. However, for multi-label learning the evaluation is less straight forward since the predictions can be either wrong, partially wrong or right. Depending on the problem there are at least three types of approaches: evaluating partitions, evaluating ranking and label hierarchy. Relevance ranking evaluates the labels that are ranked in order of relevance, and label hierarchy evaluates how successful the algorithm is to take into account the hierarchical structure between labels[33].

A.5.1 Evaluation Partitions

(33)

then averaged over all of the examples. Label based evaluation does not address the possible correlations between labels but takes each label-evaluation separately and then averages over all of the labels.

Example-based Evaluation has diﬀerent approaches. The simplest one being the Ex-act Match Ratio which considers partially wrong labels as completely wrong. This is seen as a trivial and harsh way of evaluating multi-label classification. Therefore, to take partially correct predictions into account the accuracy, precision, recall, F1-measure and Hamming Loss has to be considered.

Accuracy is the number of correct labels over the total of predicted and actual labels for that example; the overall accuracy is an average over all of the examples accuracy[33].

Precision is the true positive labels over the predicted positive labels. The overall prediction is the same as with the accuracy, the average overall examples.

Recall is the predicted positive labels over all the true positive labels, averaged over all examples.

F1-measure takes the harmonic mean between precision and recall.

Hamming-Loss takes the average on how many times a label is predicted wrongly. This is done by considering the prediction error where a label is incorrectly predicted and the missing error where an appropriate label was not predicted. These two errors are normalized over the total number of classes and the total number of examples[33].

High scores on accuracy, prediction, recall, and F1-measure means that the algorithm is performing well.

Label based Evaluation uses the micro- and the macro approach on precision, recall, and F1-measure. The micro-precision, for example, gives the precision over all the example/label pairs while macro-precision gives the average precision over all of the examples/labels[26]. These predictions can be used if the label prediction is of the binary kind (0/1).

A.5.2 Ranking-Based Evaluation Measures

(34)

TRITA CBH-GRU-2019:107