Denoising Monte Carlo Dose Calculations Using a Deep Neural Network

(1)

Denoising Monte Carlo Dose

Calculations Using a Deep Neural Network

HANNES FORNANDER

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Calculations Using a Deep Neural Network

HANNES FORNANDER

Master in Robotics and Autonomous Systems Date: July 10, 2019

Supervisor: Josephine Sullivan and Peter Kimstrand Examiner: Danica Kragic Jensfelt

School of Electrical Engineering and Computer Science Host company: Elekta

Swedish title: Brusreducering av Monte Carlo-dosberäkningar med ett djupt neuralt nätverk

(4)

(5)

Abstract

This thesis explores the possibility of using a deep neural network (DNN) to denoise Monte Carlo dose calculations for external beam radiotherapy. The dose distributions considered here are for inhomogeneous materials such as those of the human body. The purpose of the project is to explore whether a DNN is able to preserve important features of the dose distributions as well as to evaluate if there is a potential performance gain of using a DNN compared to the traditional approach of running a full Monte Carlo simulation. The network architecture considered in this thesis is a 3D version of the U-net. The results of using the 3D U-net for denoising suggest that it preserves the features of the dose distributions rather well while having a low propagation time.

Thus, this indicates that the proposed approach could be a feasible alternative to quickly predict the final dose distribution.

(6)

Sammanfattning

Detta examensarbete utforskar möjligheten att använda ett djupt neuralt nät- verk (DNN) för brusreducering av Monte Carlo-dosberäkningar för extern strålbehandling. De dosdistributioner som avses här är för inhomogena material så som människokroppen. Syftet med detta projekt är att avgöra huruvida ett DNN kan bevara dosdistributionernas viktiga attribut och om ett DNN kan öka prestandan jämfört med att beräkna dosen med en komplett Monte Carlo- simulering. Nätverksarkitekturen som används i detta projekt är en 3D-version av U-net. Resultaten av att använda ett 3D U-net för avbrusning indikerar att metoden bevarar dosdistributionernas attribut relativt väl och har dessutom låg propageringstid. Detta indikerar alltså att den föreslagna metoden skulle kunna vara ett möjligt alternativ för att snabbt beräkna den slutgiltiga dosen.

(7)

Acknowledgements

I would like to begin with expressing my gratitude to my supervisor at Elekta, Dr. Peter Kimstrand for his valuable feedback and support during the course of the project. I would also like to thank my colleagues at Elekta for all important discussions. Finally, I would like to thank my supervisor at KTH, Prof.

Josephine Sullivan for sharing her knowledge and ideas for the project.

(8)

1 Introduction 1

1.1 Societal Impact . . . 2

1.2 Purpose . . . 3

1.3 Research Questions . . . 3

1.4 Outline . . . 3

2 Background 5 2.1 Related Work . . . 5

2.2 Dose Calculation . . . 7

2.2.1 Dose Distribution . . . 7

2.2.2 Monte Carlo Dose Calculation . . . 9

2.2.3 GPUMCD . . . 13

2.3 Deep Learning . . . 14

2.3.1 Deep Neural Networks . . . 14

2.3.2 Learning From Data . . . 15

2.3.3 Autoencoders . . . 16

2.3.4 Denoising Autoencoders . . . 16

2.3.5 Convolutional Neural Networks . . . 16

2.3.6 Activation Function . . . 19

2.3.7 Downsampling . . . 20

2.3.8 Upsampling . . . 20

2.3.9 Loss Function . . . 21

2.3.10 Optimization . . . 22

2.3.11 Skip Connections . . . 22

2.4 Evaluation . . . 23

2.4.1 Line Dose . . . 23

2.4.2 Gamma Index . . . 23

2.4.3 Median Filter . . . 24

vi

(9)

3 Methods 25

3.1 Datasets . . . 25

3.1.1 Dataset I: Slab Phantoms . . . 25

3.1.2 Dataset II: Patient Geometry . . . 29

3.2 Network Architecture . . . 32

3.3 Model Setup . . . 33

3.3.1 Model Parameters . . . 33

3.3.2 Training the Model . . . 34

3.3.3 Software and Hardware . . . 34

3.4 Evaluation . . . 35

4 Results 36 4.1 Dataset I . . . 36

4.1.1 Training and Validation Loss . . . 36

4.1.2 Gamma Index . . . 37

4.1.3 Line Dose Comparison . . . 41

4.2 Dataset II . . . 43

4.2.1 Training and Validation Loss . . . 43

4.2.2 Gamma Index . . . 43

4.2.3 Line Dose Comparison . . . 46

4.3 Computation Time . . . 48

4.4 Noise in the Reference Data . . . 48

5 Discussion 50 5.1 Denoising Ability . . . 50

5.2 Comparison to a Full Monte Carlo Simulation . . . 51

5.3 Learning With Noise . . . 52

6 Conclusions 53 6.1 Further Work . . . 53

Bibliography 54

A Network Specification 59

(10)

(11)

Introduction

The aim of radiotherapy is to control or kill tumors with ionizing radiation. Ex- ternal beam radiotherapy is the most common form of radiotherapy, in which the patient is treated with a radiation source external to the patient. This is thus a non-invasive method. The beam of radiation can be generated by e.g.

a linear accelerator. This, in contrast to brachytherapy where the radiation source is inserted into the tumor (or as close to as possible). Before performing radiotherapy, clinicians plan the treatment to simulate the beam setup for the specific case. This is done through treatment planning software, in which the beam setup is optimized to maximize the radiation delivered to the tumor while minimizing the harm done to surrounding tissue and sensitive organs.

As treatment methods have evolved over the years to today’s state-of-the-art of intensity modulated beams, the demand for both fast and accurate dose calculations has also increased. These criteria being in conflict to each other means that the optimization problems of today’s treatment planning has to sacrifice accuracy for speed to get a solution in a reasonable time frame.

The next generation of external beam radiotherapy is likely to be real-time MR guided tracking, where the linear accelerator is combined with a magnetic resonance imaging device (MR-linac) that would make it possible to treat and follow the tumor during ongoing treatment. However, this relies on the possibility of calculating the dose in real time. Furthermore, the MR-linac introduces a problem compared to conventional linacs, that is asymmetries in the dose distribution as a result of the magnetic field introduced by MR. The current best method to deal with this is Monte Carlo (MC) based dose calculations. This method, however, is very time consuming and has inherit noise in the resulting distribution. A possible solution to speed up the dose calculation would then be to add a denoising post-processing step to shorter MC

1

(12)

dose calculations (thus resulting in dose distributions with a higher level of noise) as visualized by the pipeline in fig. 1.1. Trying to denoise these dose distributions using analytical approaches results in important features of the distributions being lost in the process due to e.g. smudging effects. Another alternative is then to approach the denoising problem using deep learning (DL) and more specifically a deep neural network (DNN) to handle the denoising part of the calculation, which is explored in this project.

Figure 1.1: A visualization of the dose calculation pipeline. The project explores the possibility of denoising shorter MC dose calculations with higher noise (the proposed approach marked in green).

1.1 Societal Impact

Using a DNN for post process dose distributions could possibly be a future alternative to today’s state-of-the-art method of solely running MC dose calculation. This, if the DNN prediction can achieve comparable accuracy and also increase the performance in terms of computation speed. Thus, this ex- ploration is important for the future of external beam radiotherapy, where the end-goal is for the dose calculations to be run in real time to enable real-time MR guided tracking for treating tumors. There are ethical aspects of this as well. It is important for the DNN to accurately and reliably preserve the features of the dose distributions, as the end goal would be to use this in a real radiotherapy system. Thus, inaccurate dose calculations could be of potential harm to the patient. Furthermore, as DNNs are data-dependent, it is important that the dataset used covers all important aspects of the treatment setting to ensure a robust prediction.

(13)

1.2 Purpose

Denoising dose distributions using a DNN has proven to be effective in a pre- ceding degree project [1] for treatment plans for radiosurgery using the Lek- sell Gamma Knife. This project focuses on denoising dose distributions for external beam radiotherapy using the MR-linac, which poses a different problem setting. As external beam radiotherapy can be performed anywhere in the body as opposed to the Gamma Knife which only operates on the head - which for dose calculation purposes can be considered as a sphere of water - one has to consider the inhomogeneities of the materials of the body. The dose distribution will have features at the interfaces between these materials that are important to preserve. Furthermore, the magnetic field of the MR-linac also introduces asymmetries to the dose distribution. External beam radiotherapy uses a few (typically two or more) beam directions as opposed to the Gamma Knife, which has 192 beams meeting in the isocenter. This results in a dose distribution with much more articulated edges in comparison to the Gamma Knife.

The purpose of this project is thus to investigate whether denoising using a DNN is able to preserve the true features of the dose distributions. As it is desirable for the dose calculation to run in real-time, the project will also investigate whether there is any speed gain of using a DNN as a denoising post processing step to the MC dose calculations compared to running a "full" MC dose calculation. The overall accuracy and versatility of DNN denoising will also be investigated.

1.3 Research Questions

The thesis aims to answer the following two research questions:

• Can the features of the true dose distribution be preserved by the DNN dose prediction?

• Can a DNN implementation result in a performance gain compared to traditional dose calculation?

1.4 Outline

The report is organized as follows: Chapter 2 describes the background to the problem, related work and the theory used in the project. Chapter 3 describes

(14)

the methods used for investigating the stated problem, including generation of data and network architecture, as well as how the results are evaluated. In chapter 4, the results of using the network for denoising are presented. These are further discussed in chapter 5. Lastly, chapter 6 concludes the thesis and suggests further work to pursue in the future.

(15)

Background

The background for the project will be described as follows. Related work will consider various DNNs for denoising. MC dose calculation will be explained to give the background to the data generation. The theory related to deep learning and neural networks that is relevant for the specific network will be presented. Background will also be given to the methods used for evaluation.

2.1 Related Work

Denoising is a field that has been extensively explored for various noise types.

As denoising is not limited to the medical field, there are many methods to consider. Furthermore, denoising is something that is extensively explored for mainly images in general. Here, neural networks has seen much use and has the advantage of being able to handle any type of noise.

Block matching and 3D-filtering (BM3D) [2] is one of the methods considered as state-of-the-art for image denoising that is not a deep learning approach. However, it has been shown that a plain neural network (multi-layer perceptron) can compete with BM3D with a large enough dataset and network [3].

One of the key networks for denoising, originally introduced for unsu- pervised feature learning, is the denoising autoencoder (DAE) [4], which is trained to repair corrupted inputs to result in a clean output. The idea has been further explored with stacked DAEs (SDAEs) [5], in which the benefit of stacking several DAEs in to a deep architecture was shown for classification problems in terms of a significantly lower classification error. Furthermore, another stacked version of the DAE is the stacked sparse denoising autoencoder (SSDA) [6], providing a solution to the problem of blind image inpainting

5

(16)

which had not been tackled before. An adaptive multi-column variant called AMC-SSDA was shown to be more robust to different types of noise [7]. A SSDA has been used for low-dose computed tomography (CT) image restoration (where one has lowered the radiation dose to spare the patient, which results in a noisy CT image), outperforming several state-of-the-art methods according to experimental results [8].

Convolutional neural networks (CNNs) have been proposed for natural image denoising [9], with comparable results to state-of-the-art methods. CNNs have also been combined with a stacked autoencoder (CAES) [10], further ex- ploring the concept of stacking DAEs while also utilizing the 2D-information with convolutional layers. This showed superior results on digit and object recognition benchmarks. Such a network has been implemented for medical image denoising called CNNDAE with good denoising performance even with a small training dataset [11].

A convolutional residual encoder-decoder neural network (RED-Net) has been used for image restoration [12], which for experimental results achieved better performance than the state-of-the-art for image denoising and super- resolution. Here, the method of skip connections between several mirrored convolutional-deconvolutional layers was used to preserve details that would otherwise be lost due to the convolutions as well as to combat the vanishing gradient problem that can occur when training deep architectures. This was inspired by highway networks [13] and residual networks [14]. The RED- Net also inspired the architecture RED-CNN [15] used for low-dose CT image denoising, which showed competitive performance to state-of-the-art methods and the potential of neural networks to suppress noise as well as preserving structural detail while having a high computational speed. Another CNN used residual learning along with batch normalization for image denoising to boost the denoising performance for several denoising tasks with a single network [16].

The U-net architecture [17] was originally used for biomedical image segmentation. Here, skip connections were also utilized between the layers on the same level in the chain of down- and upsampling. However, it has also been used for denoising [18], showing favorable results compared to the state-of- the-art while having a low propagation time. Furthermore, 3D versions of the U-net have also been implemented for segmentation [19, 20].

The FlowNet [21] achieved good results for real-time optical flow estimation. A recurrent DAE has been used for image reconstruction [22] with similarities to both U-net and FlowNet. The proposed architecture has a low propagation time and shows significantly better results compared to methods

(17)

that can run at comparable speed. This architecture uses the same ideas of skip connections and encoder-decoder structure, combined with recurrent connections in the encoder part of the network.

Another type of network is the generative adversarial network (GAN) [23].

Here, a generator and discriminator network are trained simultaneously, where the generator is trained to generate outputs that the discriminator cannot dis- tinguish from the reference data (thus resulting in a minimax problem). This kind of network has recently been used for low-dose CT [24, 25] with percep- tual loss, wich showed promising results for noise reduction while preserving structural detail.

2.2 Dose Calculation

Dose calculation can be performed using different types of algorithms, with varying accuracy and calculation speed. MC methods are generally regarded as the most accurate. For external beam radiotherapy, MC dose calculation is performed by simulating the particle transport for the specific treatment setting and patient geometry, and the process can be simulated until a desired noise level is reached.

2.2.1 Dose Distribution

The dose distribution represents how the radiation dose that has been absorbed by the tissue of the patient is distributed. The absorbed dose is measured in Gray (Gy) where 1 Gy corresponds to 1 J/kg. For external beam radiotherapy, the radiation is delivered by a linear accelerator (linac) in the form of a photon beam in the MeV range. The linac is built in to a gantry that can rotate around the couch on which the patient is lying (see fig. 2.1). The gantry and couch angle can be utilized to direct the beam from different directions. The linac generates the photon beam by producing electrons with an electron gun that are accelerated by a waveguide. When colliding with an X-ray target, photons are produced as a result (see the simplified linac in fig. 2.2). Upon leaving the linac, the photon beam is shaped by the collimator. Particles hitting the collimator are attenuated in the collimator body and thus removed from the beam as visualized in fig. 2.3. Furthermore, to be able to flexibly adapt the collimator to different shapes the collimator opening can be customized with adjustable leaves, for which one previously had to use custom made cut-outs.

As the unused leaves cannot be closed completely the collimator jaws are used

(18)

to prevent the radiation from leaking through. A visualization of adapting the collimator opening to a target can be seen in fig. 2.4.

Figure 2.1: A simplified view of the treatment setting. The gantry and couch can be rotated as shown by their corresponding angle.

Figure 2.2: A simplified view of how the radiation beam is generated. The waveguide accelerates electrons from the electron gun that when colliding with the X-ray target produces photons as a result.

(19)

Figure 2.3: The collimator’s effect on the direction of the radiation beam.

Here, the arrows correspond to the direction of particles.

Figure 2.4: Visualization of how the collimator leaves and jaws can be used to shape the radiation beam to approximate the shape of the target as well as how jaws are used to prevent leakage .

To calculate the dose distribution, the patient geometry is modelled as a grid of voxels (3D-matrix) defined to be of a certain size (e.g. 2 mm in each dimension), where every voxel contains a material definition and a density, e.g. derived from CT-images. The dose distribution will then be a grid of the same size with the corresponding dose for each voxel.

2.2.2 Monte Carlo Dose Calculation

Monte Carlo (MC) methods can simulate processes by the use of random sampling and can thus solve problems that would not be possible using determin- istic methods. Dose calculation using MC methods is done by simulating the transport of particles in the defined volume. Even though very useful, MC methods tend to be very slow and with inherit noise in the results.

(20)

Monte Carlo and noise

Looking at the simple example of solving an integral using Monte Carlo, consider an integral I expressed as

I = Z b

a

F (x)dx, (2.1)

which can be written as an expectation value by introducing an arbitrary probability density function (PDF) p(x) for which p(x) > 0 for x ∈ [a, b] and p(x) = 0 otherwise. Furthermore, f (x) = F (x)/p(x). I can be written as

I = Z

f (x)p(x)dx. (2.2)

It is then possible to evaluate the integral using the Monte Carlo method. This is done by sampling N random points xⁱ from the distribution p(x) and ap- proximating the integral as

I ≈ ¯f = 1 N

N

X

i=1

f (x_i). (2.3)

Furthermore, the law of large numbers says that

N →∞lim

f = I,¯ (2.4)

and thus by selecting a large N , the integral I can be approximated with ar- bitrarily low uncertainty. Looking at the variance of the approximation, it is found as

V ar( ¯f ) = 1 N²

N

X

i=1

V ar(f (x)) = V ar(f (x))

N , (2.5)

and thus, the standard deviation is found as σ_f =

q

V ar( ¯f ) =

rV ar(f (x))

N , (2.6)

which implies that the uncertainty decreases as ^√¹_N. In practice, this is one of the main drawbacks of using Monte Carlo methods, as to lower the uncertainty with a factor of 2, N has to be increased by a factor of 4. Thus, depending on what level of uncertainty is acceptable for the specific use case, the computation time can become very long.

(21)

Photon transport

The basic scheme for particle transport considered here is shown in fig. 2.5.

Consider a photon with energy E, initial position ~x₀ and direction ~µ₀. Fur- thermore, an energy threshold E^cutis set for which a particle with E < E^cutis considered absorbed. Otherwise, if E > E^cut, the particle is then transported according to the step size s as

~x = ~x₀+ ~µs. (2.7)

For a homogeneous phantom, s is sampled as s = − 1

µ(E)log(1 − r), (2.8)

where r is a random number uniformly distributed between 0 and 1. µ(E) is the total interaction coefficient (not to be confused with the direction ~µ) and s is the distance to the interaction point. For a heterogeneous phantom, however, the geometry has to be considered. One way to circumvent this is to use the Woodcock ray tracing method, which introduces a fictitious attenuation coefficient and makes it possible to consider the volume as homogeneously attenuating [26]. The fictitious attenuation coefficient for each voxel ~x can be expressed as

µ(~x)_{f ict} = µ_max− µ(~x), (2.9) where µmaxis the maximum attenuation coefficient for all of the voxels in the volume. µ^max is then used in the place of µ(E) when sampling the step size s. For photon transport in the considered energy range, there are three types of interactions that have to be considered, namely

• Pair production

• Compton interaction (incoherent scattering)

• Photoelectric interaction Furthermore, µ(E) is calculated as

µ(E) = N_A

A ρ(σ_pp+ σ_inc+ σ_ph), (2.10) where ρ is the mass density, A is atomic weight and NAis Avogadro’s number. Furthermore, the σ’s correspond to the total cross section values for the

(22)

respective interactions listed above. Consider the interaction coefficient µifor interaction i, calculated as

µ_i(E) = NA

A ρσ_i. (2.11)

The probability Pⁱof an interaction i taking place is P_i = µ_i(E)

µ_max . (2.12)

After transporting the photon to the point of interaction, if the photon has not left the volume of interest, the interaction is sampled and the photon is scattered. After scattering, energies and directions of the resultant particles are sampled.

Electron transport

As described above, photon transport results in various interactions, for which electrons are produced. The electrons are the main source of the deposited energy and have to be transported according to a similar, but more compli- cated, scheme as for photons as shown in fig. 2.5. This, as in addition to hard interactions (as for photons), electrons also result in soft interactions (multiple scattering) between the hard interactions. The hard interactions considered for electrons are

• Delta scattering (Møller and Bhabha for electrons and positrons respectively)

• Bremsstrahlung

and the step size between each interaction is sampled just as for photons and thus the Woodcock ray tracing method is used here as well. Furthermore, the soft interactions are modeled at substeps between the hard interactions as shown in fig. 2.5. Thus, as opposed to photon transport where the particle is transported directly to the next interaction, when performing the electron transport a substep t is taken to the next soft interaction, where the multiple scattering deflection angle is sampled and the electron’s direction is adjusted.

Furthermore, the soft interaction results in energy loss for the electron. The energy loss is deposited in the current voxel, resulting in the dose distribution.

Substeps are taken repeatedly until the electron reaches the next hard interaction or until it leaves the geometry or its energy drops below a threshold similar to that of the photon energy.

(23)

When the electron reaches the point of the next hard interaction, the interaction is sampled similarly as for photons. Following the interaction, energies and directions of the resultant particles are sampled here as well.

Figure 2.5: A simplified scheme for particle transport, based on the schemes by Bielajew [27].

2.2.3 GPUMCD

GPU Monte Carlo dose (GPUMCD) [26] is a fast method for calculating Monte Carlo dose distributions by parallelizing the particle transport utilizing the GPU. The simulation is controlled by setting a maximum noise level as a threshold for which the algorithm will terminate when the threshold is reached.

Thus, it is guaranteed that the resulting distribution will contain less noise than the set threshold. GPUMCD is used in this project to generate both the noisy

(24)

input and the reference dose distributions to train the network’s denoising ability.

2.3 Deep Learning

Deep learning (DL) is a subfield of machine learning which itself is a subfield of artificial intelligence. Many of the concepts within the field of deep learning are not new. However, the attention to the field has varied greatly over the years and one obstacle being the computational heaviness of the field in general.

Today, the technology and the research within DL has come a long way and it has seen success in many areas, such as image classification, text generation and language translation.

2.3.1 Deep Neural Networks

Deep neural networks (DNNs) is one of the subfields of DL that has seen most success. DNNs are based on artificial neural networks (ANNs) that are inspired by the biological neural networks of the human brain. By building architectures of nodes (artificial neurons) and connecting them in certain ways, various problems can be solved, such as e.g. classification, regression and clustering problems. Typically, these nodes are arranged in layers with feedforward connections as shown in fig. 2.6. There are however many other types of ANNs such as recurrent neural networks.

Figure 2.6: Visualization of a simple feedforward neural network.

DNNs simply refers to stacking of ANNs in to a deep architecture consist- ing of several layers, which potentially increases the expressive power of the network, and is one of the main reasons for the success of DL.

(25)

2.3.2 Learning From Data

For a neural network to give any sensible output, it has to be trained. As can be imagined, there are numerous ways to define the problem that the network should try to solve, which further determines what the network should actually output (e.g. a classification) and what data can be used as input for the network to process and eventually be able to use to produce the desired output. The most common form of learning is supervised learning, where the desired output data (reference data) for a certain input is available. For supervised learning the reference data is used to train the network by comparing the output of the network to the reference and minimizing the error between these. Mathematically, this can be expressed as follows. Given a dataset D = {(x₁, y₁), ..., (xn, y_n)} with input data xⁱand its respective desired output data yi, we want to learn the function f (x; θ) from D so that it predicts y.

The general goal of supervised learning is then to learn the parameters θ as argmin

θ

X

(xi,y_i)∈D

l(y_i, f (x_i; θ)), (2.13)

where l(yi, f (x_i; θ) is the loss function that is minimized. This concept is visualized in fig. 2.7.

Figure 2.7: A simplified view of how the data is used to train the network.

As the network is trained using the available data, the network’s performance is limited by how well the data at hand represents the underlying model.

Thus, having a big enough dataset is crucial for the model to learn properly.

Furthermore, a too complex model for the dataset can result in overfitting the training data, thus unintentionally modeling the noise in the training set, whereas a too simple model will instead result in underfitting, i.e. not modeling the true characteristics of the of the underlying model. However, instead

(26)

of solving this problem by trying to find the actual optimal model complexity, a complex model combined with regularization can be used, where the regularization helps combat the problem of overfitting.

2.3.3 Autoencoders

An autoencoder is an encoder-decoder network that, by training it to have the same output as the input, can learn a more compact representation of the data.

This is done by using a bottleneck structure as visualized in fig. 2.8. By having fewer hidden nodes than the input, this structure forces the network to represent the data with fewer nodes and can thus perform dimensionality reduction.

Figure 2.8: The autoencoder uses an encoder-decoder structure to reduce the dimensionality of the data.

2.3.4 Denoising Autoencoders

The denoising autoencoder (DAE) was first introduced by Vincent et al. [4].

The idea of denoising autoencoders is to use a similar concept to that of a normal autoencoder, except with a corrupted version of the output as input.

The network is thus trained to remove the noise in the input, resulting in a clean output.

2.3.5 Convolutional Neural Networks

Convolutional neural networks (CNNs) have seen most success in the image domain, where the network type has been extensively explored. Unlike the multi-layer perceptron (MLP), CNNs utilize the spatial information of the data.

Thus, through the use of trainable kernels, spatial features in the data can be found. Furthermore, this structure decreases the computational complexity compared to applying a MLP, as the pixels in the image are only weighted with adjacent pixels defined by the kernel size as opposed to every pixel being

(27)

weighted. Even though CNNs have seen most popularity for images, i.e. 2D, the concept works the same in 3D, which is what is going to be explored here, with 3D-data as input and output.

Convolution

The convolution of a 3D feature map X (of arbitrary size) and a kernel F is performed according to

S = X ∗ F + b, (2.14)

where S is the output feature map and b is the bias. For a kernel of size 3×3×3 each element of S can be calculated as

S_lmn = X

i∈{0,1,2}

j∈{0,1,2}

k∈{0,1,2}

Xl+i−1,m+j−1,n+k−1F_ijk+ b. (2.15)

To find features in the data, it is necessary to have several kernels that after training would extract different features, resulting in one output feature map per kernel. Also, if not for the first layer, the following layers will have several input feature maps as well. For M input feature maps X = {X¹, ..., X_M} and N kernels F1, ..., F_N of size f × f × f × M , eq. (2.14) can be extended to

S_i = X ∗ F_i+ b_i f or i = 1, ..., N, (2.16) where the output feature maps S = {S¹, ..., SN}. Furthermore, for one kernel, say F = {F¹, ..., F_M} as well as the input X = {X1, ..., X_M}, the convolution is performed as

X ∗ F =

M

X

i=1

X_i∗ F_i (2.17)

Padding

When performing the convolution, the kernel is centered around an index in the input feature map corresponding to an index in the output feature map.

This creates a problem for the edge cases, where the kernel cannot be directly applied as it extends outside the range of the feature map. This is most often solved in one of two ways. Either, this problem is ignored and the resulting output feature map will be of a slightly smaller size, losing the surrounding

"frame", which would be the outer value for the whole cube when using a 3 × 3 × 3 kernel. This is known as ’VALID’ padding (i.e. no padding). If

(28)

it is preferable to preserve the size of the feature map, a padding of zeros can instead be added around the input so that the kernel can be applied to these indices as well. This is known as ’SAME’ padding. The difference is shown in fig. 2.9 (in 2D for the sake of simplicity).

Figure 2.9: Visualization of the difference between ’VALID’ padding (above) and ’SAME’ padding (below).

Stride

When moving the kernel, the numbers of indices it is moved in every dimension of the data is called strides (see fig. 2.10). Thus, a stride of 2 in all dimensions corresponds to moving the kernel 2 indices for every dimension. This would result in an output feature map of half the input size in all dimensions (for using ’SAME’ padding).

(29)

Figure 2.10: Visualization of the effect of stride in 2D. Here, the stride is 2 in both dimensions.

2.3.6 Activation Function

The activation function is an essential part of neural networks as it helps in- troduce non-linearity. Inspired by biology, the activation function represents the firing rate of a neuron. For a CNN, it is applied to each entry of the feature maps after each convolutional layer. There are several different activation functions and the sigmoid is one that has seen much use traditionally. It is defined as as

σ(x) = 1

1 + e^−x. (2.18)

Thus, x is mapped to the interval [0, 1] and is saturated. Considering other activation functions, one of the most frequently used in modern neural networks is the Rectified Linear Unit (ReLU). Applying ReLU is done as

ReLU (x) = max(0, x). (2.19)

ReLU has certain nice properties, e.g. that it does not saturate for large posi- tive values (which is good for preventing vanishing gradients), as well as being easy to compute. Setting negative values to zero also results in a sparse representation. However, one risk of using ReLU is that parts of the network have the potential to die, i.e. many neurons outputting zero due to being stuck at a negative value. If this is a problem for the network, there are versions of ReLU such as leaky ReLU and parametric ReLU (PReLU), with a small slope for negative values. Furthermore, in practice, ReLU tends to show much better convergence than the saturating activation functions (e.g. the sigmoid or tanh) [28].

(30)

2.3.7 Downsampling

An important technique that is most often used in CNNs is downsampling.

This is important to increase the computational efficiency as well as for the network to find general patterns in the data. The most used technique for downsampling in CNNs is max pooling. This corresponds to taking the maximum value of a specific input size. This is shown in fig. 2.11 for a 2D example for the sake of simplicity. For a 2 × 2 × 2 region, this can be expressed as

Slmn= max

i∈{0,1}

j∈{0,1}

k∈{0,1}

X2l+i,2m+j,2n+k, (2.20)

where X is the input feature map and S is the downsampled output feature map. There are other downsampling techniques, such as average pooling. Fur- thermore, another possibility is using convolutions with stride, proven to be able to achieve similar performance in the All Convolutional Net [29].

Figure 2.11: Visualization of 2 × 2 max pooling operation in 2D.

2.3.8 Upsampling

As opposed to downsampling, upsampling faces the problem of increasing the resolution as visualized in fig. 2.12. There is no optimal way to do this, as the downsampling step is non-invertible. However, just as for downsampling, there are several different methods to perform upsampling, e.g. nearest- neighbor and bilinear interpolation, for which the performance depends on the specific situation. Another solution to the problem is to use transposed convolution. Transposed convolution is trainable and thus theoretically has the potential to perform better compared to other upsampling techniques by finding an optimal weighting. This, with the downside of adding more parameters for the network to train and thus slightly increasing the training time. The

(31)

transposed convolution for the simplified problem of one feature map X and one 2 × 2 × 2 kernel F (with stride 2) can be performed as

S2l+i,2m+j,2n+k = X

i∈{0,1}

j∈{0,1}

k∈{0,1}

X_lmnF_ijk+ b, (2.21)

where S is the upsampled feature map.

Figure 2.12: Visualization of the upsampling problem.

2.3.9 Loss Function

The loss function is the function that is minimized by the optimizer and thus has an important role for the training of the neural network. There are many to choose from. However, the two most common are mean absolute error (MAE) and mean squared error (MSE). MAE is calculated according to

M AE = 1 n

n

X

i=1

|y_i− ˆy_i|, (2.22)

where y are the reference values and ˆy are the predictions from the network.

Similarly, MSE is calculated as M SE = 1

n

X

i=1

(y_i− ˆy_i)². (2.23) Which one works best depends on the data at hand. While MAE tends to be more robust to outliers, MSE has a better curve for learning. This can be interpreted from the curves in fig. 2.13. As MSE is influenced more by large errors, outliers will have greater impact compared to using MAE. However, approaching the minimum, the gradient for MAE stays the same, while for

(32)

MSE it naturally decreases, which is preferable for learning. A combination of the two can also be used, e.g. Huber loss, which is approximately MAE for large errors and MSE for small errors, i.e. when getting close to the minimum.

Figure 2.13: Curves for MSE and MAE loss.

2.3.10 Optimization

To train the network parameters and minimize the error, an optimization algorithm is used. One of the most well-known algorithms is stochastic gradient descent (SGD) which uses the gradient of the loss function along with a learning rate to update the parameters. There are several versions of SGD de- veloped to boost the performance, e.g. AdaGrad and RMSProp, both keeping a separate learning rate for each parameter as opposed to SGD for which the learning rate is the same for all parameters. Today, one of the most popular algorithms for DNNs is Adam (i.e. Adaptive Moment Estimation) [30], which combines the advantages of AdaGrad and RMSProp and similar to them also uses a separate learning rate for each parameter.

2.3.11 Skip Connections

Skip connections can be used in neural networks for several reasons. By adding skip connections, one combats the problem of vanishing gradients during training. For the specific case of a denoising neural network, the skip con-

(33)

nections are also helpful to preserve details in the data that would otherwise be lost from downsampling.

2.4 Evaluation

To evaluate the denoising ability of the neural network properly, several aspects have to be considered. The line dose and the gamma index will be used for evaluating the denoising ability. Furthermore, a median filter is introduced to set a ground level for denoising as a comparison.

2.4.1 Line Dose

As dose distributions are 3D entities this means that they are hard to visually evaluate. In this particular case, it can be hard to judge from the 3D data where the network might have trouble to denoise the dose distributions. One simple method to get an idea of how well the network performs is to look at the line dose along the central axis of the beam with respect to the depth through the patient geometry (see fig. 2.14). Thus, by plotting the 1D line dose for the dose distributions, it is easier to inspect the dose at the interfaces of the patient materials.

Figure 2.14: Visualization of the line dose. Here, the line dose is extracted from the dose distribution along the yellow line (left), that intersects the interfaces of the different materials of the geometry (middle). The resulting line dose (right) is shown for the input and the corresponding reference distribution (in blue and orange respectively).

2.4.2 Gamma Index

The gamma index [31] is a metric for the similarity of dose distributions (for points r^c and r^m in the calculated and measured dose distributions respectively) which combines the measures for dose difference and distance-to-agreement.

(34)

By setting thresholds for the dose difference ∆DM and distance-to-agreement

∆dM, a gamma-index γ ≤ 1 means passing the criteria, while γ > 1 means failing the criteria. The gamma index is calculated as

γ(r_m) = min{Γ(r_m, r_c)}∀{r_c}, (2.24) where

Γ(r_m, r_c) = s

r²(r_m, r_c)

∆d²_M +δ²(r_m, r_c)

∆D_M² (2.25)

Furthermore,

r(r_m, r_c) = |r_c− r_m| (2.26) and

δ(r_m, r_c) = D_c(r_c) − D_m(r_m), (2.27) where δ(r^m, rc) is the difference in dose.

Instead of evaluating a calculated dose in comparison to the measured dose, the gamma index will be used for evaluating the predicted dose from the network to the corresponding reference dose.

2.4.3 Median Filter

The median filter is a simple method that is frequently used for denoising. The method essentially applies a kernel of a certain size to the data for which the median is calculated. A 3 × 3 × 3 sized kernel is applied to each element of the data as

Slmn= median

i∈{0,1,2}

j∈{0,1,2}

k∈{0,1,2}

(Xl+i−1,m+j−1,n+k−1), (2.28)

where X is the input data and S is the median filtered output of the same size.

(35)

Methods

This chapter explains the methods used in the project. It describes how the datasets were generated, presents the network architecture used and finally explains the setup for training as well as how the network will be evaluated.

3.1 Datasets

The data used for the project were 3D dose distributions as described in section 2.2.1 with a size of 128 × 128 × 128 voxels. The dose distributions were generated with GPUMCD, where the noise level threshold was set to < 30%

and < 2% for the input and reference data respectively. Furthermore, training with noise in the reference data has been proven to work well, with similar results to training with clean reference data [32]. As described in section 2.2.3, the actual noise level will be less than the set threshold. In order to quan- tify the actual noise levels of the input and reference data, they will have to be estimated from the noise of the generated dose distributions. Two datasets were used during the course of the project with different characteristics. These characteristics and the actual noise level of each dataset is described below.

3.1.1 Dataset I: Slab Phantoms

For the first dataset, the geometry was set to slabs of bone, water, solid water and dry air, where the thickness of each slab was randomized along with the order of the materials. The specific materials were chosen to mimic the human body densities as well as to see how the network manages to deal with both small and large differences in density of the materials (see the respective densities in table 3.1). Furthermore, the voxel size was set to 2 mm in each

25

(36)

dimension. As for the beam setup, the gantry angle was sampled from a uni- form distribution between 0^◦and 360^◦and the collimator opening was set to a rectangle, with both sides sampled uniformly between 4 mm and 150 mm. A visualization of the beam and phantom setup is shown in fig. 3.1.

Table 3.1: Table of densities for the patient geometry.

Bone Water Solid water Dry air Density (g/cm³) 1.819 1.000 1.046 0.010

Figure 3.1: Visualization of the beam and phantom setup. The gantry angle, collimator opening as well as the geometry slab thickness and material order are randomized.

The dataset consisted of 1000 input/reference dose distribution pairs. These were split up as follows: 800 of the pairs were used for training whereas 100 were used as validation data and the last 100 for testing. An example of an input and a reference distribution can be seen in fig. 3.2, where the middle slice of the dose distributions are plotted. Furthermore, the 100 test set reference dose distributions were generated with a noise level threshold of < 0.5%

(referred to as the low noise reference data), as further described in section 3.4.

(37)

Figure 3.2: Example input (left) and reference (right) distribution pair for dataset I.

The noise levels of the actual input and reference data were estimated by the mean of the noise levels for the voxels with a dose above 50% of the maximum dose for each respective dose distribution. The input data, generated with a noise threshold of < 30%, had an average noise level of 11.2% whereas the reference data, generated with a noise threshold of < 2%, had an average noise level of 1.1%. Furthermore, the low noise reference data with a threshold of

< 0.5% had an approximate noise level of 0.38%. Histograms of the noise levels for the dose distributions are shown in fig. 3.3, fig. 3.4 and fig. 3.5 for the input, reference and low noise reference data, respectively.

(38)

Figure 3.3: Histogram of noise levels for the input dose distributions of dataset I.

Figure 3.4: Histogram of noise levels for the reference dose distributions of dataset I.

(39)

Figure 3.5: Histogram of noise levels for the low noise reference dose distributions of dataset I.

3.1.2 Dataset II: Patient Geometry

For the second dataset, the aim was to mimic a clinical setting, using a real patient geometry instead of a slab phantom. Thus, instead of simulating the geometry with slabs as for dataset I, two real patient geometries were used, one for training and one for testing, that had a voxel size of 3 mm in each dimension. These geometries were both of lung patients and thus, they spanned the same types of body materials. Furthermore, collecting or generating 1000 clinical plans is very time consuming and could not be achieved within this project. In order to take one step closer to realistic clinical beam setups the collimator opening was set to an ellipse with the height and width uniformly sampled between 4 mm and 150 mm as well as its angle uniformly sampled between 0^◦and 360^◦. The isocenter within the patient, i.e. the patient position within the treatment machine, was also sampled from the set of the 50×50×50 central voxels. This set was selected to not limit the dataset to dose distributions aimed at the center of the patient geometry, while avoiding edge cases where e.g. half of the beam is outside the geometry. The gantry angle was sampled uniformly just as for dataset I, i.e. between 0^◦ and 360^◦. The beam setup is shown in fig. 3.6. The same amount of data and the split for training, validation and testing as for dataset I was also used for this dataset. An example input and reference distribution is shown in fig. 3.7. Furthermore, just

(40)

as for dataset I, the test set reference dose distributions were generated with a noise level threshold of 0.5% as low noise reference data.

Figure 3.6: Visualization of the beam setup. The gantry angle, collimator opening as well as the isocenter are sampled uniformly, where the isocenter is sampled within the red square (cube in reality).

Figure 3.7: Example input (left) and reference (right) distribution pair for dataset II.

Just as for dataset I, the actual noise levels were also calculated for this dataset (generated with the same input and reference noise thresholds). The input data had an average noise level of 7.3% and the reference data had an average noise level of 1.0%. Furthermore, the average noise level for the low noise reference data was 0.26%. Histograms of the noise levels for the dose

(41)

distributions are shown in fig. 3.8, fig. 3.9 and fig. 3.10 for the input, reference and low noise reference data, respectively.

Figure 3.8: Histogram of noise levels for the input dose distributions of dataset II.

Figure 3.9: Histogram of noise levels for the reference dose distributions of dataset II.

(42)

Figure 3.10: Histogram of noise levels for the low noise reference dose distributions of dataset II.

3.2 Network Architecture

The network architecture used for denoising in this project is a 3D-version of the U-net [17] (see fig. 3.11 for a visualization and appendix A for the com- plete network specification). It has a similar architecture to the U-net, but with every operation done in 3D instead of 2D, that is convolution, max pooling and up-convolution, i.e transposed convolution followed by ReLU activation. This, to utilize the information in all three dimensions of the data. It has an encoder- decoder structure where the encoder, similar to a usual convolutional neural network, is a chain of convolutional layers and pooling layers, whereas the main difference to other CNNs happens in the decoder, reconstructing the volume by a chain of convolutional layers and upsampling layers (corresponding to transposed convolutions) concatenated with the feature maps of the same size in the encoder part through the skip connections. The specific network consistently uses kernels of size 3 × 3 × 3 for the convolutional layers and padding with zeros to preserve the size. This is followed by a ReLU activation, with the only exception being the output layer. Furthermore, downsampling is done using 2 × 2 × 2 max pooling and the corresponding upsampling is done using transposed convolution. Every downsampling/upsampling step increases/decreases the number of feature maps by a factor of 2 respectively,

(43)

starting with 32 feature maps for the highest level (see fig. 3.11).

Figure 3.11: A visualization of the U-net in 3D, which was the network used in this project. Here, the number above each box corresponds to the number of feature maps.

3.3 Model Setup

Aside from a suitable network architecture, there are other important factors to consider for the network to learn properly. There are many parameters that can be tuned as well as loss functions and optimizers to choose from. Furthermore, the training and testing time is greatly affected by the specific hardware and software used.

3.3.1 Model Parameters

The model parameters are presented in table 3.2. The same parameters were used for both datasets. Due to handling 3D data, memory issues were met with a batch size > 1. Furthermore, as Adam optimization sets an adaptive learning rate for each parameter, the learning rate here is the maximum learning rate Adam will use. The Adam parameters β1 and β2 presented here are the exponential decay rates for the first and second moment estimates [30].

Table 3.2: Table of model parameters for the network.

Epochs Batch size Learning rate β1 β2

300 1 1e − 5 0.9 0.999

(44)

3.3.2 Training the Model

As described in section 2.3.2, the network is trained with input and corresponding reference data. For this specific problem, the training is performed according to fig. 3.12, using 800 input and reference dose distributions for training and 100 for validation. By using both the high noise input dose distribution and possibly also the corresponding patient geometry (i.e. the grid of material densities with the same size as the dose distribution) as input to the network, it is thus trained to weight these to properly denoise the dose distribution and preserve the dose features that are typically located at the interfaces of the materials which are more clearly given by the geometry. The network prediction is compared to the reference dose distribution using the MSE loss function, which is minimized (by adjusting the weights) with Adam optimization.

Figure 3.12: A simplified view of how the data is used to train the network.

3.3.3 Software and Hardware

The code is written in Python 3.6.5 and the deep learning library Keras 2.2.4 is used for the network implementation with Tensorflow (GPU) 1.11.0 as back- end. The network is trained and tested on a desktop computer using a NVIDIA GTX 1080 Ti GPU. Furthermore, CUDA Toolkit 9.2 is used along with cuDNN 7.3.1. For the data generation, GPUMCD was run on two NVIDIA GTX 1080 Ti GPUs.

(45)

3.4 Evaluation

The results of using the network for denoising will be evaluated in several ways. The impact of using the geometry as an additional input to the network will be evaluated. Furthermore, comparisons will be made continuously through the results of the network performance to the performance of a median filter as a ground level of denoising.

The evaluation will be a combination of visual inspection of the line dose as well as gamma index example plots and histograms. The gamma index criteria are set to 1% and 1 mm for dose difference and distance-to-agreement respectively. However, these are rather strict criteria and a dose prediction is considered to pass the test if 95% of the voxels fulfill γ ≤ 1. Furthermore, the gamma index being a value per voxel in each distribution, the evaluation will be done by looking at the percentage of passing voxels for a certain gamma index, as well as looking at histograms of the gamma indices for all test set distributions.

The network is trained with reference data containing noise level below a threshold of 2% (see section 3.1.1). While this is sufficient for training, it is still not fair looking at the similarity to these distributions as the network possibly does not model the noise. Thus, the network prediction will be compared to the low noise reference data for an even more accurate evaluation.

(46)

Results

The results of using the 3D U-net for denoising of the dose distributions are presented below. First, the denoising results for the low noise reference data are presented for each dataset following the same structure. The training and validation loss is first presented. Following that, the network’s denoising ability is compared with a median filter. For this, histograms of the gamma indices for all of the test set distributions are shown along with example plots of the gamma index for the middle slice of dose distributions. Furthermore, example plots are shown of the line dose through the distributions with respect to depth. Following the results specific to each dataset, results regarding the computation time for the network propagation and the input and reference data is presented, which relates to both datasets. Finally, a result of having noise in the reference data is shown, which is also related to both datasets.

4.1 Dataset I

The results on dataset I show the network’s denoising performance in a simplified setting both with and without the geometry as an additional input to the network along with being compared to a median filter.

4.1.1 Training and Validation Loss

The training and validation loss for the network both with and without the geometry as input is shown in fig. 4.1. The training was done for 300 epochs, which took approximately 58 hours. From the loss it is also clear that the network (both with and without the geometry) does not overfit the training data, as it is continuously decreasing even on the validation set. Furthermore,

36

(47)

the network converges slower with the geometry as input, which is expected when giving more information to the network. However, the network appears to approach a similar loss value in both settings.

Figure 4.1: MSE loss for training and validation data both with and without the geometry.

4.1.2 Gamma Index

The gamma index results for the predicted and the reference dose distributions are shown in two aspects below. Histograms of the gamma indices and voxel passing rates are shown followed by example gamma index plots for the middle slice of the dose distributions. Recall for these results that a passing rate of 95% means that the dose prediction has passed the gamma test.

Gamma index histograms

The gamma index results of using the network for denoising are presented both with and without having the geometry as input to the network and are also compared to using a median filter. The (reverse) cumulative gamma index histograms for the test set dose distributions are shown in fig. 4.2 for the respective network predictions and the median filter. These are also supported

(48)

by table 4.1 showing the passing rate of the voxels. As seen from the histograms, most voxels of the network predictions have a low gamma index both with and without the geometry as input. Furthermore, 98.86% and 98.92% of the voxels pass a gamma index of 1 with and without the geometry respectively. Thus, the overall results are slightly better without the geometry as input. A histogram of the voxel passing rates with respect to each individual dose distribution for the network predictions is shown in fig. 4.3, where it is clear that all of the predicted dose distributions have a passing rate above 95%, most of them with a great margin, both with and without the geometry. For the median filter, on the other hand, 87.51% of the voxels are passing a gamma index of 1 and as seen from the histograms in fig. 4.2 the median filter has a larger number of failing voxels for all gamma values.

Figure 4.2: Cumulative gamma index histogram for all gamma indices for the network predictions (blue with geometry, red without geometry) and the median filter output (green).

(49)

Figure 4.3: Histogram of the voxel passing rates for each dose distribution.

Table 4.1: Table of voxel passing rates with respect to gamma index for the network and median filter results.

γ ≤ 1, (%) γ ≤ 2, (%) γ ≤ 3, (%) γ ≤ 4, (%)

Network (with geo.) 98.86 99.93 99.99 100.0

Network (without geo.) 98.92 99.91 99.98 100.0

Median filter 87.51 97.69 99.59 99.89

Gamma index plots

Four representative example gamma index plots are shown in fig. 4.4 for the middle slice of the dose distributions. The examples were chosen as both wide and narrow beams as well as having both small and large peaks as can be seen below for the line dose plots. The plots show that most of the voxels of the network predictions pass the test, where the prediction with the geometry as input seems to better find certain dose features as can be seen from comparing the failing voxels for the predictions (as highlighted by the green circles). This, however, is not the case for the median filter output for which the remaining noise causes a significant amount of voxels to fail.

(50)

(a)

(b)

(c)

(d)

Figure 4.4: Example gamma index plots for four different dose distributions.

From left to right: the reference dose distribution, network prediction with geometry, network prediction without geometry and median filter output. Voxels in red have failed a gamma index of 1.

(51)

4.1.3 Line Dose Comparison

The line dose with respect to depth for the input, reference and predicted dose distribution is shown in fig. 4.5 for the same four example distributions as for the gamma index plots above. From the line dose plots it is clear that the network is able to preserve the features of the dose distributions, i.e. the peaks in the line dose, rather well. Especially in comparison to the median filter, for which the features are not preserved as well. Furthermore, the overall denoising is not as good, as can be interpreted by the green curve fluctuating around the reference curve. With the geometry as input, the network is also able to denoise the smaller peaks more consistently (here highlighted by the red circles), which supports the results of the gamma index plots.

(52)

(a)

(b)

(c)

(d)

Figure 4.5: Four line dose examples for the network predictions with geometry (left), without geometry (middle) and the median filter output (right) in green. The blue curve is the input and the orange curve is the reference dose distribution.

(53)

4.2 Dataset II

With dataset II, the limits of denoising dose distributions with the proposed network are further explored in a more realistic setting. Just as for dataset I, the results are compared with a median filter. As the results on dataset I suggest that using the geometry as input helps with finding certain dose features, this is the version of the network that will be used for dataset II.

4.2.1 Training and Validation Loss

The training and validation loss is shown in fig. 4.6. Running the same number of 300 epochs and the same amount of data, the training took approximately 58 hours for this dataset too. Similarly to the loss for dataset I, there are no signs of overfitting.

Figure 4.6: MSE loss for training and validation data.

4.2.2 Gamma Index

The gamma index results are shown below, just as for dataset I, i.e. histograms of the gamma indices followed by example gamma index plots for the middle slice of the dose distributions, again comparing the network prediction to a median filter.

(54)

Gamma index histograms

The (reverse) cumulative histograms for the test set dose distributions are shown in fig. 4.7 for the network and median filter. It is clear from the histograms that the both network and the median filter performs worse on this dataset, compared to the first one. From the voxel passing rates shown in table 4.2, a gamma index of 1 is passed by 94.84% of the voxels for the network and 85.88% of the voxels for the median filter. Furthermore, a histogram of the voxel passing rates with respect to each individual dose distribution for the network predictions is shown in fig. 4.8, where more than half of the dose distributions have a passing rate above 95%.

Figure 4.7: Cumulative gamma index histogram for all gamma indices for the network predictions (blue) and the median filter output (green).

(55)

Figure 4.8: Histogram of the voxel passing rates for each dose distribution.

Table 4.2: Table of voxel passing rates with respect to gamma index for the network and median filter results.

γ ≤ 1, (%) γ ≤ 2, (%) γ ≤ 3, (%) γ ≤ 4, (%)

Network 94.84 99.29 99.86 99.96

Median filter 85.88 95.70 98.56 99.28

Gamma index plots

Example gamma index plots are shown in fig. 4.9 for four representative example dose distributions. The network appears to perform worse on this dataset than on dataset I with an increased number of failing voxels. However, it seems to have no significant problem of denoising the features of the dose distributions as indicated by the gamma index not being significantly higher for the features of the reference dose distribution. Furthermore, the median filter has a problem of preserving the peaks and just as for dataset I many voxels do not pass the gamma index criteria.

(56)

(a)

(b)

(c)

(d)

Figure 4.9: Four example gamma index plots for each respective reference dose distribution (left) and the network prediction (middle) and median filter output (right). Voxels in red have failed a gamma index of 1.

4.2.3 Line Dose Comparison

The line dose for the same four example dose distributions as for the gamma index plots above are shown in fig. 4.10 for the network prediction and median

(57)

filter respectively. The network prediction still appears to perform rather well with the exception of a few peaks, even though not as well as for dataset I. The median filter filters not just the noise, but also the peaks as expected. With the introduction of the more complex geometry and as a result an increasing number of peaks in the dose distribution, the median filter performs poorly.

(a)

(b)

(c)

(d)

Figure 4.10: Four line dose examples for the network prediction (left) and median filter output (right) in green. The blue curve is the input and the orange curve is the reference dose distribution.

(58)

4.3 Computation Time

To evaluate the potential performance gain of using the network for denoising as opposed to the traditional approach of solely using MC calculation, one important aspect is that calculating the input dose distribution and propagating it through the network with acceptable denoising performance is faster than the traditional approach. The average computation times are shown in table 4.3 for the network propagation and the input and reference dose distributions, with the software and hardware setup according to section 3.3.3. Here, it is clear that the combined average computation time for the input dose distribution and network propagation is significantly shorter than the average computation time for the reference dose distribution.

Table 4.3: Table of the average computation times for the network propagation and the input and low noise reference dose distributions.

Network Input Reference Comp. time (s) 0.19 5.54 104.17

4.4 Noise in the Reference Data

The above results for the network’s denoising ability is presented using the low noise reference data. This, as the noise in the reference data has a significant impact on the results, which can be seen from the histogram comparison in shown in fig. 4.11 (for dataset II, but the same applies to dataset I). Here, the effect on the gamma index of lowering the noise is clear. That is, for comparison, the noise in the reference data should be decreased as much as possible (which is limited by the computation time of the MC dose calculation).

(59)

Figure 4.11: Comparison of cumulative gamma index histograms for the network on the high and low noise reference data.