IwouldliketothankmysupervisorPrashantSinghforallthegoodsupportandfeedbackduringthedevelopmentofthisproject.IwouldalsoliketothankErikB¨angtssonfromScandiDosforhisexpertiseinradiationtherapyandforthedataheprovidedtotheproject.LastbutnotleastIwanttothankSimo

(1)

UPTEC F 19011

Examensarbete 30 hp

Mars 2019

Learning Phantom Dose Distribution

using Regression Artificial Neural

Networks

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Learning Phantom Dose Distribution using Regression

Artificial Neural Networks

Mattias Åkesson

Before a radiation treatment on a cancer patient can get accomplished the treatment planning system (TPS) needs to undergo a quality assurance (QA). The QA consists of a pre-treatment (PT-QA) on a synthetic phantom body. During the PT-QA, data is collected from the phantom detectors, a set of monitors (transmission detectors) and the angular state of the machine. The outcome of this thesis project is to investigate if it is possible to predict the radiation dose

distribution on the phantom body based on the data from the transmission detectors and the angular state of the machine. The motive for this is that an accurate prediction model could remove the PT-QA from most of the patient treatments. Prediction difficulties lie in reducing the contaminated noise from the transmission detectors and in correctly mapping the transmission data to the phantom. The task is solved by modeling an artificial neuron network (ANN), that uses a u-net architecture to reduce the noise and a novel model that maps the transmission values to the phantom based on the angular state. The results show a median relative dose deviation ~ 1%.

(3)

Acknowledgments

I would like to thank my supervisor Prashant Singh for all the good support and feedback during the development of this project. I would also like to thank Erik B¨angtsson from ScandiDos for his expertise in radiation therapy and for the data he provided to the project. Last but not least I want to thank Simon Str¨omstedt Hallberg who started this project with me for all the good discussions and for introduced me to Tensorflow.

(4)

Popul¨

arvetenskaplig sammanfattning

Str˚albehandling är en populär metod att behandla cancerpatienter med idag. Metoden utförs med hjälp av en linjäraccelerator (linac) som koncentrerar str˚alning mot tumöromr˚adet samtidigt som omkringliggande frisk vävnad utsätts för s˚a lite str˚alning som möjligt. Detta är möjligt d˚a linac:en kan forma och rikta str˚alen fr˚an olika vinkeltillst˚and. Innan en str˚albehandling kan genomföras p˚a en patient m˚aste en s.k. str˚albehandlingsplan räknas ut och kvalitetssäkras för att försäkra att korrekt str˚aldos levereras till patienten. Kvalitetssäkringen best˚ar dels av en tidskrävande för-kvalitetssäkring p˚a en syntetisk cylinderkropp (fantom) med detektorer. En transmissionsdetektor är monterad framför myn-ningen av str˚alkällan för att övervaka str˚albehandlingen. Examensarbetet un-dersöker om str˚aldosdistributionen p˚a cylinderkroppens detektorer g˚ar att räkna ut genom att träna ett artificiellt neuralt nätverk (ANN) att transformera sig-nalen fr˚an transmissionsdetektorn till cylinderkroppen. Sv˚arigheter med trans-formationen är främst att reducera brus fr˚an transmissionsdetektorn men ocks˚a att hitta en korrekt relation mellan transmissionsdetektorn och samtliga de-tektorer p˚a cylinderkroppen som en funktion av vinkeltillst˚andet fr˚an linac:en. Modellen som tagits fram använder sig av befintlig ANN arkitektur för brusre-duktion kombinerat med nya lösningar för att transformera signalen fr˚an trans-missionsdetektorn till cylinderkroppen. Resultatet visar ett relativt medianfel runt 1% för de flesta av str˚albehandlingsplanen i testgruppen.

(5)

Nomenclature

Acronyms

ANN Artificial Neural Network

AT-QA At-Treatment Quality Assurance CPC Charged Particle Contamination linac linear accelerator

MLC Multileaf Collimator MU Monitor Units

PT-QA Pre-Treatment Quality Assurance QA Quality Assurance

TPS Treatment Planning System Data

θ1 gantry angle rotation

θ2 collimator angle rotation

˜

P predicted Phantom data P target Phantom data T Transmission data

(8)

1 Introduction

Radiation therapy is a popular approach towards treating cancer patients. The treatment involves radiating the affected part of the body using a radiation source. The radiation process must be performed very carefully so that healthy tissues are left untouched by the radiation. In order to achieve this, pre-treatment machine calibration is performed using a ‘phantom’ that takes place of the patient, and measures the shape of the radiation received. The machine parameters are then tuned in order to achieve the perfect shape. This process is often time consuming, expensive and leads to the machine being unavailable for treatment for different spells of time. This project explores machine learn-ing models as a replacement of the phantom-driven calibration process. The project is in collaboration with the company ScandiDos, which specializes in quality assurance and dosimetry for modern radiation therapy. The following section describes the problem in detail.

1.1 Radiation Treatment

The radiation treatment used for this project involves a linear accelerator (linac). It consists of a radiation source that is mounted on a gantry, that can rotate around the patient. A collimator is also mounted on the gantry, which is used to shape the radiation field in a way that is compliant with the anatomy of the patient being treated. The collimator can rotate around an axis that spans from the radiation source to the linac isocenter, see Figure 1a. This axis is also known as the beam or field axis. These two rotational degrees of freedom are described by the gantry angle rotation and collimator angle rotation. The collimator consists of one or two pairs of blocks (or jaws or diaphragms) and a multileaf collimator (MLC). The MLC consists of a number of metal (in general tungsten) leaves that can be moved in and out of the radiation field in order to create an appropriate aperture. A typical MLC consists of 60-80 leaf pairs that move independently of each other. The intensity of radiation is described in terms of monitor units (MU). The higher the MU, the higher the dose being delivered. The correlation between MU and the delivered dose is linear.

(a) A schematic description of the linear ac-celearator.

(b) A treatment room view of a linear acce-larator.

(9)

1.2 Treatment Plan

A treatment plan is a patient specific set of machine parameters that describes how the radiation is delivered to the patient. A treatment plan consists of one or several fields, where each field consists of a number of control points. A control point is essentially a moment in time where the gantry and collimator angles are specified together with MLC and block positions, and a specific value in terms of MU. Typically, a treatment field consists of a 100 to 200 control points. The treatment plan is generated by a treatment planning system (TPS). The TPS is in general designed to create the plan such that an optimal energy (or dose) is delivered to the tumour while the surrounding healthy tissue is spared. This is done via inverse planning and it involves a number of models that describe the energy fluence through the linac head, the allowed range of motion of the MLC and blocks, the radiation transport in the patient, etc. In order to verify that these models are correct and that the treatment plan can be delivered by the linac as planned, it is necessary to perform treatment plan quality assurance (QA) in order to validate it.

1.3 Treatment Plan QA

For a patient specific treatment plan, the TPS is used to predict the dose that is delivered to a phantom. A phantom is a detector system or a measurement device that consists of a body (typically a plastic cylinder) enclosing a set of detectors at specific locations. The phantom dose distribution is sampled in the detector positions, which enables prediction of the discrete dose distribution that can be measured. The phantom is placed on the treatment couch and the treatment plan is delivered. During the treatment, the dose that is delivered to the phantom detectors is measured. This measured dose is compared to the predicted dose and if the two dose distributions are similar (enough), the plan is approved for patient treatment. Note that the dose distribution that is predicted in, and delivered, to the phantom is not the same dose distribution as in the patient. But as long as the predicted and delivered dose distributions in the phantom are equal, we can conclude that the models in the TPS are accurate enough and that the linac can deliver the dose in the way that is modelled.

1.4 Pre-Treatment QA

The procedure described above is often performed as pre-treatment (PT)QA. It is done once and only once before the start of the treatment. Note that the most common type of radiation treatment is based on treatment fractions. This means that the therapeutic dose is not delivered all at once. Instead the treatment is split in a number of fractions, often around 30. In each fraction an nth_{of the total dose is delivered, if there are n fractions. So a typical treatment}

plan involves a PT fraction when the plan is delivered to the phantom, and approved for treatment, followed by a number of treatment fractions with the patient present.

(10)

1.5 At Treatment QA

When QA is also performed during the treatment fractions, the scenario is known as at-treatment (AT)QA or online QA. This requires that the dose de-livery be monitored by a device that can measure the dose while the patient is treated. ScandiDos have developed a detector called Delta4 Discover for this purpose, which is a transmission detector that is mounted on the linac head between the collimator and the patient. The transmission detector measures the energy fluence that is impinging on the patient. The Discover is designed such that the measured signal is mapped to a dose distribution in the phantom geometry. This is done with a plan specific calibration that is done during the pre-treatment QA fraction. This is known as PT-AT calibration and it makes it possible to detect anomalies in the dose delivery that may occur during the different fractions.

1.6 Measurement Setup

The detector system considered in this study consists of one transmission de-tector (Delta4 Discover) and a cylindrical phantom (Delta4 Phantom+). These two units operate together and/or alone. The Discover unit is mounted on the linac head and the Phantom+ is placed on the patient couch. When the Discover and the Phantom+ operate together in Synthesis mode, the phantom measurement acts as a calibration of the Discover signal. The Discover unit is equipped with 4040 detectors (semiconductor diodes) that are ordered in a regular rectangular grid. In the Phantom+ there are 1069 detectors which are ordered in two crossing planes. In each plane the detectors are laid out in two grid configurations. There is a fine grid (5mm distance) close to the origin and a coarse grid (10mm distance) outside a 6 x 6 x 6cm3box around the origin.

(11)

Figure 2: Simultaneous measurement (Pre-treatment QA) with the Delta4 Dis-cover and the Delta4 Phantom+.

When the Discover and the Phantom are setup as in Figure 2 they collect data synchronously. Data packages that consist of the transmission detector signal and the phantom signal (together with the gantry and collimator angle information) are sampled every 25ms. In this way it is possible to calibrate the signal from the Discover so that it can be interpreted as a discrete phantom dose distribution. In the existing product used for experiments in this work Delta4 Synthesis, this calibration must be performed once for every treatment plan that is going to be delivered to a patient.

The reason behind the calibration is that the transmission detector signal is highly contaminated by charged particle radiation that originates in the linac head. The photons radiation that is used for the patient treatment is collimated in the linac head by different kind of metal structures. When the photon ra-diation is obscured by this metal (and other kinds of materials as well), it is absorbed. In the absorption process high energy electrons are produced. These electrons constitute the charged particle contamination (CPC) of the radiation that is measured by the transmission detector. These electrons however do not reach the patient/phantom surface and they do not contribute to the delivered therapeutic dose.

(12)

1.7 Problem Formulation

The pre-treatment QA procedure is both time consuming and economical ex-pensive for the clinic, since they only generate revenue during the time the patients are treated on the linac and not for the time the phantom occupies the machine. The goal of this project is to investigate whether it is possible to predict the phantom dosage on a treatment plan

Ptp=

n

X

i=0

Pi (1)

based on the set of control points from the transmission detectors _{Ti

|i = 0, ..., n_{} and the states from gantry angle rotation {θ}i

1|i = 0, ..., n} and the

collimator angle rotation_{θi

2|i = 0, ..., n}. If it is possible to predict the dose

distribution on the phantom accurately enough, the PT-QA procedure could be removed from the treatment plan. This task will be solved by modeling an Artificial Neural Network (ANN) model that inputs control points data: Ti_{, θ}i

1,

θi

2and outputs a predicted control points phantom vector

˜

Pi= f (Ti, θi1, θ2i). (2)

The predicted phantom dosage on a treatment plan is the sum of its control points fractions ˜ Ptp= n X i=0 ˜ Pi. (3)

1.8 Further known properties

An extra obstacle to tackle while modeling the function Eq. 2 is the decay of the signal while it is propagating from the gantry to the detectors on the phantom. ScandiDos have performed measurements to find out how the radiation decays inside the plastic cylinder that encloses the phantom. The measurements are conducted in water tanks that have similar properties as the plastic. They found out that different field sizes from the gantry result in slightly different characteristic in the decay of the signal.

(13)

Figure 3: Graph showing the decay of a linac signal when transmitting in water for different field sizes. The window between the dotted lines shows the plastic depth range for the signal before reaching the phantom detectors.

2 Theory

This project explores ANN models as replacement of the phanton-driven QA process. ANNs have delivered promising results for problems involving denois-ing images. An ANN has the ability to accurately model highly non-linear patterns. In particular, the u-net architecture has delivered good results in pre-vious work regarding denoising images. The following text explains artificial neural networks in detail.

2.1 Artificial Neural Network

An Artificial Neural Network (ANN) can be seen as a directed graph, where the input signals direct through a network of nodes to an output of one or more nodes. The nodes inside the ANN are referred to as neurons. The name origi-nates from the idea of trying to mimic the biochemical process of the neurons in the human brain. A neuron takes a set of inputs, multiplies them by a set of trainable weights and sums the products with each other and an additional weight referred to as the bias

f (x; w) =X

i

xiwi+ b = x· w + b, (4)

where w =_{wi, ..., wn, b}. The output of each neuron can also be an input node

to another neuron. The numbers of neurons an input signal has to pass on its way to the output is referred to the depth of the model. The higher depth an

(14)

ANN has, the more complexity lies in the model i.e. higher ability of the model to learn non-linear patterns [3]. If the neuron operation comprise of simply a linear transformation of input nodes to output nodes, a higher level system of neurons could just be rewritten as a single level system. To accomplish a more complex model (a non-linear transformation) the linearity of the neuron has to break. To achieve this an activation function is present to the output of each neuron. The activation function can be any non-linear function, with a well defined derivative. Popular activation functions are the sigmoid function, f (x) = 1

1+e−x and the rectified linear unit (ReLU), f (x) = max(0, x). The

neuron operation for a general activation function can be written as

fn(x; w) = a (x· w + b) . (5)

Figure 4 shows a schematic view of a neuron.

.. . w1 w2 .. . wn b P fa Weights Input nodes Activation function Output node

Figure 4: A schematic view of a neuron in an ANN.

2.2 Optimization and back propagation

ANN models are a powerful learning tool and enable modeling of complex datasets outputs arising from diverse fields. The datasets typically comprise of input-output relationships wherein certain variables form the input to the model, while output is represented by certain target variables. For a classifica-tion task the output would be correct classificaclassifica-tion predicclassifica-tions, for a regression task the output would be predicted values similar to the target values. For the ANN to realize this an objective function needs to be defined. An objective

(15)

function is an optimization (minimization or maximization) of an scalar function referred to as the loss or cost function. The cost function can take a different form depending the task. For a regression task a common cost function is the summed squared error defined as

C = 1 m m X j=0 Pj − ˜Pj2₌ 1 m m X j=0 n X i=0 (P_ij_{− ˜}P_ij)2_, ₍₆₎

where m is the number of records and n is the number of outputs of the model. To update the weights of the graph, the cost function is fed to an optimizer. In context of ANN, an optimizer typically refers to the gradient descent optimizer. It does three things:

1. It propagates a number of input records through the graph to the cost function, stores all of the neuron outputs,

2. it computes the gradient of the cost with respect to w, defined as ∇C = ∂C ∂w1 , ..., ∂C ∂wn , (7)

3. it updates the weights with a learning rate α defined as

w := w_{− α∇C.} (8)

The gradient of the cost function Eq. 7 is derived with a method called back-propagation. Because the neurons in the graph are just analytical functions with well defined derivatives with respect to the weights, and the graph is just a network of neurons that are connected to a cost function that also is differen-tiable, it is possible to compute the gradients of the cost function with respect to all the weights. Before describing backpropagation it is convenient to define the following notations:

• ii:= input vector for neuron i

• ii,j:= input j for neuron i

• oi:= output for neuron i

• wi := weight vector for neuron i

• wi,j:= weight j for neuron i

The cost function Eq. 6 can be written as C = 1 m m X j=0 X oi∈ ˜Pj (Pij− o j i)2, (9)

(16)

where oi = ˜Pij. The derivative of C with respect to any weight in the graph wk,l can be written as ∂C ∂wk,l =₋2 m m X j=0 X oi∈ ˜Pj (P_ij_{− o}j_i) ∂o j i ∂wk,l . (10) And ∂oji

∂wk,l is just the partial derivative of the neuron function with respect to

wk,l derived as ∂oj_i ∂wk,l =∂fn(i j i; wi) ∂wk,l = a0(ii· wi+ bi) ∂ii· wi ∂wk,l ∂iji · wi ∂wk,l =      ij_k,l if wk,l∈ wi P oj_ii∈ij_i ∂oj_ii ∂wk,l if i j i = f (wk,l) 0 else     , (11)

where a0 _{is the derivative of the activation function.}

2.3 Mini-Batch Stochastic Gradient Descent

Performing gradient descent on a complete training set in one iteration can in many cases be computational impossible if the training set is too large. The computational cost increases linearly with the size of the records it acts on. A solution to this problem is to split the entire training set into equal sub sets and update the weights on every one. In machine learning those sub sets are referred to batches or mini-batches depending the size of the sub sets. The extreme case of having batch size one is called stochastic gradient decent (SGD). Given enough iteration SGD will improve the model, but is very noisy. A compromise is to select a small batch size (10-1000) which is called mini-batch gradient decent.

2.4 Dense Layers

The simplest ANN model is the Dense Neural Network or Fully Connected Neural Network (FC). It consists of several layers of neurons, all fully connected to all neurons from the previous layer. The layers between the input and output are called hidden layers. The name comes from the fact that the output from the intermediate layers is hidden from the user.

(17)

Hidden dense layer 1 Hidden dense layer 2 Hidden dense layer 3 Output layer Input layer

Figure 5: A schematic view of a FC network with 3 input nodes, 3 layers of hidden fully connected neurons of various sizes and a single valued output node.

2.5 Convolutional Layers

For input data that has spatial dimensions e.g. sound files (1-Dim), images (2-Dim), a convolutional neural Network (CNN) can be used. The concept is that kernels (trainable filters) will perform convolutional operations on the input data and output a feature map. The feature map can be seen as a higher level meta image, where the values correspond to patterns from the input layer. The idea of convolutional layers is to extract patterns from its input. A common use of the convolutional layer is to pass data through several levels of operation to extract more complex patterns. A big advantage over the classical dense layers is that convolutional layers share weights (kernel parameters) over the whole input space where the connections in the FC network are unique. This of course saves a lot of computational cost and memory and therefore also reduces overfitting in the same range as a FC network. There are various dimensions of convolutional operation but in this context (image data) it is referred as the 2-dimensional convolutional operation. A kernel is a 3-dimensional filter with trainable parameters. The size of the first two dimensions (x-,y-) of the kernel are hyper-parameters defining the spatial shape of the filter, the third dimension takes the shape of the number of channels of the incoming image e.g. a gray-scaled image has 1 channel, a RGB-color image has 3 channels or an output from a previous convolutional layer can have any number of channels (defined by the number of kernels from that previous layer).

2.5.1 2-Dimensional Convolutional Operation

Figure 6 visualizes a 2-dimensional convolutional operation. Kernel K slides across the width and height of the input image I, computing the dot product between the entities on the kernel and the active (red shadowed) part of the image resulting in an 2-dimensional feature map. A convolutional layer

(18)

con-sisting of a set of k kernels will generate k 2-dimensional feature maps. These are concatenated into one 3-dimensional feature map where the third dimension acts as channel depth.

0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 I ∗ 1 0 1 0 1 0 1 0 1 K = 1 4 3 4 1 1 2 4 3 3 1 2 3 4 1 1 3 3 1 1 3 3 1 1 0 I∗ K 1 0 1 0 1 0 1 0 1 ×1 ×0 ×1 ×0 ×1 ×0 ×1 ×0 ×1

Figure 6: A convolutional operation between an 1 channel deep image I and a 3x3x1 kernel K.

The shape of the feature map depends on the stride length of the convolu-tional operation and the size of the kernel if the padding is set to VALID defined as

feature map x-length VALID= ceil

_{input data x-length}

− kernel x-length + 1 stride length x

, feature map y-length VALID= ceil

_{input data y-length}

− kernel y-length + 1 stride length y

. (12) Next section will describe padding.

2.5.2 SAME Padding

A key to restore the size of the output feature map after a convolutional op-eration (under assumption that the stride is 1 in both direction) is to pad the input image with zeros at the borders. To add enough padding to restore the spatial dimension is called SAME padding. The shape of the feature map with SAME padding is calculated as

feature map x-length SAME padding= ceil

_{input data x-length} stride length x

, feature map y-length SAME padding= ceil

_{input data y-length} stride length y

.

(13)

(19)

2.5.3 Max Pooling

A usual component to the convolutional layer is to apply a pooling layer after the operation. A pooling layer downsamples the feature map, reducing its di-mensionality and allowing for assumptions to be made about features contained in the sub-regions binned. A pooling operation reduces the computational cost, avoids over-fitting and achieves a spatial variance of the input. The most com-mon pooling layer is max pooling, it outputs a reduced size of the feature map with the local max values, see Figure 7.

4 0 7 2 1 7 0 4 2 4 5 2 6 8 3 4 7 7 8 5 2x2 max pool

Figure 7: A 2x2 max pooling filter with stride [2,2]. 2.5.4 Transposed Convolution

The transposed convolution operation is a backward version of the convolution operation. Here the input is a feature map and the kernels in the transposed convolution layer get up-sampled to the output. The transposed convolution is primarily designed to decode information from features to images. The typical design is 2x2 kernel with stride 2 transforming the input to doubles the spatial dimensions and often bisects the feature space.

2 3 0 5 0 1 2 2 6 F ∗ 1 0 0 1 K = 2 0 3 0 0 0 0 2 0 3 0 0 5 0 0 0 1 0 0 5 0 0 0 1 2 0 2 0 6 0 0 2 0 2 0 6 F_{∗ K}

Figure 8: A transpose convolutional operation between an 1 channel deep feature map F and a 2x2x1 kernel K.

2.6 U-net Architecture

The U-net model was developed for biomedical image segmentation at the Com-puter Science Department of the University of Freiburg, Germany [5]. The u-net is an encoder/decoder like model, it decodes the input to a spatial low level, feature high space, referred to the bottle neck representation of the record. The

(20)

bottle neck can be seen as a feature state of the incoming spatial organized data (image data). From the bottle neck it decodes back to an spatial high dimension. The encoder is composed of several levels, each consists of 2 convolutional oper-ations, where the first doubles the feature space. Each level except the last ends with a max pooling layer that bisects the spatial dimension. The model builds up higher orders of features after each level and enables more complex patterns to be learned. The decoding part is level symmetrical to the encoder. Each decode level begins with an transposed convolutional layer where it doubles the spatial dimension and bisects the feature dimension. The output is concate-nated with the feature map from the corresponding encoding level (the second convolutional layer). The concatenated layer passes a convolutional layer before entering the next decode level. See Figure 10 for a visualization of the u-net architecture used in this project. The name is self-explanatory when observing the figure, the shape of the graph are formed as a ’U’. Note: the original paper [5] used VALID padding in the convolutional operations, making the spatial dimensions different at the concatenated operation. It was solved by cropping the encoded feature map to fit the decoded feature map. But by using SAME padding in this work, the spatial dimension gets restored during each convo-lutional operation and no cropping is needed. Also the u-net designed in this project only use a single convolutional layer after the concatenated operation in contrast to the design in [5] where they use 2 convolutional operation.

2.7 Batch Normalization

Batch normalization [6] is an algorithm made primarily for speeding up training. During model training, the weight updates from previous layers affect the distri-bution of the layers next in line. This cause disturbance when fitting the weights of the ”forward” layers to operate with the corresponding activation functions. Batch normalization solves this by first normalizing the outputs of a layer (be-fore the activation function) by subtracting the batch mean and dividing by the batch standard deviation then re-scaling it by two trainable parameters γ and β, see algorithm Eq. 1. The idea is that now γ and β controls the output distri-bution that go in to the activation function (γ is the standard deviation and β is the mean) independently of the distribution shifts from previous layers. The mini-batch mean and and mini-batch variance is just calculated under training, under inference the model uses moving mean and variance updated under train-ing accordtrain-ing to Eq. 14. Here k is a hyper-parameter to define between 0-1. The value of k affects how much of the current mini batch statistics (µB, σB)

will influence the update of the mean µ and variance σ used under inference. A typical choice is to set k = 0.99, that means 1% of the mean and variance is affected by the current mini batch statistics.

(21)

Input : Values of x over a mini-batch: _{B = {x}1...m}

Parameters to be learned γ, β Output:_{yi= BNα,β(xi)}

µB← _m1 Pmi=1xi // mini-batch mean

σ2 B← m1

Pm

i=1(xi− µB)2 // mini-batch variance

ˆ

xi← √xi−µB σ2

B+

// normalize

yi ← γ ˆxi+ β≡ BNγ,β(xi) // scale and shift

Algorithm 1: Batch Normalizing Transform, applied to activation x over a mini-batch. (from the original batch-norm paper)

µ_{← kµ + (1 − k)µ}B

σ← kγ + (1 − k)σB

(14)

3 Model

The model is an ANN i.e. a graph based function with trainable parameters. Two assumptions were made on the physical dependency between the incoming data from the transmission image T and the angle state θ1, θ2and the outcome

phantom detectors P i.e. _{{T, θ}1, θ2} → P:

• The value of a phantom detector Piis dependent on only one point on the

transmission image and that point is dependent on the angular state i.e. Pi= f (Txi,yi) where Txi,yi is an interpolated point at position (xi, yi) on

the transmission image T and the point coordinate is determined by the angular state (xi, yi) = fi(θ1, θ2).

• The value of a phantom detector Pican be predicted as the product of the

angles specified point on the transmission image and an additional angles only dependent factor i.e. Pi= Txi,yi· fi(θ1, θ2).

The main function consists of three smaller task specific functions, the first function is the noise reduction filter, a u-net convolution network that reduces the contaminated signal from the transmission detectors discussed in Section 1.6. The second function is a transformation function that maps the values from the transmission detectors to the phantom detectors i.e. a unique angle-dependent function for every detector in the phantom that interpolates a point from the noise reduced transmission image to its corresponding phantom detector. The third function is the decay factor, also a unique angle-dependent function for every detector in the phantom that is needed due to radiation absorption in the plastic of the body that reduces the intensity of the signal discussed in Section 1.8.

(22)

3.1 Noise Reduction Filter

The transmission detectors are placed in a two dimensional 40x101 evenly dis-tributed grid network, this can be seen as a one channel image, the transmission image. The purpose of the reduction filter is to reduce the noise from the trans-mission detectors i.e. reduce the noise from the transtrans-mission image. Previous works have shown good results in reducing noise from images using a u-net architecture [1]. The idea with the u-net is that the encoding convolutional layers will extract the noise pattern from the input image and subtract it in the decoding part.

Figure 9: A transmission image sample.

The u-net architecture for this project is shown in Figure 10. Due to the unbalanced relation between the height and width resolution, the transmission image got padded at the left and right with 5 additional zero-valued lines each (40x101 _{− > 50x101). The first max pool layer has the kernel size 1x2 and} the stride (1,2) this so the following levels in the u-net hierarchy will have a more quadratic shape (50x101 _{− > 50x51). The following max pool layers} in the model has the kernel size 2x2 and the stride (2,2). All of the max pool, convolution layers, transpose convolution layers have SAME padding, this avoids cuttings in the feature maps at the concatenation phase of the model. The noise reduction filter can be described as the function

fn.r.(Traw; wn.r.) = Tnoise reduced, (15)

(23)

50x101 x1 50x101 x8 50x101 x8 50x51 x8 50x51 x16 50x51 x16 25x26 x16 25x26 x32 25x26 x32 13x13 x32 13x13 x64 13x13 x64 7x7 x64 7x7 x128 7x7 x128 4x4 x128 4x4 x256 4x4 x256 2x2 x256 2x2 x512 2x2 x512 50x101 x8 50x101 x16 50x101 x8 50x51 x16 50x51 x32 50x51 x16 25x26 x32 25x26 x64 25x26 x32 13x13 x64 13x13 x128 13x13 x64 7x7 x128 7x7 x256 7x7 x128 4x4 x256 4x4 x512 4x4 x256 50x101 x1

3x3 convolution layer, Relu, Batch-norm, same padding, stride: (1,1)

2x2 max. pool, same padding, stride: (2,2)

1x2 max. pool, same padding, stride: (1,2)

2x2 transposed convolution layer, Relu, Batch-norm, same padding, stride: (2,2)

1x2 transposed convolution layer, Relu, Batch-norm, same padding, stride: (1,2)

concatenating parts input: raw transmission image output: noise reduced transmission image

(24)

3.2 Transformation Function

Figure 11: Visualization of orthogonal radiation propagation from the transmis-sion to the phantom.

The transformation function is created with the assumption that the radiation beams propagate almost orthogonal to the transmission plane when hitting the phantom detectors. Under this assumption it is valid to assume that each de-tector in the phantom is dependent on one specific point in the transmission image, that point is angle dependent e.g. phantom detector i will map its value from a point (xi, yi) on the transmission image determined by a point

func-tion fp(i)(θ1, θ2) = (xi, yi). The point function for all phantom detectors can be

written as

(25)

The coordinates that outputs from the point function are then used to interpo-late the values from the transmission image to the phantom detectors as

ft.f.(T, θ1, θ2) = fBL(T, fp(θ1, θ2))

= v =_{vi| i = 1, ..., 1069},

(17) where fBL is a bi-linear interpolation function that takes the coordinates from

fp(θ1, θ2) and interpolates them on the transmission image T and output the

vector v with the interpolated values related to each phantom detector. The point function Eq. 16 can be rewritten as

fp(θ1, θ2; wt.f.) = frot(θ2, fproj.(θ1; wt.f.)) , (18)

where frot is the rotation function and fproj is the projection function. The

projection function can be written as

fproj.(θ1; wt.f.) = fo.p.(θ1) + fcorr.(θ1; wt.f.)

=_{(xproj,i, yproj,i)| i = 1, ..., 1069}

(19) It consists of two terms where the first term is the orthogonal projection function fo.p.. It takes the gantry angle rotation θ1as input and calculates the points in

the transmission image that would project a line orthogonal to the transmission plane to each detector in the phantom (if the collimator rotation, θ2 = 0), see

Figure 11 for a visual interpretation. The second term fcorr. is the correction

term, it takes the gantry angle rotation θ1 as input and outputs a tuple of

coordinates. wt.f. is the set of trainable weights in the Dense Neural Network

described in Figure 12. Note that only the correction term is trainable. The coordinates from Eq. 19 are then fed to the rotation function

frot(θ2, x, y) = cos(θ2) sin(θ2) − sin(θ2) cos(θ2) x y ={(xrot,i, yrot,i)| i = 1, ..., 1069}, (20)

where each coordinate tuple are rotated with the collimator rotation angle θ2.

The rotation function is simply an affine rotation transformation i.e. a matrix multiplication between a matrix and some coordinate tuples. The transforma-tion functransforma-tion Eq. 17 can now with the input of Eq. 18 be written as

(26)

θ1 .._. .. . .. . = (x1, y1) = (xi, yi) = (x1069, y1069) Output layer Abs. Dense layer Input layer

Figure 12: the dense neural network used for the correction function. It takes a single valued argument as input (θ1) and outputs 2138 values (1069 coordinate

points).

3.3 Decay Function

The decay function has the same input and output shape as the transformation function and is designed similar to the correction term from the transformation function. It uses the same dense neural network, but the output shape is 1069 instead of 2138. The decay function can be written as

fd.y.(θ1, θ2; wd.y.) = d ={di| i = 1, ..., 1069}. (22)

3.4 The Complete Model

The Noise reduction filter Eq. 15, the transformation function Eq. 21 and the decay function Eq. 22 can now be put together as one complete function

fc.f.(T, θ1, θ2; w) = ft.f.(fn.r.(T; wn.r.), θ1, θ2; wt.f.) fd.y.(θ1, θ2; wd.y.)

= ˜P =_{{ ˜}Pi| i = 1, ..., 1069}.

(23) T is the raw transmission image, ˜P is the predicted phantom vector, w = {wn.r., wt.f., wd.y.} is the complete set of all the trainable weights and is the

(27)

Hadamard product. The cost function

C = (P_{− ˜}P)2= (P_{− f}c.f.(T, θ1, θ2; w))2, (24)

is defined as the summed squared difference between the target phantom vector P and the predicted phantom vector ˜P. The cost function will be fed to an optimizer and the model is then ready to be trained.

3.5 Training the Model

During training two data sets are available. A training set for which the model will be trained on and a validation set that after each epoch of training the model will be evaluated on. Each evaluation compares its result to the prevailing best result. If the new evaluation results is improved the model is saved as the best result. The model is saved after each epoch (regardless the evaluation result) as current model. The idea with saving two versions of the model is that the best result model can be used to predict new data, while the current version can be trained in parallel, allowing to be fed with new training data. The reason not to only have the best result model to train on is that that will affect the model to overfit the validation set after some time. There are many way of scoring the result. The most trivial one would be the mean cost value or mean error over all records in the validation set. But the objective is not to produce the lowest cost or error record wise. The objective is to produce the smallest absolute relative dose deviation e.g. the deviation between the summed prediction and the summed target divided by the summed target over a whole treatment plan (tp) defined as Dtp= abs  X j∈tp ˜ Pj−X j∈tp Pj   abs  X j∈tp 1 Pj   =_{D_itp_{| i = 1, ..., 1069}.} (25)

The relative dose deviation Dtp _{consists of 1069 values (corresponding every}

phantom sensor). The mean value of Dtp_{can give a misleading understanding of}

the magnitude of the deviation because some detectors report inflated means. A better way of representing the dose error is to first calculate the median relative dose deviation from every treatment plan in the validation set and then take the mean over every treatment plan median Dtp_{defined as}

result= 1 len(V)

X

k∈V

median(Dk), (26) where V is the set of treatment plans in the validation set and len(V) is the number of treatment plans in the validation set.

3.6 Design Settings

The number of hyper-parameters in this model is large, due to time limita-tions no smart hyper parameter optimization algorithm has been explored. But

(28)

during the design process many different hyper-parameter versions have been compared and this section will discuss how the current model has been chosen. 3.6.1 U-net hyper-parameters

The shape of the u-net is crucial for the performance of the model. A less dense version, with a small numbers of channel (fewer kernels in the convolutional layers) will be computational cheaper but may not reduce the noise properly. The best performance was found with an eight level deep u-net, that is a bottle neck layer with spatial dimension 2x2. And with the first convolutional layer consists of 8 kernels, a bottle neck with a feature dimension of 512. Denser models did not perform better.

3.6.2 Abs. dense layer hyper-parameters

The abs. dense layer is used in both the decay function and for the correction term in the transformation function. The shape of the abs. dense layer was not found to be critical for the performance of the model (computation and precision wise) as long it was within the range 100-500. A minor improvement was detected when the angle parameters was initialized evenly between 0_{− 2π .} 3.6.3 Batch size

With the computational efficiency in mind only batch sizes of even bit were tested. 27_{= 128 was found to perform best.}

3.6.4 Optimizer

The optimizer used in the model was the Adam optimzer. Due to the individual adaptive learning rate of the Adam optimizer [2] no decrease on the learning rate was necessary. The default setting learning rate 0.001 was the best of the tested ones (0.01, 0,001, 0.0001).

3.6.5 Batch norm

The batch norm was used with the extended version batch renormalization [4] allowing the training and inference mode to use the same moving statistics (µ and σ). The paper recommends the additional hyper-parameters rmaxand dmax

to start at 1 and 0 then slowly increase to 3 and 5. The model training was found to be faster by setting them free from the start (rmax= inf and dmax= inf).

3.6.6 Dropout filter

The model was first integrated with dropout filters [7] after each convolution layer in the noise reduction filter. The dropout didn’t improve the performance at the validation set (or training set) so it was removed from the model.

(29)

4 Additional functions

This section contains algorithms designed during the progress of this project.

4.1 Orthogonal Projection Function

The orthogonal projection function is used to find the coordinates on the un-rotated transmission plane from which the phantom detector would get injected from, under the assumption that the radiation beam transmits orthogonal to the transmission plane. For this the coordinates from each phantom coordinate are needed. This is the only part of the complete model where additional in-formation is used from the phantom coordinates. The orthogonal projection function fo.p. used as the first term in the the projection function Eq. 19 is

defined as

fo.p.(θ1) = (xt, yt)

xt= xpcos(θ1)− zpsin(θ1)

yt= yp,

(27)

where (xp, yp, zp) is the coordinate tuples for the phantom detectors.

4.2 Absolute Dense Layer

This layer was designed to transform single valued inputs to multiple valued outputs, where each output gives a measurement for how close it is to some center value θc,i corresponding to output oi. One can see the layer architecture

similar to the classical dense neuron layer but with abs. neurons instead. The abs neuron acts as a hat function with 3 trainable parameters: θc, Vmax, k. The

abs. neuron is defined as

fabs(θ) = relu (Vmax− |θc− θ| · |k|)) (28)

(30)

0 π 2 π θc 3π2 2π 0 0.5 Vmax 1 1.5 θ f (θ )

Figure 13: Abs. neuron function with parameter: θc = 4, Vmax = 0.75 and

k = 2.

5 Implementation

5.1 The Data Set

ScandiDos provided this project with 37 treatment plans with a total of 126 878 control points. A control point contains 4040 transmission detector values, one gantry angle, one collimator angle and 1069 phantom detector values. The smallest treatment plan consists of 1564 control points and the largest plan consists of 7500 points. There are 5 different types of treatment plans in this data set:

• 2 Ani plans (Tumors in the rectum and/or the colon). • 3 Esof plans (Esophagus, tumors in the the food tract).

• 4 HoN plans (Head and Neck. Tumors in the region above the lungs, but not in the brain).

• 4 Lung plans (Lung tumors)

• 24 Prost plans (Prostate tumors. In the pelvis region for male patients). The 37 treatments plans are divided into 3 sub sets: Training data, Test data and Validation data. To minimize the dependency between the sub sets, the plans are divided evenly between the sub sets. The training data set got: 1 Ani plan, 2 Esof plan, 3 HoN plans, 2 Lung plans and 19 Prost plans. The validation data set got: 1 Esof plan, 1 Lung plan and 1 Prost plan. The test data set got: 1 Ani plan, 1 HoN plan, 1 Lung plan and 4 Prost plans.

(31)

5.2 Pre-process and Normalization of the Data

A rational assumption is that no detector should have a negative value. Be-fore normalizing the input transmission values and target phantom values, the negative values are set to zero. Two pre-processing goals were deemed critical when choosing the normalizing algorithm of the data: avoid negative values and have similar magnitude of the input transmission data and the target phantom data preferably around magnitude 1. A simple approach could be just to scale the transmission data with a factor so the mean value was of the magnitude 1, and do the same with the phantom data. These constants could preferably be the mean of all transmission data and the mean of all the phantom data. Here we chose to normalize the input transmission by multiplying them with 1e-4 and multiplying the output transmission with 1e+4. The input angles were transformed from degrees to radians.

5.3 Software

The project was modeled in the programming language Python using the frame-work Tensorflow. Tensorflow is an open-source software library for dataflow programming across a range of tasks.

6 Results

The results are divided into three sub sections: Training Progress, Dose Ac-curacy and Sub Functions. The training progress Figure 14, 15 and 16 show the progress of the control points during training. During training progress the validation data set was used for quality measurements after each training epoch. Measurements done on the training data set was done while training over the epoch. So the training set result on epoch n is a mean result between epoch n−1 and epoch n. One could argue that the indexing on the training set should state epoch n₋1

2 instead of n but for the visualized plot over 51 epochs, it doesn’t

differ much. The error measurement used for the control points is defined as error = |˜P− P|

|P| . (29)

For the treatment plans accuracy different statistics of the relative dose devi-ation is used instead. The argument for using different measurements lies in avoiding zero division that occur in the control points data.

The dose accuracy shows the treatment plans accuracy on the test set for the training version with highest accuracy results with respect to the validation set. Figure 17 shows histograms of the relative dose deviation defined as

relative dose deviation = D tp₌  X j∈tp ˜ Pj₋X j∈tp Pj   abs  X j∈tp 1 Pj   =_{Dtp_i _{| i = 1, ..., 1069}} (30)

(32)

for all the packages from the test set. The mean and median of the absolute relative dose deviation Eq. 25 are indicated at the top of every histogram. Note the difference between absolute relative dose deviation and relative dose deviation. The visualization of the relative dose deviation gives a clearer picture of whether the prediction is over or under produced. A desired outcome would be that the deviation is centered around 0%. Figure 18 compares the predicted and the target phantom detectors dose distribution on the two crossing phantom planes x = 0 and z = 0. As additional information the error measurement used for the control point accuracy is derived at the top of each deviation plane. The third sub section Sub Functions visualize the behaviour of the sub functions of the main model. Figure 19 visualize the noise reduction function for two sample control points from the test set. Figure 20 visualize the point function (from the transformation function) with the initial orthogonal projection points and with the added trainable correction term points for different angle states. Figure 21 shows the decay function for different angle states.

6.1 Training Progress

(33)

Figure 15: Mean error progress during training plot: training set and validation set.

Figure 16: Mean median dose error progress during training plot: validation set.

(34)

6.2 Dose Accuracy

(35)

Figure 17: (2/2) Histograms showing the relative dose deviation between pre-diction and target over all detectors in the phantom.

(36)

(a) treatment plan: Ani1.1

(b) treatment plan: HoN1.1

(c) treatment plan: Lung1.1

(37)

(d) treatment plan: Prost1.1

(e) treatment plan: Prost4.1

(f) treatment plan: Prost6.1

(38)

(g) treatment plan: Prost9.2

Figure 18: (3/3) Scatter plot of the predicted, target and deviation dose distri-bution visualized over the two crossing planes x=0 and z=0, the error Eq. 29 on each plane are expressed inside the brackets on top of the two deviation plots.

(39)

6.3 Sub Functions

Figure 19: Noise reduction filter: input sample image (left) and output sample image (right).

(40)

Figure 20: Scatter plot of the point function from the phantom detector on the x = 0 plane hitting the transmission plane for different angles. Doted region correspond to the area of transmission detectors.

(41)

Figure 21: Decay function acting on the two crossing planes: x=0, z=0, for various angles.

(42)

7 Discussion

The dose error in the test set shows satisfying results with an median deviation ∼ 1% for almost each treatment plan. A substantial error contribution comes from the detectors at the border of the y-axis, can be seen clearly in Figure 18: a,d,f. The cause of these errors could be that the transmission detectors placement is too narrow. Figure 20 of the point function could confirm that theory, the structure of the points is broken outside the area of the transmission image. Another theory for the bad results on the border could be that the point function can’t ”walk” back in the active part of the image after it has reached the padding area. A point that has completely reached the zero-padding area (all four transmission detectors that the point is interpolated from is part of the zero-padding) has no cost gradients for movement in any direction. The decay function visualized in Figure 21 shows some pattern breaks at the detectors on the border of the y-axis and also some detectors next in line from the border. This confirm the previous statements. Elsewhere the decay function acts according to expectations. The decay of the signal is proportional to the distance the x-ray propagates inside the cylinder before reaching the detector. The noise reduction filter visualized in Figure 19 shows a smoother image after it passes the filter. A strange property that was revealed in the figure is the vertical stripes that appear on the darker part of the plot. The reason for the phenomenon is unclear. A theory could be that it has something to do with the uneven last transposed convolution stride the model has to take to fit the dimension of the the transmission detector distribution.

7.1 Alternative Model

This sub-section discusses alternative models that during the process of the project got abandoned for various reason. For future work these models may also be worth getting inspired from.

7.1.1 Transformation Matrix

A tempting approach was to design a transformation matrix M that would transform the transmission data T to the phantom plane P (Here T acts as a vector instead of an image). Each element in M having its own angular function Mi,j= fi,j(θ1, θ2; wi,j) with trainable weights. This matrix would replace both

the transformation function and the decay function of the current model. The noise reduced image gets flattened out [50,101]− >[4040] and then undergoes a matrix multiplication with matrix M see Eq. 32. This approach has many pros over the current model. The biggest challenge here is to define the angular functions for each element correctly. The only model tested was a pre-rotated (θ2) abs. dense network similar to the ones used for the correction term in the

transformation function and the decay function. The design is shown in Figure 12 where the output layer would be of size 4318760 ( = 4040x1080), one output for every element in the transformation matrix. Even if the training of the model

(43)

converged to something better than guessing, it performed far worse than the current model. The pros of having a transformation function compared to the current model is:

• Simpler model

• More general (no need to assume orthogonality of the propagation. And don’t need to assume the first physical assumption that the current model is relying on)

• No need of any spatial information from the phantom detectors (,or infor-mation at all from the phantom detectors).

The cons are additional to finding a good angular function, the computational cost in training a model with a matrix of that size.

P = M(θ1, θ2; wt.m.)× T (31)

falt.1(T, θ1, θ2; w) = M(θ1, θ2; wt.m.)× fn.r.(T; wn.r.)

= ˜P =_{{ ˜}Pi| i = 1, ..., 1069}

(32)

7.1.2 Picture to Picture

Another tempting approach was to build an ”encode/decode” approach sim-ilar to the u-net but the decoder will transform into corresponding points in the phantom instead of a noise reduced version of the incoming image. This approach would be even simpler than the transformation matrix approach dis-cussed above. This model does not need any of the physical assumptions that the current model is relying on. In contrast to the u-net the decoder in this model can not use concatenated bridges from the encoder. The reason is that the feature map from the encoder is locally connected to the spatial dimension that is to be transformed. The angular information is a crucial component for this model to work. A reasonably assumption is to put it in the bottle neck of the model, so the angular information is treated as a feature. This approach failed to succeed on a similar model to the u-net used in this project without the concatenated contribution.

7.1.3 Modify the Current Model

A modification on the current model that has never been tested is instead of the element wise multiplication (the Hadamard product) between every pair of transformed transmission image and decay function output, create a function that maps every output tuple independently. This modification would drop the second physical assumption from the current model.

(44)

8 Conclusions

This work presents a proof of concept for predicting the dose in the phantom geometry from the signal in the transmission detector with the angle state in-formation from the gantry and collimator using a trained ANN. 4 out of 7 treatment plans had a median absolute relative dose deviation below 1% (worst treatment plan had 2.37%). This fulfilled the project goals specified by Scandi-Dos ”The error in the ANN dose in the phantom must be less than 1% compared to measurement or TPS dose”. The u-net architecture for denoising the trans-mission detector/image had shown good results in earlier works. What was new in this project was the additional task of transforming the information from the moving transmission plane on the gantry to the two crossing planes inside the phantom. The two physical assumptions that were made for the model proved to work sufficiently accurately regarding transforming the signal between the planes. The data set provided in this work is just a fraction of all the data ScandiDos can deliver to the model. A first realization could be to have the model running in the background while new PT-QA procedures are running. The new treatment plans data can then validate the model before mixing it into the training data set for continued training of the model. When enough valida-tions are done with proper accuracy, the pre-treatment part could be removed from the Quality Assurance.

(45)

References

[1] Xun Jia Weiguo Lu Xuejun Gu Zohaib Iqbal Steve Jiang Dan Nguyen, Troy Long. A feasibility study for predicting optimal radiation therapy dose distributions of prostate cancer patients from patient anatomy using deep learning. arXiv:1709.09233 [physics.med-ph], 2017.

[2] Jimmy Ba Diederik P. Kingma. Adam: A method for stochastic optimiza-tion. arXiv:1412.6980 [cs.LG], 2014.

[3] Kyunghyun Cho Yoshua Bengio Guido F. Montufar, Razvan Pascanu. On the number of linear regions of deep neural networks. Advances in Neural Information Processing Systems 27, 2014.

[4] Sergey Ioffe. Batch renormalization: Towards reducing minibatch depen-dence in batch-normalized models. arXiv:1702.03275 [cs.LG], 2017. [5] Thomas Brox Olaf Ronneberger, Philipp Fischer. ”u-net: Convolutional

networks for biomedical image segmentation”. arXiv:1505.04597 [cs.CV], 2015.

[6] Christian Szegedy Sergey Ioffe. Batch normalization: Accelerating deep net-work training by reducing internal covariate shift. arXiv:1502.03167 [cs.LG], 2015.

[7] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.

IwouldliketothankmysupervisorPrashantSinghforallthegoodsupportandfeedbackduringthedevelopmentofthisproject.IwouldalsoliketothankErikB¨angtssonfromScandiDosforhisexpertiseinradiationtherapyandforthedataheprovidedtotheproject.LastbutnotleastIwanttothankSimo

Examensarbete 30 hp

Mars 2019

Learning Phantom Dose Distribution

using Regression Artificial Neural

Networks

Abstract

Learning Phantom Dose Distribution using Regression

Artificial Neural Networks

Mattias Åkesson

Acknowledgments

Popul¨

arvetenskaplig sammanfattning

Contents

Nomenclature

1

Introduction

1.1

Radiation Treatment

1.2

Treatment Plan

1.3

Treatment Plan QA

1.4

Pre-Treatment QA

1.5

At Treatment QA

1.6

Measurement Setup

1.7

Problem Formulation

1.8

Further known properties

2

Theory

2.1

Artificial Neural Network

2.2

Optimization and back propagation

2.3

Mini-Batch Stochastic Gradient Descent

2.4

Dense Layers

2.5

Convolutional Layers

2.6

U-net Architecture

2.7

Batch Normalization

3

Model

3.1

Noise Reduction Filter

3.2

Transformation Function

3.3

Decay Function

3.4

The Complete Model

3.5

Training the Model

3.6

Design Settings

4

Additional functions

4.1

Orthogonal Projection Function

4.2

Absolute Dense Layer

5

Implementation

5.1

The Data Set

5.2

Pre-process and Normalization of the Data

5.3

Software

6

Results

6.1