Dose planning from MRI using machine learning for automatic segmentation of skull and air

(1)

Dose planning from MRI using machine learning for automatic segmentation of skull and air

Jens Sjölund jsjol@kth.se April 26, 2012

Supervisor: Håkan Nordström Examiner: Mats Danielsson

TRITA-FYS 2012:24

ISSN 0280-316X

ISRN KTH/FYS/–12:24–SE

(2)

Abstract

The superior soft tissue contrast and inherent patient safety of MRI makes it preferable to CT for many imaging tasks. However, the electron density information provided by CT makes it useful for dose calculations in radiotherapy.

If these could instead be based solely on MRI it would spare the patient from additional ionizing radiation as well as saving the health provider the time and cost of an additional examination.

In this thesis the possibility of achieving this using a machine learning algorithm called support vector machines to segment head MRI images into soft tissue, bone and air is investigated. To train the algorithm a large set of registered MRI and CT images corresponding to the same patients were used.

The results were evaluated on five test patients using Monte Carlo simula- tions. An important finding was that the threshold value used to segment the bone in the CT images was important for the prediction performance. More- over, the results indicate that there are significant variations in bone density among patients, an aspect with important implications for the accuracy of the dose calculations.

Nevertheless, the dose calculations performed hold promise of drastically increased accuracy with relative errors reduced from 2-3 % to less than 1 % in regions where few rays pass through air-filled cavities. In regions where this is not the case the local prediction outcome in the target area has a large impact on the dose calculation accuracy.

(3)

Acknowledgements

I would like to thank my colleagues at Elekta for their interest and support when writing this thesis. In particular, my warmest gratitude goes to my supervisor Håkan Nordström for his enthusiasm and careful proofreading. Also, I would like to thank Andreas Eriksson Järliden and Anders Lundin for their patience when taking the time to help out with their expertise.

(4)

Introduction

Since its introduction in the 1970s magnetic resonance imaging (MRI) has de- veloped into an indispensible tool for medical imaging. Its superior soft tissue contrast and inherent patient safety makes MRI preferable to other modalities such as computed tomography (CT) for many imaging tasks. This is certainly true when concerned with brains, be it either for visualization of brain disorders or in radiotherapy planning. In this thesis we will focus on the latter.

In radiotherapy planning the MRI images are used to delineate the target and sensitive structures, referred to as organs-at-risk. Naturally, the overall goal of the radiotherapy is to deliver a specified radiation dose to the target while keeping the dose to the organs-at-risk at a minimum.

However, there is one major obstacle that prevents straightforward use of MRI images for the whole planning process, namely that they do not provide information that directly relates to where dose is absorbed. On the other hand this is exactly what a CT scan does. To draw from the benefits of both it is therefore common to combine their information. This is done by matching image sequences from both modalities to each other in a process called image registration [1].

If resources for the extra CT scan is lacking or it is otherwise unadvisable a different, simplified, approach is also conceivable. The idea behind this approach is to approximate all tissue properties as equivalent to those of water. However, this may result in deviations between the planning dose and the actual dose delivered, in particular in areas close to the skull bone [2, 3].

An improvement on the dose planning based only on MRI, possibly making the additional CT scan redundant, is therefore desirable. The patient benefits would include reduced radiation exposure and a better dose planning compared to the water approximation. It would also benefit the health provider since the time and cost of the extra scan could be spent elsewhere.

Hence, the aim of this thesis is to develop and evaluate a method capable of achieving this.

Regarding the structure of the thesis the rest of chapter 1 provides a more in-depth background on the problem setting as well as an outline of related work. Following this, chapter 2 describes a proposed solution based on machine learning. The methods used for evaluating this are described in chapter 3.

Thereafter, the results are presented in chapter 4 and discussed in chapter 5.

(7)

1.1 Imaging modalities

In order to properly understand the problem at hand this section is meant to provide an introduction to the imaging modalities in question, namely computed tomography (CT) and magnetic resonance imaging (MRI). A more exhaustive coverage can be found in for example [4, 5].

1.1.1 Computed tomography

The idea behind CT is to collect a large number of X-ray projection images from different angles and to combine them in order to image a thin slice of the patient’s 3D geometry. This is achieved by rotating an X-ray source around the patient while continuously imaging. The final image is thereafter reconstructed, often by a method known as filtered backprojection.

Just as for projection imaging, CT images show how much X-rays are atten- uated in different parts of the body. For the energy ranges used in radiotherapy this attenuation is predominantly made up of Compton scattering, which in turn depends mainly on the electron density. At the lower energies of CT imaging the photoelectric effect may also be important depending on the type of tissue.

Still, it holds true that differences in electron density provide the main part of the image contrast [4]. For dose planning this is in fact exactly the information needed to calculate where interactions and subsequent energy deposition occurs.

Although we defer the details of this calculation to chapter 3 the usefulness of CT in dose planning is evident.

1.1.2 Magnetic resonance imaging

Magnetic resonance imaging makes use of a phenomena known as nuclear magnetic resonance. It is inherently a quantum mechanical phenoma and as such a proper treatment of its physical background requires concepts from quantum mechanics. Before going into this a brief overview is provided and readers un- familiar with quantum mechanics may thereafter go ahead to the next section that describes some of the different techniques used in MRI.

In short

Nuclear magnetic resonance is based on the fact that protons and neutrons, that make up every atomic nucleus, have an intrinsic quantum property called spin.

In hand-waving terms one may say that when applying an external magnetic field this spin will align either parallel or anti-parallel with the magnetic field.

This results in an energy difference between the two states corresponding to the energy of a radio frequency photon. Consequenctly radio frequency emitters and receivers can be used to probe the image subject and measure the strength of the re-emitted signal from different areas. Among biologically relevant elements, this effect is by far easiest to make use of for hydrogen since it is both the most abundant and the one with strongest interaction with an externally applied magnetic field. Therefore, the strength of the signal is related to the distribution of hydrogen atoms.

(8)

Physical background

A suitable starting point for a quantum mechanical treatment of MRI is the relation between total angular momentum ~J and the corresponding magnetic moment ~µ

~

µ = γ ~J (1.1)

where the proportionality constant γ is known as the gyromagnetic ratio and depends on the particle. In the case of hydrogen, whose nucleus consist of a single proton, the gyromagnetic ratio is

γ = 2.675 · 10⁸rad/(sT) (1.2)

Moreover, the total angular momentum is quantized and a measurement of, say, the z component of any atomic or nuclear angular momentum leads to integer or half-integer multiples of ~, which is Planck’s constant divided by 2π. Thus,

Jz= mj~ where mj = −j, −j + 1, . . . , j (1.3) Here mj is called the magnetic quantum number which in turn depends on the quantum number j that is associated with the magnitude of the total angular momentum according to

J²= j(j + 1)~ (1.4)

Note that the operators Jz and J² are simultaneously measurable since they commute.

When an external magnetic field ~B = B ˆz is applied to a particle with gyro- magnetic ratio γ the potential energy is

E = −~µ · ~B = −γm_j~B (1.5)

When considering the special case of a single proton (spin 1/2) at rest in this external magnetic field it holds that ~J = ~S and the corresponding Hamiltonian operator is

H = −γBSz=−¹₂~ω0 0 0 ¹₂~ω0

(1.6) The quantity ω0 = γB is called the Larmor frequency for reasons soon to be explained. It follows by inspection that the Hamiltonian (1.6) has the eigenstates

ψ+=1 0

(1.7) ψ₋=0

1

(1.8) with the corresponding eigenvalues

E_±= ∓1

2~ω0 (1.9)

which shows that the energy is lowest when the magnetic moment is parallel to the external field.

(9)

Now, following [6], we turn our attention to the time-evolution of the mag- netic moment ~µ = γ ~S. The solution to the time-dependent Schrödinger equa- tion,

i~∂ψ

∂t = Hψ (1.10)

can be expressed using the eigenstates as

ψ(t) = c+ψ+e^−iE⁺^t/~+ c₋ψ₋e^−iE⁻^t/~= c+e^iω⁰^t/2 c−e^−iω⁰^t/2

(1.11) where the constants c₊ and c₋ are determined by the initial conditions. From the normalization requirement it follows that a natural way of rewriting these are as c₊ = cos(α/2) and c₋ = sin(α/2) with a fixed angle α. This turns out helpful when calculating the expectation value of ~S (and ~µ, since they are related by a constant). The result is that

hµxi = γψ(t)^†Sxψ(t) = γ~

2 sin α cos(ω0t) (1.12) hµyi = γψ(t)^†Syψ(t) = −γ~

2 sin α sin(ω0t) (1.13) hµ_zi = γψ(t)^†S_zψ(t) = γ~

2 cos α (1.14)

This shows that for a constant magnetic field the expectation value of the mag- netic moment ~µ is tilted at a constant angle α to the direction of the magnetic field and precesses about it at the Larmor frequency ω0.

For the spin 1/2 particle under discussion, the energy difference between the states with parallel and anti-parallel spin (mj= ±¹₂) is given by equation (1.5) to be

∆E = ~ω0 (1.15)

According to the Maxwell-Boltzmann theory in statistical mechanics the prob- ability of finding a system (a proton in this case) with energy in contact with a reservoir at temperature T (body temperature) is given by

P () = e^−/k^B^T

Z (1.16)

where the partition function Z serves as a normalization factor and kB is Boltz- mann’s constant. The consequence of this is that there is a larger probability of finding a proton in the low energy state, i.e. with spin parallel to the magnetic field.

At physiological temperatures and with a magnetic field of 1 T equation (1.16) gives that there is about 3 spins per million more in the low energy state.

This might not seem much, but considering that the number of protons in an image sample is on the order of Avogadro’s constant (6.022 · 10²³) it produces an observable magnetic moment when summed.

In equilibrium, this excess number of protons in the low energy state produces a net magnetization in the direction of the magnetic field, referred to as the longitudinal magnetization. The component of the magnetization orthogo- nal to the applied field is called the transverse component. In equilibrium it is zero since the spins are randomly oriented in the xy-plane and therefore average out.

(10)

Imaging techniques

In order to actually image something, one makes use of the fact that a photon with an energy that exactly matches the energy difference can make the spin flip. Such photons are typically in the radio frequency range for magnetic fields on the order of 1 T.

By applying such a radio frequency pulse during a limited time the net magnetization can be changed. In order to to see how, first consider a coordinate system that rotates around the z-axis at the Larmor frequency, i.e.

ˆ

x⁰= ˆx cos(ω₀t) − ˆy sin(ω₀t) (1.17) ˆ

y⁰= ˆx sin(ω0t) + ˆy cos(ω0t) (1.18) ˆ

z⁰= ˆz (1.19)

Assuming that the radio frequency pulse is left circularly polarized and has a time-dependent magnetic field strength B1(t) its associated magnetic field can be written simply as ~B1(t) = B1(t)ˆx⁰. From a derivation similar to the one for the static case one may show [5] that the expectation value of the magnetic moment after applying the pulse for a time τ are

hµx⁰(τ )i = hµx⁰(0)i (1.20)

hµy⁰(τ )i = hµy⁰(0)i cos Φ + hµz⁰(0)i sin Φ (1.21) hµz⁰(τ )i = −hµ_y⁰(0)i sin Φ + hµ_z⁰(0)i cos Φ (1.22) where we have defined the flip angle

Φ = Z τ

0

γB1(t) dt (1.23)

Physically, the flip angle is the angle with which the orientation of the net magnetization has changed due to the radio frequency pulse.

When the pulse is turned off the sample will start to relaxe to its equilibrium state. However, this happens at two different time scales for the transverse and longitudinal magnetization.

In case of the longitudinal magnetization, the relaxation time is limited by how fast the excited spins release their energy to the surrounding lattice. This leads to an exponential regrowth of the longitudinal magnetization

Mz(t) = Mz(0)(1 − e^{−t/T 1}) (1.24) where T1 is a material dependent time constant. Scans that primarily achieve contrast from differences in T1, called T1-weighted scans, achieve a strong signal from fat, gray matter and white matter whereas the signal from CSF, bone and air is weak.

The mechanism behind the behavior of the transverse magnetization is different. Due to variations in the local fields from its neighbours the precessional frequencies of the individual spins spread out with time. This is commonly referred to as dephasing. Since the net transverse magnetization is the sum of all the individual transverse components this causes an exponential decay of the transverse magnetization

Mxy(t) = Mxy(0)e^{−t/T 2} (1.25)

(11)

(a) T1 (b) T2

Figure 1.1: Examples of MRI images acquired using T1- and T2-weighted sequences on different patients. Note that the outermost bright areas are not bone but skin.

Tissue type T1 (ms) T2 (ms)

White matter 780 90

Gray matter 900 100

Cererbrospinal fluid 2400 160

Table 1.1: Comparison of approximate T1 and T2 relaxation constants with a magnetic field strength of 1.5 T. The values are taken from [4].

where T2 is also material dependent time constant that characterizes the speed of the decay. However, the presence of inhomogenities in the external magnetic field accelerates the dephasing. The resulting decay rate is characterized by a machine-dependent time constant T2*. Depending on the sequence in which the radio frequency field is turned on and off the resulting signal depends on either T2 or T2*. Two common sequences are called spin echo (T2) and gradient recalled echo (T2*). The images produced from T2- or T2*-weighted scans display a strong response from CSF and intermediate response from fat, gray matter and white matter. However, the signal from both bone and air is still weak. Figure 1.1 illustrates the different signal responses from T1- and T2- weighted scans.

The explanation for the differences in contrast is that the two types of magnetization are at different time scales where T1 is much longer than T2. This holds even though the magnetic field strength influences T1 relaxation but has an insignificant effect on T2 decay. Some example values of the relaxation times are shown in table (1.1). It is the differences in T1 and T2 along with proton density variations that are the sources of the excellent soft tissue contrast in MRI.

One of the major challenges of dose planning using only MRI is the fact that bone has an extremely short T2 relaxation time, 0.4–0.5 ms [7], which makes it technically difficult to sample. Recently, the possibility of doing that using an MRI sequence called ultrashort echo time (UTE) has gained attention [7, 8, 9]

(12)

but has still not been widely employed.

Because of the short relaxation time, conventional MRI sequences result in a very weak response from bone, thus making it difficult to distinguish from air.

This is particularly troublesome since bone is the material which attenuates radiation the most and air on the other hand barely attenuates it at all.

Image distortion

An additional complication when using MRI for treatment planning is image distortion. It may lead to incorrect tumour delineation as well as systematic errors during image registration [10]. In contrast, CT is considered to be geo- metrically accurate.

There are two main components responsible for the image distortion [11].

The first and typically dominant part is the machine-dependent one. It arises due to inhomogenities in the static- or gradient magnetic fields. The second component is the tissue-dependent part that stems from susceptibility differences and chemical shift.

In general the machine-dependent part can be measured and characterized for different systems and methods to correct for it is commonly provided by the vendor. These corrections mitigate the problems with image distortion but does not remove them completely [12].

On the other hand tissue-type distortions are patient dependent and thus harder to correct for.

Although image distortion is something that needs to be taken into account when evaluating MRI as the sole modality for treatment planning it is not the main topic of this thesis. Consequently, while keeping its effects in mind image distortion will be intentionally overlooked in what is to come.

1.2 Problem specification

After this introduction to the imaging modalities in question we are now able to elaborate on the aim of this thesis as well as the problems that need to be overcome.

First of all, the critical problem is that the MRI images does not provide electron density information, which is essential for accurate dose calculations.

As was mentioned before a first approach is to approximate all tissue properties as equivalent to those of water. A refinement of this approach is to identify the tissue which deviates the most from this approximation, namely bone, and assign a different attenuation value to that tissue. This is commonly referred to as bulk density assignment and has been shown to give a significant reduction of the dose error [2, 13]. For the same reason it is valuable to identify air, which barely attenuates the radiation at all.

Performing the described segmentation has typically been done by having an expert delineate the bone, one image slice at the time. Due to the time required for this it is not an appealing option in practice. Hence, the need for an automated segmentation method is evident.

Consequently, the aim of this thesis is to design a prototype algorithm capable of robustly solving the segmentation of MRI brain images into air, bone and soft tissue and furthermore to evaluate its performance.

(13)

The next section is meant to provide an overview of previous work in this direction. It is interesting to note that although we are concerned with applications in radiation treatment planning, the same problem is also of interest for a number of other applications. In summary these are

• PET/MRI attenuation correction. In order to compensate for attenuation inside the body when imaging with PET, density information is needed.

However, combined PET/MRI machines are difficult to also provide with CT imaging capabilities.

• EEG and MEG source modeling. To enable an accurate modeling of the electric sources when examining the brain it is important to distinguish the skull from soft tissue due to the significant differences in conductivity between the materials. This information is readily obtainable from CT images but not from MRI images, which are typically the ones used.

• Scatter correction in Cone-Beam CT (CBCT). When using CBCT for verification of patient positioning before radiotherapy treatment, photon scattering inside the patient leads to a degradation of image quality. This is remedied in a bootstrapping fashion, where an image is first reconstructed while neglecting scattering and then used for scatter correction in a second iteration. If the planning MRI could immediately be used for scatter correction it would reduce the time in which the treatment machine is occupied.

1.3 Overview of the field

Several of the algorithms described in this section are in fact intended for what is referred to as skull stripping, which is to segment brain from non-brain tissue without further considering the composition of the non-brain tissue. The reason for this is that many types of brain analysis is facilitated by considering only the brain. Yet other of the articles cited are concerned with this subsequent analysis and focus on the segmentation of the brain into different types of soft tissue.

Despite these differences many of the ideas presented may prove valuable for the problem at hand.

1.3.1 Segmentation based on local information

An image processing viewpoint

Since the segmentation problem is an image processing problem in its nature it seems reasonable to try and solve it using image processing methods. Especially popular in this respect are so called morphological operations, see for example [14] for an introduction.

Based on the assumption that the skull varies smoothly in shape and topol- ogy it is possibe to segment the skull using a combination of morphological operations and thresholding [15]. It has also been shown that under the same conditions the watershed transform can be used in order to perform skull stripping [16]. In fact a hybrid algorithm that combines this watershed algorithm with a deformable surface model, as described below, form the basis of a skull

(14)

stripping algorithm that is widely used [17]. One of the main reasons for this is that it is included in the Freesurfer software package, which is freely available.

An objection against these methods is that the regularity assumption can only be guaranteed in the upper part of the skull which makes their reliability for other areas questionable. Furthermore, the inability to distinguish air from bone based on only image intensity adds to its weaknesses.

The same criticism is valid for another proposed method based on first applying the Radon transform on the images and thereafter perform image processing in the transform space followed by an inverse transformation [18].

Gaussian mixture models

In earlier work it has been shown that Gaussian mixture models can be used for classification of soft tissue into white matter, gray matter and cerebrospinal fluid if one also considers the neighbourhood of a pixel [19]. Due to the be- forementioned difficulty with air and bone being indistinguishable in most MRI studies, this can not be straightforwardly used for bone segmentation. However, it has recently been shown that by using a sequence with ultrashort echo time, such that signal from bone is acquired, the Gaussian mixture model can be used to construct a substitute CT [7].

Machine learning

Machine learning is a branch of artificial intelligence concerned with algorithms that allow computers to learn from examples. The principal tasks in machine learning are classification and regression. Classification, also referred to as pat- tern recognition, aims at learning to distinguish between exemples and make predictions on their class. In our case this could be to determine whether or not a pixel corresponds to bone. Regression on the other hand can be used to approximate an unknown function, which would be of interest when trying to construct a substitute CT based on an MRI image.

Since in our case there are both CT and MRI images of each patient available the CT images can be used to provide the correct output when the input is taken from the corresponding MRI image. In machine learning this is referred to as supervised learning.

There is a plethora of different such machine learning algorithms, of which we will consider a few that have been used for segmentation of MRI images. These are artificial neural networks [20, 21], support vector machines [22, 23, 24] and AdaBoost [25].

Artificial neural networks are inspired by biological neural networks. They are commonly used to model complex relationships between input and output and to find patterns in data. They are versatile but their training and tuning is time-consuming.

In contrast, support vector machines have a more narrow aim in that they have traditionally only been used for classification. However, they are typically much easier to train than artificial neural networks and tend to generalize very well to unseen data. These properties qualify support vector machines for a first attempt at solving the given problem. Hence, we will return to them in chapter 2 for a proper introduction.

(15)

The last algorithm, AdaBoost, is in fact a meta-algorithm that combines the results from several weak classifiers to make a stronger one. It has been succesfully used to combine several support vector machines that take different features as inputs [25].

1.3.2 Segmentation based on a priori information

In the methods described above, global segmentation is done based on local image properties in what could be referred to as a low-level manner. In contrast there are high-level methods that make use of a priori information on the expected appearance and then adapt this to the image analyzed. Due to this information they are more stable against local image artefacts than the low-level approaches [26]. The high-level approaches are usually divided into deformable models and atlas-based models, although hybrid versions are a common sight [27].

Deformable models

Deformable models perform segmentation by starting from a previously learned shape and then deforming it by minimizing an energy functional that balances local image information and general smoothness constraints [28, 29]. A popular approach is to use so called level-sets [30] in combination with the addition of another term that deforms the contour towards a previously learned shape [31, 32, 33]. One of the major advantages of using level sets is that they provide an efficient way of representating contours especially during possible topological changes. Critics however point out that if training samples vary to much it may result in the learning of invalid shapes [26].

Other types of deformable models known as active shape models learn the mean shape as well as the most important modes of variation from a set of training examples [34, 35]. The mean model is then adapted to the actual image by adding linear combinations of the modes of variation. An extension of this known as active appearance models [36] also take the texture of the object into account. These types of deformable models impose stricter shape constraints than level-sets, and proponents of the latter also like to point out that active appearance models suffer from possible numerical instability and difficulty in handling topological changes [37].

The software package FSL [38], which is freely available for non-commercial use, is capable of performing skull segmentation from T1 images , preferably in combination with T2 images. It is based on a deformable model but appears to be largely based on heuristics [39].

Atlas-based segmentation

Atlas-based segmentation starts off by having an expert label structures of interest in one or several images and combine them into an image referred to as the atlas. The atlas is then transformed into the image analyzed in such a way that the similarity between them is maximized. Segmentation is then achieved by using the same transformation on the labels. Depending on the types of transformations allowed the accuracy differs, see [40] for a comparison.

(16)

As was mentioned before, atlas-based segmentation and deformable models are frequently combined [37, 41, 42, 43]. Still, a problem with all such methods is that they require the existence of a smooth deformation or transformation between the expected and the analyzed shape. This requirement may not be fulfilled in the presence of tumors or other pathologies.

(17)

Chapter 2

Support vector machines

Support vector machines (SVMs) are among the most popular algorithms in modern machine learning. The main reason for this is that they often provide significantly better classification performance than other algorithms on datasets of moderate size [44]. However, computational complexity may limit their use on very large datasets.

The idea behind SVMs, which is accountable for their excellent generaliza- tion to unseen data, is to not only find a way of separating data but to find the best way of doing it. This chapter intends to explain how this is done as well as demonstrate how it can be applied to the problem of segmenting bone.

2.1 Theoretical background

A central problem in the theory behind SVMs is how to find a generalized model based on a finite data set. In the tutorial [45], which underlies this presentation, the difficulties are explained as follows:

“Roughly speaking, for a given learning task, with a given finite amount of training data, the best generalization performance will be achieved if the right balance is struck between the accuracy at- tained on that particular training set, and the “capacity” of the machine, that is, the ability of the machine to learn any training set without error. A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree. Neither can generalize well.”

We will begin by considering the simplest case of linear SVMs trained on separable data and step by step progress to the most general case of nonlinear SVMs trained on non-separable data. As was mentioned in the introduction, SVMs have primarily been used for classification tasks. Consequently that will be the focus of the treatment below.

(18)

Figure 2.1: Example of a two dimensional classification task where the red dots are positive– and the blue dots are negative examples. The black line is called the decision boundary. The distance between the red and blue line is the margin and the points that lie on either of these lines are the support vectors.

2.1.1 Linear separation

Assume that we are provided with n training data points, xi, each belonging to either of two classes denoted by the class indicator yi ∈ {−1, 1}. For now we will assume that the two classes are linearly separable, as illustrated in figure 2.1. For the sake of discussion, let us assume that the red points in the figure belong to the positive examples, i.e. have y_i= +1.

In cases where the data is linearly separable there is typically an infinite number of hyperplanes (straight lines in two dimensions) capable of performing the separation. After training, new data points are classified based on which side of the hyperplane they end up in, which explains why it is also referred to as the decision boundary. The question that we have set out to adress is therefore how to find the optimal separating hyperplane. In order to eventually be able to formulate this as an optimization problem some definitions are needed.

Following [45], denote the shortest distance from the separating hyperplane to the closest positive and negative example by d+ and d− respectively. More- over, we define the margin to be the sum of d+ and d−.

The margin is a central concept in the theory of SVMs since in the case of linearly separable data the optimal separating hyperplane is the one that maximizes the margin.

Another definition used is that the positive and negative examples that are

(19)

at the minimum distance from the the optimal separating hyperplane are called the support vectors. As we will see later these are the only data points a SVM actually uses to make classifications.

To find the size of the margin we first note that points x on the separating hyperplane can be expressed as

w · x + b = 0 (2.1)

where w is normal to the hyperplane and b is constant. This means that clas- sification of a new data point is done based on whether the left hand side of equation (2.1) turns out positive or negative. By an appropriate choice of scaling for w and b, points lying on or outside the margin can be written

xi· w + b ≥ +1 for {i | yi= +1} (2.2) xi· w + b ≤ −1 for {i | yi= −1} (2.3) which can equivalently be written

yi(xi· w + b) − 1 ≥ 0 ∀i (2.4) By comparing points, x±, on the lines through the support vectors to the cor- responding nearest point, x, on the decision boundary we find that

w · (x±− x) = ±1 (2.5)

=⇒ d₊= d₋ = 1

kwk (2.6)

Consequently the margin is maximal when kwk is minimized. To find the opti- mal separating hyperplane we may therefore solve

minimize

w,b

1

2kwk² (2.7)

subject to y_i(x_i· w + b) − 1 ≥ 0 i = 1, . . . , n (2.8) This is referred to as the primal formulation of the optimization problem. It is a quadratic optimization problem and as such it could in principle be solved using standard techniques. However, the later generalization to nonlinear SVMs is greatly facilitated by converting it to the corresponding dual problem (see for example [46] for details). Hence, introduce Lagrangian multipliers αi ≥ 0, one for each data point, giving the Lagrangian

L = 1

2kwk²−

n

X

i=1

αi(yi(xi· w + b) − 1) (2.9)

From the requirement that the gradient with respect to w and b is zero it follows that

0 = w −X

i

αiyixi (2.10)

0 =X

i

αiyi (2.11)

(20)

By substituting these expression into the Lagrangian the dual problem to (2.7)–

(2.8) is found to be

maximize

α

X

i

α_i−1 2

X

i,j

α_iα_jy_iy_jx_i· x_j (2.12)

subject to X

i

αiyi= 0 (2.13)

αi≥ 0 i=1,. . . ,n (2.14)

In the solution to this problem there is typically only a few non-zero α, these correspond to the support vectors. All the other training points are effectively thrown away.

Since the dual problem presented above is a convex quadratic optimization problem the Karush-Kuhn-Tucker (KKT) conditions are both necessary and sufficient for its solution (again, see [46] for further details).

2.1.2 Non-separable data

In reality it is common that the input data is impossible to separate per- fectly. Furthermore, outliers may single-handedly distort the decision boundary thereby impairing the generalization. In order to accomodate for such occur- rences the concept of soft margins, that allow for some degree of misclassifica- tion, is introduced. Mathematically, this is done by adding slack variables, ξi, in the primal problem which then takes the form

minimize

w,b,ξ

1

2kwk²+ C

n

X

i=1

ξi (2.15)

subject to y_i(xi· w + b) − 1 + ξi≥ 0 (2.16)

ξi≥ 0 i = 1, . . . , n (2.17)

The constant C controls the degree of misclassification allowed. A large value means that only slight deviations are tolerated. This carries over nicely to the dual problem for which the only difference is that the Lagrangian multipliers become bounded from above by C, i.e.

0 ≤ α_i≤ C ∀i (2.18)

replaces condition (2.14).

2.1.3 Nonlinear separation

As the dimensionality increases it becomes increasingly easy to find a hyperplane that linearly separates the data. For high dimensional feature spaces the linear SVM described above may therefore perform well. However, the same can’t be expected in lower dimensions. The solution is therefore to transform the low dimensional data into a potentially very high dimensional space where it can be linearly separated and then transformed back. In the original feature space this has the effect of a nonlinear separation.

Using the kernel trick the linear SVMs can be generalized to perform this nonlinear separation in a remarkably straightforward way. The first thing to

(21)

note is that in the dual optimization problem above the training data only appears in the form of dot products. Next, consider a mapping Φ from the low dimensional feature space to a high or even infinite dimensional Hibert space H.

For the transformed data the optimization problem would then be formulated in terms of dot products in H, i.e. as Φ(xi) · Φ(xj). If it is possible to find a kernel, K, such that

K(xi, xj) = Φ(xi) · Φ(xj) (2.19) then the computationally expensive dot products in the transformed space can be replaced by computations of the kernel function using vectors in the original, low dimensional, space. Hence, the transformation and inverse transformation can be done without ever needing to evaluate the high dimensional vectors Φ(xi), all that is needed is a suitable kernel function.

Popular examples of basis functions with corresponding kernels are

• polynomials of order up to k > 0 with kernel

K(x_i, x_j) = (1 + x_i· x_j)^k (2.20)

• sigmoid functions with parameters κ and δ and kernel

K(xi, xj) = tanh(κxi· xj− δ) (2.21)

• Gaussian radial basis function (RBF) expansions with parameter σ that controls the smoothness of the boundary. The corresponding kernel is

K(xi, x_j) = e^−kxⁱ^−x^j^k²^/(2σ²⁾ (2.22) There is no general theory regarding when to use a particular kernel but ex- perience suggests that the RBF kernel often results in very good boundaries and should be used as the default choice [47]. In our case we note that since Gaussian mixture models have been applied with some success it seems likely that the Gaussian RBF kernel would too.

After training, classification of a new data point, x, is done by evaluating the sign of the discriminant function

ˆ

y(x) = w · x + b =

n

X

i=1

αiyiK(xi, x) + b (2.23)

the value of b is determined by averaging the value obtained from (2.4) over the support vectors, i.e.

b = 1

|Ns| X

i∈Ns

(w · xi− yi) = 1

|Ns| X

i∈Ns



 X

j

αjyjK(xi, xj) − yi



 (2.24)

where Ns is the set of indices corresponding to the support vectors.

(22)

2.1.4 Optimization procedure

At first glance the quadratic optimization problem we are faced with might appear to be a typical one that is tractable by standard algorithms. For small problems this is indeed true. However, as the problem size increases a distinguishing feature starts to emerge. The fact is that the kernel matrix, with elements Kij= K(xi, xj), is typically not sparse. This makes standard algorithms, that explicitly store the Hessian matrix, very inefficient or even unapplicable due to the associated memory requirements.

This complication has lead to the development of an optimization routine called Sequential Minimal Optimization (SMO) [48]. The idea behind SMO is to decompose the problem into a series of smallest possible sub-problems which are solved analytically. Compared to other methods it is less complex and does in fact not even require a numerical optimization solver. The general procedure can be outlined as follows. First, the pair of Lagrangian multipliers that violate the KKT-conditions the most are selected [49]. These are then jointly optimized and a new pair is selected until all of them satisfy the KKT-conditions.

Research is still being done on how to improve on the convergence rate by making an optimal selection of the pair of multipliers. Likewise, other types of decompostions have also been considered. Still, SMO as it has been described here is guaranteed to converge and has proven highly effective.

2.1.5 Practical aspects

To write an efficient SVM trainer from scratch is not a trivial task. Fortunately some high performing alternatives have been made publicly available. Two frequently employed by the scientific community are LIBSVM [50] and SVM^light [51]. Moreover, MATLAB^R provides a commercially available SVM trainer in its Bioinformatics Toolbox.

Even though these software packages handle the major part of the training there are some aspects that the user needs to be aware of [47].

Regarding the preprocessing of the data it is of key importance that each feature is scaled to a fixed range, e.g. [0, 1]. This ensures that the features are given the same importance and also avoids numerical difficulties. Note that the same scaling has to be used on both the training and the test data.

Another thing to consider is how to find the optimal parameters for the kernel and the soft margin. In fact, there is no analytic way of doing this. Instead it is done by cross-validating the performance while varying the parameters in an exponential sequence. This process is easily expediated by parallelization.

2.2 Implementation

This section concerns the implementation of a support vector machine for the problem at hand. More precisely, it describes the features selected and the processing steps carried out to extract these as well as the method used to evaluate the performance.

(23)

2.2.1 Feature selection

A key aspect for any accurate machine learning algorithm is the choice of features on which to base the prediction. In particular, the mere number of features are important to consider since the amount of data needed increases fast with the size of the feature space.

A possible solution to this would be to use the AdaBoost algorithm, which combines learners trained using different feature sets [25]. For simplicity we will however begin by using the feature set described in [22]. It is made up of 25 elements that will be further explained below, but in short there are 1 probability value, 3 spherical coordinates, 9 intensity values in the gradient direction and 12 intensity values (±2 voxels along each Cartesian coordinate axis).

Image processing

In order to reduce the overall computational burden an image mask containing the region of interest was created. The main purpose was to remove the stereotactic headframe all patients in the study used as well as most of the outside air.

A stepwise description of the image processing done to remove the stereotactic headframe and extract the bone from the CT images is outlined in figure 2.2. The only difference when extracting air is to use a different threshold. For the MRI images a similar procedure was also used with the exception that no high thresholding was done.

It is worthwhile pointing out that the values chosen for the thresholding that extracts bone have a major impact on the end result. However, a detalied treatment of this is deferred to section 3.4.

Coordinate system

Spherical coordinates are used since they are likely to give a more natural representation of the head than their Cartesian counterparts. From a practical view the conversion from Cartesian to spherical coordinates is straightforward provided that an origin is specified. For that, the center of mass for the mask described above was chosen.

Image gradient

The purpose of selecting the 9 closest intensity values along the gradient direction was to recognize boundaries in the vicinity of the voxel in question. To determine an approximation of the image gradient a 3D Sobel filter was used.

It considers only the 26 nearest neighbours and performs smoothing in the directions perpendicaluar to the derivate direction and takes the central difference along it.

Since the images considered here are in general anisotropic with respect to voxel size the central difference part needs to be slightly modified by dividing with the voxel size in the derivate direction. Figure 2.3 illustrates the absolute value of the gradient in each direction.

By projecting the total gradient onto each of the unit length vectors from the center voxel to each of its neighbours the most suitable discrete representation of the gradient direction could be determined.

(24)

(a) Original CT (b) After Otsu thresholding

(c) After closing and filling (d) Final mask

(e) Filtered CT (f) Extracted bone

Figure 2.2: The steps taken to remove the stereotactic headframe and extract the bone. (a) A slice of the original 3D CT analyzed. (b) Result of applying a threshold determined by Otsu’s method. (c) Result of performing a closing operation with a rectangular structuring element and subsequent hole filling.

(d) Final mask after having applied a threshold at 90 %, an opening operation and finally selecting the largest contiguos volume. The image appears disjunct since it is only a cross-section of the actual 3D volume. (e) Result of applying the mask to the original CT image. (f) Bone extracted by selecting intensity values between 40 % and 85 %.

(25)

(a) |Gx|

(b) |Gy|

(c) |Gz|

Figure 2.3: Absolute value of the gradient images arrived at by applying a 3D Sobel filter to an MRI image and correcting for differences in voxel size. The seemingly strange naming of the two upmost images is due to the fact that the coordinate system used has its origin in the top left corner with the x-axis pointing down and the y-axis pointing right.

(26)

Figure 2.4: A slice of the probability distribution arrived at by registering CT images from 753 patients and averaging the bone extracted from them. The intensity of each pixel corresponds to the observed probability of there being bone at that pixel.

Probability atlases

The idea behind including a probability value is to capture the normal appearance, thereby ensuring that the prediction will not differ significantly from what can be expected. In order to acquire an estimate of this probability distribution a data set consisting of CT images from 753 patients was used.

First, one of these patients was selected as a template used to provide a common frame of reference throughout. Next, the stereotactic headframe was removed from the template according to the steps shown in figure 2.2 and at the same time the origin of the coordinate system was determined. Thereafter the other patients were registered one at the time to the template using the mutual information algorithm [1]. In order to properly register images of two different patients it is necessary to include the image scaling in all three dimen- sion as parameters in the optimization procedure. The degrees of freedom in the registration are thus three translations, three rotations and three scalings.

Lastly, the bone and air were extracted using the method described above and the probability distribution was estimated by averaging over all patients. Figure 2.4 shows a slice from the bone atlas.

(27)

Training CT

Template CT Template MRI Training MRI

Bone Atlas

Figure 2.5: An overview of the registrations performed in the process of extracting features used for training. Each arrow corresponds to a registration with the direction specifying the target. The dashed arrows indicate a registration in which scale transformations are allowed whereas for the solid arrow they are not.

When acquiring training data the image in question was removed from the bone atlas so as to not include the true values in the input.

2.2.2 Training data

As always with machine learning algorithms, the performance of a support vector machine improves with the amount of data used for training it. Unfor- tunately, so does the time required for training. More precisely, for n support vectors the complexity is between n²and n³depending on the size of the number C that penalizes misclassifications [49].

Since a whole CT or MRI study contains a vast number of voxels and it is also desirable to use data from several patients it is not feasible to use all the available data for training. Instead, a limited amount of voxels is sampled from each patient.

A purely random sampling would contain relatively few examples of bone and many of the air outside the head. To counteract this, half of the training samples are drawn from voxels known to be bone whereas the other half is randomly selected using the mask described previously. Hence, there will be a slight overweight of training examples that are from bone, but not by very many. In contrast, the training data for air was sampled randomly.

To be able to extract the selected features from a voxel the frame of reference defined by the template CT has to be imposed on the other image sequences.

This is done in three steps. First the MRIs of the training and template patients are rigidly registered and resampled to the corresponding CTs. Then the CT of the training patient is registered and resampled to the template CT while allowing for scale transformations. Finally, the same transformation is applied to the training MRI thereby implicitly relating it to the template MRI. The different registrations performed are depicted in figure 2.5.

(28)

Actual value

p n

Predicted p’ True Positive False Positive value n’ False Negative True Negative

Table 2.1: The definitions used in a confusion matrix to analyze the performance.

2.2.3 Evaluation of performance

To evaluate the performance of the algorithm for different choices of features and parameters some kind of measure is needed. One alternative would be to simply measure the percentage of correct prediction made on a validation set. Due to the fact that there is an unbalance in the number of examples belonging to the different classes this measure may yield misleading results.

An alternative description that allows more detailed analysis is the confusion matrix. As illustrated in table 2.1 it shows the number of false positives (FP), false negatives (FN), true positives (TP) and true negatives (TN). Based on this, the true positive rate (TPR) and false positive rate (FPR) are defined as

TPR = TP

TP + FN

FPR = FP

FP + TN

In words, the true positive rate represents the number of correctly predicted positive results out of the actually positive examples. Analogously, the false positive rate represents the number of erroneously predicted positive results out of the actually negative examples.

Indeed, these are reasonable measures in general but not necessarily the best suited for any given task. Since the goal here is ultimately to improve the dose planning accuracy of the Leksell Gamme Knife^R a measure tailored for that specific purpose can be constructed. An illustration of how the Gamma Knife works is shown in figure 2.6.

During irradiation the main part of the photons enter through the top side of the head and hence the accuracy of the skull segmentation there has the greatest impact on the dose calculation. A measure that takes this into account while keeping the computational burden low is therefore desired. It is thus proposed to use a measure M (C, σ) defined as

M (C, σ) =X

i

(t_i,true− ti,pred)² (2.25)

where C as before controls the regularization in the training of the support vector machines and σ affects the shape of the radial basis functions used.

Moreover, ti represents the thickness of the bone traversed by a ray i going from the center of the head in the direction specified by the polar angle θi and the azimuthal angle φi. It is proposed to choose these angles such that the the hemisphere is uniformly sampled. This is done by taking φi to be linearly spaced on [0, 2π) and θi∈ (0, π/2] such that cos θi is linearly spaced.

(29)

Figure 2.6: The Gamma Knife is used for non-invasive treatment of brain disorders, typically performed on a single occasion. During the procedure 192 Cobolt-60 sources converge with high accuracy on the target. Each individual beam has low intensity and therefore has no significant effect on the tissue passed on the way to the target. The beams converge in an isocenter where the cumulative radiation intensity becomes extremely high. By moving the patient’s head in relation to the beam’s isocenter the radiation dose can be optimized in relation to the shape and size of the target.

(30)

2.2.4 Parameter tuning

As was mentioned before, there is no general way of finding the optimal choice of the parameters C and σ. However, by using a measure such as (2.25) it is possible to perform an automatic optimization like the one provided by MAT- LAB’s fminsearch command. It uses a nonlinear optimization procedure called the Nelder-Mead simplex method which does not require derivatives [52]. How- ever, it is well-defined for twice differentiable and unimodal problems, neither of which is guaranteed in this case. To avoid converging to a non-global minimum it is thus advisable to ensure a good initial guess by trying out parameter values in an exponential sequence that covers a wide range of values.

(31)

Chapter 3

Dose calculations

The overall goal in radiotherapy is to deliver large amounts of radiation to the target while sparing healthy tissue. This is quantified by the dose, which is defined as the mean energy imparted per mass. A complication in the context of radiotherapy is that the dose delivered is not amenable to direct measurements.

Therefore, dose distributions have to be estimated from calculation models.

Depending of the context, the requirements on these vary. In clinical applications, where dose calculations are often performed repeatedly as part of an optimzation procedure, time considerations are important. This often comes at the cost of reduced accuracy. On the contrary, in the design and evaluation of radiotherapy equipment, such as this work, emphasis is on accuracy and time constraints are less of a problem. Naturally, this type of trade-offs limit the range of algorithms conceivable for different applications.

This chapter briefly discusses the processes in which energy is deposited, then outlines the demands on the dose calculation and lastly presents the algorithm used henceforth.

3.1 Energy deposition in photon beams

In radiotherapy with conventional linear accelators as well as the Gamma Knife, the radiation consists of photons in the MeV range. In case of the Gamma Knife the radiation originates from Cobalt-60, which emitts photons at 1.17 and 1.33 MeV with equal probability according to the decay scheme shown in figure 3.1.

This means that pair production, requiring energies greater than 1.022 MeV, is possible. However, the cross section for pair production at these energies is by far exceeded by that of Compton scattering and will consequently be left out of the discussion in what follows. When considering linear accelators, operating at higher energies, this might not be appropriate. The photoelectric effect on the other hand is initially negligible but becomes increasingly important as the photons scatter and lose energy. A comparison of the linear attenuation coefficient in water for different types of interactions is shown in figure 3.2.

In short, Compton scattering is an interaction in which the incoming photon scatters inelastically against an electron. The energy imparted to the electron makes it recoil and it is thereby ejected from the atom. The remaining part of the energy is emitted as a scattered photon in a different direction than

(32)

Figure 3.1: Decay scheme for Cobalt-60 which is used as the radiation source in the Gamma Knife.

Figure 3.2: Linear attenuation coefficient in water for Rayleigh scattering (µ_Ra), Compton scattering (µ_C), photoelectric effect (µ_Ph) and pair production (µ_PP).

(33)

the electron, such that the overall momentum is conserved. The probability of scattering in different directions is given by the Klein-Nishina formula.

The recoiling electron ionizes and excites atoms along its trajectory until its kinetic energy is lost. Since the mean free path of the electron is significantly shorter than that of the scattered photon, its energy is deposited far more locally. Therefore it is common to make a distinction between primary dose, deposited by the electron close to the place of interaction, and secondary dose due to the scattered photon, which also eventually gives of its energy by electron interactions.

3.2 Methods for dose calculation

This section aims to describe some of the requirements on dose calculations as well as an outline of algorithms to deal with them. It is to a large extent based on an excellent review article on the subject [53]. As is mentioned there, several recommendations on overall dose accuracy exist, but an accuracy of

±5% is commonly considered acceptable. This however includes inaccuracies such as patient positioning and machine errors, making a precise statement of the required dose calculation accuracy difficult. Still, it is undisputable that a higher accuracy is better.

Analytically, a radiation field at a given time can be described by the phase space density. This is defined as the number of particles per six-dimensional volume made up of the three spatial coordinates and either the particle energy and direction or equivalently the three momentum components. Through some general continuity considerations the phase space evolution can be expressed by the well known Boltzmann equation. Although the scattering and absorption processes themselves are well understood the overwhelming number of such processes makes the Boltzmann equation difficult to solve analytically.

Early methods for dose calculation were instead based on semi-empirical formulas. Today these are mostly used as independent checks for quality assur- ance.

Other methods more commonly used in clinical applications draw from an idea in image formation. By viewing the energy deposition by secondary particles around a primary photon interaction site analogously to a point spread function in image formation the overall dose distribution can be computed as a convolution. Strictly speaking this requires that the kernel describing the energy deposition is spatially invariant, which is only true in homogenous media.

To calculate the kernels analytically is difficult and the standard method is therefore to do it using the statistical Monte Carlo method. Furthermore, to reduce the computational burden this is typically done only for water. The result is then rescaled according to the density of the actual tissue.

Methods based on convolution are popular since they manage to achieve good accuracy at a reasonable computational cost. On the other hand, in cases where highest possible accuracy is desirable, Monte Carlo methods are considered the superior choice.

The general idea behind Monte Carlo methods is to use repeated random sampling of processes expressed as probability distributions in order to statisti- cally evaluate mathematical or physical systems. They are typically used when analytical computations are impractical, as in the present case of the Boltzmann

(34)

equation. However, alternative deterministic methods of solving the Boltzmann equation do exist.

In principle Monte Carlo methods could model the explicit physical processes without introducing approximations. However, two major factors limit the accuracy. First, there is an uncertainty in the interaction cross sections used and second there is a statistical uncertainty which depends on the number of photons traced.

Due to the long mean free paths of the photons in the dose calculation, the energy is transferred in local fractions over large volumes. Consequently a very large number of photon histories are required for accurate statistics. In fact, to get comparable statistics for electron transport requires only about a tenth of the number of histories [54].

Moreover, the Monte Carlo method allows for modeling of beam transport in the machine, a process which may contribute significantly to the accuracy.

To avoid redundant calculations this part can be performed in advance and the resulting phase space data stored.

3.3 XVMC

Since there is a greater emphasis on accuracy rather than execution speed in the evaluations to be performed a Monte Carlo method is used. More precisely, the algorithm used is XVMC [54, 55] which is an improved version of a Monte Carlo algorithm for electron beam dose calculations [56, 57].

Since the photons only interact a limited number of times before either being absorbed or exiting the patient their interactions can be modeled directly.

On the contrary, scattered electrons undergo a substantial number of Coulomb interactions before coming to a halt and direct modeling of all these is therefore impractical. Thus the collisions and production of brehmsstrahlung photons are treated in an approximate fashion [57]. This is first done in homogenous water and then rescaled to the patient geometry using information from a CT study.

To obtain electron density from the image intensity in CT images a sequence of steps is carried out.

First, image intensity is converted to Hounsfield units according to a linear relation included in the CT study’s DICOM information.

Second, the Hounsfield units are related to density according to equation 22 in [56]

ρ(h⁰) =











−0.008 + 1.033h⁰, h⁰ ≤ 0.895 0.108 + 0.904h⁰, 0.895 < h⁰ ≤ 1.1 0.330 + 0.685h⁰, 1.1 < h⁰ ≤ 2.381 0.580 + 0.580h⁰, 2.381 < h⁰

(3.1)

where h⁰= 1 + h/1000 and h is the Hounsfield value.

Third, the density is connected to electron density by equation 10 in [54]

ne(ρ) n^w_e =

(ρ/ρ^w, ρ ≤ ρ^w

0.85ρ/ρ^w+ 0.15, ρ ≥ ρ^w (3.2) where ne(ρ) is the electron density as a function of density ρ, n^w_e is the electron density and ρ^wthe density of water.

Dose planning from MRI using machine learning for automatic segmentation of skull and air