Regression Forests

(1)

Degree project in

Automatic localization of bounding boxes for subcortical structures in MR images using regression forests

LARS LOWE SJÖSUND

(2)

Master’s Thesis at CSC CVAP Supervisor at Elekta: Jens Sj¨olund Supervisor at CVAP: Josephine Sullivan Examiner: Stefan Carlsson

E-mail address: sjosund@kth.se Subject: Computer Science

Svensk titel: Automatisk Lokalisering av Hj¨arnstrukturer i MR-bilder med hj¨alp av Regression Forests

(3)

Automatic Localization of Bounding Boxes for Subcortical Structures in MR Images Using

Regression Forests

Lars Lowe Sj¨osund August 21, 2013

(4)

Abstract

Manual delineation of organs at risk in MR images is a very time consuming task for physicians, and to be able to automate the process is therefore highly desirable. This thesis project aims to explore the possibility of using regression forests to find bounding boxes for general subcortical structures.

This is an important preprocessing step for later implementations of full segmentation, to improve the accuracy, and also to reduce the time con- sumption. An algorithm suggested by Criminisi et al. is implemented and extended to MR images. The extension also includes using a greater pool of used feature types. The obtained results are very good, with an average Jac- card similarity coefficient as high as 0.696, and center mean error distance as low as 3.14 mm. The algorithm is very fast, and is able to predict the location of 43 bounding boxes within 14 seconds. These results indicate that regression forests are well suited as the method of choice for preprocessing before a full segmentation.

(5)

Automatisk lokalisering av hj¨arnstrukturer i MR-bilder med hj¨alp av Regression Forests

Sammanfattning

Manuell segmentering av riskorgan i MR-bilder är en väldigt tidskrävande uppgift för läkare. Att kunna automatisera denna process vore därför av stor nytta. I detta examensarbete har vi undersökt möjligheten att använda regression forests för att hitta en minsta bounding box för olika strukturer i hjärnan. Detta är ett viktigt steg för att snabba upp och öka precisionen hos en senare komplett segmentering. En algoritm utvecklad av Criminisi med flera utvidgas till att användas p˚a MR bilder och innefatta en rikare bas av möjliga funktioner. De resultat som f˚as fram är väldigt bra, med en genomsnittlig Jaccard similarity coefficient p˚a 0.696 och en genomsnittlig feluppskattning av bounding box centrum p˚a 3.14 mm. Algoritmen är även väldigt snabb och den lokaliserar bounding boxes för 43 strukturer p˚a 14 s.

Dessa resultat visar tydligt att algoritmen kan anv¨andas som ett steg innan komplett segmentering.

(6)

Acknowledgements

I would like to thank my co-workers at Elekta for helping me out and showing interest in this master’s thesis project. In particular I would like

to thank Jens Sj¨olund for supervising me, and giving me tons of great input on my work. I would also like to thank Josephine Sullivan at CVAP, KTH, for lending me her expertise and supervising me in this very exciting field of research. Finally I would like to thank all of my friends who helped

me with the proofreading of the report.

(7)

Introduction

The Gamma Knife was developed in the 1950s by the Swedish neurosurgeon Lars Leksell as an apparatus for treating brain tumors. It uses gamma radi- ation from cobalt-60, emerging from several sources all aimed at the location of the tumor, see Fig. 1.1. A single gamma beam generates a very low ra- diation dose, and thus, as opposed to conventional radiotherapy, untargeted tissue will not be affected by the treatment. Compared to ordinary neu- rosurgery, operations performed using a Gamma Knife eliminate the need to open a patient’s skull, thus reducing complications related to infection and surgeon error. Over 600 000 patients have been treated at the time of writing.

Before a Gamma Knife treatment, a treatment plan needs to be made, based on a MRI and/or a CT-scan. During the planning, the planner decides target areas and the duration of gamma rays convergance. Complications arise if a tumor is located close to any of the more sensitive and vital structures of the brain (organs at risk), such as the optic nerve or the cochlear nerve. This gives rise to the need for the planner not only to delineate the tumor, but also the organ at risk. Performing this delineation takes a significant amount of time [11], and it would be desirable to automate this process.

In this thesis project, we investigate automating the task of finding bounding boxes for organs at risk. This is an important preprocessing step for later implementations of full segmentation, both to improve accuracy, and also to reduce the running time of the planning procedure. For example, the algebraic multigrid segmentation has a time complexity of O(n) [7], where n is the number of voxels in the volume. The number of voxels, in turn, is cubic in the side length of the volume. To reduce the volume where

(10)

the segmentation is made to a fraction of its original size would result in a significant reduction of processing time.

It is also desirable that the algorithm is general enough to be easily adopted to other organs than the ones originally trained for.

(a) The Gamma Knife Perfexion

(b) The Gamma Knife Illustration

Figure 1.1

(11)

1.1 Related Work

Due to the large size and complex nature of a CT- or MRI-scan, locating the bounding boxes of anatomical structures is an important preprocessing step before full segmentation is made. Hence, several research groups have made attempts to make this step as accurate and time-efficient as possible.

In [17], Wels et al. use the concept of Marginal Space Learning (henceforth MSL) to perform detection and segmentation of 8 structures in T1- weighted MR brain scans. MSL is a top-down approach where the segmentation search space is projected down into the subspaces of finding first the center position of each structure, then orientation, scale, and finally shape.

By doing this, Wels avoids exhaustive search in a high dimensional space, and the calculations are performed at only 13.9 seconds for 8 subcortical gray matter structures on a Fujitsu Siemens notebook equipped with an 2.20 GHz Intel Core 2 Duo CPU and 3 GB of RAM. 3D generalized Haar wavelet features are used to find possible object centers, and steerable features are used to find orientation and scaling. This paper does not investigate this approach due to a possible conflict of interest between the associated patent owner (Siemens AG [19]) and Elekta, who have collaborated with the author of this thesis.

In [20], Zhou et al. get good results detecting bounding boxes for body organs in CT-scans, using multiple 2D Haar-like detectors. These detectors are run over each slice along the axial, coronal, and sagittal directions, and their results are combined into a 3D voting space. The best results are observed while detecting the object centers, where a large number of the detected organs are within 10 mm from ground truth. The mean results for the Jaccard similarity coefficient (JSC, defined as |A ∩ B|/|A ∪ B|, where A is ground truth, and B the detected volume) varies significantly, and for the spleen it is only 0.571 with a standard deviation of 0.236. The scaling of the algorithm is not optimal, because it needs to be run once for every organ to be detected. It also needs to train one classifier for each direction and structure. In the results reported in [20], it is stated that each organ takes about 15 seconds to localize on a computer equipped with an Intel Due2Core 2.23 GHz CPU, which is a lengthy compared to MSL [17]. Additionally, in this approach, hard labels are assigned for each of the classifiers to each point in the voting space, something that is arguably less robust than a more probabilistic approach, since there is no measure of how certain your estimate is. Also, by only using local 2D detectors, information is lost compared to looking at 3D features, as well as compared to combining local

(12)

In [5], Criminisi et al. suggest using regression forests to find the bounding boxes of body organs in CT scans. The training is made by passing down a subset of voxels through each decision tree and maximizing the information gain with respect to certain splitting parameters for each node. During testing, voxels from unseen CT-scans are passed down through each tree, ending up in leaf nodes where they vote for the positions of the bounding boxes. The results they get are good, not as good as for Zhou et al. in [20], but then, the fact that the scans used in [5] have a worse resolution might play a part. The run-times of the algorithm are very good; 26 anatomical structures are localized in only 4 seconds on a conventional single-core machine. Their method uses only a single model, containing 4 trees, to predict the bounding boxes for all of the structures, and is thus very scalable. They also use a softer voting approach than the one proposed in [20]. Here, each regression tree makes a probabilistic vote in the voting space, and except for a prediction of the positions of the boundary boxes, measures of confidence of these prediction are obtained. In this thesis project, the same approach has been tested and extended for MR brain images.

(13)

Chapter 2

Background Theory

2.1 MRI

MRI is an abbreviation for Magnetic Resonance Imaging, and is a commonly used technique for detailed visualization of internal structures of the human body. MRI is particularly useful for creating images with high contrast between different kinds of soft tissues.

For atoms with an uneven total number of nucleons, there will be a non- zero spin. The simplest example of this is a single proton or neutron. Single protons, hydrogen, are the most suitable particles for MRI-detection use, since they are abundant in the brain, and also have the greatest magnetic moment of the atoms in our bodies.

When placed in a magnetic field, protons will line up with spins parallel and anti-parallel to the field with probabilities in accordance to the Boltzmann distribution

p(i) ∝ e^−Eⁱ^/(k^b^{T )}, (2.1) where Ei indicates the energy for state i. For a magnetic field of 1 T, and a normal physiological temperature, it can be shown that there will be 3 more protons per million in the lower energy state, with the spin parallel to the magnetic field [3]. The spin is not perfectly aligned to the field but will have an angle to it. This makes the magnetic field exert a torque on the proton, which will precess about the field direction, with a frequency known as the Larmor frequency. This frequency is proportional to not only the strength of the magnetic field, but also on the gyromagnetic ratio, an attribute unique for each type of particle. The energy is proportional to the frequency and thus the energy difference between the parallel and the anti-

(14)

with energy exactly corresponding to this energy difference (radio frequency) are aimed at the subject. When a proton in the lower energy state is hit by a photon, it will get excited to the higher energy state, and when another photon hits it, spontaneous emission of two photons occurs. This will make the proton return to the lower energy state. This emission can be detected, and when superimposing magnetic non-uniformities, the spatial location of the emission can be determined. Combining this with the fact that different tissue types have different relaxation times, it is possible to create images with high contrast between tissues.

Interested readers are referred to chapters 14 and 15 in [3] for a more in-depth study.

(15)

2.2 Decision Trees

A decision tree is a commonly used machine learning method, invented by Breiman and Friedman in the 1970s [13]. Decision trees can be of either classification or regression type. A regression tree will output either a point estimate or a probability distribution. Whereas a classification tree will output a class prediction. A decision tree partitions the feature space into subregions, often hyperrectangles, by applying split functions. For classification, each subregion is associated with a class as shown in Fig. 2.1(a).

When performing regression, each subregion is connected to a function, approximating the behaviour within that region. The output fT of a regression tree T given input v can be written as

f_T(v) =X

i

f_i(v)I(v ∈ R_i), (2.2)

where Ri is the i’th region, I(.) the indicator function, and fi the function approximating the behaviour in region i, see Fig. 2.1(b). The split functions are often binary, and in case of numerical features, often axis aligned. This makes decision trees very easy to interpret, since it is often in line with how humans perform classification. Decision trees are also very fast at runtime, with a time complexity of O(log |R|), where |R| is the number of regions.

In practice, the input data is introduced at the root node (see Fig. 2.2) of the tree, and is then passed down along a path decided by the split tests performed in the non-terminal nodes, henceforth called split nodes. The input ends up in the terminal nodes, henceforth called leaf nodes, which make a prediction about the output value according to Eq. (2.2).

A problem with decision trees is that they have high variance. A single decision tree can easily grow too complex and overfit the training data, thus giving worse generalization on unseen data. The high variance is due to the greedy learning of the hierarchical structure, where an error early in the tree will propagate down to the lower levels [8]. To address this problem, it is possible to average the output from several trees, as shown in the next section.

(16)

(a) Partitioning of the feature space, and prediction by a classification tree.

(b) Partitioning of the feature space, and prediction by a regression tree.

Figure 2.1: Illustrations of the learnt partitions and the predictions made by a classification and a regression tree. Figure (b) from [8].

(a) Decision tree, circular nodes are split nodes, square nodes are leaf nodes.

(b) In every split node, a decision is based on the feature response f(.).

Figure 2.2: Illustrations of a decision tree.

(17)

2.3 Entropy and Maximum Information Gain

For every level of a decision tree, it is desirable to gain some information compared to a previous level. To be able to understand this increase in information one needs some formal definitions from information theory. In [1], Bishop describes information as a “degree of surprise”, that is, if a sample x of a random variable X is very unlikely, it contains more information than a very likely data sample. Mathematically, we want a function h(x) that is monotonously decreasing with the probability p(x). Also, if two samples are statistically independent, the information obtained should be added;

p(x, y) = p(x)p(y) ⇒ h(x, y) = h(x) + h(y). Both these criteria are fulfilled by defining information as

h(x) = − log p(x). (2.3)

From this, the average information for sampling a distribution can be calculated as

H(x) = E[h(x)] = −X

i

p(x_i) log p(x_i). (2.4) This is also known as entropy and is a measure of the uncertainty of the random variable X, see Fig. 2.3. For a continuous distribution one talks about differential entropy, which is defined as

H(x) = − Z

p(x) log p(x)dx. (2.5)

When training a regression tree, one wants to maximize the information gain at every level. This is equivalent to maximizing the difference in entropy between a node and its children. The distributions will later be modelled as multi-variate Gaussians, and the differential entropy for a multi-variate Gaussian is

H(x) = 1

2log((2πe)^d|Σ|), (2.6)

where Σ is the covariance matrix for the distribution, d is the number of dimensions of the distribution, and e is the base of the natural logarithm.

For a complete derivation, see appendix A.1.

(18)

0 0.2 0.4 0.6 0.8 1 0

500 1000 1500

Uniformly Distributed Random Numbers

Count

(a) Output from a uniform distribution

0 0.2 0.4 0.6 0.8 1

0 500 1000 1500

Gaussian Distributed Random Numbers

Count

(b) Output from a Gaussian distribution

Figure 2.3: Illustration to increase the intuition about entropy. (a) is very spread out and thus has a higher entropy than (b).

(19)

Chapter 3

Regression Forests

A regression forest is a collection of hopefully decorrelated regression trees, where the output is given as the average of all the separate trees;

f_rf(v) = 1 N

N

X

i=1

f_T⁽ⁱ⁾(v),

where N is the number of trees in the forest and f_T⁽ⁱ⁾(.) is the output function of the i’th regression tree, as described in Eq. (2.2). By performing this averaging, it is possible to avoid the high variance otherwise associated with decision trees.

The idea of combining several decision trees into a decision forest was first suggested by T.K. Ho in 1995 [9]. To get a collection of decorrelated trees, Ho randomly selects the features to be used for the splits. This gives trees that generalize in different yet complementary ways. Her experiments classifying handwritten digits show better generalization with increasing numbers of trees, while maintaining a 100 percent accuracy on the training set. This should be compared to previous methods for overcoming the problem with variance, such as pruning, where training-set accuracy is sacrificed in favor of generalization.

In [2], L. Breiman showed that when more trees are added to a forest, its generalization error converge to the generalization error of the expected tree. He also provided a proof that the upper bound of the generalization error can be written in terms of the strength of the individual trees and the correlation between the trees. One should aim for low correlation to get more accurate forests.

(20)

dom variables with a variance σ², the total variance becomes σ²_av= 1

Nσ².

If we were able to obtain totally uncorrelated trees, the problem with high variance would be solved. In practice however, there is a bias towards select- ing trees that make use of relevant information. This gives some correlation between the trees, and it is possible to show that the variance of the average of N trees will be

σ²_av= ρσ²+ σ²(1 − ρ)

N ,

where ρ is the average correlation between two trees [8]. The degree of correlation between trees depends on the choice of parameters, something that will be discussed in the next section.

In [10], Ho goes into more detail about the feature selection when constructing the trees, and gives a measure to compare similarity between different trees. Using this measure, Ho is able to compare the random feature selection to other methods for constructing forests, such as bootstrapping and boosting. The results show that for some datasets, the trees generated with the random-feature method are significantly less correlated, while for other sets, the methods gives similar results. Another advantage compared to bootstrapping, where the training data is sampled with replacement, is that one can train with the complete training set on every tree. This becomes an important issue when dealing with small datasets.

3.1 Training of the Forest

The purpose of the training is to generate a forest that performs an optimal split of the set of incoming data points. An optimal split is one that mini- mizes the test error. A commonly used surrogate for error minimization is information-gain maximization. All input data points are sent into all the root nodes at the same time. For every split node, the set of data points are split in two, and this continues until each set has ended up in a leaf node, where they give rise to a specific output function.

When using leaf nodes that output a probability distribution, the smaller the variance of the distribution the better. To obtain this, it is desirable that data points with similar ground-truth output are passed down along the same path in the trees. This implies that for each split of a set S of data points into two subsets S_Land S_R, the subsets should be chosen so that each of them have as high internal similarity as possible, see Fig. 3.1. The way to

(21)

Figure 3.1: An example of a probable split of a set S, when using a scalar threshold on the x dimension. The split is subject to the criteria that the data points should, internally for each subset, be as similar as possible. In this case the feature response for a data point v is simply f_i(v) = v_x.

split the set at node i is controlled by varying the splitting parameters ξ_i, and the feature response fi, and picking the ones that gives the best split. These can potentially live in a high dimension, and a complete search over the space might be very time consuming. Instead, the parameters are sampled from a subset of possible values. More formally, one wants to maximize the information gain, which is the same thing as choosing the split parameters that reduce the entropy as much as possible when S is split into S_L and SR. Let x = (x1, x2, . . . xN) be the attributes to predict, and p(x; Si) be the estimated probability distribution of these, given the set of data points S_i, that have reached the node i. Finally, let ω_k be the fraction of S_i that have reached node k, one level down in the tree. The selection of the split parameters and feature responses for node i, ξ_i and f_i, can then be written as

ξ_i, fi ← argmax

ξ_i,fi

H(x|Si) − X

k∈{L,R}

ω_kH(x|S_k)

. (3.1)

By using the definition of differential entropy, we get

(22)

ξ_i, f_i← argmax

ξ_i,fi

− Z

x

p(x|S_i) log p(x|S_i)dx+ X

k∈{L,R}

ω_k Z

x

p(x|S_k) log p(x|S_k)dx

, (3.2) where p(x|S) depends on the chosen model. By using the fact that the first term is independent of the split, the equation can be rewritten as a maximization of the second term only,

ξ_i, f_i ← argmax

ξ_i,fi

X

k∈{L,R}

ω_k Z

x

p(x|S_k) log p(x|S_k)dx

, (3.3) The tree will stop growing either after it has reached a maximum tree depth, when the entropy is decreasing with a rate lower than a specified threshold, or when the number of voxels in a node is below a specified threshold. The nodes where the growing stops will be the leaf nodes, and these will store the relevant statistics, that will be used for future predictions, see Fig. 3.2.

Figure 3.2: Illustration of distributions for different nodes. The distributions stored in the leaf nodes are the ones used for the regression. Figure from [4].

(23)

3.1.1 Choice of Parameters

The parameters chosen for the forest are the depth of each tree, the number of trees in the forest, the parameters for the binary split, and how big fraction of the leaf nodes that should be considered as relevant. In [4], Criminisi et al. tested the performance of forests with different tree depths, and found that shallow trees tend to underfit the data, while deep trees tend to overfit them. If we decide to use a fairly simple regressor function at the leaf node, we can understand the underfitting by imagining a degenerate tree with only one leaf node. This would result in an attempt to fit a single function to possibly very complex data, hence the underfit. At the other extreme, a very deep tree with a huge amount of leaf nodes, each one suited for a specific configuration of the input data, is likely to fit to the noise, which leads to overfitting. In the case of MR images, the input data is quite complex, and thus it can be assumed that a relatively deep tree is needed to get an accurate regressor.

As mentioned above, the generalization error is bounded as the number of trees is increased. On the other hand, a large number of trees will allocate more memory and will slow down training and runtime. Implementing the algorithm on a GPU would solve the latter problem [15], but that is outside the scope of this thesis project.

In [10], Ho suggests that differences between the trees are introduced by randomly choosing a subset of all features on which to perform the splits. By sampling only a few numbers of possible features and thresholds, more ran- domness is introduced in the forest and the trees become more decorrelated.

When working with very complex input data, the choice of features might not be obvious, making Ho’s suggestion an ideal approach in the case of MR images. In [18] Yaqub et al. performs an offline feature selection to find the K best available features. This is done by exhaustive testing of all possible features, and then storing the ones that provide the largest improvements for the splits. This procedure speeds up the training and improves the final results. Albeit an interesting angle, this procedure lies outside the scope of this thesis project and is left as a topic for future research.

(24)

3.2 Testing

During testing, a previously unseen set V of data points is passed down to the leaf nodes in all the trees in the forest. Compared to the training, both splitting parameters ξ and response functions fi are held constant. Now, the maximum a posteriori of x, x_{M AP}, can be calculated by using the stored statistics in the following way,

xM AP = argmax

x

X

t∈T

X

l∈Lt

p(x, l|V)

= argmax

x

X

t∈T

X

l∈Lt

p(x|l, V)p(l|V),

(3.4)

where Ltis the set of leaves in tree T , and p(l|V) is the probability for voxels from the set V to end up in leaf node l. Note that both p(x|l, V) and p(l|V) are functions of the data points in the training set V. More specifically, p(x|l, V) is a function of the data points that end up in that specific leaf.

The probability distribution p(x|l, V) can be either parametric or non- parametric. For the non-parametric method, a histogram is aggregated for each leaf node, based on the voxels ending up in it. While this method offers more flexibility than a parametric method, it suffers from not having enough of training data, due to sparse sampling. A parametric method, such as fitting a Gaussian distribution to the voxels, will not have this problem.

It also seems reasonable to assume that the probability distributions of the position of subcortical structures are unimodal. Therefore, it is suitable to use a parametric method as a first approach.

(25)

Chapter 4

Implementation

The main implementation of the algorithm was done using C++. Most of the pre- and post processing was done in MATLAB, including intensity nor- malizations, scaling of images, and visualization of results. The extraction of the ground-truth bounding boxes for the subcortical structures was also implemented in MATLAB, due to the availability of an easy-to-use library for working with NifTI-images [16].

4.1 Hardware

The algorithm was implemented on a computer with a 2.60 GHz Intel Core i7-3720QM processor, 8 GB of RAM-memory, running Windows 7.

4.2 Sherwood

For the implementation of the regression forest, a library called Sherwood, developed by Criminisi and Shotton was used [6]. This library has been developed specifically to serve as a flexible framework for solving problems using any kind of decision forests.

4.3 Dataset

The MR brain-data sets and their manual segmentations were provided by the Center for Morphometric Analysis at Massachusetts General Hospital and are available at http://www.cma.mgh.harvard.edu/ibsr/. We obtained 17 T1 weighted volumetric images, manually normalized into the Talairach

(26)

orientation (rotation only). These data have been processed by the CMA autoseg biasfield correction. However, due to corrupted files, 2 of the in- cluded volumes were unusable. For each of the volumes a segmentation was provided, containing manual delineations of 43 gray and white matter structures. Here the axis aligned bounding boxes were extracted to be used as ground truth. In the performed experiments, 9 medium sized structures were chosen as test cases. The volumes had resolutions ranging between 0.837 mm/voxel and 1 mm/voxel in anteroposterior- and left-right direction, and a distance of 1.5 mm/voxel in dorsoventral direction. The source subjects for these volumes were both males and females, in ages ranging from 7 to 71 years (including 4 subjects denoted as ”juveniles”).

4.4 Algorithm

The implementation of the algorithm follows in the footsteps of Criminisi et al. [5], with the extensions of the preprocessing steps, the use of several different feature types, and the speed-up using integral images.

4.4.1 Preprocessing Rescaling

To remove the intra-scan differences in resolution, all scans were downsam- pled to 1 mm/voxel in both left-right- and anteroposterior-directions. To preserve the segmentation labels, a nearest neighbour interpolation was used for the volumes containing the segmentations. For the MR-scans bicubic interpolation was used.

Intensity Normalization

The intensity value does not have a fixed meaning for MRI, not even if the same imaging protocol, MR scanner, and patient is used for obtaining the images [14]. To still be able to use the intensity information, all the MR-scans are normalized to have intensities in the same range. Since all the images are read into memory at the same time, it is important to keep the sizes of the individual images as small as possible, therefore an intensity range 0-255 is chosen. This enables the use of the C++ data type unsigned char, to store the images. To prevent the creation of artefacts due to noise, all voxels with intensities falling below a certain threshold were set to zero before the equalization, to match the background value. The scans were then

(27)

normalized by using histogram equalization, whereby the background was excluded from the calculations. The histogram equalization algorithm takes the intensity histogram of an image as input and calculates a mapping

I⁰ = T (I) = Imax

N

I

X

i=0

ni,

where I is intensity, N is the total number of pixels, and n_i is the number of pixels with intensity equal to i. The mapping is stored as a lookup table, and is used to output a normalized image. When not performing the initial thresholding, the histogram equalization created a large number of artefacts in the background. This is highly undesirable, since it reduces the possibility to make use of the knowledge that voxels are close to the edge of the brain, by falsely indicating that we have structure outside the brain.

Integral Images of Gradient Magnitude

Just-in-time calculation of gradients was found to be a major bottle-neck in the algorithm. To speed up the calculations, the integral image for the gradient magnitude was preprocessed and stored. An integral image in three dimensions is defined by

I(x, y, z) = X

x⁰≤x

X

y⁰≤y

X

z⁰≤z

i(x⁰, y⁰, z⁰)

where i is the voxel value in the original image. Using an integral image, the sum over all voxels in a rectangular cuboid can be calculated using only 7 additions and subtractions. The approximate improvement using integral images compared to the na¨ıve approach can be calculated in the following way. Denote the depth of a tree with D, the number of trees with T , the number of tested features per node with F , the fraction of the features that will be of the type that we calculated the integral image for with f , the number of voxels with V , the number of operations needed to calculate a specific characteristic for a voxel with C, the sizes of the image with S_x, S_y, and S_z, subsampling factors s_x, s_y, s_z, and the average size of a feature response box side with B (for details about the feature responses, see section 4.4.2). Assuming that the training of the trees will always reach the maximum depth, each voxel will pass through DF f tests of the specified type per tree. If all feature response calculations are made online, the number of calculations needed for the specified type will be of the order

S S S T Df F B³C

(28)

By calculating the integral image prior to passing the voxels down through the tree, the number of total calculations will be of the order

N_II= NPre-calculations+ NPass down trees≈ S_xS_yS_zC + 7V T Df

= S_xS_yS_zC + 7SxSySzT Df F s_xs_ys_z . Taking the ratio of these results in

N_II NNa¨ıve

≈ S_xS_yS_zC + ^7S^x^S_s^y^S^z^{T Df F}

xsysz

SxSySzT Df F B³C sxsysz

= s_xs_ys_z

T Df F B³ + 7 B³C.

Plugging in typical values that are used in the implementation of gradient magnitude, T = 4, D = 10, f = 1/4, F = 10, B = 5, C = 53, and sx = sy = sz= 3, gives

N_II NNa¨ıve

≈ 0.0032.

During testing there is no search for the best feature response, so F = 1.

This will reduce the time for calculating the gradient magnitude responses with a factor 0.0227.

While this was necessary to implement for the gradient magnitude responses to get acceptable training speeds, it was not for the other types of responses. While reducing the time for both training and testing, it allo- cates more memory. Not only does one have to store an additional image for each integral image calculated, but it is also no longer feasible to store voxel values with the unsigned char data type. The integral images have a much wider range of values, thus a bigger data type like float or double is needed. However, if one was to acquire a bigger dataset, it would be worth considering pre-computing integral images for every feature-response type, to speed up the training.

4.4.2 Feature Responses

We want the feature responses to capture relevant information about where a chosen voxel is located in the brain. We also want different tree nodes to exploit different pieces of information. This results in a finer separation further down in the tree. The feature responses have the form

f (v, θi) = |B1|⁻¹ X

p∈B1

R(p) − |B2|⁻¹ X

p∈B2

R(p), (4.1)

(29)

where B1 and B2 are boxes of random size, located at a random offset, a limited distance away from the input voxel v, see Fig. 4.1. R(p) is the response at point p, and is selected randomly from a selection of different response types, which will be described later. The box sizes, offsets, and response type are all captured by the parameter θ_i. By using off-centered responses, the space from where the features is drawn becomes larger. It is also possible to take advantage of more global information available, as opposed to only using features centered at the input voxel. We follow the procedures of Ho [10] by sampling from the feature space, using random box sizes, offsets, and response types, thus introducing differences between the forest trees. This leads to smaller correlations between the trees, and should give a more accurate classifier, in accordance with Breiman’s proof [2]. To exploit the different kinds of information embedded in the MR-scans, we have used several different types of feature responses. These are intensity, gradient magnitude, entropy, and basic tissue type segmentation.

Figure 4.1: 2D-illustration of the general structure of the used feature responses. v is the voxel for which the response is calculated. B₁ and B₂ are rectangular cuboids of random size, at random displacements b1 and b2.

(30)

(a) Intensity (b) Entropy response

(c) Gradient response (d) Tissue segmentation

Figure 4.2: Examples of responses for voxels in a plane parallel to the sagittal plane.

4.4.3 Intensity

The intensity feature response is the most basic response type. Except for the rotation- and intensity normalisations, it is what the MRI scanner outputs. Due to this simplicity, it is worth considering if it can be used as a stand-alone feature type. This is the type used by Criminisi et al. in [5].

4.4.4 Entropy

An entropy descriptor is used to extract how unordered a local volume is, see Fig. 4.2(b). The definition (2.4) of entropy is used, and as probability distribution the relative frequency of the intensities is used, p(i) =

|B|⁻¹P

b∈B1_i(y(b)), where 1_i(y) is the indicator function that y equals i, and y(b) is the voxel value at point b. It is worth noticing that the response function for entropy is not of the same form as (4.1), but put on the form of a difference between two entropies; f (v, θ) = H_B₁(i) − H_B₂(i).

(31)

4.4.5 Gradients

Gradient magnitude responses are captured to get a measure of how much and how sharp edges a volume has, see Fig. 4.2(c). For a point b the gradient magnitude is given by the expression

R(b) = ||∇I(b)|| = v u u

t ∂I(b)

∂x

!2

+ ∂I(b)

∂y

!2

+ ∂I(b)

∂z

!2

. (4.2)

To speed up calculations, it is common to approximate each of the deriva- tives with the Sobel operator, a discrete difference operator. The operator convolves a 3 × 3 × 3 kernel with the image volume centered at the voxel of interest. The kernel in a specific direction can be seen as a Gaussian blur followed by a derivative approximation in the same direction. The Sobel filter in the x-direction can be written as

d(1, :, :) =





1 2 1 2 4 2 1 2 1



, d(0, :, :) =





0 0 0 0 0 0 0 0 0



, d(−1, :, :) =





−1 −2 −1

−2 −4 −2

−1 −2 −1



, Due to the fact that the distance between voxels in z-direction differs

from the distance in x- and y-directions by a factor 1.5, the Sobel kernel is reduced by the same factor in z-direction.

4.4.6 Tissue Segmentation

In addition to the T1-weighted volumetric images, and the structure segmentations, the IBSR dataset contains tissue segmentations, where background, cerebrospinal fluid, white matter, and gray matter have been separated and mapped onto the set S = {0, 1, 2, 3}, see Fig. 4.2(d).

4.4.7 Finding the Optimal Splits

It is desirable to cluster voxels together, which produces similar predictions about the bounding box locations. First, a bounding-box vector bc is defined as the six-dimensional vector containing the maximum- and minimum bounding-box coordinates for class c,

b_c= (b^right_c , b^{lef t}_c , b_c^anterior, b^posterior_c , b^dorsal_c , b^ventral_c )

(4.3)

(32)

Figure 4.3: Illustration of the displacement between a voxel and a bounding box. The image is taken from a sagittal slice and shows the bounding box of the left part of the putamen.

Define the displacement dcfrom a voxel v to each of the sides of a bounding box bcof class c as

d_c= (b^right_c , b^{lef t}_c , b^anterior_c , b^posterior_c , b_c^dorsal, b^ventral_c ) − (v_x, v_x, v_y, v_y, v_z, v_z), (4.4) see Fig. 4.3.

For a set S of voxels in the same node, we model d as a 6-dimensional Gaussian distribution, where each dimension corresponds to a bounding-box side and the mean and variance are obtained by using maximum-likelihood estimation. To reduce model complexity we assume that both the sides of the bounding boxes, and the locations for the different subcortical structures, are uncorrelated. Using the correlation of all sides and boxes might be interesting if one has access to a bigger dataset, it would be especially interesting to introduce the correlation between opposite sides of the bounding boxes. This can be assumed to increase the accuracy of the estimation of the size of the boxes and would, given an accurate estimation of the center of

(33)

the bounding box, increase the JSC. We use the Eqs. (2.6) and (3.1) to de- rive the information gain for each node, given that we are using a Gaussian distribution. To account for the fact that a structure might not be present in a segmentation, either due to surgery or that it was not delineated, we use the joint probability p(d, c; S_i). The information gain over this probability can be expressed as

IG = H(d, c|S_i) − X

k∈{L,R}

ω_kH(d, c|S_k). (4.5) H(d, c|S) is now derived as an expression of known variables,

H(d, c|S) = −X

c

Z

d

p(d, c|S) log p(d, c|S)dd

= −X

c

Z

d

p(d|c, S)p(c|S) log p(d|c, S)p(c|S)dd

= −X

c

p(c|S) Z

d

p(d|c, S)

log p(d|c, S) + log p(c|S)

dd

= −X

c

p(c|S)

− H(d|c, S) + log p(c|S) Z

d

p(d|c, S)dd

. The integral is 1 due to the fundamental laws of probability, and the expression can be rewritten as

H(d, c|S) = H(c|S) +X

c

p(c|S)H(d|c, S), (4.6) where p(c|S) = n(c|S)/Z, where n(c|S) is the number of volumes where the structure c can be found, and Z is a normalization constant. The results from Eq. (2.6) are inserted into Eq. (4.6);

H(d, c|S) = H(c|S) +X

c

p(c|S)1

2log (2πe)^N|Σ_c(S)|. (4.7) This expression is inserted into (3.1), and the final equation for the information gain IG for node i is obtained to be

IG_i =

H(c|S_i) − X

k∈{L,R}

ω_kH(c|S_k)

+ · · · 1

2

X

c

p(c|S_i) log (2πe)^N|Σ_c(S_i)| − X

ω_kX

c

p(c|S_k) log (2πe)^N|Σ_c(S_i)|

.

(34)

Using this equation, the threshold and feature response for each node can be selected as

ξ_i, f_i = argmax

ξ∈Ξ,f ∈F

IG_i, (4.9)

where Ξ and F are the sets of sampled thresholds and features respectively.

The thresholds are sampled within the range of the specific feature tested, on the subset of the training data reaching the node in question.

4.5 Testing

During testing, voxels sampled with a regular interval of three voxels in each direction are sent down each trained tree in the forest, along a path determined by the binary tests. When reaching a leaf node, each voxel votes on the absolute location of all of the 43 bounding boxes, using the voxel position and the statistics stored in the leaf node. As mentioned in section 3.2, the probability of finding the bounding box for class c at b is

p(bc|V) = 1 T

X

t∈T

X

l∈Lt

p(bc, |l, V)p(l|V), (4.10)

where T is the number of trees in the forest. We choose Ltas the set of leaves from tree t, whose distributions have the smallest variance, which contain a specified fraction of all voxels. We approximate p(b_c|l, V)p(l|V) by aggre- gating non-parametric distributions using the set of voxels Vlcontaining the voxels v that ends up in the leaf l;

p(b_c|l, V)p(l|V) = p(b_c|V_l, µ_l,c, Σ_l,c)p(l|V) ≈ 1 Z_l

X

v∈Vl

p(b_c− v|µ_l,c, Σ_l,c), (4.11) where Z_lis a normalizing constant. Since, according to Eq. (4.4), dc = bc− v, and that d_c is modelled as a Gaussian distribution, we have

p(bc− v|µ_l,c, Σl,c) = N (µl,c, Σl,c),

The prediction for bcis chosen as the vector that maximizes the posterior probability (4.10). To do a naive search in the six-dimensional space where b_clives has the complexity O(s⁶), where s is the side length of the MR-scan.

However, since it is assumed that the location of the sides of the bounding boxes are uncorrelated, it is possible to perform the search in each dimension separately, giving a complexity of O(6s). It is also worth noticing that since

(35)

the bounding box sides are assumed to be uncorrelated, the probability on the right hand side of Eq. (4.11) can be written as

p(b_c− v|µ_l,c, Σ_l,c) =

D

Y

d=1

p(b^d_c− v^d|µ^d_l,c, σ_l,c^d ), (4.12)

where d runs over all dimensions. We can now write down a very explicit formula for the estimation of bc, that can easily be implemented;

b˜_c= argmax

bc

X

t∈T

X

l∈Lt

1 Z_l

X

v∈Vl

D

Y

d=1

p(b^d_c− v^d|µ^d_l,c, σ_l,c^2,d). (4.13)

Due to the small dataset a leave-one-out cross-validation (LOOCV) was performed. We varied the number of trees, the depth of trees, the fractions of voxels allowed to vote on the final bounding box positions, the feature selection, the number of possible split thresholds at every test, and the number of different split feature at every test. By using LOOCV, we get a very small variance in the training set, at the same time as we get a big variance in the test data.

4.6 Performance Measures

4.6.1 Jaccard Similarity Coefficient

The Jaccard similarity coefficient (JSC) is commonly used when measuring the similarity between two sets. It is defined as

J = |A ∩ B|

|A ∪ B|,

where A and B are the compared sets. In the case of bounding box estimation A is taken to be the ground-truth bounding box, and B the estimation.

The JSC will take values between 0 and 1, where 0 corresponds to non- overlapping boxes and 1 corresponds to completely overlapping boxes. One weakness of the JSC when finding bounding boxes for full segmentation, is that it does not take into account how the overlap between ground truth and prediction looks like. This is important because many algorithms set the volume outside the predicted bounding box as background. We would thus rather slightly overestimate, than underestimate, the bounding box.

However, since the JSC is commonly used in this context, it gives a good

(36)

4.6.2 Center Mean Error Distance

Center mean error distance is the average Euclidean distance between the ground truth bounding box center and the center of the prediction. Since the distances between voxels are different for the z-axis, the distance formula needs to be modified according to

d(p⁽²⁾, p⁽¹⁾) = q

p⁽²⁾x − p⁽¹⁾_x 2

+ p⁽²⁾y − p⁽¹⁾_y 2

+ 1.5 p⁽²⁾z − p⁽¹⁾_z 2

. 4.6.3 Precision/Recall

The JSC does not give any information whether the whole ground truth bounding box is covered by a prediction or not. This might be of interest when deciding about later segmentation algorithms. If the prediction always covers the whole ground truth bounding box, it is possible to regard anything outside the prediction as belonging to the background. A commonly used tool that is useful for this is the precision/recall plot. Precision P is defined as

P = |A ∩ B|

|B| ,

where A is the ground truth bounding box, and B is the predicted one.

Recall R is defined as

R = |A ∩ B|

|A| .

It is desirable to have R close to 1 to ensure that no part of the ground truth bounding box is mistakenly discarded as background. This could be accomplished by simply setting the prediction to be the whole image. Since this would lose the whole point of finding a bounding box, we also demand that P should be as close to 1 as possible. This leads to a trade-off between precision and recall.

(37)

Chapter 5

Results

All results were obtained by training on 14 different MR-images and testing on one image, not contained in the training set. The averaging was made over the 9 different brain structures shown in table 5.1.

List of used structures Left Caudate

Left Putamen Left Pallidum 3rd Ventricle 4th Ventricle Brain Stem Left Hippocampus Left Amygdala

Left Thalamus Proper

Table 5.1: List of structures that were used in the study.

Due to a long training time, and that the performance evaluation was made using LOOCV, it was infeasible to do an extensive search over the entire parameter space. Instead the parameters were varied one by one, while the others were set to standard values and held constant. The standard values were chosen such that a forest using them would get good results, without taking excessive time to train. For every parameter set, the average JSC and the center mean error distance were calculated. The averaging was

(38)

results. See appendix A.4 for the results for individual structures. The results for the separate structures were then combined into a single result for a specific parameter set. Assuming that the results of each structure are taken from different distributions, the total mean of the results is calculated as

µ = 1

P

iN_i X

i

N_iµ_i,

where i goes over the different structures, N_i is the number of test images containing structure i, and µ_i is the mean for structure i. The combined standard deviation was then estimated as

σ =

s 1

P

iNi− 1

X

i

(N_i− 1)σ_i²+ N_iµ²_i − µ²X

i

N_i

.

To verify that the predictions from the trees are reasonable, the output probability p(b_c) is plotted, and it is clear they are unimodal, see Fig. 5.1.

(39)

Figure5.1:Estimatedprobabilityp(bc)forthepositionofthedifferentbounding-boxsidesoftheleft thalamusproper.

(40)

5.1 Variation of parameters

When a variable was not the one being varied, it had its value from Table 5.2.

Standard values of parameters

Depth 10

Trees 4

Tested features 10

Tested thresholds 10

Fraction of voting leaves 0.75

Table 5.2: Standard values of parameters during training of the regression forests.

5.1.1 Number of Trees

According to the theory in chapter 3, increasing the number of trees in a regression forest would lead to a reduction in the variance of the classifier predictions, thus generating better generalization properties. In Fig. 5.2 we see slight improvement of both the center mean error distance and the JSC, as well as a decrease in the standard deviation of the center mean error distance, when increasing the number of trees.

(41)

(a) Center mean error distance when varying the number of trees in the forest.

(b) Average JSC when varying the number of trees in the forest.

Figure 5.2: Results when varying the number of trees in the forest.

(42)

5.1.2 Depth

By increasing the maximum allowed depth of the trees, we obtain a model capable of making use of more detailed information. The results in Fig. 5.3 show significant improvements with an increase in maximum depth.

(a) Center mean error distance when varying the maximum depth of the trees.

(b) Average JSC when varying the maximum depth of the trees.

Figure 5.3: Results when varying the maximum depth of the trees.

(43)

5.1.3 Number of Tested Features

The results obtained by varying the number of tested features for each split node during training show a rapid increase in accuracy and a decrease in variance, see Fig. 5.4. These are expected results and are in line with the theory. By only testing one feature, one just performs a randomization of what feature to assign to a node, something that should naturally produce worse results compared to an active selection. The increase in accuracy stabilizes quickly as the number of tested features is increased.

(44)

(a) Center mean error distance when varying the number of features tested in the training of the split nodes.

(b) Average JSC when varying the number of features tested in the training of the split nodes.

Figure 5.4: Results when varying the number of features tested in the training of the split nodes.

(45)

5.1.4 Number of Tested Thresholds

The number of thresholds tested at each split node during training was varied. However, there was no significant improvement in the JSC, and only a minor improvement in center mean error distance, see Fig. 5.5.

(46)

(a) Center mean error distance when varying the number of sampled thresholds tested in the training of the split nodes.

(b) Average JSC when varying the number of sampled thresholds tested in the training of the split nodes.

Figure 5.5

(47)

5.1.5 Fraction of Voting Leaves

It was tested whether or not the results could be improved by allowing only a fraction of the leaves to vote. The lower the standard deviation of the Gaussian approximation of the leaf result, the sooner a leaf was picked. I.e.

if 10% of the leaves were allowed to vote, it was the 10% with the most certain predictions. The results showed no interesting trends when varying this percentage, see Fig. 5.6(a). However, the testing time was linear to the percentage of voting leaves, see Fig. 5.6(c). The training time is not affected by varying this variable.

(48)

(a) Center mean error distance when varying the percentage of voting leaves.

(b) Average JSC when varying the percentage of voting leaves.

(c) Average time for prediction of all different structures, as a function of the percentage of voting leaves.

Figure 5.6

(49)

5.1.6 Single Feature Types

The dataset was then tested by incrementally varying feature types, in order to compare our method to that of Criminisi, who uses only a single feature type. Interestingly, when using only the tissue segmentation type, the results were similar to those observed when using all features together. The other three features gave significantly worse results, and of these, the entropy feature was the worst. Prediction through use of all features but the tissue segmentation gave results similar to those observed when using only the gradient magnitude.

(a) Average JSC for running the algorithm by only using one type of feature.

(b) Center mean error distance for running the algorithm by only using one type of feature.

Figure 5.7: Results when only a single feature type is used. The result for running the algorithm with a combination of features is inserted for reference.

(50)

Figure 5.8: Plot showing the relative frequencies of how often a specific feature type is used.

When plotting the relative use of the different feature types, it is clear that the tissue type segmentation response is the one that is most frequently selected, see Fig. 5.8.

5.2 Precision and Recall

Fig. 5.9 shows the precision and recall for the prediction of all structures in several images. It can be seen that the ground truths are in general not covered by the predictions. Since some algorithms need the whole object of interest to be covered by the predicted bounding box, we investigated what happens with the recall as the box sizes are increased. In Fig. 5.10, the precision and recall are shown when all bounding box sides are increased, with one voxel at a time in each direction. The results show a fast convergence towards a recall of one, that is, a fully covered ground-truth bounding box.

(51)

Figure 5.9: Precision/recall data points taken from all the studied structures over all images.

Figure 5.10: Precision/Recall for structures from one volume, when the bounding-box sizes are systematically enlarged. Each trajectory corresponds to one structure. For each dot, all sides are expanded outwards with one voxel. The original predictions correspond to the dots with the lowest recall for each trajectory.

(52)

5.3 Comparison with the Mean

A very rudimentary method is used as benchmark. The mean position of each bounding box class in the training set is used as a predictor. Translation is added to the test image to simulate noise in the patient position during the scan. For the unperturbed image the results of the mean prediction is on par with the regression forest. When the perturbation is introduced, the results for the mean prediction decline. The results for the regression forests stay constant, independent of the perturbation.

(53)

(a)

(b)

Figure 5.11: Plots comparing the regression forests with a mean prediction, when the test image is perturbed.

(54)

Chapter 6

Discussion

In this chapter the suggested approach and the made assumptions will be examined together with the achieved results. The results will be summarized and discussed in the light of the problem specification. Possible sources of error and suggestions for further work will be covered briefly.

6.1 The Dataset

The used dataset, ISBR, consists of 17 volumes, complete with segmentations of subcortical structures. Of these 17 volumes, 2 could not be used due to file corruption. This gives a remaining dataset of 15 volumes, which should be compared with the dataset used by Criminisi et al. [5], where 400 volumes were used, of which 318 were used for training. Even though they were able to train with over 20 times as many images, the results achieved in this thesis project are far superior with center mean error distances of around 3.5 mm, compared to 13.5 mm in Criminisi’s article [5]. It is not, however, completely fair to compare the two works, since Criminisi looked at a different dataset, containing body organs in CT-images. Even so, it gives some indication that the ISBR dataset is fairly well behaved, in comparison to the one used by Criminisi. A possible reason for the difference in accuracy is that the ISBR dataset is normalized with respect to rotation, where Criminisi states that the images in his dataset are not normalized in any way. This, in combination with that body organs might have more positional variation than subcortical structures, could explain the observed difference in accuracy.

(55)

6.2 Prediction Results

The resulting predictions gave results that were significantly better than expected. Since no previous studies have been made where bounding boxes of subcortical structures are detected using the same dataset, it is slightly diffi- cult to make comparisons. The results are superior to studies that examined different datasets, such as Criminisi et. al. [5]. The results are also superior to those achieved by Zhou et al. [20] where an average JSC of around 0.66 was obtained, compared to 0.696 in this study. Additionally the standard deviation of the JSC in this study was lower, with values around 0.07 compared to values between 0.084 and 0.236 for Zhou, and 0.130 for Criminisi.

It is not straightforward to compare our results with the ones achieved by Wels et al. since they perform a complete segmentation.

The variation of each parameter was made independent of each other.

Due to having a 5-dimensional parameter space, an exhaustive search would have been infeasible.

The results obtained while varying depth are as expected, with increas- ingly better results for deeper trees. However, there were no clear indication of overfitting as the depth of the trees increased. The experiments tried values up to 15 with increases in the accuracy until a depth of 14, see Fig.

5.3. One might consider that having a depth of 15 and assuming that the tree is everywhere fully grown, an average of V /2¹⁵ voxels will end up in each leaf node, where V is the number of voxels that are put in the root node. Given that we have 14 training images, an image width of around 130, and sample every third voxel in all directions, this number is around 33 voxels/leaf node.

The results obtained when varying the number of trees showed a slight improvement with the number of trees. The JSC and standard deviation are both almost constant as a function of the number of trees. The center mean error distance improves with the number of trees, with an improvement of the average of approximately 10% compared the instance of using one tree with that of using ten trees. This increase is however relatively small compared to the one obtained by Criminisi, that measured around 16%. The fact that the results are very good when using only a single tree is another indication that the used dataset is quite well behaved. Usually, increasing the number of trees is made to reduce the impact of noise on the general- isation ability of the forest. Now, when the generalization improvement is relatively small, we can assume that the noise levels are not too severe. It is also an indication that the relative positions of the subcortical structures

Regression Forests

Automatic localization of bounding boxes for subcortical structures in MR images using regression forests

LARS LOWE SJÖSUND

Automatic Localization of Bounding Boxes for Subcortical Structures in MR Images Using

Regression Forests

Acknowledgements

Contents

Chapter 1

Introduction

1.1 Related Work

Chapter 2

Background Theory

2.1 MRI

2.2 Decision Trees

2.3 Entropy and Maximum Information Gain

Chapter 3

Regression Forests

3.1 Training of the Forest

3.2 Testing

Chapter 4

Implementation

4.1 Hardware

4.2 Sherwood

4.3 Dataset

4.4 Algorithm

4.5 Testing

4.6 Performance Measures

Chapter 5

Results

5.1 Variation of parameters

5.2 Precision and Recall

5.3 Comparison with the Mean

Chapter 6

Discussion

6.1 The Dataset

6.2 Prediction Results