Bone segmentation and extrapolation in Cone-Beam Computed Tomography

(1)

Bone segmentation and

extrapolation in Cone-Beam

Computed Tomography

ZAINEB AMOR

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

extrapolation in Cone-Beam

Computed Tomography

ZAINEB AMOR

Master in Computer Science Date: June 22, 2020

Supervisor: Kevin Smith Examiner: Mårten Björkman

School of Electrical Engineering and Computer Science Host company: General Electric Medical Systems

(4)

(5)

Abstract

This work was done within the French R&D center of GE Medical Systems and focused on two main tasks: skull bone segmentation on 3D Cone-Beam Computed Tomography (CBCT) data and skull volumetric shape extrapolation on 3D CBCT data using deep learning approaches. The motivation behind the first task is that it would allow interventional radiologists to visualize only the vessels directly without adding workflow to their procedures and exposing the patients to extra radiations. The motivation behind the second task is that it would help understand and eventually correct some artifacts related to partial volumes. The skull segmentation labels were prepared while taking into ac-count imaging-modality related considerations and anatomy-related consider-ations. The architecture that was chosen for the segmentation task was chosen after experimenting with three different networks, the hyperparameters were also optimized. The second task explored the feasability of extrapolating the volumetric shape of the skull outside of the field of view with limited data. At first, a simple convolutional autoencoder architecture was explored, then, adversarial training was added. Adversarial training did not improve the per-formances considerably.

(6)

Sammanfattning

I detta arbete undersöktes två huvuduppgifter, skallbenssegmentering på 3D CBCT-data och extrapolering av skallvolumetrisk form på 3D CBCT-data. För båda uppgifterna användes djupinlärningsmetoder.

Den första uppgiften är användbar eftersom den skulle göra det möjligt för in-terventionsradiologer att endast visualisera blodkärlen direkt utan att lägga till arbetsflöde i sina procedurer. För att förbereda uppgifterna tog vi hänsyn till avbildningsmodalitetsrelaterade faktorer och anatomirelaterade faktorer. Ar-kiekturen för denna uppgift valdes efter experiment med tre olika nätverk, hy-perparametrarna optimerades också.

Den andra uppgiften undersökte möjligheten att extrapolera den volumetris-ka formen på svolumetris-kallen utanför synfältet med begränsade data. Denna uppgift är viktig eftersom den möjliggör korrigering av specifika artefakter kopplade till partiella volymer. I början undersöktes en enkel autoencoder-arkitektur, där-efter tillkom adversarial training vilket inte avsevärt förbättrade prestandan.

(7)

All documents presented in this report are strictly confidential and intel-lectual property of General Electric Healthcare. The documents are presented here only for evaluation. They cannot be used by anyone by any other means that the ones previously mentioned.

(8)

1 Introduction 1

1.1 General context and problem statement . . . 1

1.1.1 Skull segmentation context and problem statement . . 3

1.1.2 Skull volumetric shape extrapolation context and prob-lem statement . . . 3 1.2 Research Question . . . 4 1.3 Delimitations . . . 4 1.4 Principal interest . . . 4 2 Background 7 2.1 Clinical context . . . 7

2.2 CBCT data and the related problems . . . 8

2.3 Related work . . . 13

2.3.1 Convolutional neural networks for medical imaging . 13 2.3.2 Different types of convolutions and their utilities . . . 13

2.3.3 Residual blocks . . . 14

2.3.4 Semantic segmentation . . . 15

2.3.5 Three-dimentional shape extrapolation . . . 17

2.3.6 Generative models and generative adversarial networks 19 2.4 Database . . . 20

2.4.1 The database for the skull segmentation task . . . 20

2.4.2 The database for the skull shape extrapolation task . . 21

3 Methods 22 3.1 Skull segmentation on 3D CBCT data . . . 22

3.1.1 Ground truth generation and data preperation . . . 22

3.1.2 Deep learning approach for skull segmentation . . . . 26

3.2 Skull volumetric shape extrapolation on 3D CBCT data . . . . 28

3.2.1 Data preparation . . . 28

(9)

3.2.2 Deep learning approach for volumetric skull extrapo-lation . . . 32

4 Experiments and results 37

4.1 Skull segmentation on 3D CBCT data . . . 37 4.1.1 Experimental setup . . . 37 4.2 Results . . . 40 4.3 Skull volumetric shape extrapolation on 3D CBCT data . . . . 41 4.3.1 Eperimental setup . . . 41 4.3.2 Results . . . 44

5 Discussion 47

5.1 Skull segmentation on 3D CBCT data . . . 47 5.2 Skull volumetric shape extrapolation on 3D CBCT data . . . . 48

6 Conclusions 50

6.1 Sustainability and ethics . . . 51 6.2 Societal aspects . . . 51

Bibliography 52

(10)

(11)

Introduction

In this section the context and the problem statement will be presented, then the research question will be stated before the delimitations are discussed. The chapter ends with a section about the principal interest.

1.1 General context and problem statement

This degree project is done in the context of applying deep learning methods to medical data. In fact, today, the modalities of acquiring medical data are multiplying and given the shortage of medical experts which starts to gener-alize all over the world, it becomes increasingly difficult for doctors to review all the data they must investigate in a reasonable amount of time. This sit-uation leads to delays in making diagnoses and therefore in delivering treat-ments which for some people can be lethal. In addition to this, medical data is usually complex, and the risk of subjective interpretation and human error is high, especially when little time is allocated to reviewing the data. Due to these reasons, automated methods for medical data processing were studied. Far from aiming at replacing doctors, most of these solutions are intended to help the medical staff in their everyday tasks. More specifically, this mas-ter thesis was done within the context of inmas-tervetional neuroradiology (INR). Interventional radiology (IR), a broader discipline, aims at the use of medical images to carry out minimally invasive procedures for diagnosis and treatment purposes. These medical procedures are performed by directly reaching the location of the operation and administrating the treatment locally. In order to do so, interventional radiologists can insert devices into the vascular system of the patients and therefore they need to have a global idea about the topology of the vascular system, and this is where interventional imaging becomes crucial

(12)

to their work. The range of the treatments goes from cancer therapies to blood circulation problems treatment. Interventional neuroradiology is the branch that focuses on the head/brain, neck and the spine. When it comes to INR, interventional radiologists are assisted by 2D or 3D acquisitions acquired with different modalities. In this work, we focus mainly on cone-beam computed tomography (CBCT) data.

The CBCT system is a C-arm system, composed of an emitter and a de-tector. While C-arm systems were historically used to produce only 2D acqui-sitions, they can be used to produce volumes too. Figure 1.1-a) displays one of GE’s C-arm system called the Discovery IGS 730. The emitter is an X-ray tube and the detector is a flat panel. In INR, biplane systems which have one lateral and one frontal C-arms; as shown on the Figure 1.1-b) , are preferred.

a) b)

Figure 1.1: a) GE HealthCare – Discovery IGS 730. b) GE Healthcare – In-nova IGS 630. Source: GE Healthcare website

The acquisitions are done by injecting a contrast agent (iodine) into the vessels so that they become radiopaque and therefore visible on the scans, X-rays are then emitted and cross the objects to be scanned. As the X-X-rays pass through tissues, bones or other materials, they lose some of their energy, i.e. they are attenuated according to Beer’s law, this attenuation is then used to compute the reconstruction from the signal received by the detector. While the C-arm rotates, it acquires a collection of 2D projections for a limited num-ber of angles around the patient. These projections are filtered using a ramp filter and then the reconstruction is computed as the integral along all the pro-jection lines of these filtered propro-jections. The second chapter in [1] develops the mathematical grounds and theory behind the reconstruction algorithm.

(13)

1.1.1 Skull segmentation context and problem

state-ment

In interventional neuroradiology, doctors are interested in visualizing the ves-sels. Bones, which are radiopaque and hence always present on the scans, may then interfere with their analysis. To remove the skull, one mode of CBCT ac-quisitions consists of a first scan called mask (Figure 1.2, top) acquired with-out injecting the contrast agent and a second one called contrast (Figure 1.2, middle) acquired while injecting the contrast agent, thus making the vessels radiopaque too. On the mask, only the bones are visible, while both the bones and the vessels are visible on the contrast. Then a 3D-subtracted CBCT (Fig-ure 1.2, bottom), called sub, is computed by subtracting the mask to the con-trast. The problem with this procedure is that it exposes the patient to extra radiations and adds workflow which justifies the utility of an automatic way to segment the skull bones on CBCT scans. Injected vessels’ intensities are usually slightly higher than bone intensities but as it will be explained later the voxels’ values are not predictable enough on CBCT data and vessels and bones might have comparable intensities on some volumes, which is why traditional segmentation methods often fail on CBCT data. A deep learning approach seems then appropriate since it will learn a global representation of the skull volumetric shape. It can also learn the bones texture.

1.1.2 Skull volumetric shape extrapolation context and

problem statement

Most frequent CBCT acquisitions are done with a maximum source to image distance (SID) equal to 1.20m, the detector’s field of view being equal to 30cm, the conic shape of the beams makes it impossible to acquire the whole head with such a setup. Radiologists need to focus on a specific region of the brain most of the time and therefore do not mind visualizing partial skulls, but it creates so-called truncation artifacts and prevents from correcting another type of artifacts called beam-hardening as it will be explained later. To correct beam-hardening and truncation artifacts, information about the whole skull is needed. This can be done either by making sure that the whole skull is within the field of view, which is impossible in most cases, or by learning to extrapolate the volumetric shape of the whole skull outside of the field of view and consequently infer the information about the real width crossed by the X-rays. Again, a deep learning approach seems suitable to grasp the shape of the skull.

(14)

1.2 Research Question

The main research question answered in this thesis is "What architectures and configurations are suitable for a convolutional network to perform volumetric skull bone segmentation and 3D skull shape extrapolation and how beneficial would adversarial training be for the former task given a limited amount of data?". We also consider the subquestion "Which image modality-related and anatomy-related considerations need to be taken into account to annotate data for skull segmentation on CBCT?"

1.3 Delimitations

The main delimitation for the first task is the inability to compare the segmen-tation results with segmensegmen-tations done manually by radiologists since such data does not exist but because also GE does not have any software which would allow radiologits to produce such data at the moment. We can also try to use Long Short-Term Memory (LSTM) models combined with 2D Convo-lutional Neural Networks (CNNs) to model the link between the slices of the 3D volumes, which is not explored because of time contraints.

For the extrapolation task, the variational autoencoder approach is not thor-oughly explored because of memory constraints.

1.4 Principal interest

The master thesis is done in the form of a six-month internship at GE Medical Systems in Buc (France), more specifically within the R&D department of the interventional imaging team and under the industrial supervision of Vincent Jugnon. The Buc site is one of GE Healthcare’s biggest R&D centers in Europe and its activities are focused on interventional imaging and mammography as well as software development for medical images review.

GE Healthcare is a branch of the General Electric group founded in New York and it is present in more than 160 countries around the world. It manu-factures and commercializes medical systems and life science products. The medical systems include imaging systems, ultrasound systems, clinical mon-itoring systems for cardiology, anesthesia and perinatal care. GE develops software related to its medical systems but also software solutions aimed at improving the access to healthcare services worldwide. During the last years, its software solutions became more and more based on AI particularly for

(15)

dis-ease detection and relevant anatomy information extraction. Its life science products include instruments, software and pharmaceutical products focused on precision medicines. It also develops pharmaceutical products related to its own imaging systems like contrast media. GE Healthcare specialties range from women’s health to oncology, neurology, cardiology and orthopedics.

More particularly, my master thesis topic within GE Medical Systems re-volves around segmentation and extrapolation of skull bone in cone-beam computed tomography using deep learning techniques. The different missions are the preparation of the databases for both the segmentation task and the extrapolation (outside of the field of view) task and the investigation of the architectures which allow such goals. This project is interesting for them be-cause it was done within their R&D activities and would allow to reduce the workflow and radiation exposure for some situation, also, it would allow to in-crease the quality of the acquisitions that suffer from patial volume’ artifacts.

(16)

Figure 1.2: Top: Example of a mask viewed according to the axial (left), coro-nal (center) and sagittal (right) views

Middle: Example of a contrast viewed according to the axial (left), coronal (center) and sagittal (right) view

Bottom: Example of a sub viewed according to the axial (left), coronal (cen-ter) and sagittal (right) views.

A mask is an acquisition done without injecting a contrast agent and shows only the bones.

A contrast is an acquisition done while injecting a contrast agent and shows both the bones and the vessels.

A sub is the subtraction between the contrast and the mask and it shows only the vessels.

(17)

Background

This chapter covers the relevant background to understand the work done within this thesis. An insight to the clinical context and to the CBCT modality is given, followed by the related work in deep learning. Finally, the databases used are discussed.

2.1 Clinical context

Interventional neuroradiology (INR) deals with head/brain, neck and spine diseases which affect mainly the vessels and therefore blood circulation. The main pathologies treated in INR are ischemic strokes (blood clots), aneurysms (an inflation on the wall of a blood vessel that risks rupturing, hence creating an internal hemorrhage) and arteriovenous malformations (AVMs) -regions of the vascular system with “abnormal connections between the arteries and the veins” as it is explained in the first chapter in [1].

Ischemic strokes can be treated chemically by thrombolysis – which con-sists of injecting a chemical agent in the vessels to remove blood clots- or mechanically using a stent retriever as shown on Figure 2.1. Stenotic vessels -vessels that are narrowed or even obstructed- are also treated mechanically by balloon angioplasty, where a balloon is inserted and inflated in order to re-cover normal blood flow. AVMs are treated chemically by embolization where an embolization agent is injected into the malformed regions and acts like a cement that blocks blood flow there. Aneurysms are treated by placing coils inside them so that the blood flow inside is blocked.

The medical images that assist doctors in making the diagnosis, planning the procedure and operating, if need be, can be produced through several modalities. Pre-operating imaging rely on modalities like magnetic resonance

(18)

Figure 2.1: Illustration of a mechanical thrombectomy using a stent retriever from [2].

imaging (MRI) and X-ray computed tomography (CT). Per-operating imaging rely on X-ray cone-beam computed tomography (CBCT) but also 2D real time acquisitions made via fluoroscopy or digital subtracted angiography (DSA) and useful for real time guidance.

CT differs from CBCT in the shape of the X-ray beams emitted: while CT uses fan-shaped beams and the acquisitions are interpolated from several detectors that form an arc around the patient as displayed on the Figure 2.2, CBCT uses cone-shaped beams and can scan large volumes in one rotation. The Figure 2.3 explains how CBCT acquisitions are generated.

Figure 2.2: CT fan beam, Source: U.S Food and Drug Administration website.

2.2 CBCT data and the related problems

CBCT digitally generated reconstructions are produced via filtered back-projection as explained in the introduction. The images and volumes produced are grey scale, more specifically, voxel values are expressed in Hounsfield Units (HU). The HU scale, shown in Figure 2.4, is calculated according to the equation

(19)

Figure 2.3: CBCT acquisition principle as illustrated in [3].

2D projections are acquired for several angles and then used to reconstruct the 3D volume by filtered back-projection.

2.1, where μteris the linear attenuation coefficient of water and μr is the linear attenuation coefficient of air.

HU= 1000 ×

μ_{− μ}_ter

μ_{− μ}_r (2.1)

Both CT and CBCT reconstruction algorithms use the HU scale but while HU intensities on CT scans are reliable and repeatable, i.e. dependent on the materials, HU intensities, for a single material, on CBCT scans are fluctuant and are, therefore, called pseudo HU. This is mainly due to the fact that the energy spectra of the X-rays emitted by the CBCT system are less controlled than those emitted by the CT system, but other physical artifacts such as scatter also contribute to this fluctuation. Pseudo HU are not dependent only on the materials and, therefore, they change from one acquisition to another, which means that segmentation tasks are harder on CBCT acquisitions than on CT acquisitions.

Figure 2.4: Hounsfield units scale for CT acquisitions from [4]

Moreover, CBCT acquisitions are affected by several artifacts that can pre-vent the correct interpretation of the data. Artifacts may obscure some regions

(20)

but also create patterns that mimic pathologies and therefore skew the diagno-sis. The main artifacts one can observe on a CBCT scan are angular sampling streaks, scatter artifacts, truncations artifacts and beam hardening.

Noise is inherently present: typically, photonic noise which results from fluctuations in the number of photons counted by the detector is modeled by a Poisson distribution. In the usual reconstruction method, i.e. filtered back-projection, the projections are filtered and then back-projected. The filtering stage induces a sharpening of the edges which means that Poisson noise, in most times, is not the main issue. In other reconstruction methods, like itera-tive ones, a statistical model of the noise is used to improve the image quality at each iteration.

Figure 2.5 displays an example of under sampling streaks. This type of artifacts is explained by a low angular resolution which means that the projec-tions are done along a very limited number of angles.

Figure 2.5: Under sampling streaks on a 2D slice of a brain CBCT volume. Scatter artifacts and more precisely Compton scatter happens when, af-ter being diffracted by some maaf-terials, photons change direction and, there-fore, are either not counted or end up counted on the wrong detector/pixel and hence contribute to the pixels of an organ/anatomic region/material that was not crossed by them. This results in low frequency artifacts. The HU inten-sities of the correspondent organ/anatomic region/material become incorrect and unreliable.

Since X-ray sources are inherently polychromatic, when the rays cross a given material the low energy photons are more attenuated than the high en-ergy photons and therefore the attenuation cannot be modeled by the exponen-tial Beer’s law anymore which leads to an incorrect computed reconstruction and therefore artifacts so-called beam hardening. According to [5], as the

(21)

ma-terial crossed by the rays becomes denser and its atomic number higher, this phenomenon becomes more predominant and induces black streaks at the level of highly absorbent materials. A correction of beam hardening can be com-puted but it requires that the complete scanned object is within the scanned region, otherwise this correction cannot be computed. When X-rays cross only soft tissue, beam hardening is insignificant, and the total attenuation can be estimated by a polynomial according to equation 2.2, n being the order of the polynomial, ₀the initial X-ray intensity, I the output X-ray intensity and

α__{depending of the attenuation properties of the material. Now, if we assume}

that there are two different materials with different densities and that the ray crosses the first one then the second one, let 0 be the X-ray intensity after the first material and β depend on the attenuation properties of the second material, then the attenuation can be estimated via equation 2.3.

The projections consider the whole object as made of a single material, the attenuation approximation error is then the difference between equation 2.2 and equation 2.3 as expressed in equation 2.4. To estimate this error, it is then necessary to estimate n(

0

 ) 

which translates the contribution of the second material only (for us it will be the bones). Truncation (when only a portion of the object is visible on the scan) may lead to an incorrect estimation of this value on the reconstructed volume and therefore an incorrect correction of beam hardening. In fact, if a beam crosses a width W of bones but the field of view of the reconstruction contains a part of the skull that only explains half of that width, n(

0

) 

will be misestimated. Therefore, we need to have information about the whole skull to correct beam-hardening. An example of beam hardening in shown on Figure 2.6-a). In fact, beam-hardening creates low frequency artifacts like the pattern shown by the white arrows.

P_ = n X =0 α__{× n(} ₀  )  (2.2) P_ = [ n X =0 α__{× n(} ₀ ₀)  ] + +[ n X =0 β__{× n(} ₀  )  ] (2.3) ΔP = n X =0 (α− β) × n( 0  )  (2.4) Another type of artifacts is the metallic artifacts, shown on Figure 2.6-b). It happens because photons cannot pass through metal which means that detectors won’t get any information about the photons whose paths contain

(22)

a metallic object and therefore the reconstruction algorithm lacks necessary information. Metal artifacts usually are combined with beam hardening and scatter. One way to avoid it is to simply eliminate metallic objects from the scan, but this is not always possible (coils, braces. . . etc.). Software corrections using interpolation to compute the real pixels values are also used.

a) b)

Figure 2.6: a) Beam-hardening example from [6]. b) Illustration of a metallic artifact taken from [7].

Finally, the truncation artifacts (Figure 3.2-a) happen when big parts of the object are not within the scanning field of view (FOV). They are low frequency artifacts. They are due to the fact that some of the detected photons have been through parts of the object that are not within the field of view of this former. The reconstruction algorithm is therefore incapable of calculating the correct pixel values as it is based on the hypothesis that all photons counted by the detectors have interacted with objects within its FOV. In other words, the structures inside of the field of view cannot fully explain the attenuation values which leads to the reconstruction algorithm computations being incorrect.

These are the most important artifacts. More thorough reviews of the CBCT technique and its related problems are [5], [7] and [8].

(23)

2.3 Related work

2.3.1 Convolutional neural networks for medical

imag-ing

With the emergence of deep learning, convolutional neural networks (CNNs) proved to be very interesting candidates for image related tasks as they estimate the optimal set of filters to achieve specific tasks.

CNNs were used for image classification tasks like anatomy and body parts classification in [9] and [10] on CT data. While [9] classifies the data into 5 classes (liver, pelvis, legs, neck and lungs), [10] divides the human body into 12 continuous regions (nose, chin/teeth, neck, shoulder, clavicle/lung apex, sternal, aorta arch, cardiac, liver upper, liver middle, abdomen/kidney, ilium/femur head). [10] presents its method as a possible preliminary step before other tasks such as segmentation. CNNs were also used for computer aided detec-tion (CAD) tasks particularly when it comes to cancer and disease detecdetec-tion. In [11], CNNs are applied to CT data in order to detect sclerotic bone metas-tasis, lymph nodes and colonic polyps. In addition to this, [12] uses a CNN-based approach to detect polyps on coloscopy videos. Instead of training the network from scratch, they studied how fine-tuning a pre-trained network can yield good results while overcoming the need for huge amount of data.

2D CNNs were used on image data acquired via one modality but also via several ones and fed to the networks as multi-channels 2D input. 2.5D CNNs use orthogonal 2D patches as input, usually the three patches belong to the XY, YZ and XZ planes, while the kernels are kept bidimensional. 2.5D CNNs have two main advantages in comparison with 3D CNNs: a relatively low computational cost and the use of 2D data which is easier to label than 3D data. On the other hand, 2.5D CNNs cannot be used with three-dimensional kernels which can be problematic when segmenting anisotropic 3D data. Also, exploiting only three orthogonal views from volumetric data can be argued not to be an optimal way to take advantage of all the information contained in the volume. 3D CNNs, despite being more computationally expensive, allow a better representation of volumetric data. A more complete review of the deep learning architectures used for medical data is done in [13].

2.3.2 Different types of convolutions and their utilities

[14] is a thorough guide through the different types of convolutions and their arithmetics. The main types, i.e. convolutions with non-unit strides,

(24)

trans-posed convolutions and finally dilated convolutions, are presented in this sub-section.

As [15] states, one way of performing down-sampling is to use convolu-tions with non-unit strides, we can also call them strided convoluconvolu-tions. Unlike pooling methods which are predefined and fixed functions, using a strided con-volution allows the network to estimate the optimal way of down-sampling. In the same spirit, up-sampling operations can be replaced by transposed convo-lutions, also known as fractionally strided convoconvo-lutions, so that the network learns the optimal up-sampling.

Dilated convolutions were inspired by the “algorithme à trous” used for wavelets decomposition in which they have an essential role. If a convolution operation between a real function A and a real kernel K is computed (in the discrete domain) according to equation 2.5 then equation 2.6 illustrates how dilated convolutions are computed (cf. [16]). We will denote the convolution operation * and the dilated convolution operation with factor l∗L. In [16], di-lated convolutions are used as multi-scale context aggregators, in fact, didi-lated convolutions stretch the receptive field while maintaining the resolution and might help the network grasp features that cannot be contained in receptive fields induced by normal convolutions.

(A ∗ K)() = X y_+z= A(y) × K(z) _(2.5) (A ∗L K)() = X y_+z= A(y) × K(z) _(2.6)

2.3.3 Residual blocks

Residual blocks were introduced in [17] as a solution to the performance drop encountered when networks are made deeper. In fact, the first state-of-the-art models in deep learning were invented by stacking more and more layers but later it was observed that from a certain depth, deep neural networks’ perfor-mances start decreasing. This can firstly be explained by the vanishing gradi-ent problem which can be overcome by smart normalization schemes. As this problem persists even when normalizing the layers, an interpretation is that the performances starts saturating from a certain depth and sometimes degrading. According to [17] this can be explained by the fact that deeper networks are not as easily optimized as shallower ones. A new block which features short-cut connections that skip one or more layers and are then added to the output of the stacked layers is then proposed. This can be viewed as adding a simple

(25)

identity mapping without increasing the model’s complexity nor its number of weights. Their work is based on the assumption that fitting the residual mapping is easier than fitting the desired underlined mapping. A similar idea can be seen in [18] with the gated shortcut connections which unlike the sim-ple shortcut connections are parametrized on the data and therefore regulate, via this parametrization, the information flow that goes through the shortcut. Long Short-Term Memory (LSTM) cells are also based on the same principle. Following the residual block idea, [19] presents a network where each layer is linked via shortcut connections to all the precedent layers. In the original pa-per [19], the layers outputs are aggregated via concatenation unlike [17] where the aggregation is done via a summation. In [20], a ResNet network is viewed as an ensemble of shallower networks that behave like a collection.

2.3.4 Semantic segmentation

CNNs producing a per-image prediction are efficient for classification tasks but they cannot be used for semantic segmentation since the last layer/layers induce a loss of spatial information which is important for dense segmentation. Fully convolutional networks (FCNs), introduced in [21] are based on the idea that semantic segmentation requires two important information: the seman-tics (what?) and their location (where?). FCNs use transposed convolutional layers to up-sample low-resolution feature maps and recover the information about the location. FCNs also include skip connections between the final layer and other lower layers combining this way information about the semantics and the localizations. In [21], their method was applied on natural images. Unlike natural images, annotated medical images are not abundant and one of the in-conveniences of a FCN as presented in [21] is that it still needs quite a lot of training data.

[22] set a cornerstone when it comes to medical image segmentation with its U-Net architecture. The U-Net is an encoder-decoder model that combines a contracting path able to grasp the feature maps corresponding to the seman-tics and a symmetrical expanding path able to extract information related to the localization. The U-Net architecture was inspired by FCNs but uses more filters in the expanding path making the architecture symmetrical, it also uses skip connections in a different way as it combines features from the same res-olution levels in the encoder and the decoder. The U-Net architecture and its different variants are still to this day state-of-the-art when it comes to medi-cal image segmentation. Building upon the U-Net architecture, [23] presents a nested U-Net architecture where the encoder and the decoder are linked via

(26)

several nested and dense skip connections. They claim that this way they com-bine semantically similar features which makes the learning process easier. Their model was used on both medical and biomedical data and induces better results than U-Net on all the datasets that were used for the article. A simi-lar idea is studied in [24] where the different resolution levels of the encoder and the decoder are residual blocks inspired by the “Inception” architecture [25] and normal skip connections are replaced by residual paths made of extra convolutional layers between the encoder and the decoder branches in order to decrease the semantic disparity between the aggregated feature maps. In fact, if we consider the first block of the encoder and the last block of the decoder (which should be of the same resolution) for example, it is clear that their out-puts did not go through the same total number of convolutions, combining their their feature maps directly is equivalent to combining information that went through different levels of processing and hence information that does not translate the same semantics. The target structures on medical data can vary a lot in size and shape. [26] presents attention gate blocks to learn how to focus on the relevant features despite the variations observed on the structures. Lately, more attention was given to 3D medical data in deep learning and the networks proposed in literature are U-Net variants transposed to three-dimensional data. [27] proposes a 3D architecture suitable for learning dense segmentation in 3D data from sparsely annotated training data. As it is ex-plained in the article, the network can work in two modes: in the first one it is able to learn from sparsely annotated data the dense segmentation of the same training data, in the second one it is able to learn from sparsely annotated data the dense segmentation of new test data. The network in [15] is trained on MRI data and uses the dice loss because of the great inequality between the foreground and background pixels’ number. The use of convolutions with non-unit strides instead of down-sampling layers is also studied. Even though, these approaches yield quite good results, the main limitation is the memory required to process large 3D data. A recent work, in [28], suggested a recur-rent convolutional neural network (RCNN) that uses long short-term memory cells to process the data sequentially. The network was used on CT data. Other articles like [29] and [30] study the effects of incorporating anatomical con-strains such as shape or location on improving the segmentation results. This approach can be a solution for organs whose size vary a lot but also when the data is too noisy and has a lot of artifacts.

(27)

2.3.5 Three-dimentional shape extrapolation

General 3D reconstruction tasks are important for medical imaging, but they are also a key knowledge to have when it comes to robotics and autonomous navigation for example. 3D reconstruction mainly concerns the reconstruc-tion of complete objects from incomplete, noisy or sparsely sampled data and therefore it can be applied to occupancy grids or point clouds. One talks about completion when the missing data is within the field of view or the known data points, which is also called “inpainting”. Extrapolation on the other hand is about extrapolating the data outside of the field of view, which is also called “outpainting”.

To the best of our knowledge, neither inpainting nor outpainting were stud-ied on 3D medical data. Deep learning approaches for 3D medical data recon-struction mainly deal with the reconrecon-struction of 3D volumes from 2D projec-tions. When it comes to general 3D scanned objects, there is extensive search about 3D completion or inpainting from partial data but also about re-construction of complete 3D data from very noisy data. The approaches used for 3D data completion can be divided into geometric techniques, template-based techniques and deep learning techniques. The geometric techniques are based on two main assumptions: the object’s shape is a combination of prim-itives (spheres, planes. . . etc.) and the object’s shape has symmetries. These two assumptions can help estimate the complete shape. Template-based ap-proaches are about retrieving similar shapes from a database and deforming them in order to fit them to the data to be inpainted. Usually one template is retrieved from the database but it is also possible to retrieve a collection of templates and use them as combination of shapes.

When it comes to deep learning based approaches, as far as we know, no re-search work was done on 3D medical data when it comes to shape completion and extrapolation. We, therefore, reviewed articles using a variety of objects and shapes but not medical data. As the inpainting and outpainting problems are quite similar, research work about both tasks are interesting to review. [31] is a review of 3 deep learning approaches, namely shallow autoencoder, con-volutional autoencoder and Generative Adversarial Networks (GANs), for 3D shape completion. Only one class of the ShapeNet dataset is used for train-ing and testtrain-ing: the chair class which includes many models. They conclude that convolutional autoencoders yield to encouraging results and GANs can be used to solve this problem too. An original and interesting use for shape inpainting is the completion of archeological objects like it is done in [32]. In [32] the overall network is a GAN with an improved Wasserstein loss to

(28)

stabilize the training process. Their generator is an encoder-decoder model with skip connections and squeeze-and-excite (SE) blocks. Their discrimi-nator is a per-image classification network with SE blocks between the con-volutional layers. SE blocks help model the interdependencies between the different channels and hence weight the response channel-wise. As explained in [33], a SE block first aggregates spatial information channel-wise and then estimates a weight for each channel. In [32], the best performances were found when using skip connections and SE blocks. The main problem is again the limitation in GPU memory and therefore the low output resolution. In [34] the idea to overcome this is to combine a 3D encoder-decoder GAN (3D ED-GAN) with a recurrent convolutional network that uses long short-term mem-ory cells [35]. The 3D-ED-GAN predicts and inpaints the missing parts in low-resolution while the recurrent network mission is to learn how to mini-mize GPU memory usage while predicting the missing data in high-resolution. It does this by dealing with the output of the 3D-ED-GAN as a sequence of 2D slices. For the same purpose of producing high-resolution outputs, [36] combines a deep learning network (3D-Encoder-Predictor CNN) that learns a representation of both the missing and the known regions of the data and a shape synthesizer that yields a high-resolution reconstruction by using shapes retrieved from a database that correlates well with the representation produced by the 3D-Encoder-Predictor CNN.

[37] is the only paper we found that explores 3D extrapolation, more par-ticularly, it explores 3D scenes outpainting. Their network takes in two inputs: the 3D partial data and the corresponding 2D top view. From the first piece of data, it estimates the three-dimensional missing part of the scene and from the second piece of data it predicts the missing 2D top view from which it es-timates again a dimensional extrapolation of the scene. The two three-dimensional outputs are then aggregated to yield the final result. [38], [39] and [40] explore outpainting GAN-based networks for 2D natural images. In [38], a deep convolutional generative adversarial network (DCGAN) is used on a subset of the MIT database places365. In order to stabilize the GAN’s training, they apply a three-phase training strategy where the generator is first trained alone until convergence, then the discriminator is trained alone until convergence and finally they are both trained together with an adversarial loss. The discriminator is a combination of a global discriminator that is fed with the whole image and a local one that focuses on the outpainted part of the image. The same idea of having a global and a local discriminator is used in [40]. [39] proposes a GAN architecture where the generator is composed of a two-stage model that includes a “Feature Expansion” autoencoder module and

(29)

a “Context Prediction” autoencoder module. Other tasks that use approaches that may be helpful and inspiring for our case are semantic segmentation using GANs [41], denoising and deblurring with GANs [42] and super-resolution with GANs [43].

2.3.6 Generative models and generative adversarial

networks

The extrapolation problem is an ill-posed problem where more than one ex-trapolation can explain the same truncated volume. We also need to generate new data and not just learn a discriminative function. In deep learning, the two main generative frameworks are generative adversarial networks (GANs) [44] and variational autoencoders (VAEs) [45]. GANs were introduced in [44] where the authors propose a novel way to estimate a generative model by training a generator (that learns the data distribution) and a discriminator (that learns to distinguish between real data and fake data) in an adversarial mode: the generator wants to fool the discriminator by generating plausible samples and the discriminator needs to detect the fake samples. This is done via a minimax two-player (generator and discriminator) -equation 2.3.6- game that should theoretically lead to a unique solution where G generates data simi-lar to the real data and therefore the discriminator cannot distinguish anymore between real and fake data: we talk about Nash equilibrium. z is a sample from a prior noise distribution pz. Given equation 2.3.6, G is needed to mini-mize log(1-(D(G(z))), but as stated in [44], this usually leads to weak gradients to train G. Instead of trying to maxinimze log(1-(D(G(z))), we can maximize log(D(G(z))) and consequently minimize -log(D(G(z))).

minGm_DV(D, G) = E__≈dt()(og(D())+E_z_≈p

z(z)(og(1−

(D(G(z))))

(2.7)

Even though the theory of GANs predicts the model’s convergence to the global optimum, the training process of GANs is very instable and some prob-lems may occur because the optimization is done with gradient descent which may not be the best way to find the Nash equilibrium as explained in [46]. For example, an imbalance between the performances of the generator and the

(30)

discriminator can lead to the generator learning which exact samples fool the discriminator the best and overfitting its model to those samples, this is called mode collapse and it causes the generator’s samples distribution to be limited to a subset of the true variety of samples we should have. Another problem is when the model does not converge and keeps oscillating. Moreover, the dis-criminator can be too powerful and therefore does not backpropagate any loss which hinders the training process a lot. A lot of work was done on studying how to improve the training process of GANs ([46], [47], [48], [49], [50]). In [46], the aim is to improve the performances of GANs in a semi-supervised training mode. Against the training instability of GANs, they suggest a new objective that match the features statistics between the generated data and the real data and against mode collapse they suggest minibatch discrimination. [47] proposes a convolutional GANs topology with specific features and prop-erties such as the use of strided convolutions and deconvolutions or the use of leaky ReLU in the discriminator. [48] and [49] are theory-oriented papers that study regularizations for GANS. [48] investigates the sources of instabilities of GANs training and how to overcome them more specifically they prove how unregularized GANs are prone to instabilities and study the Wasserstein loss [51] and instance noise [52] as remedies.

2.4 Database

As a diagnostical modality, CT modality is used frequently, consequently, a lot of CT data is available. The CBCT modality on the other hand is less used which makes CBCT data more challenging to find.

2.4.1 The database for the skull segmentation task

GE Healthcare has its own CBCT head data collected from its clinical partners from inside and outside France. The dimensions of these volumes in X and Y are equal to 512 pixels while it is variable in Z but does not exceed 512 pixels. The Z dimension corresponds to the number of slices and varies according to the horizontal collimation level chosen by the radiologists which reduces the field of view in Z and therefore decreases the number of slices. Doctors choose to reduce the field of view in Z when the region they are interested in is narrow.

The data is stored as DICOM (Digital Imaging and Communications in Medicine) files which allow transmission, storage, processing and display of medical imagery. The DICOM format conforms to international standards

(31)

when it comes to the data’s safety and the patients’ anonymity. A DICOM object can be described as pixel data and its corresponding metadata about the patient and the procedure. The metadata contains information about the device such as the last calibration date or its serial number, information about the acquisition such as the modality, the setup of the machine, some numerical tags to identify the acquisitions and information about the patient such as their age, sex and anonymized numerical tags to identify their scans.

2.4.2 The database for the skull shape extrapolation

task

Most CBCT acquisitions are made with a maximum SID (Source-to-Image Distance) but in some rare cases a minimum SID is used which allows doctors to have the whole skull or almost in the field of view. These acquisitions are made when the doctors do not know where the medical anomaly is coming from and want to investigate the case. This, typically, happens when they suspect an internal hemorrhage or a blocked vessel in the brain but do not know exactly where. Also, after performing an intervention on the brain, they usually want to check that no problem occurred because of their intervention and therefore look at the brain in its entirety. GE has some of these volumes in its databases, but quite a lot are still truncated in the Z direction (either the top or the base of the skull is missing). As a result the number of relevant CBCT samples does not exceed 65.

For this particular task and because of choices we will explain later, we can use both CBCT and CT volumes to create our input/label collection. The CT data used is a database used earlier by GE Healthcare for brain ventricles segmentation. It includes truncated and untruncated skull scans as well as some scans of the whole upper body. The truncated skulls were removed and the scans including more than just the head were pre-processed to just leave the head.

(32)

Methods

In this chapter the methodology for the two tasks is explained. For each task, a section is dedicated to the data prepared and another one to the deep learning approach.

3.1 Skull segmentation on 3D CBCT data

3.1.1 Ground truth generation and data preperation

The goal is to segment the skulls on contrast reconstructions, therefore, we can use GE’s database of contrasts and the corresponding masks -cf. 2.4.1. The input data to the network will be the contrasts and the output (also the labels) should be skulls segmentations. Since GE does not have a pre-existent database of skulls’ segmentations, i.e. the labels, they must be created. Seg-menting the bones in the masks should be easy. However, the masks are noisy, and some preprocessing is needed to extract the annotations. The labels were created from the masks through a succession of filtering steps, two prelimi-nary operations were the carotid arteries mask generation and the uniformity correction applied to the masks.

Carotids masks generation

The carotid arteries are two large vessels that cross through the skull bones near the neck. They can appear as very intense vessels on the contrast. Despite them not being visible on the masks -as said earlier the masks will be used to create the labels-, the fact that they cross the skull bones means there is a risk they get segmented by error when creating the labels. Since the doctors are

(33)

interested in the vessel, removing the carotids through the skull segmentation process makes its results useless. A convenient way to make sure that this does not happen is to set the pixels corresponding to the carotids to zero in the labels.

A relevant observation is that the carotids intensities on the contrast are higher than the bones intensities on the masks. In fact, at the iodine concentra-tions used in neuroradiology, iodine contrast medium is more radiopaque than bones. To get the carotids, for each mask, we start by estimating the bones maximum intensities. Let cdƒ() be the cumulative distribution function of the volume’s intensities, then we consider the bones maximum intensity as the intensity  such as cdƒ() < 0.999. This way, we are sure that if the maxi-mum intensity value of the volume is induced by some sort of an anomaly, it won’t be considered. The value of  is then subtracted from the corresponding contrast, which allows us to create a binary mask of only the carotids and some other very intense vessels. This binary mask is then simply used to remove the carotid (and some other vessels) from the labels. Figure 3.1 shows an example of a carotid mask (in pink) and the corresponding contrast.

Figure 3.1: Axial(left), coronal(center) and sagittal(right) views- A carotid mask (in pink) superposed with its corresponding contrast.

Uniformity correction applied to the masks

As explained in chapter 2 -cf. 2.2, a lot of phenomena lead to low frequency artifacts such as scatter, truncation or beam-hardening. Figure 3.2-a shows an example of a truncation artifact. One can see, on the left side of the axial and coronal views and on the right side of the sagittal view, very intense pixels: these pixels belong to soft tissue regions and therefore are not supposed to have such intensities. As this makes segmenting the skulls correctly difficult, a correction was applied to the masks before the segmentation. The uniformity correction is based on [53] and is done through the algorithm 2 in Appendix A, which estimates the uniformity correction from the histograms of the full

(34)

resolution volume. This algorithm was developed at GE. The carotid masks were created from the non-corrected masks because such a correction will alter the pixel values which may cause the relationship between the pixels’ values of a mask and its corresponding contrast to be even less regular.

a) b)

Figure 3.2: a) Truncation artifact on the left part of the coronal (left) view and on the right part of the sagittal view (right). b) Effect of the uniformity correction on the truncation artifact.

Skulls segmentation in the mask acquisitions

The histograms of the corrected masks show gaussian-like distributions with a heavier right tail -Figure 3.3- caused by high intensity noise pixels. The histograms show that the background is the main contributor to the volume intensities, the small peaks for the higher intensity values represent the skull pixels. The segmentation method is done on each volume separately and is a 3-step approach that starts by eliminating the background, then eliminating the relatively high frequencies which are not part of the skull and finally cleaning the volume by removing the noise that is left.

The first step is done by estimating the background intensity and then sub-tracting it from the volume, all pixels with negative intensities then have their intensities set to zero. As the right tail of the gaussians seems heavier than the left one, the median over the non-null values is used since it is a more robust estimation of the background than the mean. The null values are not considered because they belong to the air and not to the human anatomy, we also know that there are no negative values. During the reconstruction phase, intensities are shifted by 1000 units. Thus on CT data, air has null intensi-ties, water has intensities around 1000 HU and bones have intensities around 2000 HU -see Figure 2.4 for comparison. When it comes to CBCT data, this shift means that all intensities are positive, but as said earlier they are not as predictable as CT intensities.

(35)

To eliminate the high intensity pixels, the median M and the standard de-viation STD of the background-free volume is computed and intensities under a threshold that was empirically chosen to be equal to M+ 0.7STD are set to zero. Finally, a morphological opening operation using a three-dimensional kernel is used to clean the volume. The labels are binary volumes were pixels belonging to the skull have a value of 1 and pixels not part of the skull have a value of 0. This approach was approved by GE’s experts after visualzing its results on the different volumes. Since we can not have segmentations done manually by doctors, we can not compute an accuracy metric.

The carotid pixels are then removed from the labels. For memory and train-ing time reasons, we chose to sample the volumes (both the contrasts and the masks) in order to have (64,64,64) volumes. Before being fed to the network, the contrasts were equalized: each non-null pixel was centered and normal-ized by respectively the mean and the standard deviation computed over all non-null pixels of the training database.

Figure 3.3: Examples of histograms computed over the non-null pixels of cor-rected masks.

(36)

3.1.2 Deep learning approach for skull segmentation

The starting point for the segmentation task was the V-Net architecture [15], but we consider an architecture with fewer weights than the original V-Net, which is motivated by the limitation in GPU memory and for execution time considerations, we also think that the skull segmentation would still be cap-tured accurately enough at the resolution used. Several experiments were done and led to the retainment of the base-line architecture shown in figure 3.4, from which three different versions were derived and compared. Unlike the V-Net, each stage of our base-line architecture is a residual-like block composed of two convolutions with a shortcut connection -cf Figure 3.5.

Figure 3.4: Schematic representation of the baseline architecture for the seg-mentation.

(37)

The three different networks derived from the base-line architectures, cf. Figure 3.4, differ in the types of convolutions they exploit. The first one which will be called architecture or network A uses simple max-pooling layers for down-sampling, the second one, which will be called architecture B uses con-volutions with non-unit strides for this purpose and finally the third one, ar-chitecture C, uses dilated convolutions in the bottleneck of the arar-chitecture. Skip connections between the compression path and expansion path are im-plemented as summations in these three architectures and all convolutional and deconvolutional layers use ReLU activations except for the output layer. The details of architecture A and B are given in the tables A.2 and A.3 respec-tively shown in the Appendix section. In order to keep the same number of weights between the three architectures, the third residual block of the com-pression path was eliminated in architecture C and replaced by two consecutive dilated convolutions with dilation factors equal to 2 and 4, as shown on Figure 3.6. The dilated convolutions also use ReLU activations. Down-sampling and up-sampling are done using max-pooling layers and transposed convolutions respectively in architecture C.

(38)

3.2 Skull volumetric shape extrapolation on

3D CBCT data

3.2.1 Data preparation

We decided to have a network that works on a segmentation-to-segmentation basis, i.e. the input data would be a binary volume where the partial skull’s pixels are set to 1 and all other pixels are set to 0 and the output data should be a binary volume where the complete skull’s pixels are set to 1 and all the rest are set to 0. Therefore, it is not necessary to only use CBCT to create the training data, this is why we could use CT data as well. This choice is moti-vated particularly by the fact that most beam-hardening correction algorithms estimate the attenuation due to the dense structure from its segmentation on the 3D reconstructed image like it is done in [54]. Since the data available consists of full-skull volumes only and that it is impossible to find real-world data of truncated skulls (scanned using a maximum SID) and the correspond-ing complete skulls (scanned uscorrespond-ing a minimum SID), we decided to create the truncated skulls from the complete skulls.

Regarding CT data, we had to first select the relevant volumes, i.e. the ones containing at least the whole skull. A preliminary analysis of the data showed that the volumes containing the whole skull and nothing but the whole skull covered 10cm to 20cm in the Z direction, thus we filtered our data according to this condition to get these volumes. A limited number of CT volumes with the whole upper body were found. Only the cubes showing the head were extracted from these volumes, the extraction was done individually on each volume.

The segmentation of the complete skulls will be produced using the seg-mentation network studied and chosen for the skull segseg-mentation task. One problem we had not experienced in the segmentation database is that since the current data contains the whole skull, some of the volumes contain also the head support which has a shape globally like the skull’s shape.

Head support removal

None of the volumes used for training the segmentation network contained a head support and therefore it degraded a little bit the segmentation results of our model. To overcome this, we decided to pre-process the volumes and remove the head support from them before they are fed to the segmentation network. This was done on both the CBCT and the CT data.

(39)

The head support removal approach was inspired by [55] and can be launched on all the volumes without altering the volumes where there is no head support and without altering the skull pixels. The basic idea is that the head support, just like the table in [55], is always present in the shape of straight vertical lines on the sagittal slices. We, then, detect those vertical lines on each sagit-tal view with the Hough transform and set the corresponding pixels to 0 on the full resolution volume -cf. algorithm 1. We apply this method on the full res-olution volumes because we saw that it is easier to detect the correct vertical lines on full resolution data. The steps of this approach are detailed hereafter. The values of accmin and ntrials were chosen empirically. After this step, the volumes can be down sampled to (64,64,64) volumes and segmented using our network without any problems. We decided to use (64,64,64) volumes for the same reasons stated in the segmentation part. The Figure 3.7 shows the segmentation of the full resolution volume containing the skull and the head support in white on which we superpose the segmentation of the full resolu-tion head support-free corresponding volume in red. We can observe that the whole head support is removed without altering the skull.

Figure 3.7: In white, the segmentation of the full resolution volume containing the skull and the head support – In red, the segmentation of the full resolution head support-free corresponding volume.

Truncated/full skull couple preparation

To create the truncated/complete skull couples, we use the segmentation of the head support-free (64,64,64) volumes yielded by the segmentation network. The labels are then the complete skull segmentations and the input to the net-work should be cropped skulls. To create these cropped volumes, we mimed how the acquisition distribution is performed with the system. In general, most neuro-interventional CBCT volumes cover a cylinder with a diameter of 12cm,

(40)

Algorithm 1 Head support removal algorithm for each full resolution volume do

Coarse segmentation of the full resolution volume in order to have the skull, the head support if it exists and some noise.

for each sagittal slice do

Set the variable keepon to True and set the variable ntrials to 0.

while (keepon == True) and (ntrials < 5) do ntrials+= 1.

Detect vertical edges and horizontal edges using a vertical and horizontal sobel filters on the coarsely segmented volume.

Set the pixels corresponding to the horizontal edges to 0 on the volume filtered by the vertical sobel to minimize the number of pixels be-longing to something other than the vertical edges.

Detect the vertical lines with a Hough transform (angles≈ 0◦).

if any accumulator value is higher than accmin then

Find the lines corresponding to the peaks of the accumulator. Set the pixels of these lines to 0 on the unsegmented full resolution sagittal view on the volume.

else

(41)

some may have diameters of 17.5cm or 10cm. We, then, decided to use the same geometry to create the cropped data, i.e. the known regions.

We start by choosing the center (on the axial slices) of the region to be considered as known. To make sure that the known parts of the skulls are randomly chosen, we choose a random angle between 0◦and 360◦and a ran-dom offset between 1 pixel and 10 pixels and compute the center according to the equations 3.2.1 given that θ and oƒ ƒ set are, respectively, the randomly chosen angle and the randomly chosen offset and E() returns the integer part.

Xcenter, Ycenter = 32 + E ( cos(θ)×oƒ ƒ set), 32+E(sn(θ)×oƒ ƒ set) (3.1)

We then create masks of the parts that will be considered as known using the center that was randomly chosen and a diameter. The offset range consid-ered ensures that the known regions always have a minimum number of skull pixels in them. For all our data, we generated a 12cm truncation. Then, with a probability 0.2, we generated an additional 10cm or 17.5cm truncation. Since we choose a center for each diameter, we make sure that the region cropped is quite different between the three diameters. The masks are binary masks, where known pixels have the value of 1 and unknown pixels have the value of 0. We will call these masks M. We also create the masks of the unknown regions; these masks are then ¯M. When reading the dicom files, we can have

access to their spatial resolution. We use this information and the information about the shape of the volumes to convert the diameters from cm to pixels. By multiplying the volumes by ¯M we can create the truncated volumes.

Given our limited amount of data, generating data according to three di-ameters can act as data augmentation. We also performed rotations of the skulls on the axial views with angles randomly belonging to [20◦, 30◦] or [-30◦, -20◦] and with a probability of 0.1, this is done before the known cylinder (i.e. the knwon region) extraction because this way we can choose a random new center each time. The final data augmentation technique was to translate some of the skulls on the volumes, where it was possible, to have them in a corner of the volume and not in the center. This augmentation was done to have data that resembles what we could find in reality. In fact, we may have skulls that are both truncated and centered on real-world acquisitions, but we will most certainly have skulls that are truncated but not centered. To create

(42)

the known cylinder on the rotated and translated data we only use the 12 cm diameter, but a new center is always randomly chosen which again ensures that the known/unknown regions are diverse throughout the data.

In the end, for each volume we have (an illustration can be seen in Figure 4.3):

1. (64,64,64) binary volume containing the complete skulls, where every pixel of the skull is set to 1 and the rest of the pixels are set to 0. 2. (64,64,64) binary volume containing the cropped skulls, where every

known pixel of the skull is set to 1 and the rest of the pixels are set to 0. 3. (64,64,64) mask of the known regions where all pixels inside the known

cylinder are set to 1 and the rest of the pixels are set to 0.

4. (64,64,64) mask of the unknown regions where all pixels outside the known cylinder are set to 1 and the rest of the pixels are set to 0.

3.2.2 Deep learning approach for volumetric skull

ex-trapolation

Our task is a generative one, but extrapolating the skull outside of the field of view is slightly similar to deciding which pixels outside of the field of view belongs to the skull. Following the review, we did on generative models, we decided to resort to a GAN-based approach. In fact, we experimented briefly with VAEs and did not have good results. Also as the probabilistic latent space of a VAE gets larger, the number of its trainable weights increases a lot which constitutes a problem given our limitation in GPU memory. The spirit in which this task was conducted was to first study its feasibility, i.e. es-tablishing a proof of concept, and secondly to investigate whether adversarial training improves the performances we would have with a more typical con-volutional autoencoder. Unlike a typical GAN, our GAN’s generator will be trained in a supervised mode.

The generator

To choose the generator’s architecture, we compared two generators similar to the ones used in [38] and [39] which are encoder-decoder networks. While [38] and [39] use 4 consecutive dilated convolutions, we decided to use only 3 consecutive dilated convolutions, because we observed that it yielded better

(43)

results. We also decided to add skip connections between the different resolu-tion levels of the encoder-decoder architecture similar to the one used in [39]. Moreover, unlike [39], our networks are fed with the cropped skulls and the masks ¯M and M, the two former being used in the loss and metrics

computa-tions. We will call the generator inspired by [39] G1 and the one inspired by [38] G2. The figures 3.8-top and 3.8-bottom show the evolution of the met-rics on the training and validation data during training of G1 and G2. Early stopping with a patience of 50 epochs was used. A learning rate of 0.001 was used with Adam optimizer (beta1 = 0.9 and beta2=0.999) and a batch size of 8.

On Figure 3.8-top, we can see that the DiceIn (the dice coefficient com-puted on the known parts which will be developed in 4.3.1) increases quicker than the DiceOut (the dice coefficient computed on the extrapolated parts which will be developed in 4.3.1). Even though both the DiceOut and the DiceIn keep increasing on the training data, we seem to overfit as the DiceOut on the validation data slightly exceeds 0.6 and then drops. On Figure 3.8-bottom, we can observe that the DiceIn increases slowly, we can say that the convergence speeds of DiceIn and DiceOut are more consistent together for G2 than for G1. Given enough time, DiceIn would reach eventually the same performances with G1 and G2.

In the light of these results, we decided to opt for a hybrid architecture in-spired by both G1 and G2. This way we would have a good DiceOut and a good DiceIn while taking advantage of the convergence speed of G1.

Several experiments led to the baseline architecture shown in Figure 3.9. The main features of this architecture are its skip connections that use extra convolutions with a kernel size of one in order to make sure that we add se-mantically similar feature maps, the fact that it down-samples the data only one time (at the level of the third convolutional layer) and therefore only one deconvolutional layer is used in the decoder and finally that it does not use batch normalization as we saw that batch normalization degrades the perfor-mances in this case. G2 uses only one transposed convolution while G1 uses two transposed convolutions, given the results analyzed earlier we chose to limit the number of transposed convolutions that we use. This choice was also motivated by papers such as [56], which states that transposed convolutions can create checkerboard artifacts that can have bad effects on the performances of the network.

We tested three different versions of this baseline architecture: the first one uses max-pooling to down-sample, the second uses convolutions with non-unit strides, the third one has squeeze-and-excite blocks after each convolution. We

(44)

Layer Number of Channels/ Number of units Kernel Size/ Pooling Size Strides Dilation rate Dropout

rate Padding Activation Input (64,64,64) 1 3D Convolution 32 (5,5,5) (1,1,1) (1,1,1) 0.1 Valid Leaky ReLU(0.2) 3D Convolution 64 (3,3,3) (1,1,1) (1,1,1) 0.1 Same Leaky ReLU(0.2) 3D Convolution 64 (3,3,3) (2,2,2) (1,1,1) 0.1 Same Leaky ReLU(0.2) Flatten layer

Dense layer 64 Leaky

ReLU(0.2)

Dense layer 1 Sigmoid

Table 3.1: Details of the discriminator architecture.

use the same SE blocks used in [57] and showed in Figure 3.10. We tried to replace the dilated convolutions by normal convolutions and saw that it de-graded the DiceOut a lot (around 0.59 on the train and the validation) so we decided to keep them.

The discriminator

Our discriminator model is detailed in table 3.1. We decided to use only a global discriminator because after experimenting with a discriminator com-posed of a local and a global discriminator ([38], [39]) we concluded that such a setup would add important memory constrains and does not necessary yield to better results. Following the guidelines of [50], we use leaky ReLU with an α= 0.2 for all the layers except for the last one for which we use a sigmoid activation function. We found out that dropout works better than batch nor-malization in the discriminator and use therefore a dropout with a dropout rate of 0.1 after each convolutional layer. We claim that dropout out works like noise added to the data and therefore helps the discriminator being robust in distinguishing between complete skulls and skulls completed using the gener-ator.

(45)

Figure 3.8: Top: DiceOut(blue) and DiceIn(orange) evolution during training with G1 on the training data (left) and the validation data (right). Bottom: DiceOut(blue) and DiceIn(orange) evolution during training with G2 on the training data (left) and the validation data (right).

(46)

Figure 3.9: Generator baseline architecture.