Fully Convolutional Networks for Mammogram Segmentation

(1)

Linköpings universitet SE–581 83 Linköping

Master thesis, 30 ECTS | Datavetenskap

202019 | LIU-IDA/LITH-EX-A--2019/058--SE

Fully Convolutional Networks for

Mammogram Segmentation

Neurala Faltningsnät för Segmentering av Mammogram

Hampus Carlsson

Supervisor : Anders Eklund Examiner : Jose M. Peña

(2)

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-gängligheten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Segmentation of mammograms pertains to assigning a meaningful label to each pixel found in the image. The segmented mammogram facilitates both the function of Com-puter Aided Diagnosis Systems and the development of tools used by radiologists during examination. Over the years many approaches to this problem have been presented. A surge in the popularity of new methods to image processing involving Deep Neural Net-works present new possibilities in this domain, and this thesis evaluates mammogram seg-mentation as an application of a specialized Neural Network architecture, U-net. Results are produced on publicly available datasets mini-MIAS and CBIS-DDSM. Using these two datasets together with mammograms from Hologic and FUJI, instances of U-net are trained and evaluated within and across the different datasets. A total of 10 experiments are con-ducted using 4 different models. Averaged over classes Pectoral, Breast and Background the best Dice scores are: 0.987 for Hologic, 0.978 for FUJI, 0.967 for mini-MIAS and 0.971 for CBIS-DDSM.

(4)

My special thanks goes out to the people who have been helping me complete this thesis. Thanks to Magnus Qvigstad for taking the role as supervisor at Sectra and to Anders

Ek-lundfor supervision and guidance through this thesis, your help is very much appreciated.

Jose M. Peñathanks for taking on the role as examiner and your interest in my work. To my

sponsors at Sectra Kristin Lundgren and Anna Naus thanks for all your help along the way and giving me insights into the practical aspects which relates to this work. To Kristian

Köp-sén, Lukas Berglin and Daniel Forsberg thanks for you help in providing domain expertise, finding data and academic guidance.

(5)

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables ix 1 Introduction 1 1.1 Aim . . . 2 1.2 Research Questions . . . 2 1.3 Segmentation of Mammograms . . . 2 1.4 Ethical Considerations . . . 4 2 Background 5 3 Theory 6 3.1 Classification and Neural Networks . . . 6

3.1.1 Logistic Regression . . . 6

3.1.2 Neural Networks . . . 8

3.1.3 Generalization to Unseen Data . . . 11

3.2 Convolutional Neural Networks . . . 13

3.2.1 The Three Stages of a Convolutional Layer . . . 13

3.2.2 Case Study: LeNet-5 . . . 17

3.3 Semantic Segmentation . . . 17

3.3.1 From Fully Connected to Fully Convolutional . . . 19

3.3.2 Transpose Convolution . . . 20

3.3.3 Case Study: U-Net . . . 21

3.4 Existing Approaches to Mammogram Segmentation . . . 23

4 Method 26 4.1 System Overview . . . 26

4.1.1 Tools, Frameworks and Hardware . . . 27

4.2 Data Sources . . . 27 4.2.1 Preprocessing . . . 28 4.2.2 Post Processing . . . 31 4.3 Model . . . 31 4.3.1 Training . . . 33 4.4 Evaluation . . . 35 4.4.1 Metrics . . . 36 4.4.2 Experiment Setup . . . 37

(6)

5.2 Comparative Results . . . 44 6 Discussion 45 6.1 Experiment Results . . . 45 6.1.1 Summary . . . 45 6.1.2 Experiment Elaboration . . . 46 6.1.3 Comparative Analysis . . . 47

6.2 Classical vs. Machine Learning Paradigms . . . 49

6.3 Improvements and Future Work . . . 50

7 Conclusion 52

(7)

1.1 a, b) Two different types of mammogram, SFM and FFDM c) an example segmen-tation mask (green denotes breast, and blue denotes pectoral muscle) d) derivable landmarks (green denotes region of interest, red denotes skin line and blue de-notes pectoral angle). . . 3 1.2 Two examples of CC view mammograms. . . 3 3.1 Toy logistic regression example. Classifying data points into tblue, yellowu . . . 7 3.2 Canonical representation of a feed forward Neural Network. The three

dimen-sional input space fed through the network to produce a single output. . . 9 3.3 Fitting three different polynomials to the same set of data points. . . 11 3.4 Graphical depiction of how a 3 ˆ 3 feature map is created from convolving a 3D

kernel on a 3-channel input image. . . 14 3.5 Convolving a Sobel filter on a mammogram image to yield a feature map for the

edge response. . . 15 3.6 Applying max pooling to a 4 ˆ 4 image yielding a 2 ˆ 2 output containing only the

maximum found at each grid position. . . 16 3.7 LeNet architecture. . . 16 3.8 Connecting two feature maps as a vector input to a fully connected Neural

Net-work. . . 17 3.9 Example of image segmentation where the source image is being pixel-wise

la-beled according to the identified classes. . . 18 3.10 Showing the effect of applying an image of a increased size to a fully

Convo-lutional Neural Network. In both figures, convoConvo-lutional kernels have the same depth as the number of channels in the fearture map on which it is applied. . . 19 3.11 Up-convolution of a 2 ˆ 2 input with a 3 ˆ 3 kernel of stride equal to 2 and pad

equal to 1. . . 20 3.12 The U-Net architecture . . . 22 4.1 Artifacts depending on data source Left: mini-MIAS Middle: CBIS-DDSM Right:

Hologic. . . 27 4.2 An example input x(i)and the corresponding y(i). The three feature channels`1,

`₂and`₃can be seen in the right braces. An artifact of the labeling process can be seen in`₃. . . 29 4.3 The effect on loss for different values of the steps per epoch parameter in Keras. . . . 33 4.4 Two training sessions, with and without class weighting . . . 34 4.5 Hausdorff distance between triangles A and B. The distance d is depicted with a

dashed line. . . 35 5.1 Loss and accuracy on test partitions captured during training of the models.

Mod-els 1-4 were trained on Hologic, Hologic Y mini-MIAS, mini-MIAS and CBIS-DDSM respectively. . . 40

(8)

(9)

4.1 Summarizing statistics of the data sources. . . 29

4.2 The layers of the network divided into the contracting encoder-part and expand-ing decoder-part. Cnv, mx, and up denotes Convolution, Max Poolexpand-ing and Up-Convolutions respectively. . . 32

4.3 Models used in the experiments. The Train column denotes which data sources were used in training of that particular model. The Validation column denotes, for the same model, which data sources were used in order to determine the early stopping criterion. . . 37

4.4 Experiments . . . 38

5.1 Experiment 1. Training: Hologic Prediction: Hologic . . . . 40

5.2 Experiment 2. Training: Hologic Prediction: mini-MIAS . . . . 40

5.3 Experiment 3. Training: Hologic Prediction: FUJI . . . . 41

5.4 Experiment 4. Training: mini-MIAS Prediction: mini-MIAS . . . . 41

5.5 Experiment 5. Training: mini-MIAS Prediction: CBIS-DDSM . . . 41

5.6 Experiment 6. Training: CBIS-DDSM Prediction: CBIS-DDSM . . . . 42

5.7 Experiment 7. Training: Hologic Y mini-MIAS Prediction: Hologic . . . . 42

5.8 Experiment 8. Training: Hologic Y mini-MIAS Prediction: mini-MIAS . . . . 42

5.9 Experiment 9. Training: Hologic Y mini-MIAS Prediction: CBIS-DDSM . . . . 43

5.10 Experiment 10. Training: Hologic Y mini-MIAS Prediction: FUJI . . . . 43

5.11 Results produced by Model 2 and Casti et al. on two datasetsD. FFDM in the case of Model 2 corresponds to images from Hologic. . . 44

5.12 Results produced by Model 2 and Rampun et al. on two datasetsD. FFDM in the case of Model 2 corresponds to images from Hologic. . . 44

5.13 Results produced by Model 1 and Dubrovina et al. on two datasetsD. FFDM in the case of Model 2 corresponds to images from Hologic. . . 44

(10)

Breast cancer is among the most common cancers diagnosed in women world wide [5]. Screening programs, where subjects undergo examination prior to any revealed symptoms, reduce the risk of mortality associated with breast cancer and is conducted for an age deter-mined subset of the female population [19]. The inescapable side effect of screening programs is the increased workload required by radiologists and other medical personnel.

Computer aided diagnosis systems (CAD) are designed to provide automated medical ex-pertise and thereby support the work of medical professionals. Studies have shown how CAD systems can improve the quality of diagnosis making their improvement of highest interest [5].

An X-ray image of the breast, a mammogram, can be segmented into different components depending on the type of tissue present at a given location. This information, the class asso-ciated with each type of tissue, is relevant both for radiologists and for automated computer systems. The radiologist will examine mammograms at a high rate, often keeping images of the left and right breast open for examination side by side. Due to this high rate, any ac-tions performed by the radiologist such as alignment and zooming of the images, negatively impacts their rate of examination. These two types of actions, zooming and alignment, are examples of actions which can be automated if the breast- and pectoral muscle tissue can be correctly classified. For CAD systems, the main preprocessing technique is definition of the region of interest [29]. In addition to definition of the region of interest, CAD systems too benefit from the information gained by segmentation of the pectoral muscle [29].

There have been many published works in mammogram segmentation over the years, see Mustra et al. [29] for a comprehensive review. The review published in 2016, manages to capture a paradigm in which the vast majority of segmentation algorithms where designed around methods such as thresholding, morphology, region growing, active contours and wavelet filtering. In none of the reviewed articles was machine learning the critical com-ponent.

In 2012, results presented by Krizhevsky et al. [18] sparked much research interest in Con-volutional Neural Networks (CNN) and CNNs have since come to dominate much of image classification under the umbrella term deep learning. Since then, deep learning has seen ever

(11)

increasing popularity in many domains, not least medical image processing. Litjens et al. reviewed the state of the literature on deep learning in this domain and noted the surge in popularity, of their 308 reviewed papers, 242 was published in 2016 or later [24].

In mammography there are two publicly available datasets which have seen much use in the literature: MIAS [40] and DDSM [15]. These two datasets consist of analog images from scanned film (SFM). The images are noisy and have a high presence of artifacts of different types, which are challenging aspects to the problem of mammogram segmentation. Whereas DDSM contain images of both mediolateral oblique (MLO) and craniocaudal view (CC) pro-jections, MIAS only contain MLO images. More common in modern systems is full-field dig-ital mammography (FFDM), but currently there is no publicly available FFDM dataset [29]. A prevailing challenge when applying deep learning to medical images is the dependence on data labeled by experts for the models to learn from [24]. In the case of mammogram segmentation, neither DDSM nor MIAS have the relevant masks publicly available. Scarcity of labeled data requires effective use of whatever data is available.

One enabling approach for applying deep learning in this domain is to train the network on patches, instead of feeding the whole image to the network. Training on patches allows many data points to be sampled from a small corpus of data and this method was successfully used by Dubrovina et al. to classify different types of tissue in digital MLO mammograms [10]. The results by Dubrovina et al. [10] are promising and demonstrates the viability of CNNs for segmenting the mammogram. Training on patches is a compromise which achieves many new data samples at the cost of spatial context. There exists however, other methods for in-creasing the size of a small corpus of data through various augmentation techniques. These techniques allows for training on full mammograms, preserving the entire context for classifi-cation. It is the aim of this thesis to further evaluate the applicability of CNNs in this domain by training a network to segment mammograms from different data sources, projection types and capturing modalities.

1.1 Aim

Explore and evaluate the viability of mammogram segmentation as an application of CNNs, by training a network to segment images of different projection type, capturing modality and data source.

1.2 Research Questions

1. Can a Deep Convolutional Neural Network be trained on full mammograms to segment the breast boundary and pectoral muscle?

2. Can a high performing network generalize across unseen data sources of different character? 3. To the extent that results are comparable, how does a CNN approach to mammogram

segmen-tation compare against other published works in this domain which are not based on machine learning?

1.3 Segmentation of Mammograms

The two predominate views for mammography are mediolateral oblique (MLO) and cran-iocaudal view (CC). The images from Figure 1.1 are of view type MLO. Two examples of CC mammograms have been depicted in Figure 1.2. The two capturing modalities SFM and FFDM produce mammograms different in character. Typically, SFM mammograms (see

(12)

Fig-(a) mini-MIAS (b) Hologic (c) Mask

𝜽

(d) Landmarks Figure 1.1: a, b) Two different types of mammogram, SFM and FFDM c) an example segmen-tation mask (green denotes breast, and blue denotes pectoral muscle) d) derivable landmarks (green denotes region of interest, red denotes skin line and blue denotes pectoral angle).

ure 1.1a) contain more artifacts than the digital counterpart as seen in Figure 1.1b. The SFM mammogram in Figure 1.1a contain in addition to the label, a white scanning artifact. The desired output is a segmentation mask, an image of the same size as the input mammo-gram with a semantic class implied by the color of each pixel. In Figure 1.1c such a mask has been produced to signify the presence of three classes: breast region (green), pectoral region (blue) and background (red).

From the segmentation mask, landmarks and other features of interest can be extracted. Some examples have been illustrated in Figure 1.1d. The dashed green line denotes the region of in-terest, which can be utilized for automatic zooming in examination tools used by radiologists. The red curve marks the skin-line which can be used as a reference point when documenting potential abnormalities found in the breast tissue. The final example is the pectoral angle which has been marked in blue color. Sometimes a radiologist may draw benefit from ex-amining the mammogram in a piecewise fashion. The pectoral angle can be used in order to present slices of the mammogram normal to the border between breast and pectoral regions facilitating this piecewise examination.

(a) cbis-DDSM (b) Hologic

(13)

1.4 Ethical Considerations

Doing work of this character in the medical domain demands a certain carefulness. The work in this thesis is heavily dependent on image data which have been captured of real patients in real world scenarios. All data used in this work has been anonymized so as to protect the integrity of each and every patient. The author had at no point access to personal informa-tion which could be linked to any specific person as such informainforma-tion was not scientifically relevant.

(14)

This thesis-work will be conducted in collaboration with Sectra Medical Systems. Sectra is a company located in Mjärdevi Linköping, and they develop tools to support the work of med-ical professionals. For some of their products, Sectra relies on automated region of interest detection in mammograms.

One doctor may examine up to 120 mammograms per hour. To facilitate examination at this rate, the software application in use has to be designed to reduce any unnecessary actions required by the radiologist. For Sectra, one of the actions being automated is zooming to the region of interest. Zooming manually exemplifies an action which has little to no cost in the case of a single mammogram, but has significant impact when the volumes become increasingly large.

Knowing what constitutes the region of interest is what enables automatic zooming; zooming improves the workflow of the radiologist. Error in detecting the region of interest may, in the best case, cause misalignment of mammograms, and in the worst case, lead to some breast tissue not being examined. Unexamined tissue may cause a cancer to go undetected by the radiologist which stresses the need for reliable ROI-detction algorithms.

The algorithm which Sectra employs today is one which relies on traditional image process-ing and explicit rules. Their algorithm takes an X-ray image and outputs four points definprocess-ing a bounding box around the relevant breast tissue. Algorithms for skin-line detection of this kind, such as the one by Casti et al. [6] will be presented in the following chapter.

(15)

This chapter will present those aspects of machine learning which are relevant for this thesis. Basic concepts are introduced using simple models like Logistic Regression. These basic con-cepts will then be used to understand more complex models like the Neural Network which can be viewed as a network of logistic regression models. From basic fully connected Neural Networks a more specialized network known as a Convolutional Neural Network will be introduced.

The chapter will also elaborate on related works in mammogram segmentation and methods from both classical and modern image processing paradigms will be presented.

3.1 Classification and Neural Networks

Neural Networks (NNs) belong to the category of supervised machine learning models. Su-pervised models learn from examples. Examples come in pairs(x(i), y(i)). The input x(i)and the target y(i)are both vectors and the task of a supervised model is to learn to predict y(i)for some unseen x(i). In order to attain good predictions for unseen inputs, a supervised model will usually need to train on a larger set of inputsX and targetsYhence the superscript no-tation where x(i)should be read as the ith input vector inX. Subscripts denotes components of vectors, e.g. x(i)_j is the jth component of x(i).

3.1.1 Logistic Regression

For a given input vector x(i)a Logistic Regression model outputs a value between 0 and 1. Given a specified threshold T, a Logistic Regression model can be used as a binary classifier, assigning class 0 to values less than or equal to T, and class 1 for remaining values. A su-pervised learning problem is called classification when the domain ofY consist of a finite set of values. In the binary setting, which is the simplest configuration, each target vector y(i)

is labeled either zero or one. Using a training set tX,Yua logistic regression model can be trained to assign the most probable target label given some input x(i).

Due to its relative simplicity, this model will be used as a starting point for presenting some of the concepts and notions which occurs across many supervised machine learning models.

(16)

(a) Top down view with decision boundary visible.

(b) Showing the intersection of the S-shaped sig-moid planes.

Figure 3.1: Toy logistic regression example. Classifying data points into tblue, yellowu .

A basic understanding of Logistic Regression will hopefully facilitate better understanding of models of increased complexity, such as the Neural Network and later also the Convolutional Neural Network.

In a toy example presented in Figure 3.1, a Logistic Regression model has been trained to as-sign classes yellow and blue. The data in this example has been generated from two separate normal distributions, and the task of the model is to make the most probable class assignment for each datapoint. From the top-down view in Figure 3.1a, the decision boundary between yellow and blue is clearly visible. Further, it can be seen from Figure 3.1b how this boundary arises from the characteristic S-shaped planes which model the true probabilities. The fol-lowing paragraphs will present some theoretical notions regarding Logistic Regression, their more thorough treatment can be found in both Bishop [3] and Ng [30].

A Logistic Regression model (see Equation 3.1) is a supervised learning model hw based on

(17)

value of hw(x)is close to one when the label corresponding to x is one; conversely, the value

hw(x)should be close to zero when the label of x is zero.

hw(x) =σ(wTx)) = 1

1+e´wT_x (3.1)

A vector w which gives rise to the property described above can be found by applying the Gradient Ascent optimization algorithm to the likelihood function L(w). The first step in applying this algorithm is to find the derivatives of L(w)with respect to the each component wj of w. Given a dataset of N examples and assuming that p(y = 1|x; w) = hw(x) and

p(y = 0|x; w) = 1 ´ hw(x)written compactly in Equation 3.2, the likelihood function L(w)

can be formulated as shown in Equation 3.3.

p(y|x; w) =hw(x)y(1 ´ hw(x))1´y (3.2) L(w) =p(Y|X; w) = N ź i=1 p(y(i)|x(i); w) (3.3)

Typically, instead of maximizing L(w)directly, its logarithm l(w) is maximized which will yield the same result with a simpler derivative [30]. The logarithm of the likelihood function l(w)is stated in Equation 3.4.

l(w) =ln L(w) =

N

ÿ

i=1

y(i)ln hw(x(i)) + (1 ´ y(i))ln(1 ´ hw(x(i))) (3.4)

Simplifying slightly and considering only one example(x, y)i.e setting N = 1, computing the derivatives of l(w)will yield the expression in Equation 3.5.

B Bwj

l(w) = (y ´ hw(x))xj (3.5)

Finally, when the the derivatives of l(w)are attained, the Gradient Ascent update rule can be applied as shown in Equation 3.6. The gradient∇l(w)is pointing in the direction of the steepest increase of l(w). Taking steps scaled by the parameter α therefore incrementally maximizes the likelihood.

w=w+α∇l(w) =w+α     B Bw0l(w) .. . B Bwjl(w)     (3.6)

The incremental process of refining the weights, as guided by the gradient is what constitutes training and is a central aspect to much of supervised machine learning.

3.1.2 Neural Networks

The previous section presented Logistic regression as a linear classifier producing decision boundaries in the form of straight lines. Bishop [3] explains that this type of model is appli-cable even if the input space is not linearly separable. If a nonlinear transformation φ can be found such that the resulting feature space is linearly separable, linear classifiers can still

(18)

∑

𝑥 ∗ 𝑤 𝑥 ∗ 𝑤

𝑏 𝑥 ∗ 𝑤

Figure 3.2: Canonical representation of a feed forward Neural Network. The three dimen-sional input space fed through the network to produce a single output.

be applied. The rest of this subsection will use the superscript notation to denote layers of a Neural Network.

A Neural Network as Bishop describes, comprises a fixed number of basis functions (the neu-rons of each layer). The basis functions are adaptive, meaning that they will change during training in order to improve the performance of the model [3].

Using the architecture depicted in Figure 3.2 as an example, the input space is inR3+1 mean-ing that each input vector has three dimensions plus a bias term. An input vector x is prop-agated through the network as a series of feature mappings φ(x) :Rn ÞÑRm. Each neuron in the network is associated with a weight vector w and an activation function σ. For a single neuron, its activation is computed according to Equation 3.7.

a=σ(

ÿ

i

wixi+w0) (3.7)

For notational convenience it is common to include a one as the first element in each input vector such that x= (1, x0, x1, .., xn)Tthis enables a simplification of the expression in

Equa-tion 3.7. EquaEqua-tion 3.8 shows the simplified expression for neuron activaEqua-tion. The same as in Equation 3.7, here too is σ a given activation function. The choice of σ in modern architectures is the Rectified Linear Unit (ReLU) [13]. The ReLU activation function will be described in a later section.

a=σ(z) =σ(wTx) (3.8)

Each one of these mappings which occurs between the input and output layers make up the hidden layers of the Neural Network.

Letting the weight vector w of each neuron in a given layer compose the rows of a matrix W, forward propagation from one layer to the next can be written on the form expressed in Equation 3.9.

(19)

The activation a for the next layer l+1 is computed from activation function σ applied to the dot product of the corresponding weight matrix Wl+1and the vector al. The activation vector al is composed from the activation of all neurons in the previous layer. The activation of the first layer is simply the input vector x.

According to the same principle as described for Logistic Regression earlier, the parameters which maximizes the likelihood of the training data can be found for Neural Networks using Gradient based optimization methods; the example from before was the Gradient Ascent algorithm. However, more recent optimization algorithms like Adam [17] are also based on finding the gradient. Finding all the derivatives of the network parameters, which make up the gradient, can be done effectively using the Backpropagation algorithm [35].

As it is an enabling method for effective training of Neural Networks, the following set of equations will describe how backpropagation can be derived, and how it gives a recursive formula for finding the derivatives of all the network parameters. At a given time, the algo-rithm depends on the state of three layers in the network. These layers will be denoted l ´ 1, l and l+1 referring to the previous, current and next layer containing j, k and m neurons respectively.

The function J(w)depends on the input sum z of all the neurons in the last layer. Typically, J(w)is chosen such that it maximizes the log likelihood. The input sum in turn depends on the activation akin the layer before it, which in turn depends on its input sum zlwhich finally

depends on the weights wkjof the current layer. This dependence of J with regards to wl_kjhas

been outlined mathematically in Equations 3.10 to 3.12.

BJ Bwl_kj = BJ Bzl_k Bzl_k Bwl_kj = BJ Bal_k Bal_k Bzl_k Bzl_k Bwl_kj = (3.10) ÿ m BJ Bzl+1m Bzl+1_m Bal k ! Bal_k Bzl k Bzl_k Bwl kj = (3.11) ÿ m BJ Bzl+1m wl+1_mk ! σ1(zl_k)al´1_j (3.12)

The error signal δ is defined to denote how the input sum of a particular neuron affects the loss J.

δl_k” BJ

Bzl k

(3.13)

The right hand side of the definition in Equation 3.13 has already been expanded in Equa-tion 3.10 thus EquaEqua-tion 3.14.

δl_k= ÿ m BJ Bzl+1m wl+1_mk ! σ1(zl_k) (3.14)

Using the previous definition, Equation 3.15 shows how the error signal of the current layer l can be computed using the error signal of the next layer l+1.

δ_kl = ÿ m δ_ml+1wl+1_mk ! σ1(zl_k) (3.15)

(20)

1 2 3 4 5 6 7 8 9 10 -0.5 0 0.5 1 1.5 2 2.5 3 data1 linear

(a) Linear fit.

1 2 3 4 5 6 7 8 9 10 -0.5 0 0.5 1 1.5 2 2.5 data1 quadratic (b) Quadratic fit. 1 2 3 4 5 6 7 8 9 10 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 data1 9th degree (c) 9th degree fit.

Figure 3.3: Fitting three different polynomials to the same set of data points.

Once the error signal of the last layer in the network has been computed, going backwards according to Equation 3.15, enables computation of the error signal for each node in each layer of the network, finally arriving at the expression in Equation 3.16.

BJ Bwl_kj =δ

l

kal´1j (3.16)

The feasibility of successfully training a Neural Network depends to a large degree on the number of parameters there is to be learned in total [13]. As there is one parameter for every connection, the more connections there are the more parameters have to be trained. The network in Figure 3.2 is said to by fully connected as each node in every layer is connected to every node in the subsequent layer.

A sparse network is one in which not all connections in the model are present [3]. The family of architectures known as Convolutional Neural Networks are examples of sparse networks, and will be the topic of the next section.

3.1.3 Generalization to Unseen Data

When training a supervised model like a Neural Network, the desired outcome is that the model will generalize such that good predictions can be attained for unseen data points.

(21)

In Figure 3.3 three different types of polynomial models on the form of Equation 3.17 for M=1, 2 and 9 are presented.

y(x, w) =

M

ÿ

j=0

wjxj (3.17)

In Figure 3.3c the model fits the training data perfectly. However, it is not likely that it would generalize to unseen data. This phenomenon is what is called overfitting [3].

In order to estimate the generalization error of a model, the training set could be randomly partitioned such that a smaller subset (e.g. 20%) is not used at training; typically the remain-ing 20 is referred to as a validation set. The model is selected such that the chosen model is the one with the least error on the validation subset. An estimation of the generalization error is then attained by evaluating the error on the remaining 20%. This method is called simple cross validation [30].

Models which have many trainable parameters, such as Neural Networks, are capable of capturing complex patterns in the data [3]. When overfitting occurs, the model has been fitted to particularities which are only present in the selected training sample [30]. Regularization is a method for preventing overfitting and an example technique is to penalize parameters of large magnitude by a factor λ [3].

Another regularization method which has shown to decrease generalization error in deep Neural Networks is Dropout [39]. The idea is that if many different networks could be trained on different subsets of the data, better generalization could be achieved by averaging the prediction of all these models [39]. Training multiple models is seldom practical. Dropout achieves an approximation of having trained multiple models in the following way:

1. For each neuron n in some subset of the network at training time, assign a probability p of that neuron being present or not.

2. At testing time, use the same probability of dropout p as was used during training and scale the weights of n buy this factor.

If a network has n neurons, all of which are subject to dropout, there are 2npossible configu-rations. The authors of this method describe how this is analogous to training a set of these 2n networks simultaneously; with the caveat that these networks implicitly shares a large portion of their weights [39].

Widely adopted is the regularization method referred to as Early Stopping [13]. Early stop-ping limits the number of parameter updates available during training. Because deep Neural Networks have the potential to model the distribution of any particular training partition, unbounded training may often cause overfitting as described earlier.

This effect can be prevented if for each epoch, the loss on the validation partition of the data is evaluated. If the training and validation loss start to diverge, and continue to diverge for a set number of epochs, training is terminated. The state of the parameters is then reset to the state corresponding to the point of divergence yielding the final model.

Perhaps the most straight forward way to decrease generalization error, is to increase the amount of training examples [3, 30, 22].

(22)

3.2 Convolutional Neural Networks

In some domains the training set tX,Yuis composed in such a way that each input x(i) is an image and each target vector y(i)is some descriptive label for that image. Regular fully connected Neural Networks are applicable as long as the input can be formulated as one di-mensional numerical vectors. In principle, an image could be arranged in such a way, making every pixel a unique input neuron in the network, thus enabling training from individual pix-els to prediction. In practice however, image data impose too much computational overhead making training impractical. As an illustrative example, consider a 256 ˆ 256 pixel input im-age. The image has 65 536 pixels which implies that a Neural Network with a single fully connected hidden layer with the same number of units would have 4 294 967 296 weights to learn, plus a bias.

Convolutional Neural Networks (CNN) replace the general matrix multiplication with con-volution [13]. CNNs use kernels applied as filters, extracting local features across their entire input image. Convolution is the process by which this occurs and the operation yields what will be referred to as a feature map. The influence of the input image on a given pixel or unit in a feature map depends on the size of the kernel. As is typically the case, filter kernels will be of significantly smaller dimension than their input, radically decreasing the number of learnable weights [13].

Some properties of CNNs make them particularly effective on grid-like data structures like images. First they utilize how a given pixel is more correlated to its neighbours than those far away [3]; second they are inherently translation invariant, lowering the demands on large training sets [22]; third is that their composition is naturally built around hierarchically com-bining lower lever features into features of increasing complexity [22].

CNNs were developed in the late 1980’s and successfully applied as classifiers of handwrit-ten digits [21]. They initially stood out as learning systems which could be trained end-to-end [22]. Previous approaches to classification using image data relied on hand crafted features which were used as inputs to classifiers such as fully connected Neural Networks. In CNN architectures, both the classifier and the feature extractor is the subject of training. Progress in image classification in recent years has been driven by the improving perfor-mance of these specialized networks [18, 38, 42, 14].

Starting from the convolution operation, this section will present the three common stages found in CNN architectures, namely the convolution, activation and pooling stage [13]. From there will follow a brief case study of a well known CNN architecture known as LeNet-5 [22] which shows how these stages are combined into one system.

3.2.1 The Three Stages of a Convolutional Layer

The following subsection will present the three most common stages in a convolutional layer.

Stage 1: Convolution

Convolution is an operator on two functions [13]. In the discrete case, which is of most inter-est in this domain, images being grids of discrete values, 3D convolution is defined according to Equation 3.18, and 3D convolution is actually used for 2D images since the different chan-nels of an image (e.g. red, green and blue) make up the third dimension, and in deeper layers the different feature maps make the data 3D.

y(i, j, k) =ÿ m ÿ n ÿ p x(i ´ m, j ´ n, k ´ p)w(m, n, p) (3.18)

(23)

150 90 87 42 g 101 107 90 66 73 78 31 30 60 54 43 36 150 90 87 42 g 101 107 90 66 73 78 31 30 60 54 43 36 150 90 87 42 g 101 107 90 66 73 78 31 30 60 54 43 36 107 * 1 + 90 * 0 90 * 0 + 87 * -1 = +

*

33 60

The three channels RGB of input image 2x2x3 ﬁlter kernel The resulting 3x3 feature map 107 * 1 + 90 * 0 + 90 * 0 + 87 * -1 107 * 1 + 90 * 0 + 90 * 0 + 87 * -1 1 0 0 -1 1 0 0 -1 1 0 0 -1 90 87 107 90 90 87 107 90 90 87 107 90 + +

Figure 3.4: Graphical depiction of how a 3 ˆ 3 feature map is created from convolving a 3D kernel on a 3-channel input image.

The equation describes the convolution of the discrete input function x and the discrete kernel w. The result y, is what will be referred to as a feature map. The two relevant parameters for convolution in this domain are stride and padding. Stride denotes how the kernel moves inside the volume. In Figure 3.4 the stride is one both horizontal and vertical directions. Padding is the other parameter and it defines by how much the kernel is allowed to breach the borders of the original volume. Padding is used to control the dimensions of the feature map and the simplest form is zero-padding where all outside values are set to zero.

Assuming a quadratic input x of width i and likewise a quadratic kernel w of width k with stride s and padding p, a number of relationships between i, k, s, p and the output dimension

(o, o)can be observed.

Given any i and k with s=1 and p=0 the width of the output o is given by Equation 3.19.

o= (i ´ k) +1 (3.19)

Maintaining a unit stride s = 1 allowing any i, k and padding p the relationship in Equa-tion 3.20 can be defined.

(24)

Figure 3.5: Convolving a Sobel filter on a mammogram image to yield a feature map for the edge response.

When the desired outcome is to have o=i padding can be applied according to the relation-ship defined in Equation assuming an odd k= (2n+1, n PN)letting p=tk/2u

o=i+2tk/2u ´(k ´ 1) (3.21)

In the most general case, when any value for i, k, p and s is considered, the relationship in Equation 3.22 can be defined.

o=ti+2p ´ k

s u+1 (3.22)

The mechanics of the convolution operator has been depicted graphically in Figure 3.41. The figure illustrates what is expressed mathematically in Equation 3.18. The feature map is pro-duced by means of convolving a 3D-kernel on the input image. The convolution depicted in this case is one where the kernel never breaches the boundary of the input image. This yields as 3 ˆ 3 feature map as there are in total 9 ways of fitting the kernel insides the input, 3 for each row and column.

Under convolution, the kernel extracts particular features of the input, leaving others out. The Sobel operator, which is a handcrafted feature extractor for computing the gradient of an image, can be used in order to introduce some concrete notion of what it may mean to extract features using a kernel. The Sobel operator approximates the gradient of an image by convolving a kernel as the one seen in Figure 3.5. The image depicts the input, in this case of mammogram, and the corresponding feature map. The nature of the extracted fea-tures depend on the particular values of the kernel, and in a CNN these values are trainable

parameters.

Stage 2: Activation

As for regular Neural Networks, CNNs too employ a non-linear activation function. Where Neural Networks apply the linearity to each node in the network, a CNN applies non-linearity to each unit in the feature maps [13].

1_{Technically, the figure describes cross-correlation which is more commonly used in practice and is the same as}

(25)

122 90 180 111 50 69 222 77 34 28 34 44 31 25 10 4 122 222 34 44

Figure 3.6: Applying max pooling to a 4 ˆ 4 image yielding a 2 ˆ 2 output containing only the maximum found at each grid position.

Figure 3.7: LeNet architecture.

It is common for modern CNN architectures to use the Rectified Linear Unit (ReLU) activation function [18, 13]. Using the ReLU activation function (Equation 3.23) over the previously common hyperbolic tangent (Equation 3.24) can significantly reduce the training time [18].

f(x) =max(0, x) (3.23)

f(x) =tanh(x) (3.24)

Stage 3: Subsampling, Pooling

Typically, an activation stage will be followed by a subsampling stage, commonly referred to as pooling. Pooling can be applied to make a CNN insensitive to small spatial translations, Goodfellow et al. [13] explains. Different variants of pooling exists, among the common ones is max pooling [20].

Max pooling as depicted in Figure 3.6 works by reading a rectangular sub region of an input image, returning the maximum value found in that region. By striding the grid marking the subregion under consideration by more than one column and- or row at a time, max pooling achieves both translation invariance and a meaningful subsampling of the input [20]. Progressive subsampling reduce the number of parameters needed in the network, which improves training.

(26)

3.2.2 Case Study: LeNet-5

Up until this point, this section has presented the different components of a Convolutional Neural Network independent of each other. In order to facilitate a better understanding of their interaction and of CNNs as one integrated learning system, this subsection will consider the architecture LeNet-5 [22].

Composed of seven layers in total, LeNet-5 is a CNN architecture initially designed for recog-nition of handwritten characters [22]. It can be seen represented graphically in Figure 3.7 where classification of the letter ’A’ is occurring. As seen depicted in the figure, convolu-tions as described in Section 3.2.1 are followed by subsampling stages similar to what was presented in Section 3.2.1.

The two convolutional layers comprise 6 and 16 filters respectively. Each convolutional layer will produce more than one feature map, one per filter. How these feature maps can be combined as one input, going from one convolutional layer to the next, is perhaps easiest understood when considering the input to convolutional layers as volumes.

The input image of the network is a volume of 32 ˆ 32 ˆ 1. The image can be thought of as having one channel. The first set of convolutions transforms the input into another 28 ˆ 28 ˆ 6 volume, this volume will have 6 channels. The filters which make up the second set of convolutions (C3 in Figure 3.7) will have the same depth as the number of channels in the input volume. Convolutions over multiple channels was described earlier in Section 3.2.1. After subsampling of the feature maps in the second convolutional layer, the units of the feature maps in layer S4 are set as input to a fully connected Neural Network of the kind described in Section 3.1.2. The reader may find Figure 3.8 clarifying for what that procedure entails. After propagation through this part of network, a class prediction of the character is attained.

Back propagation as described in a previous section is applicable to CNNs as well. LeNet-5 was trained using the Gradient Decent algorithm [22].

,

Feature maps

Figure 3.8: Connecting two feature maps as a vector input to a fully connected Neural Net-work.

3.3 Semantic Segmentation

As exemplified in the work by Krizhevsky et al. [18] CNNs achieve state of the art perfor-mance in predicting the class of images. Further, CNNs have been shown to produce good results on more complex problems, such as localization and pixel-wise prediction.

(27)

(a) Source image before segmentation. (b) Segmented image

Figure 3.9: Example of image segmentation where the source image is being pixel-wise la-beled according to the identified classes.

A natural step from classifying the dominant object in an image is to also make a prediction as to the location of that object. This task is called localization and an example of a CNN architecture designed to this end is OverFeat by Sermanet et al. [36].

OverFeat was trained to classify images downsampled to 256 ˆ 256. During training, the classifier had a series of convolutional layers as feature extractors, and used these extracted features as input to a fully connected network, an architecture reminiscent of LeNet-5 pre-sented earlier. The authors then converted the network at testing time, such that the fully connected layers of the network were replaced with equivalent convolutional layers. This seemingly redundant design change is what enables a classifier to be applied to inputs of arbitrary shape. In the case of OverFeat, this enabled their classifier to be applied in a sliding window fashion across images larger than 256 ˆ 256 while keeping redundant computations at a minimum.

A Fully Convolutional Neural Network is a Convolutional Neural Network which only made up of convolutional layers. When a trained classifier like AlexNet [18] is converted to a Fully Convolutional Network (FCN) and applied as a sliding window over images, both spatial and class-information can be attained simultaneously. This property, which was used in Over-Feat [36] to produce bounding boxes, can be used to produce dense predictions i.e a predic-tion for each pixel. Long et al. [25] showed how already trained classifiers could be converted in this way, enabling pixel-wise prediction, more commonly known as Semantic Segmenta-tion. Semantic segmentation of an example image can be seen in Figure 3.9. The output as seen in the right subfigure is color coded such that each color represents a particular class found in the source image to the left.

Fully convolutional conversion of classifiers produce a coarse prediction of where an object of a given class is located. Long et al. [25] showed how CNNs, trainable end-to-end, could be used to learn the non-linear upsampling required to produce a pixel-wise prediction from this coarse output.

Formally, the task of a Neural Network for Semantic Segmentation is the following: Given a networkH, an imageI PRnˆmand a set of semantic classesC =t0, 1, ¨ ¨ ¨ , Ku. Let each pixel

Ix,ybelong to a class c PC. For each imageI let the ground truth be encoded using another

imageG PRnˆmˆK. The network for Semantic Segmentation is the functionH: I ÞÑ G. The following subsection will elaborate on some of the details around FCNs and also learn-able upsampling. The section will conclude with a case study of a Fully Convolutional Neural Network which has seen much use in segmentation of medical images, namely U-Net [34].

(28)

14x14x3

5x5 2x2 5x5 1x1 1x1

10x10x16

5x5x16 1x1x400 1x1x400 1x1x4

MAX POOL FC FC FC

(a) A network taking inputs of size 14 ˆ 14 ˆ 3 making a classification c P t0, 1, 2, 3u

16x16x3 5x5 2x2 _5x5 _1x1 _1x1 12x12x16 2x2x400 2x2x400 2x2x4 6x6x16 MAX POOL FC FC FC

(b) The same classifier applied to a larger image. The output now contains spatial information. Figure 3.10: Showing the effect of applying an image of a increased size to a fully Convolu-tional Neural Network. In both figures, convoluConvolu-tional kernels have the same depth as the number of channels in the fearture map on which it is applied.

3.3.1 From Fully Connected to Fully Convolutional

As described by Sermanet et al. [36] a CNN can be altered in such a way that it only uses convolutional layers front to end while remaining functionally equivalent to one which uses fully connected layers. One benefit of this transformation is that it enables inputs to be of any size. Another benefit is that it will be much faster compared to running a fully connected network separately for each pixel in the image.

In Figure 3.10a a network has been changed to only use convolutional layers. After the max pooling layer there remains a feature map of 16 channels. Where the 400 constituent units would have been connected to a fully connected layer (see Figure 3.8), there are now 400 5 ˆ 5 convolutions. The result of this convolution is a 1 ˆ 1 ˆ 400 volume. The network applies three such operations, corresponding to three fully connected layers.

An FCN is not constrained by a fixed number of connections imposed by a fully connected layer. When the shape of the input increases, the shape of the output will change accordingly. This can be seen depicted in Figure 3.10b. Using a stride of 2 a 14 ˆ 14 ˆ 3 volume can fit in 4 different positions of the new input which is of size 16 ˆ 16 ˆ 3. The effect is analogous to applying the network in Figure 3.10a as a sliding window on the larger input [36]. The new output is now instead 2 ˆ 2 ˆ 4, such that each feature channel corresponds to a unique position of the sliding window. As an additional example, consider a Fully Convolutional Neural Network which is designed to classify objects in images e.g. plane, boat and car. The network in this example is trained on images of resolution 256 ˆ 256. Being a classifier, the output of this network is a set of N nodes, one for each class. In an FCN context this means

(29)

Figure 3.11: Up-convolution of a 2 ˆ 2 input with a 3 ˆ 3 kernel of stride equal to 2 and pad equal to 1.

a 1 ˆ 1 ˆ N volume. After training, the network is applied to a larger a 512 ˆ 512 image. The desire is to apply the now trained classifier to the larger image, in order to scan it in a sliding window fashion for these objects of interest. The naive implementation of this procedure to take every sub-slice of size 256 ˆ 256 from the larger 512 ˆ 512 image and run each slice through the classifier. The analogous procedure is done implicitly using the FCN method. For each sub-slice of the large image, a classification is produced.

Thus, the output now includes spatial information. It is this fact that was used by Long et al. [25] which enabled classifiers to be used in Semantic Segmentation. Classifiers trained on small images could be applied on big ones, the resulting output could then be taken through a sequence of upsampling phases yielding a dense prediction.

Long et al. [25] showed this upsampling could be learned using the transpose convolution operation, which will be the topic of the next subsection.

3.3.2 Transpose Convolution

Efficient networks progressively downsamples the input in between the convolutional layers, therefore architectures designed for dense predictions have to perform a sequence of upsam-pling operations before the output is produced. These networks will often include transpose convolutions or as they are sometimes called, up-convolutions, as a means of achieving this. Both Long et al. [25] and Ronneberger et al. [34] include this operation in their architectures. Regular convolution, can be defined as a matrix multiplication acting on a vector. As an illustrative example, consider a 3 ˆ 3 kernel w acting on a 4 ˆ 4 input. Assuming stride s=1 and padding p=0 the convolution w ˚ x can be defined according to Equation 3.25.

w ˚ x =     w0,0 w0,1 w0,2 0 w1,0 w1,1 w1,2 0 w2,0 w2,1 w2,2 0 0 0 0 0 0 w0,0 w0,1 w0,2 0 w1,0 w1,1 w1,2 0 w2,0 w2,1 w2,2 0 0 0 0 0 0 0 0 w0,0 w0,1 w0,2 0 w1,0 w1,1 w1,2 0 w2,0 w2,1 w2,2 0 0 0 0 0 0 w0,0 w0,1 w0,2 0 w1,0 w1,1 w1,2 0 w2,0 w2,1 w2,2     ¨      x0 x1 .. . x16      (3.25)

Here the 2-dimensional 4 ˆ 4 input image has been transformed into a vector of length 16 (left to right, top to bottom). Let C denote the the matrix which defines the convolution operation with the kernel w such that Cx(i) : R4ˆ4 ÞÑ R2ˆ2. Given this definition of the convolution operation via the matrix C there is a corresponding transpose convolution with the same connectivity pattern CTx(i): R2ˆ2 ÞÑ R4ˆ4[11].

An alternative explanation of this operation has been presented in Figure 3.11 which is a graphical depiction of the transpose convolution operation described above. Each of the units in the input will act as a weight upon the kernel. The shape of the output is determined by the stride of the convolution; in this example, the kernel moves two units for every unit

(30)

in the input i.e the stride equals 2. Where the kernel overlaps in the output the result is accumulated.

The kernels for upsampling consists of learnable weights just like the ones used for sub-sampling. This enables learnable non-linear upsampling, making these types of networks trainable end-to-end [25].

3.3.3 Case Study: U-Net

U-Net is the most well-known CNN architecture used for segmentation of medical im-ages [24]. It was proposed by Ronneberger et al. [34] as an FCN architecture for performing pixelwise labeling of microscopy images of cells. The architecture which was given the name U-Net for its U-like shape, can be seen in Figure 3.12. The network design is of close resem-blance to what was proposed by Long et al. [25] but includes more feature channels in the upsampling part of the network.

U-Net can be seen as being composed of two parts, the upsampling and the subsampling part. The subsampling part is a convolutional network along the same principles as have been seen earlier in this chapter. It makes up the left half of Figure 3.12 and its task is to learn a dense representation of the contents of the input image.

The right side of the network performs upsampling, relying on the Transpose Convolution as described in Section 3.3.2. The feature channels of the layers on this side of the network are composed of both upsampled feature maps coming from the preceding layer, and also feature maps directly from the subsampling half of the network; marked with green arrows in Figure 3.12, these feature maps allows the gradients to flow more directly between layers of the same spatial dimensions.

The architects of U-Net describes two complicating aspects when applying segmentation net-works of this kind in the domain of medical image processing. First is the lack of available labeled training examples which stems from needing specific domain knowledge to create. Second although perhaps not as general, is the necessity of getting some pixel labels cor-rect over others; when segmenting images of cells, the value of corcor-rectly labeling the border pixels, which make up only a small part of a typical image, is disproportionately important. In order to enable training of a model of this kind on a small dataset, the authors use data aug-mentation. Data augmentation is a general term for creating more samples from a corpus of data by application of certain transformations. These transformations are applied to a given data point to generate new ones. From the point of view of the network a data point and its transformed counterpart will appear distinct. Simple data augmentation involves apply-ing random scalapply-ings and rotations to an image, to generate more trainapply-ing data. Applyapply-ing smooth non-linear deformations to an image is an example of slightly more advanced data augmentation.

The domain in which U-Net was first applied required correct segmentation of the thin border which separated two cells from each other. The authors computed, for each training sample, a weight-map which could be used to guide the model during training.

Equations 3.26 and 3.27 describe how loss was computed for the model in the initial publica-tion. Each pixel position x, y is associated with a label l P t1, ..., Ku. In the output layer, U-Net has K feature channels (recall that this type of network produces an imageG P RmˆnˆK) and ak(x, y) denotes the activation in G(m = x, n = y, k = k). Equation 3.26 is the

soft-max function. It describes the probability of a pixel x, y belonging to class k. Equation 3.27 is the weighted cross entropy function. It has two main components, w(x, y) is the weight

(31)

57 2* 57 2 1 56 8* 56 8 64 28 2* 28 4 28 2* 28 2 128 28 0* 28 0 128 57 0* 57 0 64 14 0* 14 0 13 8* 13 8 13 6* 13 6 256 256 68 *6 8 66 *6 6 64 *6 4 1024 512 512 32 *3 2 30 *3 0 56 *5 6 54 *5 4 52 *5 2 512 10 4* 10 4 10 2* 10 2 10 0* 10 0 20 0* 20 0 19 8* 19 8 128 19 6* 19 6 128 39 2* 39 2 38 8* 38 8 39 0* 39 0 1024 28 *2 8 1024 512 256 256 64 64 38 8* 38 8 1 I n p u t o u t p u t 3x3 Convolution 2x2 max pool 1x1 Convolution 2x2 Up-Convolution Copy and Crop

Figure 3.12: The U-Net architecture

associated with pixel x, y and pl(x,y)(x, y)is the network’s estimated probability of pixel x, y

belonging to its true class l(x, y).

pk(x) = exp (ak(x)) řK k1₌₁exp(a_k1(x)) (3.26) J= ÿ xPΩ w(x)log(p_l(x)(x)) (3.27) Related Architectures

SegNet [2] by Badrinarayanan et al. is a network architecture that was designed for image segmentation, particularly in the domain of traffic images. Its design is centered around keeping memory requirements at a minimum while preserving as much boundary informa-tion as possible. Where U-Net stores entire feature maps from the decoder and concatenates them to the corresponding decoder layer, SegNet uses Max-pooling indices. The purpose of both approaches being the same, maintaining boundary information from input to output, the authors of SegNet argued for their approach being more memory efficient.

The DeepLab system [7], published by Chen et al. is an architecture that can produce image segmentation with highly detailed contours, and has scale invariance explicitly incorporated into the network design. Invariance to scale is a desired property in a segmentation network; an object of a given class should ideally by detectable regardless of its scale in a given image. A Neural Network can learn scale-invariance, although this implies the need for a larger training set. Chen et al. took a transfer learning-approach where only the final layers of an already trained network are tuned for the task at hand. The authors performed both training and evaluation on images from he PASCAL VOC 2012 segmentation benchmark [12].

(32)

3.4 Existing Approaches to Mammogram Segmentation

The following subsection will present other published works in the domain of breast and pectoral segmentation of mammograms. Published results in these works will later serve as reference when the method presented in this thesis is evaluated.

The related works are based on the classical image processing paradigm, as well as the ma-chine learning paradigm which has emerged in recent years. The label of Classical in this particular context is taken to adhere to systems which are centered around methods such as thresholding, morphology, region growing, and active contours.

Yapa and Harada proposed a method for segmenting breast tissue in mammograms using fast marching [43]. Fast marching is a method for determining how a closed surface evolves over time. Considering the two-dimensional case, the curve is described as the 0-level set of a function φ (φ(x, y, t) =0). The curve at a given point x, y evolves with a velocity F in the normal direction of the curve. Equation 3.28 describes this evolution of the curve through time.

φt+F|∇φ| =0 (3.28)

It is the velocity function that F defines what the constitutes the desired object boundary, in this case the skin-line.

F(x, y) =e´α|∇(Gσ˚I(x,y))| _(3.29)

P(x, y) =?1 2πσe

´(I(x,y)´µ)2

2σ2 _(3.30)

Equation 3.29 describes the first component of the velocity function used by Yapa and Harada. The expression Gσ˚I(x, y) denotes convolution with a Gaussian kernel Gσ over

the intensity function I(x, y)which is defined as the grayscale intensity of the mammogram at position x, y.

F(x, y)_yh=F(x, y)P(x, y) (3.31) Yapa and Harada defined their velocity function as described in Equation 3.31. Their velocity function includes the likelihood P(x, y)intended to prohibit what they call boundary leakage. The parameters of the likelihood are computed from samples around the seed point. The manual placement of such a seed is required for their system to function. The contour search terminates when it hits a termination point, also manually specified. The method by Yapa and Harada is one of the best listed on breast tissue segmentation in the survey by Mustra et al. [29] in terms of reported accuracy at 97.9%.

In 2007 Sun et al. showed how domain knowledge could be utilized for segmentation of the mammogram via skin-line detection [41]. One of the difficulties in correctly identifying the skin-line in a mammogram is due to low density of the tissue in the outermost regions of the breast. Low density causes partial transparency in the mammogram, which is the source of this difficulty. While the skin-line resides in this low density area, the stroma edge which is another feature of the mammogram resides behind the skin-line and is easier to detect. The authors use both this fact, it being easy to detect and, that there is an anatomical dependency between the stroma edge and the breast skin-line. Their methodology relied on first finding and confirming a small part of the breast skin-line using adaptive thresholding. The remainder of the skin-line was then inferred using the anatomical dependency between

(33)

the skin-line and stroma edge. The authors used the Polyline Distance Metric to compare their contours to the ground truth examples from the mini-MIAS database. Their results indicated that the methodology outperformed existing state of the art. However, they also noted how the success of their method depended heavily on the quality of their initially confirmed skin-line.

Silva et al. proposed an approach to breast skin-line detection where the mammogram is scanned row by row [37]. Instead of looking for changes in grayscale intensity going from pixel to pixel, Silva et al. use the Laplacian operator to find local maximas for each row of pixels. From each row, a set a candidate pixels was extracted. Each candidate pixel repre-sented a local maxima in grayscale intensity for that particular row. For each set of pixels there was one which corresponded to the ground truth skin-line. The resulting skin-line was a selection of one candidate pixel from each set. As there were many possible selections, the authors defined a cost function over the chosen set of skin-line pixels and then searched for the best solution using a dynamic programming approach. The work was evaluated against 82 images from the mini-MIAS database using the Polyline Distance Metric.

Casti et al. presented an approach to breast skin-line detection which relied on Gabor fil-ters to segment the mammogram [6]. Their method consisted of a sequence of five steps: Enhancement using an adaptive values of interest (VOI) transform, computation of Gabor filter-response at multiple orientations, initial breast boundary estimation using threshold-ing, a second round of Gabor-filtering limited to the area of the approximated skin-line and a final step in which false edgepoints were suppressed.

Their VOI transformation specified a remapping of the grayscale intensity values such that the points along the skin-line would appear more prominently in the mammogram. The transformed image ˜I(x,y) was computed from the original image I(x,y) via Equation 3.32.

rI(x, y) = 1

1+exp[´4I(x,y)´wc

ww ]

(3.32)

The parameters wcand ww denotes window center and window width respectively. In

ad-dition to wc and wc the Gabor filters which were used for their edge detection capabilities

required additional parameters to be set, namely the period of the cosine modulation τ and the variances in both x- and y directions σx, σy. For the orientation of π/2 the Gabor g(x, y)

kernel is defined according to Equation 3.33.

g(x, y) = 1 2πσxσyexp [´1 2( x2 σx +y 2 σy )]cos(2πx τ) (3.33)

In total, Casti et al. had to tune eight parameters for their skin-line detection algorithm. These parameters were set using a test partition of images from the mini-MIAS database. In order to further ensure unbiased results, they had two different radiologists annotate the train and test partitions of their data.

The authors demonstrated the performance of their algorithm on both scanned film (SFM) and digital (FFDM) mammograms. They measured the performance using a suite of metrics including precision, recall and Hasudorff distance. Their results were compared against those of Sun et al. and Silva et al. with the conclusion that the developed algorithm outperformed the cited methods in terms of their published Poly line distance.

In 2017 Rampun et al. proposed a fully automated approach to mammogram segmentation for both breast and pectoral regions [33]. The authors propose a method which first relies on creating a 2D model of a typical MLO-view mammogram. The actual mammogram is then made into a binary image using Otsu’s thresholding [32]. The localized entropy of the binary

(34)

image is then computed. The entropy image could then be compared against the initial MLO-view model in order to determine the orientation of the mammogram (left or right). From the thresholded entropy image, the authors marked the largest region as breast and the remaining regions were all given the label "artifact". Multiplication of this artifact-mask and the original image removed unwanted labels and noise from the mammogram.

With noise and artifacts removed, Rampun et al. could compute an initial breast boundary estimation which could be fed to an Active Contours model. Active contours optimize a specialized energy function to evolve an initial curve to a target contour. Along the same arguments as Casti et al. who used an adaptive VOI transformation to enhance the prevalence of the skin-air boundary, Rampun et al. employed entropy filtering to the same affect. For segmentation of the pectoral region, Rampun et al. amended their breast model to in-clude the pectoral muscle. Following a set of specified criterion, the longest detected edge in the mammogram was used to as an initial estimate which could be used to parameterize a contour growing process. The segmentation method by Rampun et al. was evaluated using both digital (FFDM) and analog (SFM) mammograms. SFM mammmograms came from the mini-MIAS database.

Dubrovina et al. proposed a deep learning based framework for classifying four different types of tissue found in a mammogram [10]. The four classes were: pectoral muscle, fibrog-landular tissue, nipple, and general breast tissue. Their dataset consisted of 40 expert-labeled digital mammograms (MLO). In order to overcome the lack of data, the authors sampled these 40 images for patches of 61 by 61 pixels. For each patch the class label was set accord-ing to the class label of the center pixel. From the initial set of 40 images, 80 000 trainaccord-ing examples could be created.

Dubrovina et al. employed a Neural Network composed of 6 convolutional layers and a dense output layer. After the network had been trained on patches it was converted to fully convo-lutional using the same method described in a previous section. Having a fully convoconvo-lutional architecture, Dubrovina et al. [10] could produce dense predictions using a switch and stitch-based method [25]. Their model produced a prediction in 1.8 seconds which included a post-processing step. This post-post-processing step was deemed necessary due to the large amount of false positives produces by the network.

The work by Dubrovina et al. was included in Deep Learning and Convolutional Neural Networks for Medical Image Computing [5] and was presented a state of the art in this domain of breast tissue segmentation. The book was published in 2017.

(35)

This chapter will describe the development and evaluation of a mammogram segmentation system based on a Fully Convolutional Neural Network. The term model is used throughout this chapter and refers to a particular Neural Network with a particular weight configuration i.e model h1is equal to h2 if they are of the same architecture and for each weight wh1the

corresponding weight wh2has the same value.

From this point and onward, reference will be made to foreign and native predictions. A prediction is called native if the prediction (computing the segmentation mask for a given mammogram), is performed on a mammogram from the same data source used in training i.e training a model on Hologic mammograms and then predicting on another Hologic mam-mogram is a native prediction. Conversely, a foreign prediction is when a model trained a particular data source, is used for prediction on mammograms from sources other than those on which it had been trained.

4.1 System Overview

The model was trained on grayscale volumes of dimension 256 ˆ 256 ˆ 1. For each input image x(i)there was a corresponding target, an image marking what was to be considered as ground truth. The target y(i)was structured such that it had the same spatial dimensions as x(i)and the same number of feature channels as the number of possible label assignments available for each pixel i.e 256 ˆ 256 ˆ 3. As there is one channel per class in the output volume, only a single model has to be trained to segment the three classes.

The task of the model was to differentiate the three different classes: breast region Br, pectoral

region Pr, and background Bg for each pixel x(i)_j,k. Let `1, `2 and `3 denote the three feature

images of some y(i) P R256ˆ256ˆ3. For any choice of y P Y;`1,`2 and`3would always be

completely disjoint. In other words, there were no images in the dataset in which there were overlapping classes. A reader familiar with one-hot encoding for categorical variables may find it useful to think of the target volume y(i) in the following way: for some fixed pair of spatial indices j, k, there is a one-hot encoded vector along the label-dimension.