Adapting multiple datasets for better mammography tumor detection

(1)

Adapting multiple datasets for

better mammography tumor

detection

TAO WANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

for better mammography

tumor detection

TAO WANG

Master in Computer Science Date: July 3, 2018

Supervisor: Hossein Azizpour Examiner: Mårten Björkman

Swedish title: Anpassa flera dataset för bättre mammografi-tumördetektion

(4)

(5)

Abstract

(6)

Sammanfattning

(7)

1 Introduction 1 1.1 Motivation . . . 1 1.2 Research question . . . 2 1.3 Outline . . . 2 1.4 Ethical consideration . . . 3 1.5 Societal aspects . . . 3 1.6 Acknowledgements . . . 4 2 Background 5 2.1 Knowledge transfer . . . 5 2.2 Domain adaptation . . . 6 2.3 Semantic segmentation . . . 7 2.4 Object detection . . . 8 3 Related work 9 3.1 Adversarial domain adaptation . . . 9

3.1.1 Feature-level adaptation . . . 9

3.1.2 Pixel-level adaptation . . . 10

3.2 Semantic segmentation . . . 14

3.2.1 Deep semantic segmentation . . . 14

3.2.2 Deep semantic segmentation in mammography . 16 3.3 Deep object detection . . . 16

4 Methods 19 4.1 Overview . . . 19

4.2 Deep generative model . . . 19

4.2.1 Variational autoencoders . . . 20

4.2.2 Generative adversarial network . . . 21

4.3 Semantic segmentation . . . 23

4.3.1 Data Preparation . . . 23

(8)

4.3.2 Sampling . . . 24 4.3.3 U-Net . . . 24 4.3.4 Loss function . . . 26 4.4 Domain transfer . . . 27 4.4.1 Data Preparation . . . 27 4.4.2 UNIT . . . 27 4.4.3 Loss function . . . 31

4.4.4 Least squares GAN . . . 32

4.4.5 Generated image pool . . . 33

4.4.6 Segmentation loss . . . 33

4.4.7 A mask branch . . . 34

4.5 Other techniques . . . 36

4.5.1 Normalization layer . . . 36

4.5.2 Activation function . . . 37

5 Experiment & Result 39 5.1 Datasets . . . 39

5.1.1 CBIS-DDSM Dataset . . . 39

5.1.2 INBreast Dataset . . . 39

5.2 Evaluation metrics . . . 40

5.2.1 Pixel level metrics . . . 40

5.2.2 Instance level metrics . . . 41

5.3 U-Net for semantic segmentation . . . 43

5.3.1 Basic network . . . 43

5.3.2 Experiments on different parameters . . . 45

5.3.3 Discussion . . . 50

5.4 UNIT for domain transfer . . . 51

5.4.1 Basic network . . . 53

5.4.2 Unsupervised training UNIT . . . 54

5.4.3 Supervised training UNIT . . . 55

5.4.4 Discussion . . . 59

6 Conclusion 61 6.1 Future work . . . 61

(9)

A.1.1 Segmentation results where tumors are correctly located . . . 70 A.1.2 Segmentation results where tumors are correctly

located with some false positive judgments . . . . 71 A.1.3 Segmentation results where tumors are not

(10)

(11)

Introduction

1.1 Motivation

Breast cancer is the most common cancer in women worldwide, which is the fifth most general cause of death from cancer in women. Nowa-days, early diagnosis is the key to the treatment of breast tumor, but many women having a breast cancer show no symptoms, making it hard to be diagnosed, so screening is necessary for tumor diagnosis. Moreover, mammography, an x-ray of the breast, is the most impor-tant screening test for breast tumor. However, screening mammog-raphy could also bring a serial of problems. First of all, the subse-quent analysis of the mammography images requires specialized ra-diology knowledge and the training costs of an experienced radiol-ogist are enormous, leading to an significant lack of radiolradiol-ogist in hospital. Secondly, mammography is not entirely accurate enough, so that even an experienced radiologist can also face an overdiagno-sis/underdiagnosis problem. What’s more, different hospitals are equipped with different facilities, making the mammography images look differ-ent and increasing the difficulty of correct diagnosis.

A relatively cost-worthy option is the computer-aided diagnosis. By auto annotating some region of interests, the workload of radiologists will be significantly reduced. With the growing popularity of deep learning method, the computer-aided diagnosis could gain a potential to be stronger. Our work aims to build a widely adaptive deep neural network, which could work on mammography images from different facilities.

(12)

1.2 Research question

In our work, we are facing different mammography databases. Some of these databases are annotated by the radiologist, using a binary mask to point out where the tumor is. However, some other databases only contain mammography images or only partly annotated: only a few images in this database are annotated by using a mask. More-over, training a supervised deep localization neural network usually requires both mammography images and the annotation masks. For these datasets without any annotation (binary mask) or with partial annotation, how to correctly localize the tumor will be a problem. The research questions (RQ) could be summarized with the following de-scriptions:

• RQ1: Can we build a location network on a mammography dataset? • RQ2: How to adapt location network on datasets having

differ-ent styles, especially when some datasets do not have any anno-tation or have partial annoanno-tation?

1.3 Outline

Chapter 2 provides a theoretical background for the techniques used in this work. We first describe the knowledge transfer and domain adaptation by giving a mathematical formulation and discuss some common assumptions in domain adaptation. Then we formally de-fine two specific families of deep generative models: variational au-toencoders and generative adversarial network. Finally, we introduce some essential tasks in computer vision: semantic segmentation and object detection.

In Chapter 3, first of all, we go through the related works about adversarial domain adaptation. We take a look at the related works in feature level and pixel level adversarial domain adaptation. Then we turn to study the deep semantic segmentation frameworks and its application in mammography. At last, we look at the deep frameworks in object detection.

(13)

adap-tation separately. Also, in the last section, we introduce some funda-mental techniques used in both tasks.

Chapter 5 contains the details of the experiments setup. The datasets introduction comes first. Then, we present the design of the semantic segmentation experiments. We discuss different parameter setups and their influence on the final performance. The next section is the exper-iments of domain adaptation. We combine the domain adaptation and segmentation, to see whether domain adaptation could improve the segmentation performance or not.

Finally, in Chapter 6, the conclusion of this work is made coupled with the possible future work.

1.4 Ethical consideration

The mammography datasets explored in our work [19, 39] originate from the widely used public databases. The data masking has been implemented to ensure the patient information would not be leaked. Another potential ethics problem is the overdiagnosis/underdiagnosis by using such a computer-aided diagnosis system. We could not avoid false positive/negative judgments in such a system, and the patient’s psychological pressure caused by an overdiagnosis/underdiagnosis is inestimable. We consider our work as an auxiliary diagnosis tool in reducing the workload of radiologists, it is still the judgment of radi-ologists determining the diagnosis results.

1.5 Societal aspects

From the perspective of product application, this work could provide a new tool for auxiliary diagnosis, which will bring benefits to most of the women. Also by applying this model, we could reduce the burden of the radiologist and improve the diagnostic accuracy. For the people or enterprise who trying to build an auto-mammography diagnosis system, this work could at least provide a solution in helping them deal with multi-source training problem.

(14)

1.6 Acknowledgements

Firstly I would like to express my very great appreciation to my su-pervisor Hossein Azizpour for his valuable and constructive sugges-tions during the planning and development of this research work. His willingness to give his time so generously has been very much appre-ciated.

(15)

Background

2.1 Knowledge transfer

Machine learning has been quite successful in recent years. However, many machine learning methods are based on a common assumption: the training data and the testing data are sampled from the same dis-tribution. When they come from different distributions or the distri-bution changes, the trained model’s performance will dramatically de-crease. Here is a demand for transferring the knowledge learned from the original distribution to the new one. Let’s firstly define the transfer in a notation form [40]:

Firstly, let’s define the domain, a domain D contains two parts: a fea-ture space X = {x1, x2, x3, ..., xn}, where xi 2 X and a marginal

proba-bility distribution P (X),

Then, when given a domain D = {X , P (X)}, a task T is defined as T = {Y, P (Y |X)}, where Y is the label space, and Y = {y1, y2, y3, ..., yn},

where yi 2 Y. P (Y |X) means the conditional probability, which is

an objective predictive function. When given features X, our aim is to find the most possible labels Y . In practice, we need to find an approximate representation of P (Y |X).

Finally, we can define the knowledge transfer: given source and target domains DS and DT, then source and target tasks TS and TT.

Knowl-edge transfer aims to help improve the performance of PT(Y|X) in TT

by using the knowledge learned from DS and TS, where DS 6= DT or

TS 6= TT.

As we have mentioned that in knowledge transfer, DS 6= DT or TS 6=

TT, so the source and target conditions can vary in the following four

(16)

ways:

• PT(X)6= PS(X). The marginal probability distributions of source

and target domain are different. The source and target features are sampled from a different distribution. This scenario is called domain adaptation, which is the main focus of our paper.

• YT 6= YS. The label space is different. This case is called transfer

learning. For example, in deep learning, fine tuning method is based on this case.

• XS 6= XT. The target and source domain have different feature

space, for example, two documents are written in different lan-guage, which will bring different feature space.

• PT(Y|X) 6= PS(Y|X). The conditional probability distributions

of source and target are different.

In the next section, we will define the domain adaptation in detail and introduce some important concepts related to domain adaptation problem.

2.2 Domain adaptation

Domain adaptation is one of the sub-questions of knowledge transfer. Domain adaptation tries to build a model which is suitable for both source and target domains. The notation definition is: given source and target domains DSand DT, then source and target tasks TSand TT,

domain adaptation aims to help improve the performance of PT(Y|X)

in TT by using the knowledge learned from DSand TS, where PS(X)6=

PT(X)and PS(Y|X) ⇡ PT(Y|X).

Covariate shift

Shimodaira et al. [48] first proposed the concept of covariate shift, which is an important concept in domain adaptation. As we already know that the source and target task’s conditional probability distri-bution are same, the difference between these two domains is called covariate shift. Shimodaira et al. [48] define misspecified models to indicate the influence of covariate shift in model training. Although PS(Y|X) ⇡ PT(Y|X) sounds compelling, in the real world

(17)

✓ to minimize the expected classification error. It is hard to find a pa-rameter ✓⇤ _{which could fit P (Y |X, ✓}⇤_{) = P (Y}_{|X) for all x 2 X , so the}

model P (Y |X, ✓⇤₎_{we find is called a misspecified model. The model’s}

parameter ✓⇤ _{depends on P (X). The difference between P}

S(X) and

PT(X)will bring the difference between the model trained from source

domain and the one from the target domain.

2.3 Semantic segmentation

Semantic segmentation aims to understand an image at pixel level. We want to assign each pixel in the image an object class. Figure 2.2.c is an example of semantic segmentation.

Figure 2.1: An example of different kind of computer vision tasks[31]. (a) An example of image classification, pointing out the objects con-tained in this image, (b) An example of object detection, using bound-ing boxes to locate different kinds of objects, (c) An example of seman-tic segmentation, making a classification on pixel level (d) An exam-ple of instance segmentation, not only classify on pixel-level, but also mark out different instance.

(18)

classifica-tion, we need dense pixel-wise predictions from our models. Here is a formulation of the semantic segmentation: given a set/image X 2 Rw⇥h⇥c_{, where w is the image width, h is the image height, and c is the}

image channel number. Semantic segmentation could divide the set X into several non-empty subsets {X1, X2, ..., XN}, and [Ni=1Xi = X,

Xi[ Xj =;(i 6= j).

In our mammogram case, the semantic segmentation is used to distin-guish the tumor and non-tumor area. We divide an image into two parts making the semantic result being a binary mask.

2.4 Object detection

Object detection is a process used to find the target object. Usually we draw bounding boxes around the object targets. Figure 2.2.b is an example of object detection. To formulate this kind of problem, let’s firstly define a set/image X 2 Rw⇥h⇥c_{, where w is the image width,}

his the image height, and c is the image channel number and we get target objects {X(1)

T , X (2) T , ..., X

(M )

T }. The aim of object detection is to

extract several subsets {X(1) O , X (2) O , ..., X (N ) O } from X. For 8X (i) T , there exists X(j) O , X (i) T ✓ X (j)

O . Ideally, a perfect object detection will make

M = N and X_T(i)= X_O(j)

(19)

Related work

This chapter reviews recent works in adversarial domain adaptation and introduces some recent works in deep semantic segmentation.

3.1 Adversarial domain adaptation

The advent of GAN [15] has provided domain adaptation with a new direction, adversarial domain adaptation. Hu et al.’s [21] work indi-cate that GAN is a particular case of adversarial domain adaptation with a degenerate source domain. In the following part, we will show some recent studies of adversarial domain adaptation.

3.1.1 Feature-level adaptation

At 2015, [9, 10] proposed a model in feature-level domain adaptation by using the adversarial method [15]: Domain-adversarial Neural Net-work (DANN), which could be viewed as the birth of adversarial do-main adaptation.

DANN This model contains three parts: a feature extractor used to extract the feature from the input images, a label classifier used to pre-dict the class label and a domain classifier used to classify the domain label. On the one hand, the loss from the domain classifier needs to be minimized to correctly distinguish the source and target domain. On the other hand, to find the domain invariant feature, the loss function of the domain classifier needs to be maximized. By formulating such a min-max game, DANN finds a common feature space between the source and target.

(20)

Then, there appear some variants based on DANN. Tzeng et al. [54] consider the similarity between the classes, and propose the soft label loss. They also use untied weight mapping in feature extractor and reduces the difficulty in optimization [53].

Usually, domain adaptation aims to find a mapping function from one domain to another or a function to extract invariant features from each domain. Bousmalis et al. [3] suggest that finding an invariant repre-sentation is not enough, which ignores the individual characteristics of each domain and they propose Domain Separation Network (DSN). DSN This network not only extract the domain invariant feature but also extract the identity feature of the source and target domain, which could be viewed as the low-level features. Moreover, from the author’s point of view, the invariant feature subspace show be orthogonal with the identity subspace. This paper also adopted MMD [16] method and gradient reversal layer [9,10] to train the invariant feature extractor. However, all these approaches are based on the feature, and thus do not enforce any semantic consistency [20].

3.1.2 Pixel-level adaptation

Pixel-Level domain adaptation is usually combined with image gener-ation or style transfer. The task of pixel-level adaptgener-ation does not only focus on a correct classification but also on generating a picture with a clear semantic meaning. There are several works about the pixel-level adaptation in recent years.

DRCN Deep Reconstruction-Classification Networks [12] (DRCN) is a network structure used in domain adaptation. DRCN is made by one traditional classification pipeline (a feature extractor and a feature classifier) and another reconstruction pipeline (from extracted feature back to image). DRCN is trained by minimizing the reconstruction er-ror and the classification erer-ror on source domain. The authors think that once they minimize these two errors, the extracted feature would find an adaptive domain representation for both source and target do-main. To improve the performance of DRCN and the generalization ability, the author adopted data augmentation and denoising autoen-coders.

(21)

using GAN, propose a network called PixelDA. Except using domain loss and task loss like some feature-level domain adaptation [9,10,53], they also propose a content-similarity loss on the pixel-level. Different from the style transfer, this model PixelDA aims to learn the style of the whole source domain. By learning the style of the source on pixel-level, the model could decouple from the task-specific architecture and has higher training stability.

These previous methods are mainly focus on classification task which is unsuitable with our task. We want to generate a style transferred images and keeps the semantic meaning. Notice that PixelDA gets inspiration from style transfer, so we turn to find some recent study about the domain transfer.

Generative Adversarial Networks (GAN) [15] has shown an impres-sive performance on image generation in recent year. Can we take the advantage of GAN in doing style transfer? The following is some re-cent studies about the pixel-level domain transfer by using GAN. Pix2pix is one of the important domain transfer networks in recent years.

pix2pix Isola et al. [23] study the image-translation problem. They build an image-to-image translation framework (pix2pix) based on con-ditional adversarial networks [37]. They add an L1 term into loss func-tion, making the generator’s task is not only to fool the discriminator but also to be near the ground truth output in an L1 sense. The au-thors think that comparing with the L2 term, L1 could encourage less blurring. Different from the traditional encoder-decoder structure, the structure is based on U-net [45], which is a encoder-decoder structure with skip connections between mirrored layers in the encoder and de-coder stacks. While in the discriminator, they focus on both high and low-level correctness. The authors propose a PatchGAN method, by splitting the image into several N⇥N patches. The final discriminator result is the average of these patches results.

(22)

CycleGAN Zhu et al. [60] used cycle-consistent adversarial network (CycleGAN) to design an unpaired-image translation on pixel-level. When doing adversarial domain adaptation, usually it is hard to find the corresponding low-level relationship. However CycleGAN adopts another generator, which tries to convert the generated picture back into the source picture and adds such cycle consistent loss into the loss function. Similarly, DiscoGAN [26] and DualGAN [56] are based on the same method. The only difference is that DualGan changes tra-ditional GAN into Wasserstein GAN [1] to improve the stability and DiscoGAN choose L2 for cycle consistency, while DualGan and Cycle-GAN choose L1 for cycle consistency.

Hoffman et al. [20] make an improvement named cycle-consistent ad-versarial domain adaptation (CyCADA) based on [60].

CyCADA The authors suggest that there are three kinds of losses dur-ing domain adaptation: pixel loss, feature loss and semantic loss. They also adopts cycle consistency like[60] to learn an invertible mapping function. Different from the style-transfer [11, 24], they define a se-mantic loss instead of content loss. The content loss is calculated from pixel to pixel, while the semantic loss in this paper is represented by the sum of the classification loss on both fake picture and the inverted picture. The authors insist these two representations be analogous. From Taigman et al.’s view, when GAN reaches a balance, the gener-ator gets knowledge from the source domain, and its generated result is indistinguishable for any discriminator, which means the GAN has learned an invariant representation of the source domain. Based on this theory, they build a network call domain transfer network [52] (DTN).

(23)

picture.

Coupled Generative Adversarial [33] Networks is an variant of GAN network. By connecting two GAN networks, it could be used to learn the joint distribution of multi-domain images.

CoGAN The method of CoGAN is simple, by training two GANs at the same time, and there are some layers shared between these GANs. In the generator, the first several layers share the weight, and in the discriminator, the last several layers share the weight. From this pa-per’s view, CoGAN could extract the invariant features, for example, the object contour, by the first several shared layers. Moreover, in the private layers, CoGAN could get some detailed information, such as texture. The final loss function is the combination of these identical GANs’ loss function, and the training method is the same as DCGAN’s [41] training method.

Except from the generative adversarial networks, encoder-decoder struc-ture is also be used in generating the images.

PixelDT Yoo et al. [57] studies the domain transfer problem. They try to achieve a domain transfer of clothes on the semantic level. They build a three-level network. The first level is called convert, which is built by an encoder and a decoder. Encoder first extracts the low-level semantic information of a source picture, and the decoder decodes this information into another picture. The encoder is a convolutional neu-ral network, and they use the second and third level network to train the decoder. The second level is a real/fake discriminator, and the third level is a domain discriminator. The decoder tries to confuse the real/fake discriminator, making the decoder results look natural, and it tries to confuse the domain discriminator, making the decoder re-sults come from the target domain.

Now can we combine the encoder-decoder structure and GAN struc-ture together to achieve a better performance? The answer is yes. UNIT is a choice.

(24)

used to penalize the deviation of the distribution of the latent code from the prior distribution. Then, similar to coGAN, for each domain, they use a partly-weight-shared generator and an identical discrimi-nator. To make the generator learn the domain-specific feature, they use a cycle-consistent loss function, by comparing the generated pic-tures and the original picpic-tures, so the final method is using VAE to get the latent code and using coGAN to re-generate the pictures from the latent code. One can just simply changes the generator and discrimi-nator to achieve a domain transfer. The similar structure and method are also used in XGAN [46]. By comparing with DANN, DTN and Co-GAN, UNIT achieves the best performance, which is the state-of-art two domains transfer network.

While there is still a big problem in CycleGAN and its variants (using cycle consistent loss): CycleGAN is only suitable for the two domains problem, when we are facing the multi-domains question. Building a model for each two domain will cost a lot. Here comes StarGAN. StarGAN Choi et al.[6] find an approach to solve the multi-domains problem: StarGAN. The method behind StarGAN is simple. Instead of finding a mapping function between every two domains, StarGAN tries to find a mapping function between each domain and a common domain. They also use mask vector to build a uniform domain label space. To stabilize the training process and generate higher quality images, the authors choose Wasserstein GAN [1] instead of GAN.

3.2 Semantic segmentation

Semantic Segmentation is a classical computer vision problem. With the development of deep learning technology, semantic segmentation could achieve a more accurate result. In the following part, we will show some recent studies of deep semantic segmentation.

3.2.1 Deep semantic segmentation

(25)

use the convolution layers instead of these fully connected layers. It also uses the upsampling method to upsample these convolution lay-ers back to the original size. Moreover, skip architecture is used in FCN to optimize the result.

SegNet[2] is network based on FCN, which is first designed to solve the segmentation problem in autopilot.

SegNet The structure of SegNet is also an encoder-decoder structure; the encoder part is a pre-trained VGG-16 [50] network. The differ-ence between FCN and SegNet is the way how the SegNet does the up-sampling process. In SegNet, when doing the pooling process, the network will store the pooling indices, and in the decoder, the sampled is based on these pooling indices. While in FCN, the up-sampling method tries to find a deconvolution function and sum the up-sampling result with the encoded feature map.

FCN is based on millions of training images, which is unsuitable for a biomedical task. Ronneberger et al. [45] base on FCN and propose a network called U-Net.

U-Net The structure of U-net is similar to FCN, but different from FCN, U-Net does not use pre-trained CNN model [50], because U-Net is usually used in binary classification. In lower level feature fusion part, U-Net concatenates the original feature with the up-sampled fea-ture. Moreover, the authors mention that they use data augmentation to solve the limited training data problem.

By combine the deep learning method and some traditional computer version algorithm, the segmentation results could be extremly accu-rate.

(26)

Another segmentation method Mask R-CNN combines the R-CNN and FCN together.

Mask R-CNN Mask R-CNN [17] is a network used to solve an instance segmentation task. It is a combination of faster R-CNN and FCN. We will introduce the faster R-CNN in the next section. Faster R-CNN [43] could first get serials of region of interest from a region proposal net-work. Based on these regions of interest, faster R-CNN begins to find bounding boxes, while in Mask R-CNN, it opens a new branch to gen-erate the mask of the image. To avoid misalignment in ROI pooling, this paper proposes a bilinear interpolation method: ROIAlign, based on Fast R-CNN’s ROI pooling technique [13].

3.2.2 Deep semantic segmentation in mammography

Moor et al. [38] use U-Net in doing mammography segmentation, but different from the original U-Net structure, they double the convolu-tion part and decrease the channels numbers in the network to avoid over-fitting. As U-Net is agnostic to the size of the input, they train a patch-level u-net and test on a whole image. From the result, we find that Moor et al. achieve a relatively high sensitivity but the precision is poor.

Zhang et al. [59] did a similar job to Moor et al. The difference is that they train and test on patch-level images. During training, they emphasize the importance of the negative patches. The U-Net could not only learn from positive patches but also learn from the patches which do not contain any tumors. However, during training, a mini-batch does not have any positive patch will lead the network to not converge, so they make each mini-batch to contain at least one positive patch.

3.3 Deep object detection

(27)

2000 region proposal. (2) Using AlexNet [29] to extract the feature of each region. (3) Using SVM to classify job classify the region. (4) Using non maximum suppression (NMS) to refine the proposed region dur-ing the test. There are many tricks while implementdur-ing the R-CNN to fit the feature extractor, R-CNN needs to re-size the regions into a fixed size and during training, the authors also mention a bounding box regression method to further refine the predicted bounding box. However, the speed of R-CNN is far from acceptable. During the test, each image will take more than 10 seconds, so here come many other faster algorithms based on R-CNN.

Fast R-CNN During R-CNN training, when extracting the feature vec-tor for each region, R-CNN needs to re-size all the regions into the same size and put them into AlexNet one by one. This process will cost a huge amount of time. In Fast R-CNN [13], it use fully connected lay-ers together with a multi-task loss instead of SVM. In such a multi-task loss, fast R-CNN combines the classification loss and the bounding box regression loss. Notice that the region size is different, so Fast R-CNN uses the method in [18], introduces a ROI pooling to get a same fixed length from different size of regions.

Comparing with R-CNN, fast R-CNN combines the feature classifica-tion and extracclassifica-tion, significantly improves the testing speed, but the selective search still wastes much time, that is what faster R-CNN mainly focus to solve.

Faster R-CNN Faster R-CNN [43] could be simply viewed as a combi-nation of region proposal network and fast R-CNN. How to use deep network to replace the selective search is the main point of this paper. First of all, fast R-CNN uses a CNN together with one convolution layer and one ReLU layer to directly extract a group of feature maps. The size of feature map in this paper is 51 ⇥ 39 ⇥ 256. For each position in this feature map, the authors consider nine anchors with different size, and for each anchor, faster R-CNN will get the probability of be-ing foreground and background. A boundbe-ing box regression will be done in parallel. The result box is a coarse bounding box, having a similar function as selective search. Then a region proposal network is used to get a serial of regions of interests, and a fast R-CNN is used to do an object detection.

(28)

(29)

Methods

4.1 Overview

In the beginning of this section, we introduce two deep generative models: variational autoencoders (VAE) and generative adversarial network (GAN). Then we divide our task into two parts:

• Firstly, we deal with a semantic segmentation task and try to build a perfect segmentation network on a single dataset.

• Secondly, as we need to adapt this segmentation method to multi-datasets, we choose to use domain adaptation method, by learn-ing the styles of different datasets, so that we could do a style transfer operation, making all the images have a similar style. Then we test the transferred images using the segmentation network trained in the first task and compare it with the segmentation results on the images without any style transfer.

4.2 Deep generative model

The goal of the generative model is finding a function used to approx-imate the distribution of the original data. If we use f(X; ⇥) rep-resents such a function, finding the parameter P (⇥) becomes a pro-cess of maximum likelihood estimation. The question is when the data distribution is complex, our f will also be complex. Moreover, deep neural network could be used to represent such a complex func-tion. There are two successful frameworks used to build a generative

(30)

model: variational autoencoders (VAE) and Generative Adversarial Network (GAN).

4.2.1 Variational autoencoders

Variational Autoencoders [27, 8] (VAE) is based on encoder-decoder structure. Usually, we will have an observed data x, for example, an image and this observed data is generated by a latent code z. Training a encoder is trying to find a q (z|x) and training a decoder is trying to find a p✓(x|z). The training goal of the VAE is to maximize the

follow-ing likelihood function:

log p✓(x(1), x(2), ..., x(N )) = N

X

i=1

log p✓(x(i)) (4.1)

Also, notice that,

p✓(z|x(i)) =

p✓(x(i)|z) ⇤ p(z)

p✓(x(i))

(4.2) True posterior p✓(z|x(i)) is intractable, then we can use the q (z|x(i))

to approximate the posterior p✓(z|x(i)). To estimate the similarity

be-tween these two distribution, we use Kullback-Leibler divergence to describe it. KL(q (z_|x(i)) _{|| p}✓(z|x(i))) = Eq (z|x(i)₎log q (z|x(i)₎ p✓(z|x(i)) = E_{q (z|x}(i)₎log q (z|x(i)_)p ✓(x(i)) p✓(z|x(i))p✓(x(i)) = E_{q (z}_|x(i)₎log q (z_|x(i)₎ p✓(z, x(i))

+E_{q (z}_|x(i)₎log p_✓(x(i)) = Eq (z|x(i)₎log q (z|x(i)₎ p✓(z, x(i)) + log p✓(x(i)) = E_{q (z|x}(i)₎log q (z_|x(i)₎ p✓(z) Eq (z|x

(i)₎log(p_✓(x(i)|z)) + log p✓(x(i))

= KL(q (z|x(i)₎_||p

✓(z)) Eq (z|x(i)₎log(p_✓(x(i)|z))

(31)

We can get a basic equation of VAE.

log p✓(x(i)) = Eq (z|x(i)₎log(p_✓(x(i)|z)) + KL(q (z|x(i))||p_✓(z|x(i))) KL(q (z|x(i)₎_||p

✓(z)) (4.4)

KL(q (z_|x(i)₎_||p

✓(z|x(i)))is non-negative. If q (z|x(i)) and p✓(z|x(i))

are totally equal, this term will be zero, so we could turn to optimize the

N

X

i=1

Eq (z|x(i)₎log(p_✓(x(i)|z)) KL(q (z|x(i))||p_✓(z)) (4.5) instead of PN

i=1logp✓(x(i)). Equation (4.5) is also known as evidence

lower bound objective (ELBO). Maximizing the PN

i=1logp✓(x(i)) will

turn into maximizing the ELBO.

The first term in equation (4.5) could be optimized by stochastic gradi-ent descgradi-ent and using minibatch training samples. This term could be viewed as reconstruction error by using L2 loss when we assume p(z) is a normal distribution. In the second term, similarly, p(z) is a normal distribution N(0, 1), the encoder q (z|x)’s output is the mean and the variance of the normal distribution. The training process could also be done by the backpropagation algorithm.

Variational Autoencoder is quite efficient in generating samples, but usually tends to result in blurry images [8] because of the pixel-wise reconstruction losses.

4.2.2 Generative adversarial network

(32)

Figure 4.1: GAN’s structure used in computer vision [25]. Generative model is used to simulate the real world data’s distribu-tion, and the discriminative model is a binary classifier, used to classify the fake and the real data. A generator receives a random noise and generates a fake picture, and a discriminator tries to judge whether this picture is fake or not. The goal of the generator is to confuse the dis-criminator, while the discriminator aims to resist this confusion. Equa-tion (4.6) is the representaEqua-tion of this min-max game:

min

G maxD V (D, G) = Ex⇠Pdata[log(D(x)] +Ez⇠Pzlog(1 D(G(z)) (4.6)

In this function, Pdata is the real data’s distribution, and the Pz is the

fake distribution simulated by a generator. z is the input noise, or we could say it is a sample from the latent space Z. Usually, the training method for GAN is mini-batch gradient descent. The loss function for each m size batch on discriminator D and generator G is:

LD = 1 m m X i=1

[ log(D(x(i)) log[1 (D(G(z(i)))] (4.7)

LG = 1 m m X i=1 log[1 (D(G(z(i)))] (4.8) However, the GAN method still faces some practical problem: • Non-convergence. GAN has a good performance on the Nash

(33)

• Mode collapse problem [47]. GAN is a min-max game, so there does not exist a strict loss function. It is hard to distinguish whether the generator gets improved or not. Sometimes the gen-erator will always generate the same point or points from a sin-gle mode, making the training stuck at a local minimum, then we call it mode collapse.

4.3 Semantic segmentation

4.3.1 Data Preparation

Plentiful high-quality data is the key to build great machine learning models. However, in our task, we only have small and limit dataset, making augmentation necessary in our project. Data augmentation is a technique widely used in image pre-processing when the dataset is too small to support a deep learning task. Smart approaches to pro-grammatic data augmentation can increase the size of training set 10-fold or more. Even better, our model will be more robust and prevent overfitting by applying augmentation techniques. There are many ap-proaches to augment the data. In our task, we mainly adopt the fol-lowing operations.

• Random Rotation By rotating the image with a random angle. Moreover, for the rest part of the picture, padding it with 0. Also, we adopt different interpolation methods for the training images and the label masks. For the training images, we use bilinear interpolation, and nearest neighbor interpolation method is used for the label masks.

• Flipping Each mammography image has left or right heading. To unified the dataset, we flip all the image horizontally and to-gether with the original dataset. Notice that the combination of the random rotate and the horizontal flipping will make the ver-tical flipping unnecessary.

(34)

• Random Brightness and Contrast We adjust image’s brightness and contrast with a random ratio.

4.3.2 Sampling

As we have mentioned before, because of the limitation of the comput-ing resource, it is impossible to feed a whole image into the network during training. We sample some patches from the image to train the network. In semantic segmentation, we classify each pixel, whether this pixel belongs to the tumor tissue or not. It is evident that in the training set the ratio between the positive and negative is extremely unbalanced. However, CNN is sensitive to the ratio of the positive and negative samples. Our sampling is also responsible for balancing the positive and negative samples.

Here is the detailed description of how we implement the sam-pling. First of all, let’s define some terminology.

• Positive Pixel: Our semantic segmentation aims to classify the tumor and non-tumor pixels. Positive pixel is the pixel which belongs to the tumor tissue.

• Negative Pixel: Negative pixel is the pixel which belongs to the non-tumor tissue.

• Positive Patch: While we crop patches from the original image, the patch which satisfies one of the following rules will be re-garded as a positive patch: (1) The patch covers over X% of the total tumor area. (2) More than Y% of the pixels in the patch are positive pixels. In our experiments, X will be 50 and Y will be 2. • Negative Patch: If all the pixels in this patch are negative pixels,

then this patch is a negative patch.

We sample positive patches and negative patches from the original dataset first and then combine the positive and negative patches into a mini-batch at a fixed ratio.

4.3.3 U-Net

(35)

and does not need a huge amount of training data is quite suitable for building a segmentation network on a single dataset. A U-like archi-tecture is the reason why it is called U-Net. Also, the combination of the convolution and the deconvolution results makes U-Net become a skip-architecture. Because the original mammography has a high res-olution, it is impossible to directly train the network using the original size, so we design a patch-level U-Net. In the sections 4.3.3, we will describe the network architecture in detail.

Network architecture

Figure 4.2 a graph including the critical parts of U-Net:

Figure 4.2: Basic architecture of U-Net.

(36)

properties comparing with FCN. By four times up-sampling, we get back to the input dimensions. Finally, a 1 ⇥ 1 convolution layers to-gether with a sigmoid activate function is applied to get a probability map. Then we can obtain a probability map for each class. Each value in the probability map represents the probability of the corresponding pixel belonging to this class. In our case, we need a binary classifica-tion, so a fixed threshold will be used to convert a probability map into a binary mask.

By using such a fully convolution network, we can get an end-to-end segmentation network. The input image and the output mask’s dimension will be the same.

4.3.4 Loss function

The original design of U-Net [45] uses a 1 ⇥ 1 convolution layers at the end together with a binary cross entropy loss function. Cross entropy is a metric between given two distributions over discrete variable x, where q(x) is the estimate for the true distribution p(x). The formula is given by

H(p, q) = X

8x

p(x)log(q(x)) (4.9) In a neural network, the true distribution for the variable x is actually the label y, and the estimate distribution is the predicted label ˆy. The cross entropy loss can be present as

J = 1 N N X i=1 yi· log( ˆyi) (4.10)

where · is vector dot product and in our study ˆy is described by the re-sult of logistic function, representing the probability of being positive or negative pixel.

(37)

where ✏ is used to avoid the denominator or numerator become 0, which is 1e-5 in our experiments. We could define R = {r1, ..., rn},

which is the ground truth foreground segmentation over N images and P = {p1, ..., pn}, which is the predicted probabilistic map for the

foreground label over N images. The background label probabilistic map should be 1 P = {(1 p1), ..., (1 pn)}, and the background

label segmentation is 1 R = {(1 r1), ..., (1 rn)}.

4.4 Domain transfer

4.4.1 Data Preparation

Similar augmentation tricks talked in Section 4.3.1 are also applied in this section, but we put a whole augmented image into the domain transfer network without random sampling.

4.4.2 UNIT

From the related work’s study, the Unsupervised Image-to-Image Trans-lation Networks [32], also known as UNIT, shows impressive perfor-mance on the domain adaptation and image style transfer. It is the state-of-art unsupervised domain transfer network. UNIT is based on couple GAN [33]. This network contains two domain image encoders E1, E2, two domain image generators G1, G2, and two domain

adver-sarial discriminators D1, D2. For the two encoders, they share the last

(38)

Figure 4.3: Basic architecture of UNIT. UNIT contains 2 groups of VAE-GAN. x1and x2are the input images from different domains/datasets.

Encoder 1 and encoder 2 share part of layers, x1 and x2 could be

en-coded into a shared latent space z. Also generator 1 and generator 2 share part of layers. x1!2

1 and x22!1 are the images after domain

trans-fer, while x1!1

1 and x2!22 are the reconstructed images. Discriminator

1 accepts the domain transferred images x2!1

2 (fake) and domain 1

im-ages x1 (real) as input. Similarly, Discriminator 2 accepts the domain

transferred images x1!2

1 (fake) and domain 2 images x2(real) as input.

Encoder

(39)

Layers Encoders Shared? 1 CONV-(N64, K7, S2), LeakyReLU No 2 CONV-(N128, K3, S2), LeakyReLU No 3 CONV-(N256, K3, S2), LeakyReLU No 4 RESNLK-(N512, K1, S1) No 5 RESNLK-(N512, K1, S1) No 6 RESNLK-(N512, K1, S1) No 7 RESNLK-(N512, K1, S1) Yes

Table 4.1: Network structure for the encoders in basic UNIT Table 4.1 is a brief introduction of the encoder used in our basic UNIT. We will make several improvements based on this structure later. In this table, CONV-(N64, K7, S2), LeakyReLU means a con-volution layer has 64 output channels, the kernel size is 7 ⇥ 7 and the stride size is 2. After the convolution layers, we use a LeakyReLU as the activation function. RESNLK-(N512, K1, S1) means a residual block having 512 output channels. In such a residual block, the ker-nel size is 1 ⇥ 1, and the stride size is 1. Figure 4.4 shows a detailed structure of the residual block used in our basic UNIT.

Figure 4.4: A brief description of residual block used in UNIT. Generator

The encoder and generator are symmetrical. The generator starts with four residual blocks and then four de-convolution layers. These two generators, G1, G2 share the first residual block. Notice that in the last

(40)

Layers Generators Shared? 1 RESNLK-(N512, K1, S1) Yes 2 RESNLK-(N512, K1, S1) No 3 RESNLK-(N512, K1, S1) No 4 RESNLK-(N512, K1, S1) No 5 DCONV-(N256, K3, S2), LeakyReLU No 6 DCONV-(N128, K3, S2), LeakyReLU No 7 DCONV-(N64, K3, S2), LeakyReLU No 8 DCONV-(N3, K1, S1), TanH No Table 4.2: Network structure for the generators in basic UNIT Table 4.2 is a brief introduction of the generator used in our basic UNIT. We will make several improvements based on this structure. In this table, DCONV-(N64, K3, S2), LeakyReLU means a deconvolution layer has 64 output channels. The kernel size is 3⇥3 and the stride size is 2.

Discriminator

The discriminator is a traditional CNN architecture. It is made of 6 convolution blocks. The last convolution block’s activate function is a sigmoid function. We do not share any layers between these two discriminators. Table 4.4.2 is the detailed structure of discriminator.

Layers Discriminators Shared?

1 CONV-(N64, K3, S2), LeakyReLU No 2 CONV-(N128, K3, S2), LeakyReLU No 3 CONV-(N256, K3, S2), LeakyReLU No 4 CONV-(N512, K3, S2), LeakyReLU No 5 CONV-(N1024, K3, S2), LeakyReLU No 6 CONV-(N1, K2, S1), Sigmoid No

Table 4.3: Network structure for the discriminators in basic UNIT We can get different results using different combinations of encoder E and decoder D:

• x1!1_{generated by the combination of {E} 1, G1}

(41)

x1!2_{and x}2!1_{is the result of style transfer. For the discriminator, it}

needs to discriminate the style-transferred image and the image from the original domain. For D1, it needs to discriminate the image x1 and

the style-transferred image x2!1_{. For D}

2, it needs to discriminate the

image x2 and the style-transferred image x1!2.

4.4.3 Loss function

While considering the loss of UNIT, we combine several kinds of loss functions. A pair of encoder and generator {E1, G1}, could be viewed

as a VAE. The encoder is used to map image x1 from domain X1 to a

latent space Z, which in formulation could be written as q(z|x1). We

assume that the components in the latent space Z are independent and have a unit variance, so q(z|x1)⌘ N(Eµ,1(x1), I), where Eµ,1(x1)is the

mean vector of the encoder outputs. I is the unit variance. The gener-ator G1 is used to reconstruct the image from the encoder result. The

reconstructed image x1!1 _{= G}

1(z ⇠ q(z|x1)). From Equation (4.5), we

could formulate the VAE loss as:

LV AE1(E1, G1) = 1KL(qµ(z1|x1)||p⌘(z)) 2Ez⇠q(z1|x1)[logpG1(x1|z1)] (4.12) LV AE2(E2, G2) = 1KL(qµ(z2|x2)||p⌘(z)) 2Ez⇠q(z2|x2)[logpG2(x2|z2)]

(4.13) where we regard prior distribution p⌘(z) as a zero mean normal

distribution N(0, I). KL(qµ(z1|x1)||p⌘(z))represents the divergence

be-tween the conditional distribution given image and the prior distribu-tion. From the paper’s [32] view, pG1 is a Laplacian distribution, so the second term is used to minimize the absolute distance between the input and reconstruct image.

Then if we only consider a generator and a discriminator {G1, D1},

we will get a generative adversarial network, GAN1. Generator could

generate two kind of images x1!1_{and x}2!1_{. For x}1!1 _{stream, it could}

be supervised trained. We only utilize the x2!1 _{together with the}

source domain image x1 to train the network GAN1. From the

Equa-tion (4.6) the GAN loss is:

(42)

LGAN2(E2, D2, G2) = 0Ex2⇠p_x2[log(D2(x2))]+ 0Ez1⇠q(z1|x1)[log(1 D2(G2(z1))] (4.15) [32] thinks that cycle-consistency constraint is a natural consequence of the proposed shared-latent space assumption, so except VAE loss and GAN loss, we also introduce a cycle-consistency constraint loss, which is: LCC1(E1, G1, E2, G2) = 3KL(q1(z1|x1)||p⌘(z)) + 3KL(q2(z2|x 1!2 1 )||p⌘(z)) 4Ez2⇠q2(z2|x11!2)[log pG1(x1|z2)] (4.16) LCC2(E2, G2, E1, G1) = 3KL(q2(z2|x2)||p⌘(z)) + 3KL(q1(z1|x 2!1 2 )||p⌘(z)) 4Ez1⇠q1(z1|x22!1)[log pG2(x2|z1)] (4.17) Then by combining the previous loss functions, the object function

of UNIT is: min E1,E2,G1,G2 max D1,D2 LV AE1(E1, G1) + LGAN1(E1, D1, G1) + LCC1(E1, G1, E2, G2) + LV AE2(E2, G2) + LGAN2(E2, D2, G2) + LCC2(E2, G2, E1, G1) (4.18) The training method is similar to GAN’s training method, we first do a

gradient ascent by fixing the {E1, E2, G1, G2} and update the {D1, D2}.

Then we fix the {D1, D2} and do a gradient descent and update the

{E1, E2, G1, G2}.

4.4.4 Least squares GAN

Mao et al. propose a least squares GAN [35] which modifies the tar-get loss function of GAN. While using the traditional loss function in Equation (4.6), they find that there is still a gradient vanish while train-ing a GAN. The reason is that we would ignore the distance of the point to the decision boundary if we use a sigmoid function. In such a log loss function, we only care about the sign and sometimes we will fail in penalizing some outliers points. However, least squares GAN adopts an L2 loss to give more punishment on the outliers points. The training object of least squares GAN will be modified into:

(43)

By combining this training object with the Equation (4.18), Equa-tion (4.21) is a new object funcEqua-tion of UNIT.

min E1,E2,G1,G2 LV AE1(E1, G1) + LCC1(E1, G1, E2, G2) + VLSGAN(E1, G1) LV AE2(E2, G2) + LCC2(E2, G2, E1, G1) + VLSGAN(E2, G2) (4.21) min D1,D2 VLSGAN(D1) + VLSGAN(D2) (4.22)

We adopt the suggestion in [35]. Set a = 0, b = 1, and c = 1.

4.4.5 Generated image pool

Generated image pool is a trick used in several gan training [60, 49]. While updating the discriminator, instead of directly calculate the loss by using the currently generated image, we randomly pick an image from an image pool, which contains the history generated images. This randomly selected image is used to calculate the discriminator loss. Then to update the image pool, we replace this picked image with our current generated image. From [49]’s view, image pool could improve the stability of adversarial training.

Consider that we will meet a partly annotated dataset, only a few im-ages in this dataset are annotated. In the following two sub-sections, we will introduce some methods which are based on supervised train-ing and try to improve the quality of generated images.

4.4.6 Segmentation loss

The goal of UNIT is doing a domain transfer. In the real world case, target domain is usually a large dataset with fully annotated data, and source domain is a dataset without any annotation or having partly annotated data. In this section, we will introduce a improve method based on the dataset with partial annotations.

(44)

the segmentation network indicates the quality of the domain trans-ferred images. We name this kind of loss as segmentation loss Lseg

and Equation (4.21) will be changed into: min

E1,E2,G1,G2

LV AE1(E1, G1) + LCC1(E1, G1, E2, G2) + VLSGAN(E1, G1) LV AE2(E2, G2) + LCC2(E2, G2, E1, G1) + VLSGAN(E2, G2) + Lseg(E2, G1) + Lseg(E1, G2) (4.23)

Lseg(E2, G1) is generated by the U-Net trained on the data from

domain 1 and Lseg(E1, G2) is generated by the U-Net trained on the

data from domain 2.

4.4.7 A mask branch

To improve the accuracy of the encoders, we add a new branch parallel to the generator. The goal of the generator is to reconstruct the images from the latent code generated by encoders. The purpose of this new branch is to generate a tumor mask from such a latent code. Compar-ing with the ground truth mask, we get a new loss, called Llabel. We

(45)

Figure 4.5: A brief description of UNIT with a mask branch. The blue line connects the new mask branch. The structure of this new branch (Mask generator) is similar to the generator.

The new loss base on Equation (4.21) will be: min

E1,E2,G1,G2

LV AE1(E1, G1) + LCC1(E1, G1, E2, G2) + VLSGAN(G1) LV AE2(E2, G2) + LCC2(E2, G2, E1, G1) + VLSGAN(G2)

(46)

4.5 Other techniques

In this section, we will introduce some common tricks used in both segmentation and domain transfer.

4.5.1 Normalization layer

Normalization layer is an important technique used in building the network. In the following part, we will describe different normaliza-tion techniques in detail.

Batch normalization

Batch normalization [22] is a standardized operation for the layer in-put to provide a zero mean and unit variance matrix. In U-Net design, batch normalization is applied between the convolution layer and ac-tivation layer. For each batch B = {x1...m}, we first calculate the mean

and variance of the batch.

µ_B = 1 m m X i=1 xi (4.25) 2 B = 1 m m X i=1 (xi µB)2 (4.26)

To standardize this batch, ˆ xi = xi µB p ₂ B + ✏ (4.27) However, the normalization operation could make the activate layer only use the linear part if we use sigmoid function as our activate func-tion, so batch normalization uses another two parameters to do a scale and shift.

yi = xˆi+ ⌘ BN , (xi) (4.28)

(47)

Instance normalization

Instance normalization [7], also known as contrast normalization, has shown good performance on style transfer problem. Different from the batch normalization operation, instance normalization normalizes each image instead of each batch. For each image xi with a height H

and width W. Equation (4.29) is how we do the instance normalization. µi = 1 HW W X l=1 H X m=1 xilm (4.29) 2 i = 1 HW W X l=1 H X m=1 (xilm µi)2 (4.30)

The normalized result should do be: yilm = xilm µi p ₂ i + ✏ (4.31)

4.5.2 Activation function

The activation function is an essential part of network design. It could transform our loss calculation into a non-linear combination. In this section, we will introduce different activation functions.

Sigmoid

Sigmoid function is one of the most popular activation functions used in neural network. It could be written as:

f (x) = 1 1 + e x

The non-linear form of sigmoid function could make us stack more layers. Notice that if x is between [ 2, 2], the result f(x) will change significantly with a small change of x. This function tends to bring the f (x)to either 1 or 0.

Tanh

Tanh is also a popular activation function. It could be written as: f (x) = 2

(48)

In fact, it is a scaled sigmoid function.

f (x) = 2sigmoid(2x) 1 Rectifier Linear Unit

Rectifier Linear Unit (ReLU), is a widely used activation function. If we use the sigmoid function or its variant, usually we will meet a gra-dient vanish problem. Because if x is pretty big, there is less change on the f(x), while doing the backpropagation. The gradient will tend to be zero. However, by using ReLU, we could avoid this kind of prob-lem, and ReLU could be written as:

f (x) = max(0, x)

Also we could change the ReLU into a leaky version, called leaky ReLU:

f (x) = ⇢

(49)

Experiment & Result

5.1 Datasets

5.1.1 CBIS-DDSM Dataset

The Digital Database for Screening Mammography (DDSM) dataset [19] is one of the largest available mammogram databases. Based on this database, we use a dataset called Curated Breast Imaging Subset of DDSM (CBIS-DDSM) [30], which is an updated and standardized version of DDSM. CBIS-DDSM includes a subset of DDSM and is cu-rated by a trained mammographer. All the image format is in DICOM format. We split 70% of the data as the training set, 10% of the data as the validation set, and the last 20% is the testing set. The final numbers of the mammography images are:

• Normal: Train 50, Validation 8, Test 12 • Cancer: Train 795, Validation 110, Test 228

5.1.2 INBreast Dataset

The INBreast [39] is a mammographic dataset, with images acquired at a breast centre, located in a university hospital (Hospital de São João, Breast Centre, Porto, Portugal). INBreast has 374 images with mask annotations indicating the position of tumors. We split 70% of the data as the training set, 10% of the data as the validation set, and the last 20% is the testing set. The final numbers of the mammography images are:

(50)

• Normal: Train 154, Validation 22, Test 42 • Cancer: Train 107, Validation 15, Test 30

5.2 Evaluation metrics

To describe the quality of the segmentation results, we adopt several metrics and we divide these metrics into two parts: pixel level and instance level.

5.2.1 Pixel level metrics

In this part, the metrics are mainly used to describe the performance of segmentation on pixel level. We primarily focus on whether each pixel is correctly classified or not. Notice that the output of our seg-mentation is a probability map. Each value in this map represents the probability of being a positive pixel, so we need a fixed threshold to convert the probability map into a binary mask. Once we get a binary mask, we could compare this binary mask with the ground truth mask. Then we have the following metrics.

• True Positive (TP) is the number of the pixels which are correctly classified as positive pixels.

• False Positive (FP), also know as ↵ or type I error, is the number of the negative pixels which are classified as positive pixels. • False Negative (FN), also know as or type II error, is the

num-ber of the positive pixels which are classified as negative pixels. • True Negative (TN) is the number of the negative pixels which

are correctly classified.

Based on these concepts, we calculate precision and sensitivity (also known as recall) in the following manner:

P recision = T P

T P + F P (5.1)

Sensitivity = T P

(51)

We use Matthews correlation coefficient (MCC) to estimate the qual-ity of our binary classified which could be calculated as:

M CC = p T P ⇥ T N F P ⇥ F N

(T P + F P )(T P + F N )(T N + F P )(T N + F N ) (5.3) Additionally, we use the dice score coefficient (DSC) which is

DSC = 2T P

2T P + F P + F N (5.4)

Intersection over union (IoU) is also a popular metric used in bi-nary semantic segmentation which is written as

IoU = Intersection union

= T P

T P + F P + F N (5.5)

5.2.2 Instance level metrics

Segmentation network mainly focuses on locating where the tumor is. The estimation on the pixel level is far from enough, so we propose some instance level metrics.

First of all, considering that in each mask, there might exist several tumors and the tumors could overlap on each other, so we use the wa-tershed algorithm to extract these tumors and their positions from the mask. Figure 5.1 indicates how a watershed algorithm works. With the help of U-Net, we can get a probability map, like Figure 5.1.b, and then we set a fix threshold, transferring the probability map into a binary mask, like Figure 5.1.c. A watershed algorithm could split the binary mask on instance level and mark out different tumors. The result is shown in Figure 5.1.d.

Now for ground truth and predicted mask, we obtain the location and the area of each tumor. Then for each pair of predicted and ground truth tumors, we calculate the IoU value and set a threshold. If the IoU value is over the threshold, it means we find this tumor.

Once we get the number of the ground truth and the predicted tu-mors and the truly found tutu-mors, we set

(52)

Figure 5.1: (a) is the input image; (b) is the output probability map after feed (a) into U-Net; (c) is a binary mask by setting a fix threshold upon (b); (d) is the result of watershed algorithm using (c) as input. Watershed algorithm could mark out different parts and implement an instance level segmentation. In picture (d), we have two predicted tumors. These two tumors are marked with different colors.

• False Positive (FP), also know as ↵ or type I error, is the number of the predicted tumors which are actually not tumor.

• False Negative (FN), also know as or type II error, is the num-ber of the ground truth tumors which are not correctly found. Based on these concepts, we calculate precision and sensitivity (also known as recall) in the following manners:

P recision = T P

T P + F P (5.6)

Sensitivity = T P

T P + F N (5.7)

(53)

5.3 U-Net for semantic segmentation

In this section, we use the CBIS-DDSM Dataset for training and testing. We first get a performance baseline by defining a basic network.

5.3.1 Basic network

The basic network is trained on CBIS-DDSM which is resized into 1/8 of the original size. Then we horizontally flip these images. Together with the original images, we make a TFrecord file to accelerate the I/O processing.

Then we apply random rotation and random resize on these im-ages. The random resize ratio is from 0.75 to 1.25. Because we are go-ing to face a domain adaptation problem in the followgo-ing part, in the basic network structure, we do not apply the contrast augmentation operation. Based on these resized and rotated images, we randomly crop 128 ⇥ 128 patches from each image and randomly select the same amount of positive and negative patches as a mini-batch.

The basic network structure is shown in Figure 4.2. We adopt Adam optimizer and train the networks for 80k iterations. The learning rate for this experiment starts from 5e-4. After every 20k iterations, we di-vide the learning rate by 2.

(54)

Figure 5.2: An example of success segmentation result; where (a) is the original mammography image, (b) is the segmentation results rep-resented with a heatmap; (c) is based on the heatmap, we set a fix threshold on the heatmap and get binary mask. (d) is the ground truth mask of the original mammography image

However, not all test case could be correctly segmented. In the following examples, we could find that there exist false positive and false negative judgments in our test case.

(55)

Figure 5.4: An example of segmentation result with false negative cases. Comparing with (b) and (d), we could find that several tumor tissues are misclassified as non-tumor.

5.3.2 Experiments on different parameters

To find the best parameter setting, we use control variable method, changing one parameter setting one time and testing on the test dataset. In the following part, we will design a serial of experiments based on the basic U-Net structure we have described in Section 5.3.1, to study what is the best setting for each kind of parameter. We design the ex-periments in the following direction:

Channel numbers In this part, we will study the influence of the channels numbers on the segmentation performance. We design the following three experiments:

• Double Channels In the basic network, with the dimension size decreasing, the channels numbers range from 4 to 64. In this double channels setup, the channels numbers range from 8 to 128.

• Double Left Channels Inspired from [38], we double the channel

numbers of convolution/left part of U-Net and the de-convolution/right part channel numbers keep the same as basic network’s setup.

(56)

Loss function In this part, we will do one experiment:

• Dice Loss. In the basic network, we use cross-entropy loss and in this dice Loss setting, we use dice loss described in Section 4.3.4 instead of cross-entropy loss.

Data preparation The data preparation contains the data augmenta-tion and the shrinkage of the image resoluaugmenta-tion. In this part, we imple-ment the following three experiimple-ments:

• Random Contrast&Brightness In the basic network, we only ap-ply rotation and scale to augment the images. However, in this case, besides from rotation and scale, we add random contrast and brightness to augment the images.

• Increase Input Resolution In this experiment, we do not change any network structure, but we improve the resolution of the in-put images. The training images’ height and width in this archi-tecture are resized into 1/4 of the original size. In the basic net-work setting, the training images’ height and width are resized into 1/8 of the original size.

• Stronger Scaling In this experiment, we use basic network set-ting, except that during the re-size augmentation part, we change the resize ratio from [0.75, 1.25] to [0.5, 1.5], which means that we have a stronger augmentation in this network.

Normalization In this part, we mainly study the influence of the nor-malization layers. In the basic network, we do not use any normaliza-tion skills between the convolunormaliza-tion layer and the activate funcnormaliza-tion. In the following two experiments, we add normalization layers between the convolution layers and the activation layers.

• Batch Normalization In this architecture, we add a batch nor-malization layer between the convolution layer and the activa-tion layer (ReLU).

(57)

Other training techniques In this part, we apply some techniques which are proved to be useful in some data competitions to check whether these techniques are useful in our experiment.

• Change Last 5k Iterations Inspired by HikVision’s experience in ILSVRC 2016. They also train a patch-level network and use the data augmentation technique. The idea is that we should train in the same way as we test, so in the last several epochs, the network is trained on the images without augmentation. In this architecture, we use the basic network setting, except that we train the last 5k iteration using the data without augmentation. In Table 5.1, we describe the previous 11 experiments’ results us-ing the metrics we talked in Section 5.2. For pixel level metrics, we calculate the average pixel-level dice, precision, recall, MCC, and IoU on the test set. For instance level metrics, we change the threshold of finding a tumor from 0.25 to 0.75, with a step of 0.05. We choose to present the precision and recall when the thresholds are 0.25 and 0.5. Also, we present the average precision (AP) and recall (AR).

From the results we can get the following conclusions:

• Batch normalization can greatly improve the network’s perfor-mance. However, instance normalization cannot, it makes the result even worse.

• Comparing with dice loss and cross entropy loss, in this case, cross entropy loss seems to have a better performance than the dice loss.

• It seems that the training tricks: changing the last 5k iterations by the images without augmentation, does not work well on our dataset.

• Too big channel number makes the network get overfitting. Ac-cording to our experiments, double channels is the best channel setting till now.

• Stronger augmentation does not improve the performance. Some-times it makes the result even worse.

(58)

(59)

CHAPTER 5. EXPERIMENT & RESUL T 49 Double Channels 0.4203 0.3739 0.6233 0.4655 0.2862 0.3767 0.7215 0.3128 0.5991 0.2917 0.5589 Double Left Channels 0.4066 0.4026 0.5572 0.4523 0.2882 0.3802 0.6962 0.3087 0.5654 0.2926 0.5359 Four times Channels 0.4153 0.3604 0.6444 0.4650 0.2730 0.3486 0.7046 0.2965 0.5992 0.2720 0.5497 Dice Loss 0.3606 0.2992 0.6440 0.4165 0.2306 0.3243 0.6962 0.2692 0.5780 0.2477 0.5320 Random Contrast&Brightness 0.3657 0.4519 0.4012 0.4213 0.2793 0.3905 0.6245 0.3113 0.4979 0.3034 0.4852 Increase Input Resolution 0.4034 0.4340 0.5020 0.4434 0.2920 0.3977 0.5991 0.3221 0.4852 0.3065 0.4618 Stronger Scaling 0.3900 0.4526 0.4460 0.4370 0.2869 0.3705 0.6034 0.3109 0.5063 0.2901 0.4726 Batch Normalization 0.4353 0.4431 0.5530 0.4825 0.3107 0.4155 0.7257 0.3454 0.6034 0.3223 0.5631 Instance Normalization 0.1723 0.1137 0.6757 0.2394 0.0623 0.0990 0.2743 0.0706 0.1941 0.0646 0.1776 Change Last 5k Iterations 0.4116 0.3845 0.5850 0.4569 0.2859 0.3755 0.7130 0.3222 0.6118 0.2967 0.5634

Table 5.1: Experiment Results on Different Parameters

Pixel Level Metrics Instance Level Metrics

Dice Precision Recall MCC IoU P0.25 R0.25 P0.5 R0.5 AP AR

CBIS-DDSM 0.4298 0.4379 0.5671 0.4916 0.3237 0.4520 0.6134 0.3622 0.49159 0.3357 0.45569 INBreast 0.2503 0.3239 0.4026 0.3965 0.2703 0.4667 0.5600 0.4000 0.4800 0.3848 0.4618

(60)

5.3.3 Discussion

In this section, we explore different kinds of U-Net settings. First of all, we mainly focus on finding the best channel numbers setting. If the channel number is too small, the network could not hold all the information, which might cause under-fitting. On the other hand, if the channel number is too big, the performance is still bad. The net-work now is too strong, which means that we get over-fitting. In Table 5.1, we compare different channel number settings. At last we choose the double channels setting, as the recall is relatively higher and other metrics are almost stable. When we have four times channels numbers, the test results indicates that the network tends to get over-fitting.

Then we compare dice loss and cross entropy loss, from our related work’s study [38,59], many medical deep learning systems would pre-fer dice loss. However, in our work, it seems that dice loss performs a bit worse than cross entropy loss. We think that it is because of the extremely unbalanced dataset.

A stronger augmentation is also too bad. After comparing the IN-Breast dataset and the CBIS-DDSM dataset, we find that the contrast and brightness matters a lot in mammography. As we only have one channel (grey scale), if we change the brightness and contrast, some tu-mors will even look like normal tissue. If we add random contrast and brightness during training, the network will be confused especially when our dataset is not so big. Stronger scaling has the same prob-lem, the main restrict of our work is the dataset is not big enough. Too strong scaling will make the network confused, and cover too much unnecessary cases.