• No results found

Image Based Flow Path Recognition for Chromatography Equipment

N/A
N/A
Protected

Academic year: 2021

Share "Image Based Flow Path Recognition for Chromatography Equipment "

Copied!
82
0
0

Loading.... (view fulltext now)

Full text

(1)

IT 19 017

Examensarbete 45 hp Maj 2019

Image Based Flow Path Recognition for Chromatography Equipment

Ankur Shukla

(2)
(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Image Based Flow Path Recognition for Chromatography Equipment

Ankur Shukla

The advancement in computer vision field with the help of deep learning methods is significant. The increase in computational resources, have lead researchers developing solutions that could help them in achieving high accuracy in image segmentation tasks. We performed segmentation of different types of objects in the

chromatography instruments used in GE Healthcare, Uppsala.

In this thesis project, we investigated methods in Computer vision and deep learning to segment out the different type of objects in instrument image. For a machine to automatically learn the features directly from instrument image, a deep convolutional neural network was implemented based on a recently developed existing architecture.

The dataset was collected and preprocessed before using it with the neural network model. The model was trained with two different architecture Unet and Segnet developed for image segmentation. Both the used architecture is efficient and suitable for semantic

segmentation tasks. Among different components to segment out in the instrument, there was a thin pipe. Unet was able to achieve good results while segmenting thin pipes with fewer data as well. Results show that Unet can act as a suitable architecture for segmenting different objects in an instrument even if we have only 100 images.

Further advances can be done to improve the performance of the model by generating a better mask of the model and finding a way to

collect more data for training the model.

Ämnesgranskare: Robin Strand Handledare: Petra Bängtsson

(4)
(5)

Contents

Acknowledgments 11

1 Introduction 13

2 Background 16

2.1 Edge Detection . . . 16

2.1.1 Classical Operators . . . 17

2.2 Deep Learning . . . 19

2.3 Neural Network . . . 19

2.3.1 Training the Model . . . 22

2.3.2 Activation Function . . . 24

2.3.3 Loss Function . . . 26

2.3.4 Convolutional Neural Networks . . . 27

2.3.5 Transposed Convolution . . . 32

2.3.6 Batch Normalization . . . 34

2.3.7 Optimizers . . . 34

2.3.8 Segmentation using Neural Network . . . 37

2.3.9 Dice Loss . . . 41

2.3.10 Unet Model . . . 42

2.3.11 Segnet . . . 43

3 Dataset 45 3.1 Creating mask of dataset . . . 47

3.2 Data Augmentation . . . 54

3.2.1 Data Augmentation Methods . . . 55

3.2.2 Hardware & Software . . . 56

4 Experiments & Results 57 4.1 Edge based Methods . . . 57

4.2 Neural Networks results . . . 58

4.2.1 First Experiment . . . 59

4.2.2 Second Experiment . . . 62

4.2.3 Third Experiment . . . 68

(6)

5 Conclusion 76

(7)

List of Figures

1 Results from the deep CNN model named Alexnet in [1]. . . 14

2 AKTA pure instrument used in chromatography. . . .¨ 15

3 Image without(left) and with(right) noise [2] . . . 16

4 An ideal edge pixel and correspoding gradient vector. The magnitutde of the gradient indicates the strength of the edge.[3] . . . 17

5 Main components in neurons [4] . . . 20

6 An ANN which receives the inputs z1, z2, ..., zI and connects them to the neu- ron through the weighted edges v1, v2, .., vI. The input values are multiplied by their respective weights and subtracted from a threshold (θ). The result is passed to an activation function which non linearly transforms the weighted sum[5]. . . 20

7 Simple example of neural network . . . 21

8 Division of steps in training of an ANN model. . . 21

9 Bias effect on the equation’s output. . . 22

10 Our goal is to reach lowest point of bowl shaped object by minimizing loss which is possible by taking derivative of the loss with respect to weights [6]. 23 11 Variation in weight update due to high learning rate [6]. . . 24

12 Graphical output of sigmoid function . . . 24

13 Graphical output of TanH function . . . 25

14 Graphical Plot for Relu and Leaky Relu activation function. . . 26

15 Variation between loss function and weights. w denotes weights of ANN and J(w) is the loss function . . . 27

16 Basic Structure of NN model with CNNs [7]. In this model convolutional layers are stacked between ReLus continuously before being passed through the pooling layer and in the end goes between one or many fully connected layer. . . 28

17 CNN of size 5x5 convolving through the whole image and result is being passed to first hidden layer [6]. Figure 18 show the actual multiplication of kernel with the image. . . 29

18 A visual representation of a computation in convolutional layer. The middle element of the kernel is placed over the input image, which is then calculated and replaced with a weighted sum of itself and any nearby pixels [8]. . . 29

(8)

19 Activation map produced by convolutional layer on MNIST [9] dataset,some

blurry edges of various digits are visible [10]. . . 30

20 Convolution of 3x3 kernel into 5x5 image with value of stride and padding of 1 [11]. . . 31

21 Result of applying MAX pooling on an input image. . . 32

22 The transpose of convolving a 3x3 kernel over a 4x4 input using unit stride [11]. 33 23 Sparse matrix C . . . 33

24 Illustration of applying batch normalization in deep neural networks. . . 34

25 Fluctuations in SGD [12]. . . 35

26 Representation of an image as different labels denoting different objects [13]. 37 27 Output of an ANN model as a segmentation map [13]. . . 38

28 One hot encoding of Color column. . . 38

29 Neural Network model for segmentation preserving dimension of an image through the network [10]. . . 39

30 Neural Network model(Alexnet) used in classification of an image [10]. . . . 39

31 Converting all FCN layer into convolution layer [14]. . . 40

32 Dense Prediction for per pixel value of image resulting in segmentation by FCN model [15]. . . 40

33 Each matrix represent different labels in the image, Dice coefficient is calcu- lated for the mask of each label [13]. . . 41

34 U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations [16]. . . 42

35 An illustration of the SegNet architecture. Only convolutional layer and no fully connected layer [17]. . . 44

36 Architecture of segnet basic having only 4 encoder and decoders. . . 44

37 Sample image of ¨AKTA pure with red pipes. . . 45

38 Image from different angle consisting of both blue and red pipes . . . 46

39 The silver parts are the pumps of an ¨AKTA pure instrument. . . 47

40 Expected Input-Output by ANNs used in segmentation of an Image. . . 47

41 Example of how pumps in sample image was colored for creating the mask. . 48

42 Mask of the left image using inverted binary thresholding . . . 49

43 Sample from training dataset for valves . . . 50

44 Sample of editing the image in iPad. . . 50

45 RGB Color Space. . . 51

46 HSV Color Space. . . 51

47 Representation of how mask is generated for instrument images. . . 52

48 Training data used for blue pipes. Only blue pipes are present in the generated mask set. . . 53

(9)

49 Training data used for red pipes. Only red pipes are present in the generated

mask set. . . 53

50 Training data used for green pipes. Only green pipes are present in the gen- erated mask set. . . 54

51 Example of data augmentation using rotation and shifting. . . 55

52 Example of data augmentation using horizontal flipping. . . 56

53 Example of data augmentation using shearing. . . 56

54 Sample image used with edge based methods . . . 57

55 Edges detected in original image 54b using classical operators . . . 58

56 Sample image from training set for model1 with the corresponding mask gen- erated using binary inverse thresholding in RGB color space. . . 59

57 Sample image from training set for model2 with the corresponding mask gen- erated using binary inverse thresholding in HSV color space. . . 60

58 Result generated by the Unet model1 trained in RGB color space mask, as illustrated in 56. . . 61

59 Result generated by the Unet model2 trained in HSV color space mask, as illustrated in 57. . . 62

60 Sample image from training set for models where valves is trained with mask shown in 60b and pipes were trained with 60c mask. . . 63

61 Sample image from training set for models where valves is trained with mask shown in 61b and pipes were trained with 61c mask. . . 64

62 Results from Unet model for figure 62a, segmenting valves and pipes separately and then adding them together. . . 65

63 Results from Unet model for figure 63b and from Segnet model for just pipes of 63a, segmenting valves and pipes separately and then adding them together. 66 64 Difference between the result from Segnet and Unet model for only segmenting out pipes in the image. . . 67

65 Unet model prediction for pipes in the image. . . 67

66 Blue pipes training set. Sample image and mask. . . 68

67 Red pipes training set. Sample image and mask. . . 69

68 Green pipes training set. Sample image and mask. . . 69

69 Dice loss curve for all three models of pipes with their corresponding learning rate value. . . 70

70 Figure 70a is one of the sample images from test set. Its corresponding pre- diction for blue pipes is shown in 70b, red in 70c and green in 70d. 70e is the addition of all three pipes models and figure 70f is the final output after adding output from valves and 70e. . . 72

71 71 is one of the test set image used in earlier experiments. 71b is the prediction from model trained in second experiment and 71c is the prediction from the model trained on only blue pipes. . . 73

72 Results from Unet model for blue pipes in 72b and for red pipes in 72d. . . . 74

(10)

73 Another example of test set image where results improved in segmenting pipes, can be noticed by looking at 73e and 73f. . . 75

(11)

Acknowledgements

Throughout my dissertation, I have received a great deal of support and assistance. I would first like to thank my supervisor, Petra B¨angtsson and my reviewer Robin Strand for for- mulating together an appropriate research topic. You provided me with the tools that I needed to choose the right direction and guided me during the whole period of my thesis dissertation, in order to achieve the best results and have a deep understanding of the topic.

I would like to offer my special thanks to the section manager Frederik Isacsson for his excellent cooperation and for all of the opportunities I was given to conduct my dissertation at GE Healthcare. I would also like to thank colleague in GE Healthcare Roger Lundqvist for his constant discussion about the topic and special guidance while writing the report.

In addition, I would like to thank my parents and siblings for their wise counsel and sympathetic ear. You are always there for me. Finally, my friends, who were of great support in deliberating over my problems and findings, as well as providing a happy distraction to rest my mind outside of my dissertation.

Thank you for all your encouragement.

(12)
(13)

Chapter 1 Introduction

Chromatography is a technique for the separation of components in a mixture. “The mixture is dissolved in a fluid called the mobile phase(can be either liquid or gas), which carries it through a structure holding another material called the stationary phase(either solid or liquid phase). Various different components in mixture travel at different speeds, fostering them to separate. The separation is based on differential partitioning between the mobile and stationary phases. Subtle differences in a compound’s partition coefficient result in differential retention on the stationary phase and thus affect the separation ” [18].

Chromatography is one of the common filtration technique used in the pharmaceutical industry. Instrument used in this method tends to be complicated, making it hard for the person using it to understand how the flow of the liquid is passed through different tubings and how components are connected to one another. In order to find the connections, we need to find the different components in the machine. Components used in the machine are of varying sizes, length, and shape which makes it hard to identify them. If the identification of the components can be done, then it opens up the possibility to build a number of applications for helping the user to understand the instrument.

With the rapid development in image-based technology, performing image classification or segmentation has never been easier. Image classification involves classifying images into one of many possible labels and segmentation involves the identification of regions of interest, which generally are an object or a part in a digital image [19]. A lot of industries like textile, farming, wood and paper are trying to improve their product with the help of machine learning and computer vision(CV). Various Machine vision methods have been developed [20, 21]. They have helped in improving the inspection task and address several key concepts of the processing chain from the original image to the final classification task. Detecting defects using segmentation and classifying them into different types are being done using machine learning(ML) methods [20]. For example, defect detection in weld bead [22] which is quite important for high-quality welding. These methods are also being used for the automatic inspection of fruits and vegetables. Some of the application of such systems includes grading, quality estimation from external parameters or internal features, monitoring of fruit processes during storage or evaluation of experimental treatments [21]. The ability of methods like this helps in detecting events that take place outside the visible electromagnetic spectrum,

(14)

using ultraviolet or near-infrared spectra. These inspection methods have helped in detecting defects in Nonel tubes which are filled with explosives, missed defects would lead to potential danger. Developed methods using CV and ML not only improves efficiency but also reduce labor cost and save social resources.

Figure 1: Results from the deep CNN model named Alexnet in [1].

Advancements in deep learning has outperformed the state of the art in many visual recognition tasks [1, 23], can be seen in Figure 1. After a breakthrough by Krizhevsky by developing a model named alexnet [1] which uses convolutional neural networks(CNN).

Before CNNs, features were extracted from the images manually and then model was trained using machine learning methods. Although convolutional networks have already existed for a long time [24], their success was restricted due to the size of the available training sets, computational power for training and the size of the considered networks. Alexnet networks consisted of 8 layers with millions of parameters trained on ImageNet dataset [25] with 1 million training images. Since then, many larger and deeper networks have been developed to improve performance [26].

Typical usage of the convolutional neural network consists of performing classification task where the output of an image can be single or multi-label. Deep CNNs(DCCNs) have also achieved remarkable results for object detection [23, 27, 28] where they are trained in an end-to-end manner, delivering strikingly better results than systems relying on hand-crafted features[29]. However, with increasing data and computational power, the next step in the progression from coarse to fine inference is to be able to make a prediction at each pixel.

Segmentation of an image differs from classification task because it requires predicting a class for each pixel of the input image, instead of only 1 class for the whole input. In [15]

FCN trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state of the art without further machinery. Unet [16] was developed after modifying and extending FCN architecture such that it works with few images but yielding more precise segmentation result.

Many other models subsequent to Unet [17, 29, 30] have been developed, Segnet being one

(15)

of them performs better as it maps low-resolution features to input resolution for pixel-wise classification by reusing max pooling indices in the encoder part of its architecture.

The aim of this project is to find the best approach for flow path recognition in chro- matography equipment used by GE Healthcare in a standard lab environment. Also if there is any approach possible then develop a prototype model for the recognition(Not the final software). Finding the flow path means we need to identify different types of objects typically present in chromatography instruments. Figure 2 is one of the ¨AKTAT M instruments used for chromatography. The very first step includes performing segmentation of the different type of objects in the instrument.

Figure 2: ¨AKTA pure instrument used in chromatography.

The deep learning models mentioned above have achieved some really good results in performing image segmentation, which is one of the main reasons this can be the potential solution helping to achieve the aim of the project.

The remainder of this report is organized as follows. In chapter 2 we review the literature section. We describe the Unet and Segnet architecture and the different layers used in them.

Chapter 3 consist of the description of data and how it is preprocessed before training the model. In chapter 4 we evaluate the performance of edge-based and deep learning methods.

This is followed by a conclusion in chapter 5.

(16)

Chapter 2 Background

2.1 Edge Detection

Edge detection is a type of image segmentation technique where we segment out different objects on the basis of their intensities. Images comprises of different pixel intensity and noises. The discontinuties in the image when used with filters present in Image processing helps in identifying the boundaries by locating sharp edges [2]. Such detection filters are very useful in extracting the important features such as lines and curves which can be helpful in image recognition or computer vision algorithms. However we need to maintain the structural properties of an image while extracting edges which makes it an onerous job.

It becomes even more strenous to extract edges when there is noise in an image because noise also signify the swift changes in the intensity values of an image [31]. In the computer vision field, edge detection is one of the fundamental steps [32]. Many important features like corners, lines and curves etc can be segmented from the image. These basic features are then used for higher level computer vision algorithms [32].

Figure 3: Image without(left) and with(right) noise [2]

Figure 4 represent an edge pixel and the corresponding gradient vector. At the pixel, there is a change in intensity from 0 to 255 at the direction of the gradient. Calculating the gradient in uniform regions results in a zero vector which signifies there is no edge pixel.

Natural images usually do not have the ideal discontinuity or the uniform regions can be seen the figure 4. For detecting the edge pixels we process the magnitude of the gradient.

The elementary processing is applied for a threshold and if the magnitude of the gradient is larger than the threshold, we decide the pixel in correspond to the edge pixel. All the

(17)

above mentioned methods for detecting edges has their own filter kernel used with the image computing it’s derivative. Convolving image with these filter kernel basically computes the difference between the pixel values of the neighbouring pixels. For example, when you take the derivative in the x-direction at pixel x1 by taking the difference between the pixel values to the left and right of your pixel x0 and x2(first derivative). Image derivative is really useful in locating edges as its function tells the rate of change and due to discontinuity in pixel intensities an edge constitutes a high rate of change [33].

Figure 4: An ideal edge pixel and correspoding gradient vector. The magnitutde of the gradient indicates the strength of the edge.[3]

Edge detection techniques can be grouped into two categories, either based on Gradients or the Laplacian. Former methods usually detects the edges by taking into account the maximum and minimum value in the first derivative of the image [2]. Laplacian method looks for the zero crossing in the second derivative of the image to find edges.

2.1.1 Classical Operators

Sobel, Prewitt and Robert filter fall into classical operators used in edge detection. The primary advantages of these filters are their simplicity. Simple approximation to the gradient magnitude is given by the robert filter. Due to approximation of the gradient magnitude they can easily detect edges and their orientation in an image. The disadvantage of these filters are they can’t handle noise in the image and increase in noise to the image will degrade in accurately detecting the edges in image [2].

1 0 0 -1

Table 2.1: Horizontal mask (Gx) used in Roberts operator

0 -1 1 0

Table 2.2: Vertical mask (Gy) used in Roberts opera- tor

(18)

-1 0 1 -2 0 2

1 0 1

Table 2.3: Horizontal mask (Sx) used in Sobel op- erator

1 2 1

0 0 0

-1 -2 -1

Table 2.4: Vertical mask (Sy) used in Sobel operator

-1 0 1 -1 0 1 -1 0 1

Table 2.5: Horizontal mask (Px) used in Prewitt operator

1 1 1

0 0 0

-1 -1 -1

Table 2.6: Vertical mask (Py) used in Prewitt opera- tor

Canny Edge Detection Algorithm

The Canny edge detection algorithm has the following steps:

1. Input image is converted into grayscale for blurring the image and remvoing the noise.

Usually a gaussian filter is used for removing the noise by smoothing the image [34].

2. Gradient of the image is computed for identifying sharp changes in pixel values. The gradient is a unit vector which points in the direction of maximum change in intensity [32]. Gradient magnitude and direction are computed using the horizontal and vertical components of the gradient.

3. In this step scanning of image takes place along the edge direction and discard any pixel value which cannot be considered as an edge resulting in thin line in the output image [32].

4. Double thresholding algorithm is used to detect and link edges [2]. In this method we have 2 thresholding values such as T1 = high threshold and T2 = low threshold. Pixel values which are greater than T1 are considered as strong edges and values below T2 falls under non edges. Pixel values between T1 and T2 are decided on the basis of neighbouring pixels.

5. Weak edges that connect to the strong edges are included in the output image else they are discarded in the final output image [32].

Canny algorithm requires more computational power and is time consuming but compared to classical operators it gives much better result. Due to the double thresholding method it is better at detecting edges in noisy images [2].

(19)

2.2 Deep Learning

A Deep learning (DL) algorithms automatically extracts the high and low level features from the data which is essential for predicting the result. Features such as edge detector etc are referred here as low level features, and representation that build on those low level features such as combining edges to detect whether it’s a face or not are referred here as high level features.

The goal is to make these high level features more abstract, with their individual features more invariant to most of the difference which is usually present in the training data and also to preserve as much as possible of the information in the input. This automatic feature extraction and learning is the key point for the success of DL as it eliminates the need of manually extracted features which can be very time consuming [35]. Research in the field of DL has developed rapidly and has been evolved from tricky pretraining routines to a highly modular, customisable framework for building machine learning systems for various problems, spanning from image recognition, voice recognition and synthesis to complex AI systems.

2.3 Neural Network

The structure of human brains were the source of inspiration for Artifical neural net- work(ANN). Our brains consist of an estimated 100 billion neurons. Figure 5 is one of the example how neurons are connected and the flow of input, output through neurons.

These neurons communicate via short lived electronic signals. The interneuron connections are mediated by electrochemical junctions called synapses, which are located on branches of the cell referred to as dendrites. Each of the neurons receives thousands of response from other neurons which will eventually reach the cell body. All these incoming signals to dif- ferent neurons are summed together in some way, if the resulting signal exceeds a threshold then the neuron will generate a impulse in response or we can say it will ’fire’ that particular response to other neurons via a branching fibre known as the axon [4]. Some of the incoming signals tend to prevent firing while others promote it.

(20)

Figure 5: Main components in neurons [4]

Model architecure of ANN tries to adapt this architecture and the style of processing while training the ANN model. The artificial ANN consists of nodes in place of the biological neurons as you can see in Figure 6. An artificial neural network (ANN) consists of many simple, connected processors called neurons. Each neuron produces a sequence of real-valued activations. A neuron can be activated from an input or another neuron’s activation through its weighted connections from a previous layer [36]. Training the model means finding weights that makes ANN output desired results such as classifying different animals or different types of objects correctly.

Figure 6 shows how an ANN receives a number of inputs which are connected to an activation function through weighted edges. ANN can have any number of inputs, hidden layers, or outputs.

z1

z2

zI

f (net − θ) o

v1

v2

vI

Figure 6: An ANN which receives the inputs z1, z2, ..., zI and connects them to the neuron through the weighted edges v1, v2, .., vI. The input values are multiplied by their respective weights and subtracted from a threshold (θ). The result is passed to an acti- vation function which non linearly transforms the weighted sum[5].

ANN model can consist from a single node to a large collection of nodes in which each one of them is connected to every other node like in Figure 7. In Figure 7 we can see that each node is shown as a circle and weights are implicit in all the connections. Network is arranged into a layered structure in which signal comes out from an input and go through three layers of nodes before reaching an output. This type of structure is often referred as feedforward

(21)

NN (FFNN) and helps in classification of an image. If we need to classify an image of digits from 0 to 9(Binary images) output layer of FFNN will consist of 10 nodes (one for each digits) which will help in firing resulting node whenever a pattern of the corresponding class is supplied at the input.

Figure 7: Simple example of neural network

Training of ANN models for supervised learning(desired target is present) is done in a form where we present the network with input patterns and then its response is compared with the target output.

Figure 8 illustrates 4 major steps involve in training of an ANN model. After backpro- pogation cycle is repeated again for next batch of input. Inputs are divided into batches speeding up the learning process.

Input(Batch

Division) Forward Pass Compare

with target

Error Calculation

BackPropoga- tion( Weight

Update)

Figure 8: Division of steps in training of an ANN model.

When the model has learned the underlying structure of the problem then it should also give out the result for the unseen inputs correctly. This is often referred to as generalization of the model. If the model does not have this property then it is of very little use other than a lookup table for the classification of training set. Therefore, Better generalization is one of the most important properties of ANN.

So, what are the different components which ANN consist of and how do each of them works to help ANN get better results? Each node in ANN has a weighted connection which is used for summing up and making the decision whether to fire that neuron or not. Other then weights of each node, all the layers in ANN have a bias value also. We will discuss the importance of Bias shortly. Other than these two important components we have activation

(22)

function 2.3.2 discussed later and loss function which helps in training the model.

Y =X

(weight × input) + bias (1)

Artificial neuron calculates a weighted sum of its input, add a bias and then make a decision whether it should be ’fired’ or not with the help of activation function. Equation 1 signifies how the output is calculated at each layer but why do models add a bias to all the layers of network.

If considered carefully, bias are very helpful as it allows ANN model to shift the function from left to right which can be very important for successful learning. For example, Consider Equation 1 without bias that calculates model output by multiplying the input with its corresponding weight and passing the result through an activation function (eg sigmoid).

Figure 9a shows the function computed by equation 1 without bias for various values of weights. As you can see in the same figure 9a changing the value of weights changes the steepness of the activation function which is useful but what if we want model to output 0 when x = 2?. Changing the steepness of the activation function won’t be able to help in getting output 0 when x is 2 for that we need to shift the entire curve to the right. Now consider the same equation with bias. Figure 9b denotes the output of the model by using bias for various values of weights. If we want the model to output 0 when x = 2 then just by having weight = −5 for bias we will be able to shift the curve to the right [37].

(a) Output without bias (b) Output with bias Figure 9: Bias effect on the equation’s output.

2.3.1 Training the Model

Training of the model is one of the most important aspect as learning the features and updating kernel weights happens here. ANNs were invented as a structure of human brains so we can related training of the model in the same way. During childhood our minds don’t have any prior knowledge and learning starts by looking at a example of what is cat, dog or a car. Similar sort of way happens with ANN, before starting the training of the model, the weights of the kernel are initialized randomly. These filters don’t have any idea for how to detect low level features and use them to detect for high level features. As we grow, we keep

(23)

on going through many different examples with their corresponding labels deciding what is it.

Same methodology of being given an input image with the corresponding output or labels is the training process that ANNs go through. Back-Propagation can be divided into 4 sections i.e. the forward pass, the loss function, the backward pass and the weight update. Let’s assume dimension of image as 32 × 32 × 3 (3 denotes the depth or channel).

Section 2.3.3 talks about different types of loss function, let’s assume we are using MSE (Mean squared error) which computes value as shown in 9.

M SE = 1 n

n

X

i=1

(yi−ybi)2 (2)

For the first few examples, loss will be extremely high as the weights are completely random. Ultimately end goal for our model is to make the prediction as similar as the target output meaning minimizing the loss. For simplicity let’s suppose we only have two weights to update as seen in figure 10 so our task is to minimize the loss by adjusting the weights and reach to the lowest point.

Figure 10: Our goal is to reach lowest point of bowl shaped object by minimizing loss which is possible by taking derivative of the loss with respect to weights [6].

Mathematical equivalent of finding derivatives of the loss with respect to the weights is dWdL, where L is the loss computed and W are the weights at a particular layer. After computing derivative of the loss ,we determine which weights are contributing most to the loss and to adjust them so that the loss decreases by using backward pass. After we are done with this derivative, we move forward with the last step which is weight update. In this step all the weights of the filters are updated in opposite direction of the gradient to minimize the loss of the network. This cycle of method is repeated until we reach the minimum value of loss function.

w = wi− ηdL

dW (3)

(24)

Figure 11: Variation in weight update due to high learning rate [6].

Equation 3 has one parameter called learning rate which plays a vital role in training the model. Value for learning rate is choosen manually before training the model. A high value for learning rate can make the model to take bigger steps while updating the weights making them converge faster and a low learning rate can take a bit of time to converge.

Figure 11 shows the affect of choosing high value of learning rate across loss and weights.

Overall all these four steps makes one training iteration. This cycle of methods are repeated for a particular number of iterations(can be choosed manually) for each set of the training images.

2.3.2 Activation Function

Equation 1 denotes how neuron calculates value of Y(output). Now Y can take any value ranging from −∞ to +∞ and neuron doesn’t know the bounds of the value. Therefore, we use activation function for checking value of Y produced by neuron and decide whether it should be fired or not. We use this firing pattern because that’s the way human brain works and ANN is based on it.

f (x) = 1

1 + ex (4)

Figure 12: Graphical output of sigmoid function

(25)

The Sigmoid function is a popular choice as an activation function. It takes a real-valued input varying from −∞ to +∞ and maps it to a range between 0 and 1 [38]. Equation 4 denotes the computation used for sigmoid. It is non linear in nature so combination of this function is also non linear. Output of sigmoid function is always between 0 and 1 which can be used to represent the output class for a binary classification. One of the problem with this function is towards the either end of the graph, output value tend to respond very little for the changes in value of X. This problem is often referred to as ’vanishing gradients’ where model stops to learn further or very slow.

f (x) = 2

1 + e−2x − 1 (5)

f (x) = 2sigmoid(2x) − 1 (6)

Figure 13: Graphical output of TanH function

Another choice for activation function in TanH. Figure 13 shows its behaviour which looks like sigmoid function. Equation 5 denotes the actual computation using tanh function and equation 6 shows how both the functions are related. The output value is in range from (-1,1) and sometimes referred to as scaled sigmoid function. It also suffers from the vanishing gradient problem [39].

While sigmoid and tanh have been commonly used activation functions, there is one more new and widely popular activation function called Rectified Linear Unit(ReLu). This function simply thresholds the input value at 0. In simpler terms, it outputs 0 when x ≤ 0 and conversely, it outputs a linear function when x > 0 14a.

f (x) =

(0 x ≤ 0

x x > 0 (7)

(26)

(a) Graphical output of ReLu function

[40] (b) Graphical output for Leaky Relu

Figure 14: Graphical Plot for Relu and Leaky Relu activation function.

Advantage of using ReLu, compared to sigmoid and tanh is that it involves simpler mathematical operations and also it doesn’t suffer from vanishing gradient because there is no saturation in relu at either side of the graph making it a proper choice for deep ANN.

But due to horizontal line in ReLu(fig 14a) for negative value of X, it suffers from dying relu problem. It can go into a state where it stops learning and responding to any variation in error or input. We can apply Leaky relu instead of relu which allows a small negative slope when a unit is not active. Equation 8 shows the computation used in leaky relu.

f (x) =

(0.01x x ≤ 0

x x > 0 (8)

Among different types of activation functions discussed above, we choose the one which helps in approximating the function to faster learning of the model. For example, a sigmoid works well for a classifier problem because approximating a classifier function as combinations of sigmoid is easier than ReLu. It all depends on the type of problem which you want to solve and choosing the function accordingly.

2.3.3 Loss Function

Error is computed as the difference between the predicted and target output [5] when training the model. The function used to compute this error is often referred as loss function. Loss function is used by ANN for updating the weights after one full iteration of data, this process of correcting the weights is called back propogation. The error is propagated backwards from the output layer through the hidden layers to the input layer, wherein the weights and biases are modified in such a way that the error for the most recent input is made smalller. Loss function minimizes error by updating the weights and help them in achieving convergence.

(27)

Figure 15: Variation between loss function and weights. w denotes weights of ANN and J(w) is the loss function

There are many loss functions available such as Absolute Error, Binary Cross Entropy, Negative log likelihood and Mean Squared Error. Use of any particular loss function depends on the type of problem we want to solve. For example, if we want to predict a continuous numerical value as an output then we may use Mean Squared Error loss function. Equation 9 shows MSE calculates error by finding the average squared difference between the predicted (byi) and the target output (y).

M SE = 1 n

n

X

i=1

(yi−ybi)2 (9)

Likewise, for a problem where output is binary(0 or 1), Binary Cross Entropy(BCE) will be very helpful as a loss function. It all depends on what you are trying to solve by using ANN model and carefully choose loss function as it helps in weights convergence better and faster.

BCE = −(y log(y) + (1 − y) log(1 −b by)) (10)

2.3.4 Convolutional Neural Networks

Now ANN are being used among many different fields such as text, audio and images etc. As discussed above, learning through labelled outputs acting as ground truth when compared with predicted outputs is called Supervised learning. For each training sample there will be a set of inputs and can have one or more corresponding output values. The goal of this type of learning is to reduce the NN model overall error, by calculating correct output values using provided training examples . Convolutional Neural Networks (CNNs) are similar to ANN as they comprised of neurons and improve their learning by training through examples. The difference between CNNs and traditional ANNs is that CNNs are used in the field of pattern recognition within images. CNNs helps to encode features into the model architecture, which makes ANN model well suited for images related task and also reducing the required parameters for setting up the model [7]. [1] is among first breakthrough in 2012 where ANN model comprised of CNNs and was able to achieve considerable better result than previous

(28)

state of the art-result. Some other famous architecture which has been due to the use of CNNs are [26], [41], [42]. [42] achieved state of the art result in 2014 by improved utilization of the resources inside the network comprising of CNNs in 2014.

Figure 16: Basic Structure of NN model with CNNs [7]. In this model convolutional layers are stacked between ReLus continuously before being passed through the pooling layer and in the end goes between one or many fully connected layer.

When CNNs are used for images, the layers are comprised of neurons organised into three dimensions basically the spatial dimensionality of the input (heigth and width) and the depth. CNNs architecture consist of stacking three layers i.e. convolutional layers, pooling layers and fully connected layers. The basic characteristics of the CNN architecture shown in figure 16 can be divided into four categories -

1. Input layer have the pixel values of the images same as other forms of NN.

2. Convolutional layer determines the neurons output connected to inputs local region which are calculated of the scalar product between their weights and the region con- nected to the input volume. ReLu is usually applied with this layer for an elementwise activation function.

3. Pooling layer is used for reducing the number of parameters within that activation as it simply performs downsampling along the spatial dimensions of the input.

4. Fully connected layers have the same functionality as of standard NNs as they produce class probabilities from the activations which can be used for deciding output result [7].

Convolutional Layer

Convolutional layer is the most important layer and plays a vital role in how CNNs work.

Learnable kernel is the main focus for this layer parameters. These kernels are smaller in size(height and width) than the image but move over the whole image.

(29)

Figure 17: CNN of size 5x5 convolving through the whole image and result is being passed to first hidden layer [6]. Figure 18 show the actual multiplication of kernel with the image.

When filter slides or convolves through the input image, it is simply multiplying the original pixel values of the image with the values in the filter. These multiplication are summed up to have a single number, as seen in figure 18. This same process is repeated for every location on the input volume by moving the filter right by 1 unit and so on. Different numbers are produced with every unique location.

Figure 18: A visual representation of a computation in convolutional layer. The middle element of the kernel is placed over the input image, which is then calculated and replaced with a weighted sum of itself and any nearby pixels [8].

After kernel is convolved through the whole image it produces a 2D activation map which can be visualised, as in figure 19. Full output volume of convolutional layer is formed by stacking the activation map along the depth dimension produced by every kernel.

(30)

Figure 19: Activation map produced by convolutional layer on MNIST [9] dataset,some blurry edges of various digits are visible [10].

As these filters are connected to small region of the input volume at a time and slide across the whole image instead of mapping with the whole image at a time, number of parameters and complexity is reduced substantially compared to ANNs. Convolutional layers uses two important concepts i.e. sparse interactions and parameter sharing to achieve this parametrization. With sparse interaction, a hidden unit in convolutional layers only depends on the pixel in a small region of the image as the kernel convolves through the part of an image instead of a whole. The position of the kernel size (typically 3×3 or 5×5) is associated with the position of the hidden unit in its matrix topology. As we move to the hidden unit one step to the right, the corresponding kernel in the image also moves one step to the right.

The hidden units which are on the border of the image, for those corresponding kernel is partly located outside the image. For these border cases, zero padding is used by replacing or filling the missing pixel values in image with zeros.

As the name suggests, in parameter sharing convolutional layers share the same parameter of kernel to be present in multiple places of the network. In simpler terms, the set of parameters for the different hidden unit in this layer are all same. Unlike NNs, due to usage of this concept in convolutional layer it only learns one set of a few parameters instead of learning separate set of parameters for every position and use it as a link between the input layer and the hidden units. CNN got its name because the convolution between the input variables and the kernel can be interpreted as the mapping between the input variables and the hidden units. The sparse interaction and parameter sharing in convolutional layers doesn’t affect CNN for the translation of objects in an image. It is suggested to use more than one kernel in CNN layer to capture different properties in image such as one for detecting corner or edges etc. If one kernel is sensitive to one property, hidden unit will react to that detail regardless of where in the image that detail is present [43].

The three most important hyperparameters for these layers are the depth, the stride and padding. The depth decides how many neurons you want in the network, reducing this will significantly minimise the number of neurons of the model but it also affects the feature detection capabilities. Stride decides by how much the kernel will shift through the

(31)

whole images. Figure 20 shows how the filter moves along the image and in this example its convolving with stride of 1 as it is shifting with 1 unit to the right. Higher value of stride will reduce the overlapping and produces an output of lower spatial dimensions. Padding is simply adding the border of the input for controlling the dimensionality of the output volumes and also to avoid losing information on corners of the image. Figure 20 shows the use of padding 5*5 image by 1 with 3*3 kernel.

Figure 20: Convolution of 3x3 kernel into 5x5 image with value of stride and padding of 1 [11].

Pooling layer

As described above pooling layer aims to reduce the dimensionality of the given input but retains the important information. By reducing dimensionality it reduces the number of parameters and computational complexity of the model. There are many ways by which it can scales the dimensionality of the input such as performing MAX, Average or SUM of the input. MAX is the most common function used with the pooling layer and it choose the largest element from the feature maps, as seen in figure 21.

(32)

Figure 21: Result of applying MAX pooling on an input image.

Fully-connected layer

The fully-connected(FC) layers consist neurons which are directly connected to neurons in the two adjacent layers and no layers are connected within them. This layer is quite similar to traditional ANN shown in figure 7. These layers are always at the end of the CNN architecture where high level features are detected. It takes an input vector and outputs N dimensional vector where N is the number of types of classes that the model has to choose from. Like with MNIST [9] classification problem the output vector consist of 10 vector as the number of digits to output is from 0 to 9. FC layer looks at the output of the previous layer and determines which features correlates most to a particular class. For example, if the model is predicting whether an image is a cat, it will have high values in the activation maps that represent high level features like a paw or 4 legs, etc. Similarly, if the program is predicting that some image is a bird, it will have high values in the activation maps that represent high level features like wings or a beak etc [6].

2.3.5 Transposed Convolution

In 2.3.4, we talked about how CNNs is used for extracting the features from the Input. Im- age segmentation requires CNNs to use a transformation in opposite direction of a normal convolution operation i.e. getting the shape of Input from the shape of output while main- taining a connectivity pattern which should be compatible with said convolution [11]. This layer appeared for the first time in [44] but doesn’t have a specific name. It was widely used and referred as deconvolution layer in semantic segmentation [15]. It has many names like fractional convolutional layer [45], transposed convolutional layer etc[11, 46], we will used the word transposed convolutional here.

(33)

Figure 22: The transpose of convolving a 3x3 kernel over a 4x4 input using unit stride [11].

Let’s assume we have convolutional of 3X3. If we unroll the input and output into vectors from left to right, top to botton, the same 3X3 matrix can be represented as a Sparse matrix C where the non zero elements are the elements wi,j of the kernel(i and j being the row and columns of the kernel respectively).

Figure 23: Sparse matrix C

This linear operation takes the flattened input matrix as 16-dimensional vector and pro- duces a 4-dimensional vector which is reshaped as 2 × 2 output matrix later on [11]. Using this representation, the error is backpropagated by multiplying the loss with transpose of matrix C (Transpose of figure 23). One of the most important points of such a convolutional operation is that the positional connectivity exists between the input and output values.

Now, suppose we want to go the other way around as mentioned above i.e, map from a 4- dimensional space to a 16-dimensional space and also maintaining the connectivity pattern of the convolution as depicted in figure 22. This operation is known as transposed convo- lution. This layer works by swapping the forward and backward passes of a convolution and to determine whether it’s a normal convolution or a transposed convolution we check how the forward and backward passes are computed. For example, even though the kernel w defines a convolution in which forward and backward passes are derived by multiplying it with matrix C and CT (Transpose of matrix C) respectively, it can also define a trans- posed convolution whose forward and backward passes are derived by multiplying CT and (CT)T = C respectively.

Consider the convolution of 3 × 3 kernel on a 4 × 4 input image producing output of 2 × 2 output. As seen in figure 22, the transpose of this convolution operation will have an output of shape 4 × 4 applied on a 2 × 2 input. Dotted line in 2 × 2 input denotes padding which is necessary in order to maintain the connectivity pattern of the transposed convolution as the bottom left pixel of the input of the normal convolution only contribute to the bottom left pixel of the output, the top left pixel is only connected to the top left output pixel etc [11].

(34)

2.3.6 Batch Normalization

Batch normalization (BN) is among the most popular techniques that helps in faster and stable training of deep ANNs model [47]. It helps the model to avoid overfitting and learn general pattern from the training dataset. BN is a mechanism that stabilizes the distribution of inputs to a given network layer during training of the model. It achieves this by adding additional layers to the model which sets the mean and variance of the distribution of each activation layer to be zero and one respectively.

Figure 24: Illustration of applying batch normalization in deep neural networks.

We normalize every feature for the input layer as some are in range 0 to 1 and some in range 1 to 1000 to speed up the learning process. Applying batch normalization to hidden layers reduces the time consumed by 10 time or more. Batch normalization reduces the amount by what the hidden unit values shift around (covariance shift). For example, let’s suppose we have a ANN model with dog detection. In our training data we only had brown dog images so when we try to apply this model to data with black dogs there is a high probability it won’t perform well. Even though data consists of dog images but their results differ due to the difference in the distribution of data.

2.3.7 Optimizers

In 2.3.1 we talked about how weight parameters were updated in the ANNs model by min- imizing the loss function and updating them in the opposite direction of the gradient of the loss function J(θ). After computing the gradient of the loss function, it is continuously evaluated and then performing a parameter update is using Gradient Descent (GD). GD is the most common way of optimizing ANN loss function.

w = wi− ηdL

dW (11)

(35)

Above equation is used to update the weight parameters in the model and learning rate (η) determines the size of the step to reach the local minimum, meaning we follow the direction of the slope of the surface created by the loss function down the hill until we reach a point of minima [48]. There are three types of Gradient Descent, each of them differ by the amount of data used for computing the gradient of the loss function.

1. Batch gradient descent - Batch GD computes the gradient of the loss function wrt to the weight parameters for the entire training dataset. θ denotes the model weight parameters.

θ = θ − η · ∇θJ (θ) (12)

2. Stochastic gradient descent - Stochastic gradient descent (SGD) updates model parameter for each training example X(i) and its corresponding label or mask Y(i).

θ = θ − η · ∇θJ (θ; x(i); y(i)) (13) 3. Mini batch gradient descent - Mini batch gradient descent combines both SGD and Batch GD by updating the model parameters for every mini batch of n training examples.

θ = θ − η · ∇θJ (θ; x(i:i+n); y(i:i+n)) (14) During this project we used SGD, RMSProp and Adam optimizers in our segmentation NNs model.

SGD

SGD updates model parameters(θ) in the negative direction for each training example x(i) and its corresponding mask y(i). Eq 13 shows the mathematical update used in SGD. Batch GD computes the gradient of the whole dataset making the computation redundant as it contains the gradients of similar example as well before each parameter update. SGD over- come this reduandancy by computing one update at a time making it more faster. Figure 25 shows fluctuations in SGD due to frequent updates with a high variance [48]. Due to these fluctuations it enables the method to jump new and a better local minima.

Figure 25: Fluctuations in SGD [12].

(36)

RMSprop

RMSprop was proposed by Geoff Hinton in his coursera class [49]. The central idea of RMSProp(root mean sqaure) is to keep the moving average of the squared gradients for each weight and divide the gradient by the square root of the mean square. In Equation 15 E[g] is the moving average of squared gradients, δCδw is gradient of the cost function w.r.t to weight, β is moving average parameter which has default value of 0.9.

E[g2]t= βE[g2]t−1+ (1 − β)(δC δw)2 wt= wt−1− η

pE[g2]t

δC δw

(15)

As can be seen in the above equation it adapts learning rate η by dividing by the root of squared gradient.

Adam

Adaptive moment estimation (Adam) [50] is another very commonly used method to compute adaptive learning rates for each parameter. Adam is widely popular and used optimizer in NN models nowadays. It makes use of the average of the second moments of the gradient.

It calculates an exponential moving average of the gradient, the squared gradient and the parameters β1 and β2 controls the decay rates of these moving averages. Advantage of using adam is that the magnitude of parameter updates are invariant to rescaling of the gradient and works well with sparse gradients.

mt= β1mt−1+ (1 − β1)δC

δw (16)

vt = β2vt−1+ (1 − β2)(δC

δw)2 (17)

In equation 16 mt is the estimate of first moment (mean) and vt in equation 17 is the estimate of second moment of the gradients. Both mt and vt were biased towards zero as they are initialized as vectors of zeros so it counteracts this by computing bias corrected first and second moment estimates as below:

mct = mt

1 − β1tvbt= vt

1 − β2t (18)

Above parameters yields the Adam update rule as in equation 19. Default values proposed by authors in [50] are 0.9 for β1, 0.999 for β2 and 10−8 for .

θt+1 = θt− η

vbt+ mct (19)

(37)

2.3.8 Segmentation using Neural Network

Figure 26: Representation of an image as different labels denoting different objects [13].

As mentioned earlier, Segmentation task requires predicting a class for each pixel of the input image unlike classification where we predict one class for the whole image. Classification of image mainly deals with what’s in the input. However,predicting class for each of the pixel in input image we need to know not only what but also where. There are two types of segmentation, one is semantic segmentation and the other is instance segmentation. In semantic segmentation we only segment the category of each pixel meaning if there are two objects of the same category in an input image, the segmentation map does not differentiate between them as separate objects. However, instance segmentation distinguishes between separate objects of the same class. In segmenting out the image our goal is to take a RGB color image (height × width × 3) and output a segmentation map where each pixel contains a class label represented as an integer (height × width × 1).

(38)

Figure 27: Output of an ANN model as a segmentation map [13].

Figure 28: One hot encoding of Color column.

One-hot encoding is the most widely used encoding scheme for encoding categorical variables. It transforms a single variable with n observations and d distinct values, to d binary variables with n observations each. Each observation indicating the presence (1) or absence (0) of the dichotomous binary variable [51]. Figure 28 represent one-hot encoding for variable color with 5 observation and 3 distinct values. Similar method is used for encoding the class-labels and creating an output channel for each of the possible labels. Figure 27 can be extracted as a segmentation map by taking the argmax of each depth-wise pixel vector.

When we overlay only a single channel of our output labels, it will illuminate the regions of an image where that specific class is present which we can call a mask.

(39)

Figure 29: Neural Network model for segmentation preserving dimension of an image through the network [10].

One way of constructing an architecture for a neural network model is to stack a number of convolutional layers with same padding to preserve dimensions and output a segmenta- tion map. This will directly learn the mapping from the input image to its corresponding segmentation through the successive transformation of feature mappings but it will be com- putationally expensive if we preserve the full resolution throughout the network.

Figure 30: Neural Network model(Alexnet) used in classification of an image [10].

In 2.3.1 we saw that first layers in CNN model learns low-level featurs and further layers learn more high level specialized feature. For the best result in segmentation we need to increase the number of channels as we go further in the network. In classification, an input image is resized and goes through the convolutional layer and fully connected(FC) layer, predicting the label for the input image as in fig 30.

(40)

Figure 31: Converting all FCN layer into convolution layer [14].

Lets suppose we turn the FC layers into 1 × 1 convolutional layers and if the image is not resized, the output of the model will not be a single label instead the output will have a size smaller than input due to the max pooling. We can calculate the pixelwise output by upsampling the predicted image.

Figure 32: Dense Prediction for per pixel value of image resulting in segmentation by FCN model [15].

Due to their architecture as seen from figure 32 they were named Fully Convolutional Networks (FCNs) [15]. The structure of FCN looks like an encoder/decoder which is very common and helpful architecture for segmentating out object in an image. First, we down- sample the spatial resolution of the input, where we develop lower resolution feature map- pings which can be highly efficient at discriminating between target labels as it extracts and interpret the context(what) and upsample the feature representation into a full-resolution segmentation map by recovering spatial information [52]. Upsampling helps in precise local- ization of the object in the image and is performed when we need to get the output size larger and is called transposed convolution 2.3.5. Fully convolutional network was introduced in [15] which was adapted from [1] by changing all fcn layer into convolutional layer in CNN architecture of Alexnet. These layers helps by increasing the resolution of the output. High resolution features from the endcoder part of the network are combined with the upsam- pled output for better localization. Based on this information a more precise output can be

(41)

learned to assemble successive convolution layers [16].

2.3.9 Dice Loss

The segmentation model outputs the probability of the pixel value in original image belonging to foreground or background. Either the point belong to foreground or background based on the model learning and there is a high probability that sometimes in the whole image the object of interest for the model occupies only a small region of the image [53]. This can cause the model learning process to get stuck in local minima of the loss function whose predictions are strongly biased towards background resulting in partially or missing the foreground image. As the learning is biased towards background process it will still have high accruacy even when it not detects the foreground object. To deal with this issue, dice coefficient was proposed in [54]. Dice loss handles the class imbalance problem in segmentation efficiently.

Dice Coefficient can be written as in equation 20 where it sums over N voxels as it was derived for dealing with volumes for the first time[54], pi is the predicted binary segmentation from NNs model and gi is the ground truth binary segmentation result.

D = 2ΣNi pigi

ΣNi p2i + ΣNi gi2 (20)

Figure 33: Each matrix represent different labels in the image, Dice coefficient is calculated for the mask of each label [13].

We used dice loss to compute the loss function in our Unet and Segnet model to get the better estimate of the performance of our model. Dice coefficient is simply the measure of overlap between two samples ranging from 0 to 1 where 1 denotes perfect overlap and complete overlap. We use 1 − D as a loss function and it’s commonly referred to as soft dice

(42)

loss as instead of thresholding we are using predicted probabilities for converting them into a binary mask.

D = D = 2|AT B|

|A| + |B| (21)

2.3.10 Unet Model

U-net model was proposed in [16] as an extension of architecture used in [15]. Unet works with few training images applied together with heavy data augmentation yielding more precise segmentation result. Collecting images for dataset can be a daunting task in medical field but the ability of Unet to work with less number of images have achieved good results in [16, 55].

Figure 34: U-net architecture (example for 32x32 pixels in the lowest resolution).

Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box.

White boxes represent copied feature maps. The arrows denote the different operations [16].

One of the important modification in unet architecture was increase in numbers of feature channels in upsampling path helping the network to propagate context information to higher resolution layers. Due to this change, it has U-shaped architecture as seen in figure 34. Unet model doesn’t have any fully connected layers, so images of different size can be used as an input. The architecture can be separated into 3 differents parts i.e. contracting path, bottleneck and the upsampling path. The contracting path is composed of 4 blocks and

(43)

each block is composed of two 3 × 3 convolutions, each followed by a ReLu (sec 2.3.2) and a 2 × 2 max pooling operation with stride of 2 for downsampling. Number of feature channels are doubled at each block of contracting path. The sole purpose of this contracting path is to capture the context of the input image in order to be able to do segmentation. This contextual information will then be transferred to the upsampling path. Bottle neck refers to the part between contracting and upsampling path which connects them as well. It is at the end of the U shape architecture where the line ends, consisting of 2 convolutional layers.

Upsampling or expanding path is used to enable precise localization when combined with contextual information from contracting paths. It consist of 4 blocks which is composed of 2 × 2 transposed convolution that halves the number of feature channels, concatenation with the corresponding cropped feature map from the contracting path and two 3×3 convolutional layer each of them followed by a ReLU [16]. This concatenation helps in recovering full spatial resolution information in the upsampling path which gets lost along the contracting path, making it suitable for semantic segmentation [56].

2.3.11 Segnet

Segnet proposed in [17] is another neural network architecture for semantic segmentation. It also consists of an encoder network,a corresponding decoder network followed by a pixel wise classification layer. Encoder network architecture in Segnet is topologically identical to 13 convolutional layers in VGG 16 [26]. There is no fully connected layer like vgg16 which makes the encoder network significantly smaller and easier to train as they have less parameters.

Another key component in segnet is the decoder network which consist a corresponding encoder for each decoder. The significant difference in segnet is the way decoder network reuse the max-pooling indices received from the corresponding encoder to perform non-linear upsampling of their input feature maps. The number of parameters in the model architecture is reduced significantly enabling end-to-end training as it reuses max pooling indices in the decoding process.

Each of the encoders in the encoder network performs convolution with a number of filters to produce a set of feature maps followed by batch normalization and in an ReLu layer. After ReLu there is a max pooling layer with 2 × 2 filter and stride 2 resulting output sub-sampled by a factor of 2.

(44)

Figure 35: An illustration of the SegNet architecture. Only convolutional layer and no fully connected layer [17].

Max pooling helps in achieving translation invariance over small spatial shifts in the input image. Network consists of max pooling layer which helps in achieving translation invariance but in loss of spatial resolution of the feature maps also. To overcome this limitation segnet stores indices of max pooling (locations of the maximum feature value in each pooling) which can be memory efficient as well. The decoder network upsamples its feature maps using the stored max pooling indices from each of its corresponding encoder feature maps resulting in sparse feature maps. Convolving with these feature maps with the filters in decoder network produces dense feature maps followed by batch normalization to each of the maps. The final layer in Segnet is a trainable soft-max classifier, classifying each pixel in the input image independently. The output of soft-max is 2-channel image of probabilites classifying each pixel as either background or foreground.

Figure 36: Architecture of segnet basic having only 4 encoder and decoders.

Unlike Segnet, Unet does not reuse max pooling indices instead it transfers the entire feature maps to the corresponding decoder and concatenates them for upsampling. Also there is no fifth convolutionl layer in Unet. We used the segnet-basic variant during our experiment consisting of 4 encoders and decoders. Figure 36 illustrates the architecture of segnet-basic with only 4 encoders and 4 decoders.

References

Related documents

Bilbo is approximately three foot tall and not as strong as Beorn, for instance; yet he is a hero by heart and courage, as seen when he spares Gollum’s life or gives up his share

Following the recommendation of the World Health Organization (WHO), FFP2-certified particulate respirators is what health care workers who work with COVID-19 patients

(iii) Page 14 (paragraph after equations 5 and 6): Read “independent vectors” as..

Det är just den sociala interaktionen mellan utövarna som särskiljer dansen från många andra fysiska aktiviteter och det är också just detta som mest av allt motiverar ett

I enkäten tas också ett antal frågor upp kring hur personalen uppfattar barnens inställning till olika maträtter och tillbehör och genomgående visar resultatet på att de

Specifically, its aim was to examine how a set of physical, sociocultural, and behavioral factors was related to three dimensions of Swedish adolescent girls’ and boys’ body-esteem:

The conclusion to my research reveals that companies in the Experience industry as represented by Star Bowling AB do not communicate and do not actually need to

The ambiguous space for recognition of doctoral supervision in the fine and performing arts Åsa Lindberg-Sand, Henrik Frisk & Karin Johansson, Lund University.. In 2010, a