Classifying Microscopic Images for Acute Lymphoblastic Leukemia (ALL) using Bayesian Convolutional Neural Networks

(1)

Classifying Microscopic Images for

Acute Lymphoblastic Leukemia (ALL)

using Bayesian Convolutional Neural

Networks

Author: Mohammad Ehtasham Billah

Semester and year: Spring 2018 Degree Project: 2nd _{Cycle, 15 Credits}

Subject: Applied Statistics, Independent Project (ST - 401A) ¨

Orebro University, ¨Orebro, Sweden

Supervisor: Farrukh Javed, Assistant Professor, Department of Statistics Examiner: Stepan Mazur, Assistant Professor, Department of Statistics

(2)

(3)

Acknowledgment

I would like to express my gratitude to my supervisor, Assistant Professor Dr. Farrukh Javed whose guidance and encouragement has greatly helped me to accomplish this thesis. I always have a fascination with Machine Learning and Deep Learning and I thank my department for giving me the opportunity to work in that direction. I would like to thank my examiner Dr. Stepan Mazur for scrutinizing my thesis paper and providing helpful recommendations. I am also grateful to all of my teachers and classmates at the statistics department for their cordial co-operation from the very first day.

Last but not the least, I am indebted to my parents for their unconditional love and care that has brought me this far. They have always motivated me to become a better person. Finally, I thank my wife, Farjana for her encouragement and support through thick and thin.

(4)

(5)

Abstract

Statistics has a wide spectrum of applications in many fields. Even though the idea of estimating predictive uncertainty in the Convolutional Neural Networks (CNN) might have existed before but it has not been implemented most frequently in practical fields including medical science.

In this thesis, we employ the Bayesian Convolution Neural Networks (BCNN), proposed byGal and Ghahramani[2015a] to classify microscopic images of blood samples (lymphocyte cells). The data contains 260 microscopic images of cancerous and non-cancerous lymphocyte cells. We experiment with different network structures to obtain the model that return lowest error rate in classifying the images. We estimate the uncertainty for the predictions made by the models which in turn can assist a doctor in better decision making. The Stochastic Regularization Technique (SRT), popularly known as dropout is utilized in the BCNN structure to obtain the Bayesian interpretation.

Our developed model not only produces very good accuracy in classifying cancerous and non-cancerous lymphocyte cells but also provides useful uncertainty information regarding the predictions. The potential usage of Statistics and Artificial Intelligence (AI) in fields, such as medical science, together with the combination of probabilistic modeling through Bayesian approach is the most efficient mean of ensuring AI safety since it provides information about uncertainty. Considering this importance, we think our result contributes in a meaningful direction.

(6)

(7)

List of Figures

1.1 Development of Blood Stem Cell . . . 2

2.1 Biological Neural Networks . . . 7

2.2 Simple Artificial Neural Networks . . . 8

2.3 Stochastic Gradient Descent in 2D and 3D . . . 12

2.4 Convolution Operation Between Input Images and Kernels . . . 18

2.5 Convolutional Layers and Max Polling Operation . . . 19

2.6 Max Pooling and Flattening Operation . . . 20

2.7 End to End Convolutional Neural Networks . . . 21

3.1 Dropout . . . 23

3.2 Dropout in Training and Testing Time . . . 24

5.1 Sample Images from ALL-IDB1 & ALL-IDB2 dataset . . . 33

5.2 Error Rate Returned by Two Different Models from Three Convolutional Layers 36 5.3 Error Rate Returned by Models from Four & Five Convolutional Layers . . . 38

5.4 Predictive Uncertainty Estimated by First Four Models from Three Convolu-tional Layers & Five Hidden Layers . . . 39

5.5 Predictive Uncertainty Estimated by First Four Models from Four Convolu-tional Layers & Five Hidden Layers . . . 40

5.6 Checking for Overfitting of Models from Different Networks . . . 41

(10)

List of Tables

5.1 Mean and Standard Deviation of Accuracy Rate of Ten Models For Each of Six Different Network Structures . . . 36 5.2 Mean and Standard Deviation of Sensitivity and Specificity of Ten Models

For Each of Six Different Network Structures. . . 37

(11)

Nomenclature

Symbols

D data

L loss function

ω weights

ωih weights on connection between input units and hidden units

ωhk weights on connection between hidden units and output units

Wi weight matrices at i − th layer

xi independent variable with N observations

yi dependent variables with N observations

x∗ new observation of independent variable xi

y∗ new observation of dependent variable yi

ˆ

yi estimated yi given xi

N normal distribution ~ convolution operation

φ activation function

bh biases in hidden units

bk biases in output units

bi biases in i − th layer

pi dropout probability in i − th layer

Ki units at i − th layer

L layer

(12)

NOMENCLATURE viii g gradient ln natural logarithm log logarithm Abbreviations N N Neural Networks

BN N Bayesian Neural Networks

CN N Convolutional Neural Networks

BCN N Bayesian Convolutional Neural Networks

KL Kullback - Leibler Divergence

ELBO Evidence Lower Bound

V I Variational Inference with optimization objective KL

V IE Variational Inference with optimization objective ELBO

DO Dropout

M C Monte Carlo

SRT Stochastic Regularization Technique

SGD Stochastic Gradient Descent

GP Gaussian Process

ReLU Rectified Linear Units

ALL Acute Lymphoblastic Leukemia

ALL − IDB Acute Lymphoblastic Leukemia Image Database

AI Artificial Intelligence

SV M Support Vector Machine

SSV M Smooth Support Vector Machine

kN N k-Nearest Neighbor M LP Multi-Layer Perceptron

RBF N Radial Basis Function Network (NN with radial basis activation function)

(13)

NOMENCLATURE ix

AN F IS Adaptive Neuro Fuzzy Inference System

e.g. Exempli gratia

i.e. Id est

(14)

Chapter 1 Introduction

In this chapter, we introduce the objective of this thesis project. We discuss the importance of acquiring knowledge regarding uncertainty in predictive modeling. The method we follow in this project is also mentioned in brief.

1.1 Introduction

Among all types of pediatric cancer, Acute Lymphoblastic Leukemia (ALL) is one of those that can be observed to happen most frequently. There are four types of leukemia can-cer that can be perceived namely Acute Lymphoblastic Leukemia (ALL), Acute Myeloid Leukemia (AML), Chronic Myeloid Leukemia (CML) and Chronic Lymphocytic Leukemia (CLL). Around 876,000 patients were reported for ALL cancer worldwide and caused 111,000 deaths in 2015 [Vos et al.,2016], among which two-thirds of the total patients are observed to be aged below 5. According to National Cancer Institute U.S., ALL is a disease when the bone marrow produces more than the required amount of immature lymphocytes [NCI,

2018]. The excessive number of lymphocytes hampers the activities of other blood compo-nents i.e. red blood cells, platelets as well as the white blood cells. If not treated in time, ALL can cause death within weeks or months [Marino and Fine,2013].

In the diagnosis of ALL, one method of initially determining the symptoms of cancerous lymphocytes cells is through complete blood count [NCI, 2018]. In this technique, the number of platelets, red blood cells, and white blood cells is counted from blood samples of the patients. A significant number of lymphoblast, B lymphocyte and T lymphocyte in the peripheral blood is a probable indication of leukemia cancer [Hunger and Mullighan,2015]. If too many lymphocytes are observed, morphological bone marrow smear analysis is performed by pathologist using microscope to determine the manifestation of cancer. Determining whether a lymphocyte in the blood sample is associated with leukemia cancer is an important part of diagnosis and thus the proper identification of the cancerous lymphocytes would assist in better prognosis. The produced screening reports on microscopic blood samples by a human can be subjective based on many factors e.g. the responsible person’s experience, age, mental state, exhaustion etc. The assist of an automated system could be a vital tool to avoid errors in such critical situations. In classifying lymphocyte cells, Bayesian Convolutional Neural Networks (BCNN) could be an alternative to other methods utilized

(15)

1.1. INTRODUCTION 2

Figure 1.1: In an ALL patient, the lymphoid stem cell produces excessive lymphoblast as well as defective B lymphocyte and T lymphocyte. These defective lymphocytes also referred as ALL cancer cells, have less capability of fighting infections. Image adopted fromNCI[2018].

in recent times. The advantage of such approach would not only be the effective classification of images but also the estimation of predictive uncertainty of each prediction made by the BCNN model. Thus, the development of a model that can effectively classify the microscopic images with confidence bound would assist greatly in the better diagnosis of ALL cancer.

The observed percentage of error in classifying the microscopic images by pathologists is 30% to 40% depending on his experience and the complication involved in classifying the cells [Reta et al., 2010]. Our objective is this thesis is to develop a BCNN model that can classify with lower error rate with high certainty. As we already know that some tasks can be performed better by human compared to a computer while it’s the opposite for some other tasks. Nonetheless, the intention is not to imply that a Deep Learning model is better than a pathologist but to combine the experience and skillset of a doctor with the potential of AI for better ALL diagnosis. Since we are dealing with microscopic images which can be complex even for an expert, the convolution operation performed by a computer may produce better classification accuracy.

Previous methods used in classifying the cancerous and non-cancerous lymphocyte cells include Support Vector Machine (SVM) [Amin et al., 2015; Putzu et al., 2014], k-Nearest Neighbors (kNN) [Madhloom et al.,2012], Neural Networks [Singh et al.,2016] and ensemble of different classifiers (SVM, MLP-SVM, RBFN-SVM, kNN & Naive Bayes) [Mohapatra et al., 2014]. Experimentation with different methods including PCA-kNN, PCA-PNN, PCA-SSVM, PCA-ANFIS, PCA-SVM and best combination of classifiers referred as Hybrid

(16)

1.2. UNCERTAINTY ESTIMATION IN PREDICTIVE MODELING 3

Hirarchical Classifier are discussed in Rawat et al. [2017].

However, experimentation in CNN has not been performed before in classifying the mi-croscopic images of lymphocyte cells. Our method not only predicts the class of the images but also provides valuable uncertainty information that has not been attempted in previous works.

1.2 Uncertainty Estimation in Predictive Modeling

When we take a decision, we want to be confident about the resulting outcome. In other words, having knowledge about the range of probable outcomes beforehand would assist us in better decision making. The knowledge about the future outcomes can also be expressed in terms of uncertainty. From the statistical point of view, for a classification problem, uncertainty is a bound formed by probability values about an outcome.

Statistics has a wide spectrum of applications in many fields. Recent years have been the juvenile period of integrating statistics and artificial intelligence to solve real-world problems ranging from finance, medicine, business, life science to environmental science. Important decisions are required to take in these fields to achieve the intended goals. Deep learning has become a vital tool for making predictions about the future outcomes in order to make better decisions. In a classification problem, a deterministic Deep Learning model produces a single probability, also known as a point estimate given a single observation of the indepen-dent variable(s). In contrast, a Bayesian Deep Learning model generates a distribution of outcomes given the input data. This probabilistic modeling in supervised Machine Learning or Deep Learning produces the confidence bound for each outcome of dependent variable which can be utilized to make realistic and effective decisions. The point estimate based prediction in deterministic Deep Learning models does not give any indication about the possible variation in outcomes that could happen in reality. This lack of information about uncertainty in predictive modeling can lead to unwanted situations. For instance, in cancer diagnosis, the frequentist approach in Deep Learning model will generate a single probability value about whether a person has cancer or not and if a doctor prescribes medicine based on that single probability value, the value which could be associated with the type-I error, the consequence could be devastating. In a similar situation, a Bayesian Deep Learning model will generate a probability distribution over a possible outcome, from which it would be pos-sible to estimate the mean and variance. In general, in a classification problem, we treat the mean of predicted probabilities as the predicted outcome and the variance tells us about the uncertainty of the outcomes in any given situations. In other words, the distribution illus-trates the confidence bound about the outcome given the input data. From that confidence bound, we can observe how confident the model is in making each of the predictions.

Deep Learning models can be defined by several architectures starting form Feed Forward Neural Networks [Rumelhart et al., 1985] to Convolutional Neural Networks [Rumelhart et al., 1985; LeCun et al., 1989] and Recurrent Neural Networks(RNN) [Rumelhart et al.,

1985;Werbos,1988]. Feed Forward Neural Networks (NN) are mostly utilized for solving the regression and classification problems. CNN classify images based on pixel values [ Hecht-Nielsen, 1992] to discover the visual patterns e.g. image recognition , facial recognition or human action recognition, natural language processing , drug discovery [Wallach et al.,

(17)

1.3. BACKGROUND 4

2015], disease diagnosis [Chen et al.,2016] and many others. The Recurrent Neural Networks (RNN) focus on the time sequence based modeling, mostly utilized in time series prediction, speech recognition, human action recognition, machine translation systems, natural language processing, language modeling, video processing etc.

In general, probabilistic modeling requires theoretical efficiency and high computational effort. Due to these reasons, in practical fields, usually most of the implemented Deep Learn-ing models are deterministic models. However, recent times the computational capability has been possible to increase thanks to the high-performance Graphics Processing Units (GPU). The high computation power of modern computers has created a new horizon of obtaining results from the Deep Learning models in a more effective way. The Bayesian approach to Deep Learning produces the model uncertainty as well as the predictive uncertainty by explaining the data in an efficient way. The information regrading uncertainty in predictions made by a model gives information about how confident the model is in returning predic-tions, which in turn helps a practitioner to take sensible decisions as well as to fine-tune the Deep Learning model to obtain better predictions through iterative processes.

Note that, all the methods mentioned in section 1.1 for classifying ALL images are point estimate based except the Naive Bayes in the ensemble of different classifiers. That is, during test time, a single prediction was generated for each new observation. The vital information that we are missing in these point estimate based predictions is how confident the model is in generating each prediction. Our implemented Bayesian approach in CNN provides this vitally important information that was missed before. The predictive uncertainty generated by our models in BCNN set up extracts the information regarding the confidence bound w.r.t each prediction. This information is of vital importance since this tells the practitioner about the probable variation in outcomes that could happen in reality.

We are deploying artificial intelligence in many fields and the probabilistic modeling through Bayesian approach is the most efficient mean of ensuring AI safety since it assembles information about uncertainty. For this reason, it is of utmost importance to estimate the predictive uncertainty in supervised Deep Learning.

1.3 Background

The concept of estimating predictive uncertainty in Deep Learning first came to light during the decade of the 1990s. Initially, the pioneer works in these fields were ventured by Denker et al. [1987],Buntine and Weigend[1991], MacKay[1992],Neal[1992] which later expanded onBishop [1995], MacKay[1995] and Neal [1996]. Denker et al. [1987] first proposed to set a uniform distribution over network weights as the prior distribution. To approximate the posterior distribution, Denker and Lecun [1991] proposed to utilize the Laplace’s method. First, they optimized the network weights to find a mode and then fitted a Gaussian to that mode. The width of the fitted Gaussian was set by the Hessian of the likelihood [Gal,2016].

MacKay [1992] suggested the Bayesian evidence framework for model selection and model comparison as well as to obtain the weight decay coefficients in an efficient way. To approx-imate the true posterior distribution in NN, Hinton and Van Camp [1993] implemented a variational method known as ensemble learning. A Monte Carlo based technique, well known as Hamiltonian Monte Carlo was presented by Neal [1996] to obtain the posterior inference

(18)

1.4. THESIS STRUCTURE 5

in NN. His work was also extended on experimentation with different prior distributions. The Variational Inference technique in NN setup was first introduced byPeterson [1987] followed by Parisi [1988], Saul et al. [1996], Ghahramani and Jordan [1996], Jaakkola and Jordan [1997], and Jordan et al.[1999].

The concept of estimating predictive uncertainty in the CNN set up might have existed before but it has not been implemented most frequently in practical fields. This is probably due to complexity in achieving proper Bayesian inference in CNN structure. The practical implementation of computer vision tasks has only been achieved in recent times [Krizhevsky et al.,2012] and the concern about ensuring AI safety has generated the discussion of gaining uncertainty information from not only the CNN but also the other Deep Learning models as well. So far, the only pioneer work of estimating uncertainty in the Bayesian Convolutional Neural Networks (BCNN) has been done byGal and Ghahramani [2015a]. In their paper, to estimate the predictive uncertainty in the BCNN, they utilized dropout which is a stochastic regularization technique, most commonly used in the traditional NN to avoid overfitting. We discuss their approach in detail in chapter 4.

1.4 Thesis Structure

In chapter2, we briefly study the basics of Neural Networks (NN), Bayesian Neural Networks (BNN), and Convolutional Neural Networks (CNN) to understand the later part of this the-sis. A stochastic regularization technique, dropout commonly used in NN is introduced in chapter 3. In chapter 4, we discuss how dropout can be associated with approximate Varia-tional Inference. We deploy the BCNN in Acute Lymphoblastic Leukemia(ALL) dataset as an experiment and discuss the results in chapter5. Finally, in chapter6, we draw conclusion to our works and give recommendations for future works.

(19)

Chapter 2 Literature Review

In this chapter, we focus on understanding the basics of different Deep Learning network structures to make our journey smoother at the later stage of the thesis project. We start with Neural Networks (NN) and gradually move towards the Bayesian Neural Networks (BNN), Convolutional Neural Networks (CNN) and finally Bayesian Convolutional Neural Networks (BCNN), which is the principal area of exploration of this thesis work.

2.1 Neural Networks (NN)

The NN are the building blocks of the Deep Learning models. From statistical point of view, supervised Deep Learning models are a nonparametric method of classification and regression. The well-established concept of NN came from the Biological Neural Networks of our nervous system even though there are some disagreements about this concept. Regardless of that, it is easy to understand the fundamental concept of NN from the point of view of Biological Neural Networks. Our purpose in this section is to explore the NN on a broad spectrum.

The Biological Neural Networks in brain consists of millions of interconnected neurons. Each neuron contains a nucleus, dendrites, and an axon as depicted in Fig. 2.1. Neurons are connected to each other by dendrites and axon. In the network, the dendrites receive the signals and then pass that to the nucleus where the signals are processed and axon transmit the signal from the neuron.

The artificial NN in Deep Learning is based on almost similar architecture. In Deep Learning, NN consist of three fundamental components: an input layer, an output layer and a single or multiple hidden layers, that lie in between input and output layer as in Fig.

2.2. The input layer receives the data (independent variables) which are passed through the hidden layers for better representation of each observation and in the end, the output layer represents the final outcome (dependent variable) of each of those observations. For instance, in a regression problem in NN, the independent variable xi enters into the network through

the input layer and the output layer generates the value of dependent variable yi. Similarly, in

a classification problem, the value of dependent variable yi could be the probability that the

observation belongs to the k-th class. The number of hidden layers in between the input and output layer depends on the complexity of the data and the problem at hand. Basically, in

(20)

2.1. NEURAL NETWORKS (NN) 7

Deep Learning, the number of hidden layers is more than one and the intention of preferring so is to represent each of the data points better. The word “Deep” in Deep Learning came from this concept, in the sense that we want to represent the initially inputted data with the gradually purposeful way by passing them through each of these hidden layers. In this context, we can say that the number of hidden layers portrays the depth of a Deep Learning model.

Figure 2.1: A neuron containing the major components dendrites, nucleus, and axon. Image collected from Wikipedia.

The representation of each observation can be done through many layers even hundreds of them based on the problem at hand. During this consecutive representation of the data, the deep learning model tries to learn from the characteristics of each of the data point. NN are formed with a combination of all of these layers i.e. the input layer, the hidden layers, and the output layer.

In Deep Learning, we train our model by exposing the association between the inde-pendent variables and deinde-pendent variable. This exposition of association is done by series of consecutive data modification through multiple layers. The change made on each trans-formation is depicted by the respective weights. The networks learn from the errors by adjusting the weights in each iteration for each observation upon realizing it’s “mistakes” . Thus, for any specific problem at hand, network’s ultimate purpose is to find the “perfect” combination of values for all of these weights in all of the respective layers such that the association between the independent variable and dependent variable can be precisely mea-sured. This is the core concept of supervised Machine Learning and Deep Learning where the network gains the capability of learning and implement that in future without receiving explicit instruction from human.

Any regression or classification model is represented in two-stage in NN [Friedman et al.,

2001]. At the first stage, a linear combination of input xi is formed. An activation function

is then applied to transform this linear combination into non-linear expression. Then, at the second stage, a linear combination of these non-linear functions is formed to express the output yi. In a regression problem, there is only one output unit for each observation,

(21)

Figure 2.2: Feed Forward Neural Networks with an Input Layer, three Hidden Layers, and an Output Layer.

whereas in a classification problem, the number of output units is equal to the number of classes, say k, and each of the output unit represents the probability of belonging to the k-th class. An NN with one hidden layer and k outputs can be expressed mathematically as the following [Ripley,2007], yk = φ0 ( bk+ H X h=1 ωhkφh bh+ p X i=1 ωihxi !) .

Here xi is the independent variable and yk is the output. The number of input units,

hidden units, and output units are p, H, and k, respectively. ωih is the weights on the

connection from input units to hidden units and ωhk is the weights on the connection from

hidden units to output units. The biases in the hidden units and the output units are

bh and bk respectively. φh is the activation function mostly, sigmoid function, hyperbolic

tangent function or Rectified Linear Unit (ReLU), applied to the linear combinations of input at hidden layer. φ0 is the activation function (sigmoid function or softmax function)

implemented on the output layer for classification whereas, in regression, we utilize identity function as the activation function at the output layer. Thus, the parameters or weights in a network are bk, bh, ωhk and ωih and we collectively define them as ω in this paper.

Having discussed the basic architecture of NN, we can now briefly explore these key components and mechanism of the NN in the following.

2.1.1 Weights

The architecture of NN starts with the input layer which takes data as input and passes it to the hidden layers through a mechanism of interconnected nodes/units. Initially, in a fully connected NN, all the nodes/units in a layer are connected with the units of next layer and

(22)

a weight (ω) is assigned on each of these connections. These weights represent how much information has been modified through each unit. A set of properly estimated weights assists the network in better understanding the relationship between the independent variables and dependent variable and hence it is very crucial to find these set of proper weights in a model. The set of such weights constitutes the whole “knowledge” storage of the NN that the model has learned through trials and errors.

2.1.2 Nodes/Units

In biological Neural Networks, neuron is the center of the nervous system and a nucleus is present in each neuron. In the context of Deep Learning, these nuclei are recognized as Nodes/Units. Each input layer and hidden layer in the network is basically a set of units. These units in the layers are interconnected and on over each connection we assign a weight. Hence, each of these connections is actually weighted connections. The exchange of information among the layers is done by these units. The intention of choosing multiple hidden layers is to modify the information acquired from the units of the previous layer, and pass that information to the next layer for further modification. By modification, we refer to applying a nonlinear activation function to a linear input. The process continues this way depending on the number of layers we prefer to answer a specific question.

2.1.3 Activation Functions

Each unit in NN contains an activation function. An activation function in a specific unit decides how the input information passed from the previous unit will be processed. It basically transforms the linear input into a non-linear output. Usually, we add a bias term with each input and then apply the activation function before forwarding the information to the next layer. The activation function of a unit in a network is sensitive to a specific threshold value, and it gets triggered if the network input surpasses that threshold. Thus, activation function controls the activation of a unit depending on the input it receives from the previous unit and threshold value [Kriesel, 2007]. In general, all the nodes or units in a network are defined by the same activation function except the output layer. In modern literature, several types of activation function can be found such as,

• Binary Threshold Function • Sigmoid Function

• Rectified Linear Unit (ReLU) • Hyperbolic Tangent

• Softmax Function

• Inverse Square Root Unit (ISRU) etc.

However, we will briefly discuss the first four above mentioned activation function since these are most commonly used in Deep Learning.

(23)

• Binary Threshold Function: This is the simplest form of activation function, also known as Heaviside Step Function or Unit Step Function and it takes only 0 and 1 as values. If the value of x is less than zero, the threshold function becomes zero. In contrast, if the value of x is equal to or greater than zero then the threshold function is one. That is, if the input information (x) exceeds the threshold value of zero, the function changes its value from 0 to 1.

φ(x) =

(

0 if x < 0 1 if x ≥ 0 ·

• Sigmoid Function: Sigmoid function is simple and frequently used especially on the output layer. It can be expressed as,

φ(x) = 1

1 + e−x ·

• Rectified Linear Unit (ReLU): Rectified Linear Unit is frequently applied to the units of the hidden layer as activation function due to the fact that it improves the classification rate. It is most commonly applied activation function in hidden layers because it assists the NN to learn from the non-linear properties of the data [Nair and Hinton,2010]. This activation function works only in case of positive output of a unit and due to this reason, the network functions much faster than any other activation function.

φ(x) = max(0, x) .

• Hyperbolic Tangent Function: Hyperbolic Tangent function maps the output be-tween -1 to 1 and it exhibits some similar properties of the sigmoid function. However, the hyperbolic tangent function returns output with much wider space than sigmoid function which maps the output between 0 and 1. The hyperbolic tangent function is most commonly utilized to model complex non-linear relationship. This function can be formulated as,

φ(x) = e

x_{− e}−x

ex_{+ e}−x ·

2.1.4 Loss Function

In Deep Learning, the networks try to learn the relationship between independent variables and dependent variable. Through iteration, the network attempts to get closer to the actual output value of the dependent variable by learning from its previous errors. Loss functions measure the error that a network makes during its learning process. That is, it is a mea-surement of the distance between the actual value y and estimated value ˆy of the dependent

variable, and hence demonstrates the learning progress or regress of the NN. If the loss func-tion gets minimized after each iterafunc-tion, we can say that the NN is gradually learning from its previous errors. NN with the minimum loss returns output close to the actual value. The amount of loss a network makes during its training is associated with the weights it assigned by itself on each node/unit. Depending on multiple outputs, NN can contain multiple loss functions.

(24)

Different choice of loss function L, summarized below, can also yield probabilistic expla-nation regarding misclassification error.

• Mean Squared Error:

L = 1 N N X i=1 (yi− ˆyi)2 .

• Mean Absolute Error:

L = 1 N N X i=1 |yi− ˆyi| .

• L1-Norm Loss/ Least Absolute Deviations:

L =

N

X

i=1

|yi− ˆyi| .

• L2-Norm Loss/ Least Squared Error:

L = N X i=1 (yi− ˆyi)2 . • Kullback-Leibler (KL) Divergence: L = 1 N N X i=1 KL(yi|| ˆyi) = 1 N N X i=1 " log yi ˆ yi ! · yi # = 1 N N X i=1 log(yi) · yi− 1 N N X i=1 log( ˆyi) · yi . • Cross Entropy: L = − 1 N N X i=1 [yi.ln( ˆyi) + (1 − yi).ln(1 − ˆyi)] .

Where N is the number of observations.

2.1.5 Back-Propagation

Upon estimating error by the loss function, it is important to provide this information back to the initial stage of NN to update the weights so that the error can be minimized on the next iteration. This information about the error is propagated backward through the network to adjust the weights by utilizing a technique called “Backpropagation” [LeCun, Bottou, Orr and M¨uller, 1998]. For backpropagation to work, the loss function has to be differentiable.

(25)

Backpropagation algorithm estimates the derivative or gradient of change in error of the NN. That is, it estimates the contribution of each weight to the error value by taking derivative of the change in loss or error with respect to the change in weights. By finding the global minimum for a set of weights it can minimize the loss function. But in reality, it can be very expensive to find the global minimum with respect to all of these weights due to the fact that NN can contain thousands even million parameters/weights and we want to find a global minimum for loss/error with respect to not a single weight at a time but rather with respect to all of these weights at the same time. In such situation to minimize the loss function, we are aided by an alternative method, popularly known as stochastic gradient descent.

2.1.6 Optimization and Stochastic Gradient Descent

To optimize an n-dimensional function, the gradient decent technique is most commonly used in Deep Learning. In deep NN, we want to minimize the n-dimensional loss function. Gra-dient is the generalization of the derivative for n-dimensional function. A negative graGra-dient, −g indicates that we are steeply descending i.e. we are moving downwards. The gradient g of a two-dimensional loss function, L at point (x, y) can be conveyed as g(x, y) = 5L(x, y) . For n-dimensional loss function L , at point (x1, x2, · · · , xn), the gradient g can be expressed

as g(x1, x2, · · · , xn) = 5L(x1, x2, · · · , xn). Thus, here gradient g is a vector on n

compo-nents. From any point, say p, of a loss function L, gradient g takes us to the steepest ascent from that point p and the degree of ascent is the absolute value of g i.e. |g|.

By gradient descent, we can think of moving downwards step by step from any starting point of our loss function L, and the speed or step of descending is proportional to |g|. That is, the steeper the descent the faster we move (see Fig. 2.3). Thus, with the gradient descent, we are always moving from any starting point of the loss function, L towards the opposite direction of the gradient, g [Kriesel,2007].

(a) Stochastic Gradient Descent in 3D (b) Stochastic Gradient Descent in 2D

Figure 2.3: An illustration of Stochastic Gradient Descent. The opposite direction of gradient

g takes us to the minimum value of loss function.

With stochastic gradient descent method, we take a single observation i.e. one row at a time, feed it to the NN, estimate the loss function and adjust the weight by at first, estimating the gradient of the loss function with respect to weights and then by pushing

(26)

the weights in the opposite direction of the gradient. We follow this procedure to update the weights through iteration over all of the observations. Another way of updating the weights through Stochastic gradient descent is by taking a few observations as a “Mini-Batch” instead of taking observations individually. This technique is known as “Mini-Batch Stochastic Gradient Descent”. By repeating this process, we try to find the set of weights that minimize the total loss function.

2.1.7 Summing Up

So far, we have briefly discussed the key components of NN. By summing up all of the key components steps by step, we can figure out how NN functions and learns.

1. Data Input: We provide the data to the network through the input layer. The data we provide to the input layer can be two dimensional array vector data (ob-servations & variables), three dimensional array time series data or sequence data (observations, variables & timestamp), four-dimensional array image data (observa-tions, image height, image width & color channel) or five-dimensional array video data (observations, frame, color channel, frame height & frame width) [Chollet, 2018]. 2. Initialization of NN: The network is initialized by assigning weights on all of the

input data. These assigned weights are random since the network has not yet acquired any knowledge about the data, but the weights are required to initialize the network. At this stage, weights are set as close to zero. The conclusive aim of the NN is to find the proper set of values for these weights.

3. Feed forwarding the data: In Feed-Forward Neural Networks, the units in one layer can only pass data/information to the units of the next layer. Thus, at this stage, the network feed the data from the input layer to the units/nodes of the hidden layer. Each unit has an activation function through which the data are processed for modification. The activation function controls the activation of a unit depending on the input it receives and the threshold value. After modification of the data, the unit passes the information to the unit(s) of the next layer for further modification. The process continues this way until it reaches the output layer.

4. Measurement of Error: The output layer returns an outcome (ˆy) for a specific

observation. Then it measures the difference between the actual value (y) and the output value (ˆy) by a Loss function, L.

5. Back Propagation: The network send back the information about the error to the input layer by Backpropagation to adjust the weights such that the error is minimized in the next iteration. The error at the first iteration is bound to be large since the weights were assigned at random. Backpropagation algorithm estimates the derivative or gradient of error with respect to the weights by Stochastic Gradient Descent method. 6. Adjustment of Weights: The contribution of each weight to the loss function is being observed by Stochastic Gradient Descent and the network adjusts the weights accordingly. After adjusting the weights, the data are fed to the network again through

(27)

the hidden layer and the output layer returns an output value for each observation. The network estimates the error and adjusts the weights again by the Backpropagating the error information. The word “Learning” in Deep Learning came from this concept that during each iteration, the NN learn from errors. By repeating the same process for a specific number of iteration, we obtain a combination of weight values that minimize the loss function.

(28)

2.2. BAYESIAN NEURAL NETWORKS (BNN) 15

2.2 Bayesian Neural Networks (BNN)

In Bayesian analysis, uncertain quantities are expressed in terms of posterior probability distributions. The purpose of using Bayesian Neural Networks (BNN) is to integrate the traditional NN with probabilistic modeling since probabilistic modeling takes into account the uncertainty. The NN we discussed in previous section was focused on maximization of likelihood which is similar to minimization of loss function i.e. we find a single value from the vector of weights ω that maximizes the likelihood function. The Bayesian approach in NN gives emphasis on finding the distribution of weights rather than a single weight. This distribution reflects the uncertainty of the parameter/weights. Moreover, in contrast to the traditional NN, BNN predict the outcomes with confidence bound i.e., the networks predict each outcome with probable variation instead of producing point estimate for each outcome. Besides that, Bayesian approach also provide the following advantages [Bishop,1995]:

• We can explain the regularization as opposed to traditional NN.

• We can compare different models (e.g. models with different layers or prior probabili-ties) and choose the best one without using the test/validation data.

2.2.1 Bayesian Learning of Weights

The frequentist approach of NN is associated with assigning a single weight on each connec-tion between the nodes/units. However, in the Bayesian approach, we assign a distribuconnec-tion over each weight and this probabilistic modeling returns estimates of uncertainty. Initially, we assign a distribution of prior over each weight and based on posterior inference of weights we come up with a decision about the weights. That is, Bayesian learning of weight means how we rationalize our belief regarding the weights/parameters from prior knowledge to the posterior after observing the data.

Let X be the i.i.d. random variables with N observations and we denote them as D. Also, let’s consider the weights in the NN as ω and the prior distribution of weight is p(ω). The distribution p(ω) reflects our prior knowledge about the weight ω before witnessing the data. Now, we denote the posterior distribution of weights as p(ω|X). We obtain the posterior distribution by combining the prior distribution and likelihood according to the Bayes theorem as the following,

p(ω|D) = p(D|ω)p(ω) p(D) ·

Here, p(ω|D) is the posterior probability of weights given data. p(D|ω) is the joint probability distribution, also known as the likelihood function and p(ω) is the prior probability of weights

ω. The denominator is the normalizing constant which we can skip for now for simplicity.

Then the equation becomes

p(ω|D) ∝ p(D|ω)p(ω).

which can be expressed explicitly as,

(29)

2.2. BAYESIAN NEURAL NETWORKS (BNN) 16

This is a demonstration of improvement of our prior knowledge about the weights in the NN after observing the data.

2.2.2 Predictive Distribution

We utilize the posterior distribution to obtain the predictive distribution. For a new input

x∗ we can estimate the predictive distribution of new output y∗ as,

p(y∗|x∗, D) =

Z

p(y∗|x∗, ω)p(ω|D)dω.

That is, the predictive distribution is being expressed as the average of conditional prediction of y∗ _{over posterior distribution of weights ω.}

In a classification problem, for a new input x∗, the predictive distribution of an output class C is,

p(y∗ = C|x∗, D) =

Z

p(y∗ = C|x∗, ω)p(ω|D)dω.

We can think of predictive distribution as the collection of predictions from NN with dif-ferent weights. Then we estimate the uncertainty for predicted values of dependent variable

(30)

2.3. CONVOLUTIONAL NEURAL NETWORKS (CNN) 17

2.3 Convolutional Neural Networks (CNN)

Modern Convolutional Neural Networks (CNN) was first introduced by Yann LeCun in 1990 [LeCun et al.,1990]. It was named as CNN because we apply a unique linear operation called convolution to the NN, an operation that is different from matrix multiplication conducted on a traditional NN. In short, NN with convolution operation in at least one of its layers can be introduced as Convolutional Neural Networks [Goodfellow et al.,2016]. The convolution operation assists in portraying the spatial information of the input data and a CNN can provide very efficient results on such tasks [Krizhevsky et al., 2012]. The architecture of CNN is based on three different layers, namely convolutional layer, pooling layer, and fully connected layer. CNN has proven to be an excellent tool in solving many complex Deep Learning and computer vision task like image classification, object detection[LeCun, Bot-tou, Bengio and Haffner, 1998], Text detection, Speech and Natural Language Processing. Because of its architecture and functionality, it is especially useful in solving problems that are difficult to solve with other machine learning and Deep Learning models. It works well when a dataset contains special features. For instance, a time series data contains seasonal-ity, a colorful image accommodates different pixel value across all the pixels where each of the pixels represents a certain characteristic. Due to this fact, CNN is applicable to unravel questions that are related to time series, image, video and text data. For instance, we can apply CNN to MRI scan image data to classify the patients with brain tumor.

In the following, we briefly go through the key components that are required to under-stand the functionality of CNN.

2.3.1 Convolution operation

A convolution is simply the operation of two functions where one function modifies the another. Suppose that, we have a 2D tensor (two-dimensional array of numbers, e.g. matrix) X that represents the input data and another 2D tensor, say S that represents a certain feature of 2D tensor X. Then a convolution operation, C of X and S can be defined as, [Goodfellow et al.,2016] C(i, j) = (X ~ S)(i, j) =X m X n X(m, n)S(i − m, j − n) (2.1) Assume that the input data, X is a 7 × 7 matrix and feature detector/kernel, S is a 3 × 3 matrix. Here, it is not possible to apply matrix multiplication since the number of columns of X is not equal to the number of rows of S, but we can apply convolution operation. As depicted in Fig. 2.4, we project kernel S (blue coloured 3 × 3 matrix in the middle) over the top left corner of X (in this case the orange coloured 3 × 3 matrix on the left), take the product of numbers at the same position and then sum them together i.e 0 ∗ 1 + 1 ∗ 1 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 1 + 1 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 1 ∗ 1 = 2. Next, we project kernel S over X by moving one stride to the right and perform the same operation i.e. 1 ∗ 1 + 0 ∗ 1 + 0 ∗ 0 + 0 ∗ 0 + 1 ∗ 1 + 1 ∗ 0 + 0 ∗ 0 + 1 ∗ 0 + 0 ∗ 1 = 2. We continue this convolution operation on every possible location of X until the matrix S reaches the bottom right corner of X. Note that, the number of strides can be greater than one depending on the data and/or practitioner. Following is the visualization of how this convolution operation works.

(31)

Figure 2.4: Convolution Operation of pixel values of the input images (on the left) with the feature detector/filter/kernel (on the middle) returns a Feature map (on the right). Different feature map will be generated by convolving several feature detector/filter/kernel with the input image data.

Thus, we can see that convolution is a local operation. As we mentioned earlier, S is modifying the X and creating a new set of data point plotted on the right. A convolution operation could be commutative [Goodfellow et al., 2016] , i.e. it can also be demonstrated as, C(i, j) = (X ~ S)(i, j) =X m X n X(i − m, j − n)S(m, n) (2.2) Another variation of convolution operation is,

C(i, j) = (X ~ S)(i, j) =X

m

X

n

X(i + m, j + n)S(m, n) (2.3)

Which is also referred as cross-correlation in practical Deep Learning [Goodfellow et al.,

2016].

2.3.2 Feature Detector/Kernel/Filter

In Deep Learning literature, the matrix S we projected over X, to perform the convolution operation is called kernel or filter or feature detector. As the term indicates, a feature detector matrix is applied to detect feature from the data. In CNN, according to the most common approach, the dimension of feature detector can take the matrix form as, 3 ∗ 3, 5 ∗ 5 or 7∗7. When applied to an image data, a feature detector/kernel extracts information about a certain characteristic. For example, a feature detector can identify a specific shape that is associated with a tumor. Depending on the data, we deploy many kernels/filters/feature detectors to recognize all the important feature that a dataset can contain. One useful side of

(32)

feature detector is that it reduces the size of the data as shown Fig. 2.4and hence, decreases the number of parameters required to estimate without losing the key attributes from the initial input.

2.3.3 Feature Maps and Convolutional Layer

A feature can be obtained by convolving the input data X with a feature detector or kernel S. Different kernel returns different feature map and each of these maps contains key feature about the subject. Thus, each of these feature maps holds a specific attribute about the input data. A neuron of a feature map is connected to the adjacent neurons area of the previous layer. All feature maps together form the convolutional layer (see Fig. 2.5). At this stage, the convolutional layer is linear since they are the outcome of a linear operation called convolution. Since the images are mostly non-linear, we need to apply an activation function such as Rectified Liner Unit (ReLU) [Nair and Hinton, 2010] or sigmoid or tanH[LeCun, Bottou, Orr and M¨uller, 1998] to introduce non-linearity in the CNN for the purpose of better representation of the data. Depending on the dataset and problem at hand, we may obtain multiple convolutional layers.

Figure 2.5: Convolutional Layers formed from input image through convolution operation. A convolutional layer is a set of feature maps.

2.3.4 Pooling

We deploy pooling which is a down-sampling technique to capture the spatial invariance properties of the data. Suppose, the shape or position of a brain tumor on different images may differ and the network can get confused or miss some key information about that tumor in such situations. Pooling operation tries to assure that the NN does not miss any important information about the data. There are several types of pooling operations, such as mean

(33)

pooling [Wang et al., 2012], max pooling [Boureau et al., 2010], sum pooling etc. We apply pooling operation on the convolution layer(s) and in return get benefited in three ways; first, we are still holding the key features, second, the size of the data is reduced significantly, and third, overfitting can be prevented since the number of parameters has been reduced. These feature maps in a convolutional layer are called pooled feature maps.

2.3.5 Flattening and Fully connected layers

A pooled feature map is a multi-dimensional array and we need to convert it into a one-dimensional array (see Fig. 2.6) since these are the data points that proceed to the input layer of a fully connected NN. From this point, the functions of the CNN are similar to NN that we discussed in 2.1.7.

Figure 2.6: Multi-dimensional convolutional layers are transformed into one-dimensional vectors. These flattened layers are the input data of fully connected Neural Networks.

2.3.6 Summing Up

Having discussed the key components of CNN and their functionality, we can sum these up step by step altogether.

• Convolution Operation: We perform the convolution operation of input data with feature detector/kernel.

• Formation of Feature Maps and Convolutional Layers: Each feature detector returns a feature map and each of these maps contains key feature of specific data point. Altogether the feature maps form a convolutional layer. We implement Rectified Linear Unit(ReLU) to capture the non-linear characteristics of the data.

(34)

• Pooling: By a down-sampling technique or pooling, we reduce the size of the data and hence the number of parameters while still grasping the key features of the data. These feature maps are known as pooled feature maps.

• Flattening: We convert the multi-dimensional pooled feature maps into a one-dimensional arrays. This flattened layers are the input of NN i.e we are going to enter the NN that is fully connected.

• Estimation of Error, Backpropagation, and Adjustment of Weights: We have obtained the input data to feed to the fully connected NN by flattening the polled layers. From the beginning to this stage, we have transformed an image into a combination of vectors of numbers. Now, we initialize the NN by assigning random weights over each connection. Next, the data are fed to the succeeding layer for more modification. The network continues this procedure until it reaches the output layer and makes a prediction. After that, it estimates the error and backpropagates the error information. During the backpropagation, the network measures the contribution of kernels and weights to the loss function by stochastic gradient descent. Finally, the network adjusts the kernels as well as the weights and repeats these steps for a specific number of iteration and in each iteration, it minimizes the error function L.

Figure 2.7: Architecture of Convolutional Neural Networks. Input images are at first, trans-formed into convolutional Layers. Each Feature Map in the Convolutional Layers is then flattened and forwarded to the NN as the values of independent variable xi. The Output

Layer returns estimated value of yi, i.e. yˆi. Then we estimate the error, backpropagate

the error information through network and adjust the weights as well as the kernels based on their contribution to the loss function L. The contribution to the loss is determined by mini-batch stochastic gradient descent. We repeat this process for specific number of epochs/iterations until the loss function L is minimized.

(35)

Chapter 3 Dropout - A Stochastic

Regularization Technique

In this chapter, we discuss a frequently used regularization technique in Deep Learning, known as Dropout. We also mention two different approaches that have been proposed to consider during test period. Note that, in our experiments we deploy the approach suggested byGal and Ghahramani [2015a] which is also discussed in chapter 4.

3.1 Implementation of Dropout

Dropout [Hinton et al.,2012; Srivastava et al.,2014] is a stochastic regularization technique commonly used in NN to prevent overfitting. NN possess large number of parameters to explain the relationship between independent variables and dependent variable. Typically, during training, a network with sufficiently large number of parameters trained on small data set gives low error rate. However, when we evaluate the model performance on test data, the performance of the model may diminish i.e. the error rate is higher than it was during training. This is due to the fact that a network with a high number of parameters can easily get adjusted with the small data. This phenomenon is commonly known as overfitting which hinders the learning capability of a Deep Learning model. In practical fields, several techniques can be implemented to prevent overfitting in a Deep Learning model including early stopping [Smale and Zhou, 2007], dropout [Hinton et al., 2012], data augmentation [Hinton et al.,2012], multiplicative gaussian noise [Srivastava et al.,2014], dropconnect [Wan et al., 2013], stochastic pooling [Zeiler and Fergus, 2013] etc. With the dropout technique, during training, we randomly neglect units from the networks. As a consequence, the number of parameters is being reduced and the networks have less available amount of parameter to fit the data.

An efficient technique of decreasing the error rate could be training several networks and then obtain the mean prediction by passing the test data through each of these networks. However, this technique comes with high expense of computation especially in NN with large number of units. An alternative solution in such situation could be creating different network structure with reduced number of units by randomly dropping those with a specific probability, which is the core concept of dropout. With this technique, a unit is kept in the

(36)

3.1. IMPLEMENTATION OF DROPOUT 23

network with probability p or omitted with probability 1 − p (see Fig.3.1).

Figure 3.1: On the left, fully connected Neural Networks with two hidden layers. On the right, Neural Networks after dropping out units at input layer as well as hidden layers (red circled units). The networks now have less number of parameters to adapt to the dataset which in turn forces the networks to learn the relationship between independent variables and dependent variable more appropriately. This “appropriate” learning reduces the risk of overfitting.

The value of p is usually set as 0.5 for hidden units and close to 1 for input units [Srivastava et al., 2014]. When dropout is implemented, we can expect that a trained NN with K units is basically the integration of trained 2K_{“thinned” Neural Networks [}_Srivastava

et al., 2014]. The input xi is passed through the modified network and upon estimation the

error the backpropagation is performed with respect to the modified network. We continue this process many times with different units dropped at random each time and estimate the contribution of weights to the error by stochastic gradient descent. In this way, a different set of weights will be learned by the networks during each iteration. Hence, this approach is equivalent to training different NN with non-identical set of units. Since during the training with probability 0.5 only half of the units are considered, to keep the number of units same during test time, we multiply each weight by probability p [Srivastava et al., 2014].

During the testing, Gal and Ghahramani [2015b] suggested that we obtain predictions from 2K _{“thinned” networks and then estimate the mean prediction, which they refer as}

Monte Carlo Dropout. Note that, this technique could be computationally expensive when

K is large. On the other hand, Srivastava et al. [2014] suggested approximating the mean prediction value by passing the test data through a full network where parameters/weights are being multiplied by the probability p (see Fig. 3.2).

(37)

3.2. DROPOUT OBJECTIVE 24

Figure 3.2: On the left, during training time, a unit is selected with probability pi in i − th

layer and the respective weights are ω. During the test time, we consider all units but the corresponding weights are multiplied by probability pi (the approach suggested bySrivastava

et al.[2014]). This results in the same output at test time as for the training time. The red lines on the image on right side depicts those connections that were dropped out before. Gal and Ghahramani [2015b] suggested to keep the dropout active during testing and take the MC dropout as the average.

3.2 Dropout Objective

Let us consider NN with L layers and cross-entropy being the loss function. At the i-th layers the weight matrices Wi has the dimension Ki × Ki−1 and the vector of bias bi has

dimension Ki. Also, we consider xi as input which is independent variable and yi being the

output, a dependent variable for i = 1, 2, . . . , N observations. The cross-entropy function estimates the error i.e., the difference between yi and ˆyi. In general, to prevent NN from

being overfitted due to large number of parameter we introduce a regularization term with the loss function. Here, we consider ridge regression [Tikhonov, 1963] or L2 regularization term in all layers with regularization parameter λ dictating the magnitude of regularization. Then the optimization objective in NN with dropout takes the form as follows,

LDO = − 1 N N X i=1 [yi.ln( ˆyi) + (1 − yi).ln(1 − ˆyi)] + λ L X i=1 (||Wi||22 + ||bi||22) . (3.1)

= Loss function + L2 Regularization

In the next chapter, we discuss dropout as approximate Variational Inference (VI) in BNN as well in BCNN.

(38)

Chapter 4 Bayesian Convolutional Neural

Networks

In this chapter, we first discuss the theoretical approach for Bayesian interpretation in CNN .Then, we associate Dropout with approximate Variational Inference in both the Bayesian Convolutional Neural Networks (BCNN) and Bayesian Neural Networks (BNN) setup.

4.1 Bayesian Approach in Convolutional Neural

Net-works

Traditional CNN are prone to overfit unless we have sufficiently big dataset. One advantage of implementing the Bayesian approach in CNN setup is that it tries to reduce overfitting even on a small data while still provides the uncertainty estimates in CNN.

In order to obtain the uncertainty estimate in CNN, we put a prior distribution over the kernels. This approach had not been applied before until recently by Gal and Ghahramani

[2015a] probably due to the complexity involved in CNN structure.

After setting the prior distribution over kernels, the posterior distribution becomes as,

p(ω|X, Y ) = p(Y |ω, X) p(ω)

p(Y |X) · (4.1)

The presence of normalizing constant, p(Y |X) in equation (4.1) makes it difficult to esti-mate the posterior distribution of kernels. To approxiesti-mate the posterior, we use Variational Inference (VI) technique which we discuss in section 4.3. With the VI technique, we ap-proximate the true posterior distribution with a rather simpler variational distribution by minimizing the Kullback-Leibler (KL) divergence, which we briefly discuss in section 4.4. In general, in BNN models, the most commonly used variation distribution is the normal distribution, which makes the estimation process computationally expensive due to the high number of parameters while not contributing to the improvement the model performance [Blundell et al., 2015]. Gal and Ghahramani [2015a] proposed to use the Bernoulli distri-bution as variational distridistri-bution while we approximate the true posterior in the BCNN. This in turn, makes the estimation of posterior distribution less computationally expensive

(39)

4.2. APPROXIMATION OF POSTERIOR DISTRIBUTIONS 26

and produces coherent outcomes. Gal and Ghahramani[2016] proved that NN with dropout implemented after each input and the hidden layer has the identical optimization objective as of a Gaussian process, which is minimizing the KL divergence or maximizing the Evidence Lower Bound (ELBO). That is, they showed that dropout minimizes the KL Divergence. In our thesis, we utilized this stochastic regularization technique, dropout [Hinton et al., 2012;

Srivastava et al., 2014] to approximate the Variational Inference in BCNN setup, which we discuss in the remaining of this chapter.

4.2 Approximation of Posterior Distributions

Theoretically, the posterior distribution in equation (4.1) could sometimes be relatively easy to compute analytically and we can obtain those in closed form. For instance, with a con-jugate prior, we can compute the posterior distribution with ease. However, BCNN hold thousands even millions of parameters. In such cases, typically the models are complex and dimensions are much higher than we can even imagine. Furthermore, the normalizing con-stant p(Y |X) makes it almost impossible to estimate the posterior distribution and as such, we need to look for alternatives to compute the posterior distribution in equation (4.1) in an approximate form. Theoretically, there are several methods to approximate the posterior distribution including as following.

4.2.1 Normal Approximation

When data is sufficiently large, we can approximate the posterior distribution by normal distribution as,

p(ω|X, Y ) ≈ N (ˆω, [I(ˆω)−1]) where, I(ˆω) = −_dωd22 log p(ω|X, Y ).

First, we take the 1st derivative with respect to ˆω and the obtained ω is the mean of

the normal distribution. Then, the 2nd derivative at mode ω = ˆω, from which we get the

variance of normal distribution.

4.2.2 Variational Approximation

In this method, we approximate the posterior distribution p(ω|X, Y ) by a class of relatively simple parameterized distributions q(ω|θ), also known as variational distribution. q(ω|θ) is variational in the sense that, for every value of hyperparameter θ the density of q(ω|θ) changes. By minimizing the KL divergence [Kullback and Leibler,1951] we approximate the variational distribution q(ω|θ) with true posterior distribution p(ω|X, Y ). We start with a random value of θ and update it in each iteration in such a way that the KL divergence is minimized. When θ no longer produces any change to KL divergence, we take q(ω|θ) for that value of θ as the approximation of posterior distribution p(ω|X, Y ) [Gelman et al., 2014]. The KL divergence can be defined as,

KL [q(ω|θ) || p(ω|X, Y )] = −Eq(ω|θ) " log p(ω|X, Y q(ω|θ) !# (4.2)

(40)

Upon obtaining the approximate posterior distribution i.e. p(ω|X, Y ) ≈ q(ω|θ), it can be utilized to estimate the posterior predictive distribution as,

p(y∗|x∗, X, Y ) ≈

Z

p(y∗|x∗, ω)q(ω|θ)dω. (4.4)

4.3 Approximate Variational Inference

The objective of Variational Inference (VI) is to minimize the KL divergence. The equation (4.2) can be modified as,

KL [q(ω|θ) || p(ω|X, Y )] = −Eq(ω|θ) " log p(ω|X, Y q(ω|θ) !# = Eq(ω|θ)[log q(ω|θ)] − Eq(ω|θ)[log p(ω|X, Y )] = Eq(ω|θ)[log q(ω|θ)] − Eq(ω|θ) " log p(ω, Y ) p(Y |X) !#

= Eq(ω|θ)[log q(ω|θ)] − Eq(ω|θ)[log p(ω, Y ) + log p(Y |X)]

= Eq(ω|θ)[log q(ω|θ)] − Eq(ω|θ)[log p(ω, Y )] + log p(Y |X). (4.5)

In practice, minimization of the KL divergence is difficult to achieve due to its dependence on the evidence term, log p(Y |X) in equation (4.5). In such conditions, this objective can also be expressed in term of Evidence Lower bound (ELBO) since maximization of ELBO gives a similar result to minimization of KL divergence with respect to θ [Bishop, 2006;Blei et al., 2017]. Since KL [q(ω|θ) || p(ω|X, Y )] ≥ 0 [Kullback and Leibler, 1951; Jordan et al.,

1999] equation (4.5) becomes,

Eq(ω|θ)[log q(ω|θ)] − Eq(ω|θ)[log p(ω, Y )] + log p(Y |X) ≥ 0

That is,

log p(Y |X) ≥ Eq(ω|θ)[log p(ω, Y )] − Eq(ω|θ)[log q(ω|θ)].

We can obtain the ELBO by taking the negative value of KL divergence and then adding the evidence term log p(Y |X), which is constant with respect to q(ω|θ) [Blei et al., 2017]. Then the ELBO can be defined as,

Classifying Microscopic Images for Acute Lymphoblastic Leukemia (ALL) using Bayesian Convolutional Neural Networks