Generative adversarial networks for single image super resolution in microscopy images

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2018,

Generative adversarial networks for single image super resolution in microscopy images

SAURABH GAWANDE

(2)

Generative adversarial networks for single image super resolution in microscopy images

SAURABH GAWANDE

Master’s Thesis at KTH Information and Communication Technology Supervisor: Mihhail Matskin

Examiner: Dr. Anne Hakånsson Industrial Supervisor: Dr. Kevin Smith

TRITA-EECS-EX-2018:10

(3)

Abstract

Image Super resolution is a widely-studied problem in computer vision, where the objective is to convert a low- resolution image to a high resolution image. Conventional methods for achieving super-resolution such as image priors, interpolation, sparse coding require a lot of pre/post processing and optimization. Recently, deep learning methods such as convolutional neural networks and generative adversarial networks are being used to perform super-resolution with results competitive to the state of the art but none of them have been used on microscopy images. In this thesis, a generative adversarial network, mSRGAN, is proposed for super resolution with a perceptual loss function consisting of a adversarial loss, mean squared error and content loss. The objective of our implementation is to learn an end to end mapping between the low / high resolution images and optimize the upscaled image for quantitative metrics as well as perceptual quality. We then compare our results with the current state of the art methods in super resolution, conduct a proof of concept segmentation study to show that super resolved images can be used as a effective pre processing step before segmentation and validate the findings statistically.

Keywords: Deep Learning, Generative adversarial net- works, Super resolution, High content screening microscopy

(4)

Abstract

Image Super-resolution är ett allmänt studerad problem i datasyn, där målet är att konvertera en lågupplösnings- bild till en högupplöst bild. Konventionella metoder för att uppnå superupplösning som image priors, interpolation, sparse coding behöver mycket för- och efterbehandling och optimering.Nyligen djupa inlärningsmetoder som convolutional neurala nätverk och generativa adversariella nätverk är användas för att utföra superupplösning med resultat som är konkurrenskraftiga mot toppmoderna teknik, men ingen av dem har använts på mikroskopibilder. I denna avhandling, ett generativ kontradiktorisktsnätverk, mSR- GAN, är föreslås för superupplösning med en perceptuell förlustfunktion bestående av en motsatt förlust, medelk- vadratfel och innehållförlust.Mål med vår implementering är att lära oss ett slut på att slut kartläggning mellan bilder med låg / hög upplösning och optimera den uppskalade bilden för kvantitativa metriks såväl som perceptuell kvalitet.

Vi jämför sedan våra resultat med de nuvarande toppmoderna metoderna i superupplösning, och uppträdande ett be- vis på konceptsegmenteringsstudie för att visa att superlösa bilder kan användas som ett effektivt förbehandling steg före segmentering och validera fynden statistiskt.

Keywords: Deep Learning, Generative adversarial net- works, Super resolution, High content screening microscopy

(5)

Acknowledgements

First of all, I would like to express my sincerest gratitude to Dr. Kevin Smith for giving me the opportunity to work on this exciting topic and without whom this work wouldn’t have materialized. Having a guide like Kevin was truly a blessing and I could not have wished for a better mentor. Thank you, Kevin for always being patient with me, pushing me to go the extra mile and always making the time for me despite your hectic schedule. I would like to thank Dr.Hossein Azizpour for his continuous feedback and ideas, always being available to clear my doubts no matter how naive and serving as a beacon of inspiration. I am also thankful to my examiner Dr. Anne Hakånsson for providing me the support I needed to stay on track and helping me maintaining scientific quality of this work .

Last but not the least, no amount of thanks will ever be enough for my parents who have loved, supported and cared for me unconditionally throughout my tumul- tuous and protracted journey.

May all your minima always be local!

Tack!

(6)

Abbreviations

α weight coefficient for M.S.E β weight coefficient for content loss CNN Convolutional neural networks

CT Computed tomography

DL Deep Learning

GAN Generative adversarial networks GAN Generative adversarial networks HCS High content screening

HR High resolution

HVS Human visual system

LR low resolution image

MRI Magnetic resonance imaging

MSE Mean Squared Error

psnr Peak Signal to Noise ratio

SC Sparse coding

SGD Stochastic gradient descent SISR Single image super resolution

SR Super Resolution

(9)

Chapter 1

Introduction

In this thesis project, we explore the use of Generative adversarial networks for performing single image super resolution on high content screening microscopy images.

The project was carried out within the Bioimage Informatics Facility at the Science for Life Laboratory, Sweden.

1.1 Image Super-resolution

In most digital imaging applications, high-resolution images are preferred and of- ten required to accomplish tasks. Image super-resolution (SR) is a widely-studied problem in computer vision, where the objective is to generate one or more high- resolution images from one or more low-resolution images. SR algorithm aims to produce details finer than the sampling grid of a given imaging device by increasing the number of pixels per unit area in an image. SR is a well known ill-posed inverse problem, where from a low-resolution image (usually corrupted by noise, motion blur, aliasing, optical distortion, etc.) a high-resolution image is restored [1] [2].

SR techniques can be applied in many scenarios where multiple frames of a single scene can be obtained (e.g., multiple images of the same object by a single camera), various images of a scene are available from numerous sources (numerous cameras capturing a single scene from various locations).

SR has its applications in varied fields such as Satellite imaging (eg. remote sens- ing) where several images of a single area are available, in security and surveillance where it may be required to enlarge a particular point of interest in a scene (such as zooming on the face of a criminal or the numbers of a license plate), in computer vision where it can improve the performance of pattern recognition and other areas such as facial image analysis, text image analysis, biometric identification, finger- print image enhancement, etc. [1].

SR is particularly of great importance in medical imaging where more detailed

(10)

CHAPTER 1. INTRODUCTION

image details are required on demand, and high-resolution medical images can aid the doctors to make a correct diagnosis, e.g., in Computed tomography (CT) and Magnetic resonance imaging (MRI) for diagnosis, where the acquisition of multiple images is possible albeit with limited resolution.

1.2 Background

Convolutional neural networks (CNN) have been in existence for a long time [2]

and recently deep CNN’s have shown an upsurge in popularity due to its various successes in image classification tasks, one of them being the ImageNet Large Scale Visual Recognition Challenge which is a benchmark in object classification and detection tasks consisting of millions of images and thousands of classes [3]. CNN’s have also been applied to other sub-problems of computer vision such as object detection [4], face recognition [5] and pedestrian detection [6].Various factors are instrumental in the progress and effectiveness of CNN’s such as A) The advent of more powerful Graphics Processing Units [3], which makes it easier to train more complex models on large datasets B) The exponential increase in the amount of big data which helps in training large models and getting more accurate results. C) The proposal in the machine learning community using various activation functions such as ReLU, LeakyReLU which facilitates the CNN model to converge faster, maintain high accuracy and avoid overfitting [7].

Generative models, particularly GAN’s have shown remarkable results in image generation applications such as super-resolution [8], generating art, Image-to-image translation [9] etc. One advantage of GAN’s is the adversarial loss component that allows it work well with multi-modal outputs, e.g. in image generation tasks where an input can have multiple acceptable correct answers. Traditional machine learning methods use pixel-wise mean squared error as the optimization objective and are hence not able to produce multiple correct outputs. GAN’s excel in tasks which require generating samples resembling a particular distribution. One such task is super-resolution where from a low-resolution image, a high-resolution equivalent has to be estimated, and multiple high-resolution images corresponding to a low- resolution image are possible [9].

Image restoration and denoising techniques deal with accounting for noise and other disturbances to recover a less degraded image from the original image. Super Res- olution and Image restoration techniques are quite similar theoretically differing to the fact that, Super Resolution produces an upscaled noise-free image of the original one. There has been considerable work in Image restoration using Deep Learning methods to achieve image denoising. Burger et al. have applied The Multi-Layer Perceptron for natural image denoising and post blurring denoising [10]. Jain et al. have used CNN’s for natural image denoising and removing patterns consisting of noise such as rain/dirt etc [11]. Cui et al. [12] have proposed to include

(11)

auto encoders in their super-resolution pipeline based on an internal example-based approach. Christian et al. [8] most recently reported state of the art results in super-resolution using a generative adversarial network.

These recent developments and research show that there is a lot of potential in applying Deep Learning (especially Deep CNN’s and GAN’s) to Image SR and achieve competitive/ better results compared to the state of the art methods in Image SR.

1.3 Problem

The problem of generating a high-resolution image (HR) from a low-resolution image (LR) is an undetermined inverse problem which does not have a unique solution.

This is made worse by the fact that a variety of different solutions exist for any given low-resolution pixel.

While capturing a digital image, there is a significant loss of spatial resolution caused by optical distortions; motion blurs due to limited shutter speed, noise that occurs within the sensor or due to transmission resulting in significant differences between the original scene and the captured scene. So, apart from scaling the low-resolution image, the SR algorithm also needs to account for these factors. Standard errors in acquiring an image aside, microscopy images are more susceptible to suffering from problems like photobleaching, crosstalk, toxicity, etc. (Discussed in more detail in section 3.1).

The most common diagnostic errors in biomedical imaging are missed diagnoses, compared to those that were late or incorrect. Many Patients are misdiagnosed by images from CT scans, mammograms, MRI’s, etc. These misdiagnoses partly occur on account of observer error (due to unclear images, multiple psychophysio- logical factors, including the level of observer alertness, observer fatigue, duration of the observation task) from the observers and perceptual errors (failure to detect abnormality from images).

1.4 Purpose and Goal

This thesis will investigate whether Deep Learning (Generative adversarial networks, Convolutional neural networks) can be used to achieve better results for Super-resolution in high content screening microscopy images (compared to conventional methods).

The end goals of this project are to -

• Propose and Implement a deep learning method for single microscopic image super-resolution (SR) that directly learns an end-to-end mapping between the

(12)

low/high-resolution images and takes a low-resolution microscopic image as the input and outputs the high-resolution one.

• Evaluate the produced HR images by the method with the original LR images, compare the obtained results with the state of the art methods and discuss the suitability of Deep Learning for achieving SR in microscopy images.

1.5 Ethics and Sustainability

This work will benefit the biomedical research community in general and patients in particular since this technique might help doctors/clinicians in reducing the number of misdiagnosis. Our use of publicly available datasets assuages any privacy concerns there might be. The results are reported as they are obtained without any manipulation and appropriate sources are given credit wherever possible to avoid plagiarism.

The biggest risks we foresee at the moment is -

• Acquisition of medical image data - Medical data acquired especially images, might be filled with noise and artifacts owing to instrument acquisition errors, faulty human handling etc. This factor should be taken into account while performing the data analysis and interpreting the results as faulty conclusions might lead to incorrect diagnosis causing harm to the patients.

• Adoption of algorithms in real environments - Clinicians prefer to use the raw data/images produced from experiments without any processing and hence there is a possibility that they might not adapt our technique in practice.

1.6 Methodology

Since this thesis targets research and no known deep learning technique has been applied on microscopic images, the use of Generative Adversarial networks (which have reported state of the art results for super-resolution in natural images) is ex- plored for super-resolution in microscopic images. Moreover, acquiring microscopic images comes with a host of challenges (described in detail in section 3.1) leading to noisy captured images emphasizing the need for applying super-resolution techniques. I propose a Generative adversarial network, mSRGAN, for performing super-resolution exclusively on microscopic images inspired by Christian et al.’s, SR- GAN [8], who use it for super-resolution on natural images optimized for perceptual quality.

First off, an extensive literature survey is conducted to investigate the classical and more recent deep learning based approaches for super-resolution. The objective of this study was to get acquainted with the algorithms used for super-resolution,

(13)

make use of the latest state of the art techniques and learn about the main challenges present in the research area. Through the study, it was found that Deep Learning-based approaches for super-resolution have shown promise, with some of them breaking the state of the art results. One drawback, however, is that the resulting super-resolved images are not visually pleasing to the human observer and the main reason for this is the use of pixel-wise Mean squared error as the optimization function to generate super-resolved image.

To offset the lack of visual quality and building on the work by Christan et al.

[8], I propose a novel perceptual loss function, which makes use of pixel-wise M.S.E and a content loss (minimizing the feature representations of images stored in a mini VGG19 network ). An adversarial network similar to SRGAN by Christian et al.

[8] is used with the exception that a mini VGG19 network trained from scratch on microscopic images is proposed instead of a pre-trained VGG19 trained on Imagenet images. Using a weighted combination of pixel-wise M.S.E and content loss makes sure that generated images benefit from the strong points of both the losses. The generated image quality is measured by psnr and compared with Bicubic interpolation and SRGAN. Finally, to test the applicability of our mSRGAN model in real life applications, a nuclei segmentation study is performed, and the segmentation performance is measured by Dice co-efficient and validated further using statistical tests.

1.7 Delimitations

In this project, due to time and resource constraints, we do not conduct a compre- hensive qualitative study on the quality of the images generated by super-resolution.

The visual quality of the images was evaluated only by the author. How good an image looks is a very subjective matter, with some images looking pleasing to some- one, while the same image may not be so pleasing to others. Thus the observations in the report about the visual quality of the images are prone to the authors biases.

1.8 Outline

The rest of the report is organized as follows - In chapter 2 we give an overview of the theory and concepts that are essential to understanding the work done in rest of the project. This chapter also includes a section on related work. Next, we highlight the contribution of this work in chapter 3. The motivations for conducting this work is presented in chapter 4. We introduce the model and methods use in this project in chapter 5. Then we present our experiments, results, and evaluation in chapter 6 followed by a discussion on the methods used, experiments, and results

(14)

in chapter 7. Finally, we conclude the report in chapter 8 and also give some ideas for future work.

1.9 Contributions

The main contributions of this work are as follows -

• We propose a first Generative adversarial network called, mSRGAN, to per- form SR on microscopic images optimized for visual quality. We integrate the traditional pixel-wise M.S.E and loss calculated on the feature representations of a mini VGG19 network trained from scratch on fluorescent microscopic images.

• We then do a proof of concept nuclei segmentation study on the super-resolved, bicubic interpolated and ground truth images. By Dice coefficient and statistical validation, we demonstrate that super-resolution by mSRGAN improves the segmentation performance compared to Bicubic interpolated SR images and can be used as an effective pre-processing step for performing nuclei segmentation.

(15)

Chapter 2

Relevant Theory

This chapter begins with subchapters 2.1,2.2 aimed at providing the reader sufficient background knowledge to understand the various technical concepts used in the thesis. Then a review of the related work of non-deep deep learning, as well as learning methods for single image super-resolution, is described in subchapter 2.3.

2.1 Background knowledge

2.1.1 Definitions

Image Resolution - The term resolution in image processing corresponds to the amount of information contained in an image that can be used to judge the quality of the image and image acquisition/ processing devices. Resolution can be classified into several categories such as Pixel or Spatial resolution, Spectral Resolution, Temporal resolution, Radiometric resolution. For this project, we will be dealing with spatial resolution, and the term resolution used henceforth will imply spatial resolution. Spatial Resolution is the number of pixels that are used to construct the image and is measured by some pixel columns (width) × number of pixel rows (height), say for, e.g., 800 × 600.

(16)

CHAPTER 2. RELEVANT THEORY

Figure 2.1: As can be seen in the above figures, L has more spatial resolution than R.

Pixels - They are the smallest addressable parts of an image. Each image can be considered as a matrix consisting of several pixel values. Every pixel stores a value proportional to the light intensity at a particular location, and for an 8-bit grayscale image, the pixel can take values from 0 to 255.

Low resolution - A low-resolution image implies that the pixel density of the image is small thereby giving fewer details.

High resolution - A high-resolution image implies that the pixel density of the image is high leading to more details.

Super-resolution - SR is constructing an HR image from a single/multiple LR images.

Super resolution methods can be categorized into two categories based on the number of images involved - a) Multiframe super-resolution b) Single image super- resolution.

Multiframe super-resolution - This method utilizes multiple LR images to re- construct an HR image. These multiple images can come from various cameras at separate locations capturing a scene or several pictures of the same scene. These multiple input LR images more or less contain the same information, however the information of interest is the subpixel shifts that occur due to movement of objects, scene shifts, motion in imaging systems (e.g., satellites) If the different LR image inputs have different subpixel shifts then this unique information contained in each LR image can be leveraged to reconstruct a good HR image [13].

(17)

Single image super-resolution (SISR) - In SISR, the super resolving algorithm is applied to only one input image. Since in most cases there is no underlying ground truth, the significant issue is to create an acceptable image. The majority of the SISR algorithms employ some learning algorithms to hallucinate the missing details of the output HR image utilizing the relationship between LR and HR images from a training database.

The SR reconstruction problem can be formulated regarding an observation model [1] as shown in Figure 2.2 which relates the HR image with the input LR images.

Figure 2.2: Observation model between an LR and HR image for a real imaging system. First by continuous signal sampling the desired HR image is produced which is then subjected to translation, rotation leading to blurring due to optical, motion, imaging system motion, etc. Next, LR observation images are achieved by downsampling the blurred image.

2.1.2 Strategies to increase image resolution

The resolution of an image can be increased by either increasing the hardware ca- pabilities of imaging devices or using a software/algorithmic approach.

• Hardware Approach - One direct way to increase the spatial resolution is to increase the number of pixels per unit area by reducing the pixel size from sensor manufacturing techniques [14]. But reducing the pixel size beyond a threshold (which is already achieved by current technologies) leads to the generation of shot noise as less amount of light is available for the decreasing number of pixels, degrading the image quality severely. Another way to en- hance the spatial resolution is to increase the sensor chip size which leads to an increase in the capacitance and is not sufficient since increasing capacitance adversely affects to speed up a charge transfer rate [1]. Also, the high cost

(18)

of high precision optics and sensors acts as a hindrance in these approaches adopted to commercial solutions.

• Software Approach- To avoid the disadvantages of the hardware-based- approaches mentioned above, software and algorithmic methods (i.e., SR algorithms) are preferred. Techniques such as image interpolation, restoration, rendering, etc. are widely used in enhancing spatial resolution. Image interpolation approximates the color and intensity of a pixel based on the neighboring pixels values but fails to reconstruct the high-frequency details as noise is introduced in the HR image. Image restoration works by applying deblurring, sharpening and removing sources of corruption such as motion blur, noise, camera misfocus, etc. keeping the size of the input and output images the same. In image rendering, a model of an HR scene with imaging parameters is given which is used to predict the HR observation of camera. Image super- resolution is a signal processing technique which considers a single/multiple LR images to construct an HR image [15] [16]. Apart from costing less than the hardware-based approaches, SR techniques can be applied to the existing imaging systems.

2.1.3 Evaluation metric for Super-Resolution

Peak signal to noise ratio (psnr) - psnr is a metric used to measure the quality of any image reconstructed/restored concerning its reference or ground truth image.

For a given noise-free m × n monochrome image I and its noisy approximation K, the Mean squared error is given by -

M SE = 1 mn

m−1

X

i=0 n−1

X

j=0

[I(i, j) − K(i, j)]² (2.1)

and the psnr is given by -

psnr = 10log₁₀(M AX_I

M SE ) (2.2)

where M AX_I is the maximum possible pixel value of the image.

2.2 Neural Networks

A neural network consists of an input layer, an output layer and at least one in- termediate layer called as the hidden layer. The layers consist of units called as neurons which are connected to neurons of the preceding layer in a directed acyclic graph, i.e., the neuron outputs from the previous can become the neuron inputs in the next layer. The most common layer type used in neural networks is the fully

(19)

connected layer wherein all the neurons in the adjacent layers are pairwise connected with each other and connections between neurons of the same layer is prohibited [17].

Figure 2.3: A graphical representation of a neuron with three inputs (Input1, In- put2, Input3), their corresponding weights (Weight1, Weight2, Weight3), activation function and the resulting output.

As shown in Fig 2.3 a neuron computes the weighted sum of its inputs and a bias which is then applied a linear/ nonlinear activation function.

To represent the process formally, for given inputs x_i with its respective weights w_ij, a neuron y_j computes the weighted sum w_ij along with the bias b_j and applies an activation function f to the whole sum as shown below -

Y_j = f (w_ijx_i+ b_j) (2.3)

The activation function ’f’ introduces nonlinearity to the output of neuron y_j, This comes in handy since we want the network to account for the nonlinear patterns in the data and most of the real world data has nonlinear structure.

Some of the commonly used activation functions are -

• Sigmoid - The sigmoid function is given by σ(z) = 1

1 + e^−z (2.4)

(20)

• Tanh - The tanh function is given by tanh(z) = sinh

cosh = e^z− e^−z

e^z+ e^−z (2.5)

Figure 2.4: Visual representation of the tanh activation function

• RelU - ReLU is given by

ReLU (z) = max(0, z) (2.6)

Figure 2.5: Visual representation of the Relu activation function

(21)

2.2.1 Convolutional Neural Networks

Convolutional neural networks are a category of neural networks that have been proven to remarkably effective in computer vision and classification applications such as object detection, self-driving cars, super-resolution, etc. [3] [4] [5] [6] . A convolutional layer consists of two-dimensional filters/kernels. The idea is to organize the neurons into units with inputs from local neighborhood in the image which results in this filter. These filters are learned during the training of the algorithm, unlike the custom handcrafted features that are used in conventional machine learning algorithms. This operation is similar to the standard mathematical concept of convolution and is called that. The learned filters are convolved with the input image, and the feature responses that result from this are passed in the next processing layer as input. Neural networks that have such convolutional layers as cascaded stacks are called as Deep convolutional neural networks [17]. Some well- known architectures are Alexnet which uses five convolutional layers winning the best recognition performance at ILSVRC 2012, ResNet a 152-deep residual network winner of the best performance at ILSVRC 2015 almost entirely consists of convolutional layers [3].

Figure 2.6: Visual representation of a convolutional neural network subsequently creating two feature maps, first with a filter of size 5 and then 3.

2.2.2 Generative Adversarial Networks

Generative adversarial networks (GAN) are a class of generative models used in un- supervised machine learning consisting of two networks (the generator and the discriminator) competing against each other in a zero-sum game framework. GAN use a latent code that describes everything that’s generated later. GAN’s are asymptot- ically consistent, meaning if one can find the equilibrium point of the game-defining a GAN, it’s guaranteed that the real distribution that generates the data is re-

(22)

covered and given an infinite amount of training data, the correct distribution is eventually rescued [18][19].

To describe the working of the GAN framework, we have two competing models in the sense of game theory where there is a game that has defined payoff functions with each player trying to maximise their payoffs.

Within this game, one of the networks is the generator which is our primary model of interest that produces samples (generated samples/fake samples) with the aim of mimicking those that were from the real training distribution (real samples). The other competing model is the discriminator which inspects the sample and deter- mines whether it’s real or fake. During the training, images or any other samples are fed to the discriminator. The discriminator can be any differentiable function (usually a deep neural network) whose parameters can be learned by gradient descent. When the discriminator is applied to samples/images that come from the training set (real samples), its objective is to yield a value close to one, representing a high probability that the input was real rather than fake.

The discriminator is also applied to the samples generated from the generator (fake samples), and the goal of the discriminator in this scenario is to make the output as close to zero as possible implying the sample was fake. The generator is a differentiable function (usually a deep neural network) whose parameters can be learned by gradient descent. The generator function is applied on a sampled latent vector

’z’ which is nothing but noise at the start acting as a source of randomness helping the generator in producing a wide range of outputs. The generated images by the generator are then fed to the discriminator, and the generator tries to make the discriminator output one, fooling it into thinking the generated image is real when it is not. The readers can find more detailed technical information on GAN’s here [18].

On a higher level, the generator can be viewed as a counterfeiter trying to create fake currency while the discriminator can be viewed as the police trying to ban fake currency while allowing real currency. As these two adversaries are forced to compete against each other, the counterfeiter must create more and realistic currency samples with the ultimate objective of fooling the police into believing that generated fake currency is real.

(23)

Figure 2.7: Visual representation of a generative adversarial network in action.

Technical formulation - As mentioned above, a Generative adversarial network consists of two competing networks, the generator and the discriminator, usually differentiable multi-layer networks. The generator network ’G’ learns a mapping from a representation space, a latent space to the space of the training data. This is done by first defining a prior on the input noise variables p_z(z), then representing a mapping to the data space as G(z; θ_g), where θ_g are the parameters of the generator [18][19]. Expressing this more formally,

G : G(z) → R^|x|

where,

z ∈ R^|z| is a sample from the latent space x ∈ R^|x| is a sample from the training data

A second differentiable multi-layer network, the Discriminator network D(x; θ_d) is defined as a function that returns a scalar (0 / 1) by mapping the image data to a probability, effectively telling whether the image is from a real distribution (training images) or fake distribution (generated images by the generator) [18][19].

Expressing this more formally,

(24)

D : D(x) → (0,1) where,

D(x) is the probability that x comes from the true data distribution (p_x) instead of the generator distribution p_g

θdrepresent the parameters of the Discriminator D.

The generator G is trained to minimize log(1 - D(G(z))) in order to find parameters which confuse the discriminator the most, and the discriminator D is trained to maximize the assigned probability to the training examples and the generated samples from G [18].

The training cost is calculated by solving the value function V (G, D) [18]

minθGmaxθDEx∼ p_data(x)[logD(x)] + E_z ∼ p_z(z)[log(1 − D(G(z)))] (2.7) The training is done in an alternate fashion, with the parameters of a model are updated while the parameters of the other model are fixed. The training process is described in detail below in Algorithm 1 [18] and for a fixed generator G, there is an optimal discriminator D^∗ such that

D^∗_G(x) = p_data(x)

p_data(x) + p_g(x) (2.8)

(25)

Algorithm 1 Minibatch stochastic gradient descent training of gen- erative adversarial networks. The number of steps to apply to the discriminator, k, is a hyper paramater.

1: For number of training iterations do

2: For k steps do

3: Sample minibatch of m noise samples {z⁽¹⁾, ..., z^(m)} from noise prior p_g(z).

4: Sample minibatch of m examples {x⁽¹⁾, ..., x^(m)} from data generating distribution p_data(x).

5: Update the discriminator by ascending its stochastic gradient:

6:

7: ∇_θd_m¹ ^P^m_i=1[logD(xⁱ) + log(1 − D(G(z⁽ⁱ⁾)))]

8:

9: end for

10: Sample minibatch of m noise samples {z⁽¹⁾, ..., z^(m)} from noise prior p_g(z).

11: Update the generator by descending its stochastic gradient:

12:

13: ∇_θg_m¹ ^P^m_i=1log(1 − D(G(z⁽ⁱ⁾)))

14:

15: end for

Goofellow et. al. [20] show that there exists an optimal generator G when pg(x) = p_data(x) i.e the optimal discriminator predicts 0.5 for all the samples drawn from x and is unable to distinguish between the real and fake samples.

Goodfellow et. al. [20] further show the convergence of Algorithm 1 (i.e p_g converges to p_data) under the condition that the generator and discriminator are individually strong enough and discriminator is permitted to attain its optimum for a given generator G to improve for :

E_x ∼ p_data(x)[logD_G^∗(x)] + E_x∼ p_g[log(1 − D^∗_G(G(x)))] (2.9)

2.3 Literature Study

In this section, we review the literature for single image super resolution reconstruction techniques based on the traditional methods and the recent deep learning

(26)

methods.

2.3.1 Traditional Single Image super resolution

Traditional single image super resolution are categorized into learning based, reconstruction based, interpolation based approaches.

Learning Based

This method usually involves a training step where the relationship between HR images belonging to a specific class such as face images, fingerprints, etc. and their counterparts LR images are learned, and this knowledge is incorporated into the apriori term of the reconstruction. For obvious reasons, the training data set should be good enough (regarding sufficiency and predictability) to generalize the test set to avoid overfitting. Different learning based single SR image algorithms are discussed below

Feature pyramids - In this method by Baker and Kanade [21], HR images are downsampled and blurred to produce a Gaussian resolution pyramid which is then used for the generation of Laplacian and Feature Pyramids. After training the system, for a given LR test image, an LR image is found from all the available pyramids, which is the most similar to the LR test image. The work is used nearest neighbor method for detecting the most similar images/patches. The authors tried a different approach, and they arrange the patches/images in a tree structure. In particular, the LR image and its higher counterparts are arranged in a child/parent structure. The relationship between them is learned and used as a priori term in MAP algorithms [2].

Belief Network - Freeman et al. proposed the use of belief network such as a Markov network [22]. The LR and its corresponding HR image are divided into patches. The corresponding patches from the LR and HR images are associated with an observation function, which represents how significantly two patches are related to each other. The neighboring patches in the HR image are assumed to be associated with each other and are represented by a Transition function. Af- ter training the model, the LR image is reconstructed into an HR image and the missing details of the HR image are estimated (learned) using a belief propagation algorithm generating a MAP super-resolved image [2].

Neural Nets - These are similar to belief nets, but handle diverse types of neu- ral networks. Probabilistic neural networks, Integrated recurrent neural networks, multilayer perceptron, feed-forward neural networks, Hopfield NN, linear associative memories with single and dual associative learning, RBF, etc. [2].

(27)

Manifold based methods - This technique involves two steps. The goal in the first step is to add a global constraint over the super-resolved image, and this is achieved by integrating the manifold based methods with MAP method or a Markov based learning method. In the next step, the local constraint is added to the super-resolved image by finding the transformations between the LR and HR residual patches, and this is achieved by using methods such as kernel ridge regression, graph embedding, radial basis function and partial least squares regression. Manifold based methods use multiple nearest neighbor patches of the LR image as against most learning based techniques which use only a single most adjacent patch of the LR image and the corresponding HR image from the training set are used [2].

Reconstruction Based

These methods address the aliasing artifacts that might be present in the input LR image and are classified into the following three groups

Primal Sketches - A priori used by other algorithms was only for a class in an im- age (e.g., face). This is extended to generic priors, and primal sketches are used as a priori. Hallucination algorithm only applied to primitives (edges, ridges, corners, terminations, etc.) but not to the non-primitive parts of the image since a priori for primitives can be learned and not for non-primitives [2].

Then, based on the primal sketch prior, and using a Markov chain inference, the corresponding HR patch for every LR patch is replaced [23]. This step halluci- nates the high-frequency counterparts of the primitives. This hallucinated image is then used as the starting point for the IBP algorithm to produce an HR image [2].

Gradient profile - The shape statistics of the gradient profiles in natural images is robust against changes in image resolution, introduce a gradient profile prior.

Gradient profile-based methods learn the similarity between the shape statistics of the low and high-resolution images and the learned information is used to apply a gradient-based constraint to the reconstruction process [2].

Fields of experts - Fields of experts is an a priori for learning the massive non- Gaussian statistics of natural images. Here usually contrastive divergence is used to determine a set of filters from a training database [2].

Interpolation Based

Interpolation based approaches utilize sampling theory to approximate the high- resolution image from a low-resolution image.The disadvantages of these methods are the introduction of aliasing artifacts along the edges [2]. Bicubic interpolation

(28)

is one example of an interpolation based technique, and it will be used in the thesis as a baseline to compare the SR images generated by mSRGAN.

2.3.2 Deep Learning based Single Image super resolution

Dong et al. [24] were the first ones to demonstrate that DL can be utilized in solving the classical computer vision problem of SR and introduce their deep learning method (SRCNN) to perform super-resolution. They draw their inspiration from the traditional sparse coding (SC) based SR method, and they establish a relationship established between their proposed method and SC. This relationship serves as a guideline in designing their network structure which is an entirely convolutional neural network that learns an end to end mapping between the low and high-resolution images. Their unified framework requires very little pre/post processing beyond the optimization as opposed to SC wherein the steps in the pipeline have rarely been optimized or considered in a unified optimization framework. The SRCNN network consists of 3 layers. Given a low-resolution image which is first upscaled by bicubic interpolation, layer 1 is responsible for extracting overlapping patches from the image and representing each patch as a high dimensional vector which comprises a set of feature maps. Each of these high dimensional vectors is further mapped into another high dimensional vector by layer 2. These vectors which consist of another set of feature maps, conceptually represent the patches of a high-resolution image. The final layer aggregates the previously generated high-resolution patch wise representations to create a final HR resolution which is expected to be the ground truth image. The authors show that SRCNN, which has a lightweight structure, demonstrates state-of-the-art restoration quality, achieves fast speed for practical online usage, functions on three channels simultaneously and performs better than the state of the art methods.

Kim et al. [25] propose a very deep convolutional neural network for SR (VDSR) inspired by VGG-net used for ImageNet classification. Their primary motivation stems from the typical drawbacks found in the existing SR methods especially SR- CNN which are a) SRCNN relies on the context of small image regions (It has only three layers with the receptive field of 13 X 13) and other methods use even smaller regions. This is inauspicious since the information contained in small patches is not sufficient for detailed recovery, especially for larger scale factors. b) Training in these methods converges too slowly, and SRCNN which uses a learning rate of 10⁻⁵ takes several days to converge c) Most of the existing techniques handle different scale factors independently. So, an SRCNN model trained for a scale factor of 3 would not work for a scale factor of say 4 and a separate model would have to be trained for it. Keeping in mind these drawbacks, the authors design a deep CNN architecture which 1) utilizes more considerable contextual information spread over extensive image regions using larger receptive fields (41 X 41 vs. 13 X 13 used by SRCNN) ultimately taking larger image context into consideration, 2) converges faster due to using residual learning CNN and extremely high learning rates (their

(29)

initial learning rate is 104 times higher than SRCNN). Boosting convergence rates can potentially lead to the problem of vanishing/ exploding gradients which are handled by residual learning and gradient clipping leading to a more stable training 3) is capable of learning and processing different scale factors without training additional models. The network structure of VDSR consists of 20 cascaded layers (convolutional and nonlinear) having a 41 X 41 receptive field and 3 X 3 filters in each layer. An interpolated low-resolution image is fed through these layers and transformed into an HR image. The network estimates a residual image and the addition of interpolated low res image with the residual gives the desired output.

VDSR outperforms every other state if the art method including SRCNN by a large margin regarding accuracy, speed, and visual quality.

Tang et al.[26] proposed a compact hourglass shape CNN structure (FSRCNN) for accelerating and producing better results than SRCNN which can be used in practical scenarios demanding real-time performance (24 fps). They find two in- herent limitations that serve as the bottleneck in the runtime of SRCNN 1) The original LR image must be first upsampled to the desired HR size by bicubic interpolation that serves as the input. This causes the computational complexity to grow quadratically with the spatial size of the HR image, and for a spatial size of n, the computational cost of convolution with the interpolated LR image will be n2 times that of the original non-interpolated LR image. 2) SRCNN has a costly nonlinear mapping step wherein input image patches are projected on a high dimensional feature space which is then followed by another complex mapping to a high dimensional HR space, all at the cost of the running time.

To address the first limitation posed by SRCNN, they take the non-interpolated LR image as the input to the network and introduce a deconvolution layer at the end of the network which is responsible for upsampling the LR image. Due to this, the computational complexity is now proportional to the spatial size of the original non-interpolated LR image as mapping is learned directly from the LR image (non- interpolated) to an HR image. For the second limitation, they add a shrinking and expanding layer at the beginning and end of the mapping layer separately to strictly restrict mapping in a low dimensional feature space leading to their model using smaller filter sizes and thus saving computational cost. Their experiments showed that FSRCNN clocks speed up to more than 40X achieving even superior quality than SRCNN. The authors also present a small FSRCNN network that produces image restoration quality similar to SRCNN but is 17X faster and can be run for real-time applications on a generic CPU.

Kim et al. [27] propose a deeply recursive convolutional network (DRCN) to perform image super-resolution. Their network has up to 16 very deep recursive layers, and they hypothesize that increasing recursion depth can improve the reconstruction performance without the need of introducing new parameters for performing convolutions. In previous approaches such as SRCNN, increasing network depth

(30)

can present problems such as overfitting and the model becoming too big to be stored and retrieved. They solve these issues by proposing DRCN which repeatedly applies the same convolutional layer as many times required (network efficiently reuses the weight parameters while exploiting a large image context) which makes sure no additional parameters are introduced while more recursions are performed.

Because DRCN optimized with Stochastic gradient descent method (SGD) does not efficiently converge because of vanishing/ exploding gradients. To make the model converge more quickly, they introduce recursion supervision and skip connections.

In recursion supervision, feature maps originating after each recursion are used to reconstruct the corresponding desired HR image, and all the different predictions from each of the recursion layers are aggregated to generate a more accurate final HR image prediction. Using skip connections, the authors explicitly connect the input to the output layers for image reconstruction. This is particularly helpful since in SR an input LR image is highly correlated with output HR image to a large extent, and an exact copy of the input image is likely to be diminished during the feedforward passes. DRCN outperforms existing state of the art methods by a large margin on benchmarked datasets.

Shi et al. [28] proposed a sub-pixel convolution neural network which can perform real-time SR of 1080p videos on a single K2 GPU. They do this by introducing an efficient subpixel convolution layer at the end of the network which learns upscaling filters to increase the final LR feature maps produced by the network into an HR image feature maps generating an output HR image. By doing this, they elimi- nate the need for upscaling an input LR image by bicubic interpolation in the first step of the SR pipeline and ultimately reducing the computational complexity for carrying out the overall SR algorithm. They evaluate their proposed approach on publicly available images and videos and show that it performs better significantly better and is an order of magnitude faster than the existing CNN based SR methods.

Johnson et al.[29] propose the use of perceptual loss functions for image transformation problems such as Style Transfer SR where an input image is transformed into an output image. Recent methods such as CNN’s typically use per pixel loss between the output and ground truth images. In recent work, high-quality images are produced by optimizing perceptual loss functions based on high-level features extracted from trained networks. The authors combine the benefits of both approaches propose the use of perceptual loss functions for training feed-forward networks for image transformation task and finally experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gave them visually pleasing results.

Christian et al. [8] present a generative adversarial network SRGAN which is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. They propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes their solution to the natural

(31)

image manifold using the adversarial loss. Also, they use a content loss motivated by perceptual similarity instead of similarity in pixel space. Their deep residual network can recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score test employed showed vast gains in perceptual quality. Also, they report state of the art psnr values on benchmark SR datasets.

(32)

Chapter 3

Motivation

This chapter discusses in detail the motivation behind proposing a GAN utilizing perceptual loss for High content screening (HCS) microscopy images which are -

1. Common problems in HCS causing captured images with artifacts/noise (acquisition errors)

2. Inefficiency of the traditional pixel wise Mean squared error (MSE) 3. Feature transferability issues in CNN’s

3.1 HCS microscopy problems in image acquisition

Apart from suffering from the usual challenges in image acquisition as mentioned in section 1.3, microscopy images are prone to a host of domain specific challenges which might further degrade the quality of the images acquired. Through this chapter we attempt to highlight these common problems faced while acquiring microscopic images, which create a need to use denoising, super-resolution approaches.

(33)

CHAPTER 3. MOTIVATION

Figure 3.1: An overview of common problems encountered while acquiring microscopy images. Image Source [30]

3.1.1 Photo bleaching

Photobleaching (also called fading) occurs when due to more extended illumination periods, the fluorophores suffer from diminished excitation response thereby los- ing its ability to fluoresce caused by photon-induced chemical damage and covalent modification. In simple terms when we say a fluorophore is photobleaching it means that it has lost its ability to fluoresce, absorb and emit light. During the photobleaching process, the imaged sample gradually loses the amount of fluorescence observed thereby ultimately leading to a loss of image quality. Loss of fluorescence caused by photobleaching is crucial to take into account while performing image quantification studies as it can alter the quantitative data thereby leading to false and misleading results.

(34)

Figure 3.2: Photo bleaching over time (seconds) in quantum dot labels (shown in red) and organic dye molecules (shown in green)Image Source [31].

Figure 3.3: Images captured (a-f) at 2 minute intervals for multiply stained speci- mens. Image Source [32]

(35)

3.1.2 Bleed through/ Crosstalk

Bleed through/ crosstalk artifacts appear when two or more fluorescent markers are excited simultaneously, and the channel of interest displays fluorescent from the neighboring channel.

Figure 3.4: In a sample containing two non overlapping objects dyed in red and green and as the crosstalk factor increases the more yellowish the red object looks since its signal is recorded in the green channel apart from just the red channel.

Image Source [33]

Figure 3.5: Another instance of crosstalk where two distinct fluorophores appear in the same channel. Fluorophore observed in the TRITC filter is also observed in FITC filter. Image Source [34]

(36)

3.1.3 Phototoxicity

In live cell imaging, overexposing the cells to light (both low and high wavelength) for a prolonged time eventually, end up damaging them causing phototoxicity. One of the reasons for phototoxicity is that most cells used in a typical imaging experiment are not used to the sheer number of photons aimed at them. Fig 3.6 illustrates phototoxicity.

Figure 3.6: The cells at the top show disastrous protrusion of the plasma membrane of a cell (also known as blebbing) indicating phototoxicity, while the neighboring cells are relatively healthier. Image Source [35]

(37)

3.1.4 Uneven illumination

There are instances when a sample is not evenly illuminated by light across the field of view giving rise to uneven illumination in the image regions with darker, unclear and more brightly illuminated areas occurring together.

Figure 3.7: In the above figure, cells are stained with nucleic acid dye and uneven illumination is observed. Image Source [36]

(38)

3.1.5 Color and contrast errors

Color errors occur due to many reasons such as color degradation from auto fluorescence and improper filtration etc. Contrast errors occur due to misconfiguration of the optical train or utilizing wrong filter combinations.

Figure 3.8: Visual demonstration of color error across slides Image Source [30].

Figure 3.9: Visual demonstration of contrast error across slides. Image Source [30]

(39)

3.2 Inefficiency of pixel wise M.S.E

L2 loss or mean squared error is widely used in several machine learning applications such as regression, pattern recognition, signal processing, image processing and is the defacto error metric where the pixel-wise distance between generated and ground truth images is measured. M.S.E is used in wide variety of image applications such as super-resolution, segmentation, colorization, depth, and surface normal prediction, etc. Several factors make it a popular choice such as its con- vexity, symmetry, differentiability (favorable for optimization problems), simplicity (parameter free and inexpensive to compute), it is additive for independent sources of distortions, etc. [37].

Another catalyst for this widespread adoption is that standard software packages such as Caffe, Tensorflow, Keras, etc. facilitate using M.S.E but not many other loss functions for regression discouraging practitioners to experiment with different loss functions. More detailed advantages of M.S.E can be found here [37]

However, M.S.E has a lot of flaws for generating images, and images produced by MSE do not correlate well with the image quality perceived by a human observer. One of the reasons for this is the assumptions made while using M.S.E, such as the impact of noise, does not depend on the local characteristics of an image and M.S.E operates under the Gaussian noise model which isn’t the case in many settings. Contrary to the aforementioned assumptions, the factors influencing the sensitivity of the human visual system (HVS) to noise rely on local luminance, contrast, and structure. M.S.E overly penalizes larger errors while is more forgiving to the small errors ignoring the underlying structure of the image. M.S.E tends to have more local minima which make it challenging to reach convergence towards a better local minimum. Consequently, the most common metric to quantitatively measure the image quality, psnr corresponds poorly to a human’s perception of an image quality. As can be observed by equation 3.1 below, M.S.E and psnr share an inverse relationship with minimizing M.S.E leading to a high psnr. Thus, psnr itself cannot be an indicator of how well an image looks perceptually and there is need to adopt other loss metrics that capture intricate details affecting the HVS.

psnr = 10log10

L²

M SE (3.1)

Recently, it has been shown that visually good looking high-quality images are generated by using perceptual loss function optimization, where feature representations from pre-trained convolutional neural networks are minimized instead of pixel differences. This approach has been applied to invert feature representations, visualizing image features learned by a deep CNN, for performing style transfer between content and style images. More recently, a perceptual loss has been used for super-resolution by Christian et al. in SRGAN [8] with great success in generating

(40)

photorealistic SR images.

Figure 3.10: SR images generated by optimizing M.S.E (Left), achieving state of the art psnr results and the image on the right is the SR image generated by perceptual optimization. The visual differences are apparent with the perceptually optimized image (R) appearing more sharper and realistic despite having a low psnr value.

Image source [38].

We hypothesize that using perceptual optimization as opposed to pixel-wise optimization alone for generating microscopic SR images has the potential to contribute to making the generated images look visually pleasing and closer to the ground truth HR images. Despite its shortcomings, M.S.E still can be an asset in accounting for pixel-wise changes in the images which a perceptual loss function might otherwise miss, and we won’t completely discard M.S.E as the authors of SRGAN do. Hence, we will use a weighted combination of the pixel-wise M.S.E with the perceptual loss to combine the benefits of both approaches, and we believe that generated image would look visually pleasing (something not possible by M.S.E alone) and have a respectable psnr (something not likely by using perceptual loss alone).

(41)

3.3 Feature transferability issues in CNN’s for distant source and target domians

There is compelling evidence that for performing visual recognition tasks such as object detection/ classification, deep convolutional neural networks are the most powerful way to learn feature representations as their deep architecture makes it possible to extract several critical distinguishing features at multiple layers of ab- straction. Azizpour et al. [39] show that the features obtained from training a deep convolutional neural network should be the first choice in visual recognition tasks.

Deep networks trained on a large labeled dataset such as Imagenet yield the best results by a substantial amount by learning useful generic image feature representations. Apart from learning the representations, one essential aspect of CNN’s is

"transferability" of these representations which can be used off the shelf for solving many visual recognition tasks with remarkable performance. This transferability, however, is influenced by several factors, one of them being the distance between the source and target tasks. Bengio et al. [40] present evidence that the transferability of features is inversely proportional to the distance between the source and target tasks, i.e., transferability of features decreases as the distance between source and target tasks increases. Extensive studies conducted by Azizpour et al.

[41] further cement the fact that there is, in fact, an inverse relation between performance achieved at target tasks and their respective distance from the source task.

This transferability is one of the prime reasons that the SRGAN model proposed by Christian et al.[4] generates impressive photo-realistic upscaled images. They incorporate a content loss optimized for minimizing the feature representations in the different layers of a pre-trained VGG19 model on imagenet data and evaluate the model on widely bench marked datasets in SR such as Set 5, Set14, BSD100 which are visually very similar to the labeled images in ImageNet.

Even though SRGAN beats every other architecture and achieves state of the art results, it might be ill-suited to apply directly to the domain of microscopic images, as the authors themselves acknowledge. Considering the problems of transferability of features between distant domains mentioned above, we foresee two significant issues acting as a hindrance in directly applying SRGAN to high content screening microscopy images -

1. There is a vast difference between high content screening microscopy images and the Imagenet images. Owing to this distance in the domains, an SRGAN model trained on Imagenet might not yield the ideal results for the content loss as it won’t be able to leverage the feature representations stored in the VGG19 layers for the target task of upscaling high content screening microscopy images. One of the guidelines to learn the representations between distant source and target tasks is to train the network from scratch using target data [39].

(42)

2. SRGAN is trained on Imagenet images to optimise the visual appearance of images. The primary goal of the algorithm is to improve the perceptual quality of the SR images, so in a microscopy/ medical setting a damaged cell in the image could be reconstructed as a healthy cell just because it looks nice perceptually. This can be a catastrophic situation in medical settings. Also, the VGG network the authors use in their paper is not trained on images belonging to the microscopic domain hence the SRGAN algorithm might suffer from convergence issues and fail to generalize appropriately for microscopic images.

Drawing inspiration from the original SRGAN architecture [8], and to mitigate the challenges as mentioned earlier we propose a newer extended version of SR- GAN specialized for super-resolving microscopic images. In this proposed extended architecture, we would train a mini VGG 19 network from scratch in the original SRGAN network, for the sole task of recognizing and classification of microscopic images and subsequently use the learned feature reconstructions of these images for minimizing the content loss. We hypothesize that this modified architecture would result in better Super-Resolved images than using the plain SRGAN architecture alone as it can leverage the feature reconstructions of microscopic images for perceptual optimization. We would also train the generator and discriminator of the network only on microscopic images, leading to faster convergence and plausible reconstruction of images which represent the true distribution of microscopic images.

Summing up the motivations as hypothesis -

1. SRGAN will not perform optimally for performing SR on microscopic images compared to mSRGAN since it utilizes feature representations from a domain (natural images) distant from the target domain (microscopy images).

2. Pixel-wise M.S.E will be inefficient in generating photorealistic microscopic images compared to the perceptual loss.

Generative adversarial networks for single image super resolution in microscopy images

Generative adversarial networks for single image super resolution in microscopy images

SAURABH GAWANDE

Generative adversarial networks for single image super resolution in microscopy images

Abstract

Abstract

Acknowledgements

Contents

Abbreviations

Chapter 1

Introduction

1.1 Image Super-resolution

1.2 Background

1.3 Problem

1.4 Purpose and Goal

1.5 Ethics and Sustainability

1.6 Methodology

1.7 Delimitations

1.8 Outline

1.9 Contributions

Chapter 2

Relevant Theory

2.1 Background knowledge

2.2 Neural Networks

2.3 Literature Study

Chapter 3

Motivation

3.1 HCS microscopy problems in image acquisition

3.2 Inefficiency of pixel wise M.S.E

3.3 Feature transferability issues in CNN’s for distant source and target domians