Ensembles of Single Image Super-Resolution Generative Adversarial Networks

(1)

Ensembles of Single Image Super-Resolution

Generative Adversarial Networks

VICTOR CASTILLO ARAUJO

K T H R O Y A L I N S T I T U T E O F T E C H N O L O G Y

S C H O O L O F E L E C T R I C A L E N G I N E E R I N G A N D C O M P U T E R S C I E N C E

(2)

Ensembles of single image super-resolution generative adversarial networks / Ensembler av generative adversarial networks för superupplösning av bilder

© 2021 Victor Castillo Araujo

(3)

Abstract

Generative Adversarial Networks have been used to obtain state-of-the-art results for low-level computer vision tasks like single image super-resolution, however, they are notoriously difficult to train due to the instability related to the competing minimax framework.

Additionally, traditional ensembling mechanisms cannot be effectively applied with these types of networks due to the resources they require at inference time and the complexity of their architectures.

In this thesis an alternative method to create ensembles of individual, more stable and easier to train, models by using interpolations in the parameter space of the models is found to produce better results than those of the initial individual models when evaluated using perceptual metrics as a proxy of human judges. This method can be used as a framework to train GANs with competitive perceptual results in comparison to state-of-the-art alternatives.

Keywords

Generative Adversarial Networks; Single Image Super-Resolution; Computer

Vision; Convolutional Neural Networks; Ensemble Learning;

(4)

(5)

Sammanfattning

Generative Adversarial Networks (GANs) har använts för att uppnå state-of- the-art resultat för grundläggande bildanalys uppgifter, som generering av högupplösta bilder från bilder med låg upplösning, men de är notoriskt svåra att träna på grund av instabiliteten relaterad till det konkurrerande minimax- ramverket.

Dessutom kan traditionella mekanismer för att generera ensembler inte tillämpas effektivt med dessa typer av nätverk på grund av de resurser de behöver vid inferenstid och deras arkitekturs komplexitet.

I det här projektet har en alternativ metod för att samla enskilda, mer stabila och modeller som är lättare att träna genom interpolation i parameterrymden visat sig ge bättre perceptuella resultat än de ursprungliga enskilda modellerna och denna metod kan användas som ett ramverk för att träna GAN med konkurrenskraftig perceptuell prestanda jämfört med toppmodern teknik.

Nyckelord

Generative Adversarial Networks; Superupplösning; Datorseende; Bildanalys;

Convolutional neural networks; Ensembler

(6)

(7)

i

List of acronyms and abbreviations ... vi

1 Introduction ... 1

1.1 Background ... 1

1.2 Problem... 2

1.3 Purpose ... 3

1.4 Objectives ... 4

1.4.1 Ethics and Sustainability ... 4

1.5 Research Methodology ... 4

1.6 Delimitations ... 5

1.7 Outline ... 5

2 Background ... 1

2.1 Image Super-Resolution... 1

2.2 Evaluation Metrics for Computer Vision ... 2

2.3 Convolutional Neural Networks for Image Super-Resolution ... 3

2.4 Generative Adversarial Neural Networks (GAN) ... 4

2.5 Ensemble Learning ... 8

2.6 Model Combination Strategies... 9

2.6.1 Gradient Descent Optimization ...10

2.6.2 Search Algorithms ...10

2.6.2.1 Random Search ... 11

2.6.2.2 Grid Search ... 12

2.6.2.3 Bayesian Optimization ... 12

2.7 Related Work ... 13

2.7.1 Classic ensemble for regression or classification tasks ...13

2.7.2 Image self-ensemble ...14

2.7.3 Gradient updates ...14

2.7.4 Weights Ensemble ...14

2.8 Summary... 15

3 Methodologies and Methods ... 16

3.1 Research Process ... 16

3.2 Research Paradigm ... 18

3.3 Data Collection ... 18

3.3.1 Dataset Processing ...19

3.3.2 Sampling Independent Sets ...20

3.3.3 Validation Datasets ...21

3.4 Experimental Design and Planned Measurements ... 21

3.4.1 Model Combination Strategies ...22

3.4.1.1 Gradient Descent Optimization ... 23

3.4.1.2 Search Algorithms... 23

3.4.1.3 Random Search ... 24

3.4.1.4 Grid Search ... 24

3.4.1.5 Bayesian Optimization ... 25

3.4.2 Test Environment ...25

3.4.3 Hardware and Software to be used ...25

3.5 Assessing Reliability and Validity of the Data Collected ... 26

(8)

ii

3.5.1 Reliability ...26

3.5.2 Validity ...27

3.6 Planned Data Analysis ... 27

3.6.1 Data Analysis Technique ...27

3.6.1.1 Quantitative evaluation ... 28

3.6.1.2 Qualitative evaluation ... 29

3.6.2 Software Tools ...30

3.7 Evaluation framework ... 30

3.8 Summary... 30

4 Models Training and Testing of the Combination Strategies ... 31

4.1 Training Phases ... 31

4.2 Combinations Implementation ... 32

4.2.1 Gradient Descent Optimization ...32

4.2.1.1 Directly weighting each model by the corresponding coefficient ... 32

4.2.1.2 Weighting the parameters by the scalar coefficient during model loading 4.2.1.3 Modifying the ESRGAN network architecture ... 32 32 4.2.2 Search Algorithms ...33

4.2.2.1 Random Search ... 34

4.2.2.2 Grid Search ... 34

4.2.2.3 Bayessian Optimization... 35

4.3 Summary of the combination methods tested ... 35

5 Results and Analysis ... 36

5.1 Major Results ... 36

5.1.1 Quantitative Results ...36

5.1.1.1 Global Aggregation Results ... 36

5.1.1.2 Results Aggregated by Dataset ... 38

5.1.1.3 Evaluations with Images Subset ... 38

5.1.1.4 Interaction between the Performance Metrics... 39

5.1.2 Exploratory Results to Understand Underlying Structure: Clustering ...40

5.1.3 Explaining Features: Regression and Feature Importance ...43

5.1.3.1 Evaluating Naïve Combinations ... 44

5.1.4 Qualitative Results ...45

5.2 Reliability Analysis ... 50

5.3 Validity Analysis ... 50

5.4 Discussion ... 51

6 Concluding Remarks ... 52

6.1 Conclusions ... 52

6.2 Limitations ... 53

6.3 Future work ... 53

6.4 Federated Learning ... 54

6.5 Reflections ... 55

References ... 57

(9)

iii

List of Figures

Figure 1: Example of downscaling and upscaling with basic interpolator

algorithms (bicubic) ... 1

Figure 2: Quantitative comparison of evaluation metrics ... 3

Figure 3: Conditional GAN framework for SISR... 5

Figure 4: ESRGAN Architecture ... 7

Figure 5: Grid and Random search of nine trials ... 11

Figure 6: Relevant CNN ensembling strategies ... 13

Figure 7: General phases of the project ... 17

Figure 8: An example of the cropping and downscaling process. From left to right: Original image, 916x916 crop, 128x128 random crop HR, 32x32 downscaled LR. ... 20

Figure 9: Representation of the Dataset binning process ... 21

Figure 10: Comparison between the original CNN architecture (left) and a modification to combine two such architectures in a single one (right) ... 23

Figure 11: The meta-learning process... 24

Figure 12: Example of interpolations between the perceptual ESRGAN model (α = 1) and the PSNR oriented model (α = 0) to reduce the noise artifacts. ... 29

Figure 13: Proposed modified ESRGAN architecture ... 33

Figure 14: Metric relationship considering all models ... 39

Figure 15: Metric relationship considering only the perceptual models ... 40

Figure 16: Elbow method for clustering ... 40

Figure 17: Comparison between best LPIPS combination and best PSNR/SSIM combination ... 46

Figure 18: PIRMt 248 Results ... 46

Figure 19: PIRMt 223 Results ... 47

Figure 20: BSD100 108005 Results ... 47

Figure 21: BSD100 102061 Results ... 48

Figure 22: PIRMt 258 Results ... 48

Figure 23: PIRMv 84 Results ... 48

Figure 24: PIRMv 86 Results ... 49

Figure 25: Full image comparison. From top left to bottom right: LR, HR, best

combination result and ESRGAN result ... 49

(10)

iv

List of Tables

Table 1: Training iterations differences with ESRGAN ... 31

Table 2: Combination strategies to evaluate ... 31

Table 3: Results of the candidate combination methods ... 35

Table 4: Global results comparing the top 10 combinations and benchmark models ... 37

Table 5: Relative comparison between the best combination and ESRGAN .. 37

Table 6: Top 10 results per Dataset ... 38

Table 7: Results on a subset of images, per image ... 39

Table 8: Results by cluster ... 41

Table 9: ESRGAN gains ... 41

Table 10: Individual models gains ... 41

Table 11: Aggregate gains for the individual models ... 42

Table 12: Phase 2 gains for the individual models ... 42

Table 13: Full gains aggregate for combination models... 42

Table 14: Complete gains comparison ... 42

Table 15: Naïve model combinations and results... 44

(11)

v

(12)

vi

List of acronyms and abbreviations

BO Bayesian Optimization

CNN Convolutional Neural Network CPU Central Processing Unit

CV Computer Vision

FCN Fully Convolutional Network

GAN Generative Adversarial Neural Network GPU Graphics Processing Unit

HR High Resolution

LPIPS Learned Perceptual Image Patch Similarity LR Low Resolution

MOS Mean Opinion Score PSNR Peak Signal to Noise Ratio SISR Single Image Super-Resolution SOTA State-of-the-art

SR Super-Resolution (sometimes also Super-Resolved) SSIM Structural Similarity Index Metric

SWA Stochastic Weight Averaging

VRAM Video RAM – Video Random Access Memory

(13)

vii

(14)

1 1 Introduction

A particular type of Deep Learning networks are Convolutional Neural Networks (CNNs), which are designed to work with images by using learnable convolutional filters as their basic building block. Downscaling or adding noise to a high-resolution image is an easy task, but doing the transformation in the opposite direction is not trivial. The use of CNNs for these ill-posed low-level Computer Visions (CV) tasks of image super-resolution [1], denoising and deblurring [2] has been investigated, allowing for state-of-the-art results.

These types of networks greatly benefit from certain design choices, for example, network depth and the amount of training data [3]. However, these choices are not without challenges. For example, the deeper the network, the more hardware resources are required to be able to train it and in some cases centralizing all the data that is required for a model to be optimally trained can be unfeasible, especially when using advanced techniques such as Generative Adversarial Neural Networks (GANs) [5] that are able to better reconstruct image details, but make the training process very unstable.

The main objective of this work is to evaluate options to execute multiple simpler and independent GAN training sessions for image super resolution and apply the ensemble learning methodology to combine the resulting models into a single final model in order to obtain a better performance of CNNs for low- level CV tasks such as single image super-resolution.

1.1 Background

This work builds up from the stablished knowledge of existing Machine Learning and Deep Learning techniques, either directly or by applying techniques from methods, such as Ensemble Learning.

The specific case of the low-level CV task of Single Image Super- Resolution (SISR) is used in this work as an empirical example to test the validity of the hypothesis, using a positivist approach that can be generalized to other tasks, such as denoising or deblurring, given that the basic architectures and strategies employed are fundamentally the same, with the most important changes being the data used and the specific domain of the data (like natural images, medical images, deep space photography, and others).

State-of-the-art results in single image super-resolution has been achieved by implementing Generative Adversarial Neural Networks (GANs) [4] , which consist on a pair of networks, a Generator and a Discriminator, that compete during training, one with the objective of discriminating between real images from the target domain and fake results from the generator, which is in turn being optimized to fool the discriminator with realistic results.

A single image super-resolution (SISR) model, ESRGAN [5] is used in

this work, given that it was the winner of the “2018 Perceptual Image

Restoration and Manipulation (PIRM) challenge on perceptual super-

resolution (SR)” [6]. While the method produces results that have more details

than previous ones, it also introduces noise and artifacts that are not appealing

to human judges. In order to solve this, the authors also presented a strategy

that consist of linearly interpolating the parameters of the final model trained

for a perceptual quality objective with a model focused on high Peak Signal-to-

(15)

2 Noise Ratio (PSNR) results to obtain super resolved images sacrificing some details to reduce the artifacts.

In the case of the PIRM challenge, human judges evaluated the results to choose a winner method and this was required because traditional distortion metrics like Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index Metric (SSIM) used to evaluate image reconstruction tasks do not correlate with measures such as the Mean Opinion Score (MOS), based on user studies [7]. However, the performance of most recent networks are instead being evaluated using metrics such as the Learned Perceptual Image Patch Similarity (LPIPS) [8], as a proxy of human perception.

1.2 Problem

GANs are notoriously difficult to train, due to the framework’s unstable nature for different reasons, such as:

 Non-convexity of the system

 High sensitivity to hyper-parameter selection

 Diminishing gradients in cases when one of the networks becomes too successful in comparison to the other, limiting learning

 Non-convergence when parameters in the system start oscillating and no equilibrium is found

Some techniques to prevent these sources of instability have been presented in [9], but these don’t fully solve the instability of the framework. One possible solution is to make use of Ensemble Learning strategies to train independent models that are specialized to a subset of the data or different hyper-parameter configurations and combine these “overfit” models in a final model that is capable to produce better results than the originally trained models.

However, at the moment there aren’t specific strategies to apply efficient Ensemble Learning for low-level CV tasks using GANs. While there have been experiments that empirically demonstrate that it is possible to use traditional Ensemble Learning strategies to increase the performance of high-level computer vision tasks using CNNs like image classification [10], its use with GANs for low-level tasks remains limited. Particularly the question of how to efficiently combine individually trained models to obtain better performance than any of the original models remains open.

The two main limitations found with these networks are that traditionally final ensemble has to maintain all the individually trained learners, requiring large amounts of storage and computing resources at inference time, and that the aggregation mechanism to combine the results of multiple models consists of averaging the final outputs, which in the case of low-level CV tasks are images and directly interpolating them leads to blurry or noisy images, often with ghosting effects, blurriness or artifacts [5], even if the individual models were trained for the same task.

An alternative way to ensemble the models is required in order to

overcome these limitations. Using the knowledge exposed in [28], [12] and [32],

the weights that represent the models can be linearly interpolated to create a

combination of any two models as long as there is a correlation between the

filters of the two models to be interpolated, the result of the linear interpolation

will exhibit a behavior that is intermediate between the two original models.

(16)

3 Interpolating two CNN models is not typically feasible, even if they have the same architecture, because each model is initialized randomly and while they can generate the same filters during training, they will be found arranged differently in the weights space. If there is no correlation, the visual results of the combination will be black images, which can be explained intuitively by noticing that the average of the parameters combined in this fashion will tend to the most probable value of the distribution near magnitude zero, representing the black color. Additionally, as shown in [11] in a test with MNIST character classification, if different initializations were used to independently train two models, interpolating the models by parameter averaging results in a situation where the lowest loss is that of either of the two original models and any intermediate combination only increases the loss, with a peak at halfway between the two models, a behavior that is not desirable. To prevent these behaviors, additional restrictions must be incorporated in the formulation of an experiment aimed at interpolating models parameters.

1.3 Purpose

Considering this setting as starting point, there's a need for an efficient method of ensembling GAN models. In general, the method in evaluation in this work should be able to be used with any CNN for low-level CV task, but given the limited time and resources for the project, a single architecture will be chosen for one specific task, in this case, SISR.

The main purpose of the method is to benefit machine learning practitioners that make use of the GAN framework in their experiments for low- level computer vision tasks, by dividing the whole end-to-end training into smaller training sessions that can later be combined into a single model that exhibits higher performance than the individual models for a specific objective, facilitating the training of such networks by reducing the natural instability of the framework. This final model can be used at inference time, without the constraints of classic ensemble learning strategies.

The hypothesis is that, in the process of combining multiple models that are over-fitting to their local data, the final model is better capable to generalize to previously unseen data. The main contribution of this work is to explore alternative ensembling mechanisms by which the SISR models that have been trained independently are to be combined into a single model to obtain higher performance.

The research question then becomes: In the space of traditional machine

learning ensembles methodologies, is there at least one way to combine the

results of multiple independently trained generative CNN models for low-level

computer vision tasks like SISR, such that the resulting combination produces

a single combined model that has a better perceptual performance than any of

the independently trained models?

(17)

4 1.4 Objectives

In order to answer the research question, the work has been divided into the following sub-goals:

 Train four independent individual SISR generative CNN models with independent data in a way that they can be guaranteed to be able to be combined in a later step

 Test different aggregation strategies to generate a single model based on the previously generated individual models, using the Ensemble Learning literature as reference

 Evaluate the combination of the models, using the original models as baselines, using the LPIPS perceptual metric to evaluate the results.

This last evaluation will allow to finally answer the research question.

1.4.1 Ethics and Sustainability

Ideally, the process of dividing a larger GAN model into subset models that can later be ensembled into a single one would lead into a positive sustainability impact, where the resources needed to train each individual model are not only smaller, but required for shorter periods of training times, which are some of the technical challenges that are common in the field.

While there are ethical considerations that arise from every machine learning application, such as cases with unbalanced data and biases, the degree project itself makes use of publicly available datasets which are commonly used for computer vision tasks and does not contain any major source of ethical issues to be considered.

1.5 Research Methodology

The empirical methodology is used, evaluating the results from multiple candidate approaches, both in their feasibility and performance. In order to have a baseline to compare against, an existing state-of-the-art architecture with published and repeatable results is chosen, in this case being “Enhanced Super-Resolution Generative Adversarial Networks” (ESRGAN) [5] and the datasets used for training and validation consist of publicly available images that are standard for computer vision tasks. A quantitative research will take place, where as much of the ESRGAN work is maintained, introducing the changes necessary to evaluate the validity of the hypothesis.

CNNs for tasks such as image super-resolution as in the case of ESRGAN

require thinking about a different way to ensemble the models instead of

averaging or interpolating the outputs of the models. In classic Ensemble

Learning one mechanism is to use a simple average of the results of the models

to define the final result of the model, but given that some models can exhibit

better results than others [12], the principles of stacked weighted ensemble

methodology allow to assign a higher weight to those models and a smaller

contribution from the rest of the ensemble. This is the criteria explored as the

candidate solution to combine the models in this work, but instead of operating

on the models’ results, it will be based on the interpolation of the parameters

(parameter space) of the models.

(18)

5 However, in order to combine the models’ parameters, the filter correlation problem has to be solved first. It was demonstrated in [11] that if instead of randomly initializing the two models to be combined a common initialization is used before independently training the models, averaging the parameters results in a reduction in the loss, better than the loss of either of the original models. It was also observed in [13] that by using a seed model for common initialization for two models that are trained independently, it is possible to induce a correlation between the parameters and filters of CNN models in a way that allows for an effective interpolation between the two.

This condition of using a common seed model as initialization for the independent models is introduced in the formulation of this work, in order to facilitate the ensembling of these models using parameter interpolation. This seed model can be made to coincide with the PSNR-oriented phase of ESRGAN, where a model is trained without the perceptual objective and GAN framework to use as pretrained to stabilize the training of the second phase.

Different from the ESRGAN paper, fully independent training sessions then follows for four models using the same seed model, each session consisting of subsets of the dataset and these models are then finally aggregated in a single model using the stacked weighted ensemble approach.

In order to optimize the weights that stacked weighted ensemble will use to combine the models, multiple optimization techniques are considered including gradient-based optimization mechanisms such as Gradient Descent, as well as hyper-parameter search using Random Search, Grid Search and Bayesian Optimization. Candidate combinations from each of these methods are generated and compared, with the final results being evaluated with multiple validation datasets to test their generalization capability.

Given that this particular task has a perceptual component for comparing results and since how to effectively evaluate perceptual metrics quantitatively is an ongoing question in the field, some image comparisons will require qualitative evaluation, by observing and describing the results, as well as using the perceptual LPIPS metric. The results will be stored and analyzed using quantitative and qualitative evaluations to gain additional insights and understanding.

1.6 Delimitations

Out of the scope of this project is formulating a new neural network architecture, new loss functions or evaluating the effects of different image augmentation strategies and hyper-parameters via ablation studies.

The images to be considered in the scope are natural photographs, both due to availability of the images in public domain, as well as comparing to other works of similar scope.

1.7 Outline

Chapter 2 presents the background information about ensemble learning,

convolutional neural networks for single image super-resolution and how

concepts can be integrated to formulate a potential solution to the goals of this

project.

(19)

6 Chapter 3 contains the details of the methodology and methods used to solve the problem. This includes the conceptual formulation of the potential solution, as well as the evaluation framework that is used to validate and contextualize the results.

In Chapter 4 the details of the experiments are described and the

analysis of the results are explored in Chapter 5, including a discussion of these

results. Finally, in Chapter 6 some conclusions and additional comments can

be found, with some potential areas where the research on this work can

continue.

(20)

1 2 Background

This chapter presents some basic background information about ensemble learning and single image super-resolution using convolutional neural networks. The characteristic of metrics used to evaluate these models are also introduced, as well as the difference between distortion and perceptual metrics, which is part of the challenge of this project. Alternatives for ensembling that can be applied to low-level computer vision models are also described.

2.1 Image Super-Resolution

Obtaining a high-resolution (HR) image from one or multiple low-resolution (LR) images, is a classic computer vision problem known as super-resolution.

Basic linear interpolator upsampling methods such as Nearest Neighbor, Bilinear and Bicubic [14] [15] have been used to naively approximate the blank pixels that appear in the image if we consider the original image is extended to occupy a larger space. For example, in the case of a 2x upscale, this would mean duplicating the number of pixels in the “height” and “width” dimensions of the image, effectively squaring the total amount of pixels. For larger scales, the missing data becomes exponential. Nearest Neighbor uses the closest pixel value to assign to the blank pixels, while Bilinear and Bicubic interpolations do more intermediate calculations to obtain values between pixels. While these techniques are useful and have low complexity, results are often blurry and details are lost if trying to go back from a downscaling operation, as can be seen in Figure 1.

Another way to think about this case is that multiple high resolution images can potentially generate the same degraded version, and trying to go back from this degraded version to a specific high resolution one is not possible because of the information loss, where downscaling and upscaling back an image with the bicubic kernel results in a final image with loss of details.

Figure 1: Example of downscaling and upscaling with basic interpolator

algorithms (bicubic)

(21)

2 2.2 Evaluation Metrics for Computer Vision

In order to assess the similarity between two images, such as to evaluate the capacity of a super-resolution algorithm to reconstruct the high-resolution (HR) image from a low-resolution (LR) version, different types of metrics can be used, some of the more traditional being the PSNR and SSIM.

PSNR is defined by the maximum possible pixel value (L), also known as the dynamic range, divided by the Mean Squared Error (MSE, also known as the L2 loss) between the images [8]. When comparing two images to find the similarity, with the ground truth X and the reconstructed image 𝑋 _𝑆𝑅 , and with a total of pixels (N), its corresponding the MSE and PSNR can be calculated by the following equations:

𝑀𝑆𝐸 =

¹

𝑁

∑ ‖𝑋 (𝑖) − 𝑋

𝑆𝑅 (𝑖)

‖

2 𝑁

𝑖=1

; 𝑃𝑆𝑁𝑅 = 10 𝑙𝑜𝑔

10 𝐿² 𝑀𝑆𝐸

On the other hand, SSIM [16] incorporates three components:

luminance, contrast and structure, which are more independent. The equation to calculate the SSIM between two images is follows:

𝑆𝑆𝐼𝑀 (𝑋, 𝑋 _𝑆𝑅 ) = (2 𝜇 _𝑋 𝜇 _𝑋

_𝑆𝑅

+ 𝐶 ₁ )(2 𝜎 _{𝑋 𝑋}

_𝑆𝑅

+ 𝐶 ₂ ) (𝜇 _𝑋 ² + 𝜇 _𝑋 ²

_𝑆𝑅

+ 𝐶 ₁ )(𝜎 _𝑋 ² + 𝜎 _𝑋 ²

_𝑆𝑅

+ 𝐶 ₂ )

where 𝐶 ₁ and 𝐶 ₂ are constants to avoid instability. 𝜇 _𝑋 and 𝜎 _𝑋 represent the mean (luminance) and the standard deviation (contrast) of the ground truth, and the mean and the standard deviation of the reconstructed image 𝑋 _𝑆𝑅 are denoted as 𝜇 _𝑋

_𝑆𝑅

and 𝜎 _𝑋

_𝑆𝑅

respectively. 𝜎 _{𝑋 𝑋}

_𝑆𝑅

is the covariance between 𝑋 and 𝑋 _𝑆𝑅 .

PSNR measures pixel-wise differences, while SSIM measures structure of the images. Both are metrics that work well to evaluate degradations and distortion in images, but optimizing algorithms using these metrics as target tend to generate images that are good to define edges and shapes, but are over- smoothed, don’t have sufficient high-frequency details (blurry textures) to look natural to human observers.

They are very simple and shallow functions that fail to take into account for many nuances of human perception, resulting in images that are not correlated to the perceptual capabilities human beings possess to assess the similarity of two images.

Recent CNNs architectures have made use of pretrained classification networks such as VGG and AlexNet [17] to extract the feature maps to calculate the distance (loss) between target images and the resulting images from the models, becoming more perceptually accurate and capable of reconstructing more detailed images, but usually exhibit a tradeoff in the PSNR/SSIM metrics.

This perception-distortion tradeoff is evaluated in detail in [7], where the authors describe how distortion metrics are at odds with the human Mean Opinion Scores (MOS) which quantifies perceptual quality, meaning that as the distortion metrics performance increases, the perceptual results degrade.

Furthermore, while PSNR and SSIM can be calculated, “perceptual

quality” is instead usually estimated through user studies like MOS, which leads

(22)

3 to results that are often incomplete, inaccurate and hard to reproduce. As of 2020, how to objectively evaluate perceptual-quality algorithms remains a problem to be solved.

In [8], the authors not only demonstrated that perceptual similarity is an emergent property of deep networks, but also developed a metric and loss function based on the linear calibration of those classification networks, named

“Learned Perceptual Image Patch Similarity” (LPIPS), which is much closer to the results of evaluations by humans. The tests were made using different types of distortions to calculate the distance between the images, which is relevant to multiple CV cases considered in this work. In Figure 2 the relationship the authors found between different metrics and human perception can be observed.

Figure 2: Quantitative comparison of evaluation metrics

¹

In this project the objective is to evaluate whether if it’s possible to increase the perceptual performance of the final model and LPIPS is used as a perceptual proxy to estimate the changes in the results, allowing to disentangle the results from user studies methods.

2.3 Convolutional Neural Networks for Image Super-Resolution For some years, the main use of neural networks based-techniques, like Convolutional Neural Networks (CNNs), in relation to images was with high- level tasks, such as to classify the images based on the content and to detect or recognize objects and subjects in images.

However, with the advances of the techniques and technologies, image processing has become one of the main use cases for deep learning as it allows to avoid the traditional, explicit a priori statistical modeling of signal corruptions used in CV, and instead the networks are used to learn to map

1 From: The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. (2018). Quantitative comparison [Image]. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. pp. 586-595.

https://richzhang.github.io/PerceptualSimilarity/

(23)

4 degraded or downscaled observations to clean or high resolution versions of the same images.

CNNs have been successfully adopted to solve ill-posed low-level image tasks as is the case of Single Image Super-Resolution (SISR), with the seminal work SRCNN [1] being one of the earlier implementations of CNNs for this task.

SRCNN learns an end-to-end non-linear mapping between the low and high- resolution images domains achieving a much higher restoration quality than basic interpolation algorithms.

The mapping functions in SRCNN are learned from external low and high resolution exemplar pairs, which form the training dataset. Given 𝐼 ^𝐻𝑅 as the high resolution images and 𝐼 ^𝐿𝑅 their low resolution counterpart the super- resolved image 𝐼 ^𝑆𝑅 is formulated as:

𝐼 ^𝑆𝑅 = 𝑁 _𝜃 (𝐼 ^𝐿𝑅 )

Where 𝑁 _𝜃 corresponds to the feedforward SRCNN with parameters 𝜃.

After completing its training, the model 𝑁 _𝜃 can be used to upscale images that were not part of the training.

Even with the increase in performance gained with the use of CNNs like SRCNN for super-resolution, the architectures were oriented towards the traditional distortion metrics, and using loss functions that optimize the models in the pixel-wise space, like the Mean Squared Error (MSE) loss. In other words, 𝐼 ^𝑆𝑅 and 𝐼 ^𝐻𝑅 are compared pixel to pixel to calculate the distance and update the model after each iteration. The pixel-wise MSE loss is calculated as:

𝑙 _𝑀𝑆𝐸 ^𝑆𝑅 = 1

𝑟 ² 𝑊𝐻 ∑ ∑(𝐼 _𝑥,𝑦 ^𝐻𝑅 − 𝐺 _𝜃

_𝐺

(𝐼 ^𝐿𝑅 ) _𝑥,𝑦 ) ²

𝑟𝐻

𝑦=1 𝑟𝑊

𝑥=1

This calculation is done directly in the images pixel space with W and H representing the dimensions of the images 𝐼 ^𝑆𝑅 and 𝐼 ^𝐻𝑅 . This classic MSE loss is able to correctly capture low frequency information of the images, which corresponds to colors and shapes, but results tend to be blurry, since the high frequency information is lost in the reconstruction.

2.4 Generative Adversarial Neural Networks (GAN)

Based on SRCNN, new strategies have been developed to continue increasing the performance of the networks, most of which rely on a common strategy to increase the performance of CNN-based SISR models by designing very deep networks with millions of parameters. However, state-of-the-art results for image generation networks have been achieved with the combination of another strategy, namely the use of Generative Adversarial Neural Networks (GANs) [4].

GANs are a class of algorithms, implemented by a system of two neural

networks that are trained simultaneously in lock-step and contesting with each

other in a min-max game framework. The two networks are a generative model

G that aims to minimize the distance between the distribution of generated

images and real images, and a discriminative model D that estimates the

probability that a sample came from the real training data.

(24)

5 In the case of image super resolution, the first work making use of GANs for the task was SRGAN [18], where the formulation is that given the feed- forward CNN generator network 𝐺 _𝜃

_𝐺

with parameters 𝜃 _𝐺 = {𝑊 _1:𝐿 ; 𝑏 _1:𝐿 } and where {𝑊 _1:𝐿 ; 𝑏 _1:𝐿 } represent the weights and biases of the network with L layers and a discriminator network 𝐷 _𝜃

_𝐷

, the adversarial min-max problem is defined as:

𝑚𝑖𝑛 _𝜃

_𝐺

𝑚𝑎𝑥 _𝜃

_𝐷

𝐸 _𝐼

^𝐻𝑅

_~𝑝

𝑡𝑟𝑎𝑖𝑛

(𝐼

^𝐻𝑅

) [log 𝐷 _𝜃

_𝐷

(𝐼 ^𝐻𝑅 )]

+ 𝐸 _𝐼

𝐿𝑅

~𝑝

𝐺

(𝐼

^𝐿𝑅

) [log(1 − 𝐷 _𝜃

_𝐷

(𝐺 _𝜃

_𝐺

(𝐼 ^𝐿𝑅 )))]

Where 𝐼 ^𝐻𝑅 and 𝐼 ^𝐿𝑅 correspond to the high-resolution images and their low-resolution input counterparts, which are used to produce the super- resolved image 𝐼 ^𝑆𝑅 = 𝐺 _𝜃

_𝐺

(𝐼 ^𝐿𝑅 ). This specific use of a GAN network is called a

“Conditional GAN”, since the Generator is using the low resolution image as the condition to generate the synthetized images, instead of relying on purely random latent vectors of the original GAN definition.

In this formulation, D is trained to estimate the probability that the images originate from the real high resolution domain rather than a fake image generated by G which is being optimized to produce images similar to those of the target domain, encouraging solutions that reside in the manifold of natural images, instead of using the traditional pixel-wise losses such as MSE or L1. The entirety of the conditional GAN framework for this specific case of single image super-resolution can be seen in Figure 3.

Figure 3: Conditional GAN framework for SISR

Given the unstable nature of the min-max problem, GANs are typically difficult to train and problems such as infinite or not a number (NaN) outputs and mode collapse and they require special considerations when used for training. One of these considerations is that the loss functions for SRGAN include three components, one of which is a pixel-wise loss (MSE), a second one defined as the feature loss and the last one which is the adversarial loss.

While the MSE loss component on its own tends to produce blurry results, since it is able to capture low-frequency data correctly it helps to maintain the system more stable than if depending only on adversarial components.

The feature loss uses a third pretrained CNN, the VGG19 convolutional

network for image recognition [19] to extract high-level feature representations

of both 𝐼 ^𝑆𝑅 and 𝐼 ^𝐻𝑅 and calculate the loss as the Euclidean distance between

(25)

6 these feature representations in the “feature space”, instead of the pixel-wise distance, as follows:

𝑙 _{𝑉𝐺𝐺/𝑖.𝑗} ^𝑆𝑅 = 1

𝑊 _𝑖,𝑗 𝐻 _𝑖,𝑗 ∑ ∑(𝜑 _𝑖,𝑗 (𝐼 ^𝐻𝑅 ) _𝑥,𝑦 − 𝜑 _𝑖,𝑗 (𝐺 _𝜃

_𝐺

(𝐼 ^𝐿𝑅 )) _𝑥,𝑦 ) ²

𝐻

𝑖,𝑗

𝑦=1 𝑊

𝑖,𝑗

𝑥=1

Where 𝑊 _𝑖,𝑗 and 𝐻 _𝑖,𝑗 represent the dimensions of the respective feature maps and 𝜑 _𝑖,𝑗 indicates the extracted feature map obtained by the j-th convolution (after activation) within the VGG19 network. This loss encourages the 𝐼 ^𝑆𝑅 to have similar feature representation as the 𝐼 ^𝐻𝑅 , which correlates to high frequency details instead of per-pixel similarity, as demonstrated in [20].

Finally, the generator adversarial loss component is defined as:

𝑙 _𝐺𝑒𝑛 ^𝑆𝑅 = ∑ − log 𝐷 _𝜃

_𝐷

(𝐺 _𝜃

_𝐺

(𝐼 ^𝐿𝑅 ))

𝑁

𝑛=1

Where 𝐷 _𝜃

_𝐷

(𝐺 _𝜃

_𝐺

(𝐼 ^𝐿𝑅 )) is the probability that the reconstructed image belongs to the real images domain. The adversarial loss is back-propagated to both networks, allowing the generator to learn the high frequency details the discriminator uses to determine what domain the images belong to.

“Enhanced Super-Resolution Generative Adversarial Network”

(ESRGAN) [5] is a state-of-the-art CNN architecture that builds upon the developments made in SRCNN [1] and SRGAN [18] and focuses on generating images with realistic textures, introducing modifications that increased the perceptual performance for SISR tasks. This architecture was the winner of the

“2018 Perceptual Image Restoration and Manipulation (PIRM) challenge on perceptual super-resolution (SR)” [6], at ECCV 2018, which was specifically designed to target the perception-distortion tradeoff in CV tasks.

ESRGAN uses the same framework as SRGAN, but introduces some modifications to obtain their results. Like SRGAN, it uses a perceptual loss based on the fully trained VGG19 classification network to optimize the model in the feature space instead of pixel space. This optimizes the network to synthetize realistic textures and fine details instead than optimizing for pixel- accurate reconstructions of the HR images. In contrast to SRGAN, ESRGAN Evaluates the features loss before the activation layers to improve the information obtained from the loss.

It also employs a Generative Adversarial Network (GAN) to encourage the network to favor solutions that look more like natural images. In particular, ESRGAN uses a Relativistic Average GAN [21] which calculates relative realness, instead of an absolute “true” or “false” value for real or fake images respectively as is the case for the vanilla GAN definition used in SRGAN.

ESRGAN uses a deeper fully-convolutional neural network model in comparison to SRGAN using Residual-in-Residual Dense Block (RRDB) without batch normalization layers (that otherwise introduce BN artifacts occasionally between iterations) for superior performance and ease of training.

The residual information is used to further incorporate the LR image prior deep

in the network to allow the framework to focus on recovering high frequency

(26)

7 texture details. shows the ESRGAN network architecture and its basic construction block, the Residual in Residual Dense Block (RRDB).

Another step taken to further stabilize the GAN training and obtain more visually pleasing results, is that before using the GAN framework and feature loss, a model is first trained using only the pixel loss, where ESRGAN employs the L1 loss, instead of MSE like SRGAN. This results in a model that is optimized for the PSNR distortion metric. This model is trained to convergence and used as initialization for the training session that incorporates the feature and adversarial losses, so the generator avoids undesired local optima and allows the discriminator to receive valid super-resolved images since the beginning of its training, instead of extremely fake ones.

ESRGAN Architecture

Residual in Residual Dense Block (RRDB)

Figure 4: ESRGAN Architecture

²

The ESRGAN authors found that the model trained with perceptual losses result in images with artifacts that are not appealing to the human eye.

This effect happens with other GAN-based methods as well and is typically solved by interpolating eight versions of the same image, rotated and flipped in different directions and upscaled with the same model, known as the

“geometric ensemble” [22] [23]. This method has been used extensively in competitions to increase the PSNR and SSIM metrics results and it works because the average of the eight versions of the image produces a stronger signal where all the images agree and evens out the parts where the images disagree, such as noise patterns or upscale errors. However, in ESRGAN the authors identified that resulting images can show artifacts and ghosting due to large discrepancies in the images.

Another method tested to attempt to reduce the unpleasant artifacts and noise in the upscaled images is by using pixel by pixel interpolations of the resulting images from the pretrained PSNR model that used only a pixel (L1)

2

From: "ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks.,". In Computer Vision – ECCV 2018 Workshops., vol. 11133, no. Lecture Notes in Computer Science, pp. 63-79, 2019. https://link.springer.com/chapter/10.1007%2F978-3-030-11021- 5_5

(27)

8 loss in the Euclidean space and the final perceptual that used feature and adversarial losses model, following this formulation:

𝐼 _𝐺 ^{𝐼𝑁𝑇𝐸𝑅𝑃} = (1 − 𝛼) 𝐼 _𝐺 ^{𝑃𝑆𝑁𝑅} + 𝛼 𝐼 _𝐺 ^𝐺𝐴𝑁 ,

Where 𝐼 _𝐺 ^{𝑃𝑆𝑁𝑅} and 𝐼 _𝐺 ^𝐺𝐴𝑁 represent the super-resolved results of the PSRN and the perceptual models respectively and 𝐼 _𝐺 ^{𝐼𝑁𝑇𝐸𝑅𝑃} is the resulting interpolated image, but the authors found that it produces images that are either introduce more artifacts or are too blurry, which degrades their quality.

An alternative interpolation strategy was devised and instead of interpolating the resulting images from the models, interpolate the parameters of the final perceptual-oriented model and the PSNR model, to obtain a single model. In the combination, effectively increases the PSNR and reduces the artifacts in exchange of some perceptual performance, but the results fare better with human judges than the geometric self-ensemble, the resulting images interpolations or the fully perceptual model.

The interpolation is done by taking all the corresponding parameters of the two networks as a linear interpolation:

𝜃 _𝐺 ^{𝐼𝑁𝑇𝐸𝑅𝑃} = (1 − 𝛼) 𝜃 _𝐺 ^{𝑃𝑆𝑁𝑅} + 𝛼 𝜃 _𝐺 ^𝐺𝐴𝑁 ,

where 𝜃 _𝐺 ^{𝐼𝑁𝑇𝐸𝑅𝑃} , 𝜃 _𝐺 ^{𝑃𝑆𝑁𝑅} and 𝜃 _𝐺 ^𝐺𝐴𝑁 are the parameters of the final interpolation, the PSNR-oriented model and the perceptual-oriented model respectively, and α ∈ [0, 1] is the interpolation parameter. This technique allows to fine-tune the perceptual and distortion components without requiring to train new models with different weight parameters for the losses. This also demonstrates that while the results obtained with ESRGAN are state-of-the-art, the final model had to be manually modified to obtain better perceptual results.

Additionally, while perceptual losses were used, no perceptual metric was incorporating during the network training to evaluate its results, leaving a space to explore alternatives to search for more automatic optimizations, such as with ensemble learning.

2.5 Ensemble Learning

The main concept behind the ensemble methodology is to train and evaluate several, mostly weak, individual classifiers, and combine them to obtain a single result that outperforms every individual learner [24] [25] and obtain better predictive performance.

Ensembles of traditional machine learning for structured data, be it for classification or regression typically have outputs where the aggregation can be solved by averaging (mean) the results if using a continuous value or the mode of the values if these are categorical. Furthermore, storing a full ensemble of very simple learners doesn’t require many resources, so it can be maintained for inference in its entirety.

In the case of neural networks and CNNs for high level tasks like

classification, regression or object detection, since the final results typically go

through a fully connected layer (FCL) to evaluate the target labels, it is still

possible to consider the traditional ensemble approach as well [26].

(28)

9 Several specific cases of neural network ensembles are evaluated and compared in [10] and one of the main conclusion from this work is that the ensemble methods for neural networks are most beneficial when using a diverse set of individual learners. Moreover, a combination of the decision of several individual neural network learners is only useful if they disagree on the decision for at least some of the inputs [27], given that if all the learners agree, then there is no more information to be gained from the combination than when using a single one of them. In [28], the authors also devised a novel weighted ensemble learning method that specifically maximizes the diversity and the individual accuracy for each individual model specifically, improving performance over other methods. Introducing diversity to the individual learners is an important consideration in the design of the experiment in this work, as is the case in traditional ensembles as well as Deep Ensemble literature [29], in order to de- correlate each learner prediction so that each one is providing non-redundant information to the final aggregation.

Additionally, in recent works like [12] and [30], an important observation is made by the researchers and is that in a neural network ensemble, it is possible to obtain better results in regression and classification by ensembling a number of networks from all the available basic learners, instead of using all of them, mainly due to the fact that not all of the basic learners contribute new information to the ensemble. This means that some of the original models could be discarded in the combination process of an ensemble to obtain the best performance.

However, storing the totality of the ensemble for use during inference in the case of CNNs becomes a limitation, as the storage and memory requirements of these networks is larger than more traditional machine learning models used in classic ensemble learning. For that reason, alternative strategies to combine the independent models have to be evaluated.

2.6 Model Combination Strategies

While the classic Ensemble Learning techniques may find limitations when considering GAN networks for image super-resolution, it serves as a framework for exploring alternative strategies that could be better suited for this specific case. One of the techniques is stacked weighted ensemble, which can be employed in the models’ parameter space, instead of their results as would be the classic case. A canonical stacked weighted ensemble formulation for 𝑁 models can be as follows:

𝜃 _𝑓 = ∑ ^𝑛 _𝑗=0 𝜔 _𝑗 𝜃 _𝑗 , 𝑛 ∈ 0, 1, 2, 3, … , 𝑁

Where 𝜃 _𝑗 , 𝜔 _𝑗 and 𝜃 _𝑓 represent the parameters of the models being trained, the weight coefficient assigned to each of them in the combination and the final combined model respectively. In this case, the final combination has to result in a magnitude that is equivalent to any of the original model, which is a condition defined as the L1 norm. This condition introduces an additional formulation:

0 ≤ ∑ ^𝑛 _𝑗=0 𝜔 _𝑗 ≤ 1 , 𝑛 ∈ 0, 1, 2, 3, … , 𝑁

Which defines that the sum of all the coefficients belongs in the [0, 1]

range. This in turn also sets the minimum and maximum values for each

individual coefficient to be in the same range, with 0 indicating “no

(29)

10 contribution” and 1 being “100% contribution” to the final model. For the purpose of using as an example, these weight coefficients can be defined as a vector in the form of:

𝜔 ̅ = [ω0, ω1, ω2, ω3, … , ωN]

The coefficients in this formulation have to be defined by optimizing the combination for a specific objective, which in the case of this work is obtaining a combination of the models that results in the best perceptual performance, measured with the LPIPS metric is obtained.

There are different methods that can be employed to optimize this formulation and, for the purpose of this work, they can be grouped in two blocks, the first directly calculates the gradient of the function to create the weighted average of multiple models, called Gradient Descent Optimization and the second option is making use of a meta-learning strategy that wraps the function as a black-box and evaluates the models’ combinations as they are created, making use of search algorithms to explore the weight space.

Both options are able produce optimal candidate combinations and the details of what methods will be considered in this work are explained next.

2.6.1 Gradient Descent Optimization

Gradient descent is an iterative optimization algorithm for finding a local minimum of a convex function 𝐹(𝑥) differentiable in the neighborhood of a point 𝜌, taking guided steps that are proportional to the negative of the gradient of the function [31]. The algorithm can be expressed with the following formula:

𝜌 _𝑛+1 = 𝜌 _𝑛 − 𝛼 ∇𝐹(𝜌 _𝑛 )

In which case 𝜌 _𝑛 represents the current point at step 𝑛, 𝜌 _𝑛+1 is the next position, −∇𝐹 is the gradient of the function to obtain the direction of fastest ascent that is multiplied by a constant 𝛼 ∈ ℝ ₊ which represents the learning rate. The term 𝛼 ∇𝐹(𝜌 _𝑛 ) is subtracted from 𝜌 _𝑛 in order to find the direction of fastest descent, which gives the name to the algorithm.

For the particular case of this work, the function 𝐹(𝑥) that will be optimized is the combined parametrized by the coefficient vector 𝜔 ̅ and evaluated with the perceptual LPIPS metric.

Multiple executions of these fine-tuning gradient descent optimizations have to be run to validate the convergence of the strategy. This can result in different combination vectors when the optimization reaches different minima that can be found by the process, and these resulting vectors (leading to different combinations of the models) that can be evaluated for generalization capacity on to the unseen data of the evaluation datasets.

2.6.2 Search Algorithms

An alternative set of options to find optimal coefficient combination is by

searching the combination space with methods that do not require the gradient

of the problem to be optimized and do not assume the function to optimize is

convex in nature [32]. These types of algorithms are frequently used in the

exploration of machine learning models hyper-parameter search. In this case,

the combination can be made by using a meta-learner, similar to a classical

Stacked Ensemble, with the hyper-parameter to search being the combination

(30)

11 coefficients. The meta-learner searches for different combinations of the parameters, evaluates the results of the combined model with fine-tuning images and register the results from the performance metrics.

Figure 5: Grid and Random search of nine trials

³

These algorithms can be used to explore the combination space either by sampling the entirety of this space or with predefined combinations of the coefficients. In order to search for the optimal coefficients, we can refer to some traditional algorithms that are also used for hyper-parameter search in machine learning. The search strategies considered in this work are briefly explained next.

2.6.2.1 Random Search

The random search optimization is used to sample the entirety of the combination space, with no particular restriction of the search other that the coefficient vector has to comply with the L1 norm.

The pseudo algorithm for random search is repeating the following steps until a termination criteria is met:

1. Initialize the 𝜔 ̅ coefficient vector with values that comply to the L1 norm 2. Evaluate the results produced with the model combination of the

initialized vector using the LPIPS metric

The expected average coefficient when the number of runs 𝑛 → ∞ is the unbiased result defined as:

𝜔 _𝑛 = 1 𝑁 ∑ 𝜔 _𝑖

𝑁

𝑖=0

Which in the case of four coefficients would be 𝜔 ̅ = [0.25, 0.25, 0.25, 0.25]. While random search can require a very long time to obtain a sufficiently large sample of the search space, it’s particularly efficient at exploring this space, with high probabilities of finding optimal points, as can be seen in Figure 5.

3 From: “Random Search for Hyper-Parameter Optimization”, in Journal of Machine Learning Research 13 (2012) 281-305:

http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

(31)

12 2.6.2.2 Grid Search

Another approach to sampling the coefficient space is Grid Search. Unlike random search, grid search relies on a manually specified subset of the search space. When the size of this subset 𝑠 → ∞, the formulation of grid search is equivalent to that of random search. Using a list of coefficient candidate values between 0 and 1, a grid of all possible combination vectors with coefficients is generated. This process is a Cartesian product.

Similar to random search, a pseudo algorithm to follow for the grid search case is that starting with the first possible combination vector in the grid and until the grid is exhausted, the following steps are repeated:

1. A combination vector is selected in order from the grid

2. The results produced with the model combination selected vector is evaluated using the LPIPS metric

One limitation with Grid Search is that it's limited to exactly the set of combinations defined by the values and any potential optimal point outside of that grid will be missed, as can be exemplified in Figure 5.

2.6.2.3 Bayesian Optimization

Bayesian Optimization (BO) is an optimization method that is also evaluated to find optimal points of a function 𝐹(𝑥) that can be used when the optimization target is not known, making it appropriate in this case where the result metrics are not known a priori [32]. BO holds a probabilistic belief about the 𝐹(𝑥) and makes use of an inexpensive acquisition function that represents an expected loss associated with the evaluation of 𝐹(𝑥) at a point at where it will be evaluated next, using Bayesian inference and Gaussian Process (GP) to search for the maximum value of a black box function in as few iterations as possible.

BO chooses a Gaussian process prior:

𝑝(𝐹) = 𝐺𝑃 (𝐹; 𝜇 ; 𝐾)

Where 𝜇 is the mean and 𝐾 the covariance of 𝐹 respectively and the distribution is conditioned by the observations 𝐷 = (𝑋, 𝐹) as:

𝑝(𝐹 | 𝐷) = 𝐺𝑃 (𝐹 ; 𝜇 _𝐹|𝐷 ; 𝐾 _𝐹|𝐷 )

With these observations, the inexpensive acquisition function 𝑎(𝑥) is designed in such a way that it can be optimized to select the location of the next observation.

BO iteratively evaluates candidate parameter configurations, which in this case is the coefficients vector, and then this configuration is updated to gather observations to further sample the function, revealing as much information as possible, in particular the location of the optimum points.

The BO process balances the exploration of the function in order to reduce the uncertainty and the exploitation that consists on evaluating the function near the best results of the exploration to find the optimum.

BO has been used to find optimal machine learning training hyper-

parameters [33] and, in this case, the black box function being the evaluation

of the combination of the models, the target to optimize (maximize) the LPIPS

perceptual loss function and the parameters are the coefficients in the vector 𝜔 ̅.

(32)

13 2.7 Related Work

In addition to the previously referenced works, there are also alternative ways in which machine learning models and more specifically, deep learning models for image tasks have been ensemble in the past.

In the case of generative CNNs for low-level CV tasks, traditional ensemble methods are infrequently used. This mainly is due to the type of output from the model (images) and the large amount of resources required for storing all the trained models in an ensemble to use for inference.

However, concepts from the theory can be used to create alternative strategies used to create ensembles or combinations of models in ways that are relevant to CNNs for CV tasks, each with a set of advantages and limitations.

These alternatives can happen at different points of the process, including if it’s done during or after training. Some of these options can be visualized in Figure 6.

Figure 6: Relevant CNN ensembling strategies

The alternatives and some of their characteristics are described in the following subsections.

2.7.1 Classic ensemble for regression or classification tasks

This technique is the classic ensemble learning strategy, where several different models receive the same input and their results are combined by aggregating the numerical (regression) or categorical (classification) predictions of the models at inference time.

The aggregation can be made using simple voting mechanisms, an

average of the results (weighted or not) or with another model that learns to

predict the best value or label based on the results of the models in the ensemble

Ensembles of Single Image Super-Resolution Generative Adversarial Networks

Ensembles of Single Image Super-Resolution

Generative Adversarial Networks

VICTOR CASTILLO ARAUJO

K T H R O Y A L I N S T I T U T E O F T E C H N O L O G Y

S C H O O L O F E L E C T R I C A L E N G I N E E R I N G A N D C O M P U T E R S C I E N C E

Ensembles of single image super-resolution generative adversarial networks / Ensembler av generative adversarial networks för superupplösning av bilder

© 2021 Victor Castillo Araujo

Abstract

Generative Adversarial Networks have been used to obtain state-of-the-art results for low-level computer vision tasks like single image super-resolution, however, they are notoriously difficult to train due to the instability related to the competing minimax framework.

Additionally, traditional ensembling mechanisms cannot be effectively applied with these types of networks due to the resources they require at inference time and the complexity of their architectures.

Keywords

Generative Adversarial Networks; Single Image Super-Resolution; Computer

Vision; Convolutional Neural Networks; Ensemble Learning;

Sammanfattning

Dessutom kan traditionella mekanismer för att generera ensembler inte tillämpas effektivt med dessa typer av nätverk på grund av de resurser de behöver vid inferenstid och deras arkitekturs komplexitet.

Nyckelord

Generative Adversarial Networks; Superupplösning; Datorseende; Bildanalys;

Convolutional neural networks; Ensembler

i

Table of Contents

List of acronyms and abbreviations ... vi

1 Introduction ... 1

1.1 Background ... 1

1.2 Problem... 2

1.3 Purpose ... 3

1.4 Objectives ... 4

1.4.1 Ethics and Sustainability ... 4

1.5 Research Methodology ... 4

1.6 Delimitations ... 5

1.7 Outline ... 5

2 Background ... 1

2.1 Image Super-Resolution... 1

2.2 Evaluation Metrics for Computer Vision ... 2

2.3 Convolutional Neural Networks for Image Super-Resolution ... 3

2.4 Generative Adversarial Neural Networks (GAN) ... 4

2.5 Ensemble Learning ... 8

2.6 Model Combination Strategies... 9

2.6.1 Gradient Descent Optimization ...10

2.6.2 Search Algorithms ...10

2.6.2.1 Random Search ... 11

2.6.2.2 Grid Search ... 12

2.6.2.3 Bayesian Optimization ... 12

2.7 Related Work ... 13

2.7.1 Classic ensemble for regression or classification tasks ...13

2.7.2 Image self-ensemble ...14

2.7.3 Gradient updates ...14

2.7.4 Weights Ensemble ...14

2.8 Summary... 15

3 Methodologies and Methods ... 16

3.1 Research Process ... 16

3.2 Research Paradigm ... 18

3.3 Data Collection ... 18

3.3.1 Dataset Processing ...19

3.3.2 Sampling Independent Sets ...20

3.3.3 Validation Datasets ...21

3.4 Experimental Design and Planned Measurements ... 21

3.4.1 Model Combination Strategies ...22

3.4.1.1 Gradient Descent Optimization ... 23

3.4.1.2 Search Algorithms... 23

3.4.1.3 Random Search ... 24

3.4.1.4 Grid Search ... 24

3.4.1.5 Bayesian Optimization ... 25

3.4.2 Test Environment ...25

3.4.3 Hardware and Software to be used ...25

3.5 Assessing Reliability and Validity of the Data Collected ... 26

ii

3.5.1 Reliability ...26

3.5.2 Validity ...27

3.6 Planned Data Analysis ... 27

3.6.1 Data Analysis Technique ...27

3.6.1.1 Quantitative evaluation ... 28

3.6.1.2 Qualitative evaluation ... 29

3.6.2 Software Tools ...30

3.7 Evaluation framework ... 30

3.8 Summary... 30

4 Models Training and Testing of the Combination Strategies ... 31

4.1 Training Phases ... 31

4.2 Combinations Implementation ... 32

4.2.1 Gradient Descent Optimization ...32