Autoregressive Density Estimation in Latent Spaces

(1)

AUTOREGRESSIVE DENSITY

ESTIMATION IN LATENT SPACES

by

ALBERTO OLMO HERN ´ANDEZ B.S., Universitat Aut`onoma de Barcelona, 2016

A thesis submitted to the Graduate Faculty of the University of Colorado Colorado Springs

in partial fulfillment of the requirements for the degree of

Master of Science

Department of Computer Science 2017

(2)

Thesis for the Master of Science degree by Alberto Olmo Hern´andez

has been approved for the Department of Computer Science

by

Jonathan Ventura, Chair Jugal Kalita Terrance E. Boult

(3)

Olmo Hernandez, Alberto (M.S., Computer Science) Autoregressive Density Estimation in Latent Spaces Thesis directed by Assistant Professor Jonathan Ventura.

ABSTRACT

We propose an extension of recent autoregressive density estimation ap-proaches such as PixelCNN and WaveNet that models the density of a latent variable space rather than the output space. In other words, we propose a model which will sequentially generate an encoded version of the output and then decode it to produce the final output. By operating over an encoded representation of the output space, we can significantly speed up the sample generation process, thus enabling higher-resolution generation in an equivalent amount of time. Our experiments show that we can obtain good quality image synthesis results in standard image datasets by applying a Varia-tional Autoencoder to pixel blocks independently. Thus, the PixelCNN model was fed with blocks of pixels encoded in their latent space. Our approach is orthogonal to other autoregressive density estimation extensions such as the recent “parallel multiscale” approach and in the future they could ultimately be merged together.

(4)

(5)

ACKNOWLEDGEMENTS

I want to thank my advisor Dr. Jonathan Ventura because of his uncondi-tional help during the whole thesis and the committee for their constructive contributions to the project.

(6)

LIST OF FIGURES

FIGURE

2.1 Illustration of an encoder and a decoder . . . 4

2.2 Architecture and example output of a VAE . . . 8

2.3 GAN architecture and output example . . . 11

2.4 PixelRNN and PixelCNN architectures . . . 14

3.1 Illustration of pixel grouping examples . . . 20

3.2 Architecture of our proposed solution . . . 23

3.3 Preliminary results of BlockVAE . . . 31

3.4 BlockVAE outputs with different epochs . . . 33

3.5 CIFAR-10 frogs outputs from BlockCNN model . . . 35

3.6 BlockVAE and BlockCNN samples with different block sizes . . . 36

3.7 CIFAR-10 outputs of BlockVAE and BlockCNN . . . 41

3.8 BlockVAE LFW samples . . . 45

3.9 Example half and fully synthesized images using dataset LFW and dif-ferent block sizes . . . 46

(9)

LIST OF TABLES

TABLE

2.1 Differences between PixelRNN and PixelCNN . . . 16

3.1 Timing, SSIM and NLL for different epochs with CIFAR-10 . . . 33

3.2 MNIST BlockVAE results . . . 38

3.3 MNIST BlockCNN results . . . 39

3.4 Final MNIST results . . . 40

3.5 CIFAR-10 BlockVAE results . . . 42

3.6 CIFAR-10 BlockCNN results . . . 43

3.7 Final CIFAR-10 results . . . 43

(10)

CHAPTER I

INTRODUCTION

Recent successes in generative modeling of images [1] and audio [2] employ an autoregressive density estimation approach where the distribution of each sample is conditioned on all previous samples. State-of-the-art approaches model the conditional probability distributions with a deep neural network. While such approaches are able to model complex output spaces, generation of synthetic examples is slow because each sample must be generated sequentially.

One idea to speed up the generation process is to attempt to generate multi-ple sammulti-ples at each iteration. The recent “parallel multiscale” approach [3] does exactly that, generating equally spaced samples in parallel at each step. We propose a differ-ent and orthogonal idea, which is to generate a block of neighboring samples at each iteration. Modeling the conditional probability distribution of a block of samples as a set of discrete variables would be difficult because the output space grows exponentially with the number of samples. Instead, we propose to simultaneously learn a variational auto-encoder [4] such that the latent space representation of the block is well-modeled by a multivariate Gaussian. The way this works is by means of an encoder which brings the data from the high dimensional input to a bottleneck layer thus reducing the num-ber of neurons to work with and speeding up the execution. To combine our approach with the parallel multiscale one, we could generate equally-spaced blocks of samples in

(11)

parallel.

Our research questions to answer are: how well can we model natural images with the BlockCNN? and what speedup can we achieve with BlockCNN?. From here, we can hypothesize that there will be a tradeoff between speed of the sampling and their accuracy (or quality) and that we can find the optimum balance between the block size and the quality of the generated images.

In the first part of this thesis we talk about the related work that has been done in the same field as well as explain the necessary background information to under-stand our contribution: we talk about how an Autoencoder and a Variational Autoen-coder work and their similarities by showing their main objective, their weaknesses and strengths and what these look like in terms of architecture. In this same fashion, we also explain what Generative Adversarial Networks are as well as compare them with the previously explained Variational Autoencoders pointing out how GANs can outperform VAEs in some cases and vice versa. Also, we explain what Autoregressive models are, followed by the PixelRNN and PixelCNN implementations and their variants. We end the section by explaining what the main problem we are addressing is and how we are going to approach it. Next, we talk about the methodology we have followed: we present the BlockVAE and BlockCNN models which, combined, are our proposal to approach the problem. Finally, in the evaluation section, we present the two datasets that we have used to experiment with our implementations, the measures to evaluate them and what experiments we have conducted with their respective results obtained. Finally, in the conclusion, we discuss our results and point out possible future work. The conclusions of this thesis are as follows:

• Our results show that we can achieve an increasing speedup the larger the block size we choose is.

(12)

• There is a trade off between the speedup we can obtain and the quality of the samples that are generated.

• The larger the input images are, the better our BlockVAE and BlockCNN models perform.

(13)

CHAPTER II

RELATED WORK

In machine learning, generative models are probabilistic models such that, given some training data with probability distribution ∼ p(x), the model is capable of generating similar samples to those learned. Currently there exist three different types of generative models. These are variational autoencoders or VAEs, Generative Adversarial Networks or GANs and the PixelRNN and PixelCNN neural networks.

Variational Autoencoders

Figure 2.1: Illustration of an encoder and a decoder

Illustration of an encoder and a decoder. Left: encoder with its’ input data x and the hidden representation z. Right: decoder of z to x′_{. Note that there is a certain loss}

when decoding (x ∼ x′_).

An autoencoder [5] consists of two neural networks: an encoder and a decoder and a loss function [6]. The input we would use for the encoder is a datapoint x. The output of the encoder will be a hidden representation which we can call z with weights and biases denoted by α. As an example, let’s assume x is a 32×32 pixel image, therefore x will be an array of 1024 positions. The encoder compresses this 1024 dimensional

(14)

space into a much more reduced one. This step in the process is also referred to as the bottleneck where the encoder has to learn a way to compress x efficiently. Therefore the encoder can be denoted as qα(z|x). Since the lower dimensional space is stochastic,

the encoder outputs parameters to qα(z|x). On the other hand, the decoder is another

neural network where it has z as input and will output the parameters to the probability distribution of the data. It has weights and biases β and its representation can be expressed as qβ(x|z). The decoder will get as input the latent representation of z and

will output one parameter for each of the 1024 pixels. During this process, some of the information is lost as the dimensionality change goes from a smaller to a larger range.

In regards to variational autoencoders [6, 4] and in the same fashion as the aforementioned autoencoders, these are relatively new approaches to unsupervised learning of distributions. variational autoencoders (or VAEs in short) are built on top of neural networks which can be trained with the so called Stochastic Gradient Descent algorithm (which is a form of gradient descent optimization that minimizes an objective function written as a sum of differentiable functions). variational autoencoders also possess two neural networks which will be working as an encoder and a decoder. In regards to their functions, if we have x as a sample of data and z as the its latent representation, that is to say, its encoded representation in latent space, these can encode and decode x (represented by Enc() and Dec() respectively in the following manner:

z ∼ Enc(x) = q(z|x), x ∼ Dec(z) = p(x|z)˜ (2.1) VAEs have already shown relatively good performance in generating compli-cated data like the MNIST dataset [4, 7] or CIFAR images [8]. However, when training generative models like these, the more complicated the dependencies are amongst

(15)

di-mensions, the more difficult it is to train the model. An example of it would be to take a generated image from the CIFAR-10 dataset (a dataset with 10 different groups of images) where the top half of the image belongs to class C1 and the bottom half is

another different class of this dataset. It would be reasonable to first make the model decide which image class to generate in advance. This decision is formally called a latent variable. That is, before our model generates anything, it will randomly sample a value for the image of class C1 from the set of classes [C1, C2, ..., C9].

However, we must see that our model is truly representative of our dataset by making sure that for every image or datapoint in the dataset there is a latent variable that will generate something very similar to the original datapoint. We formally describe this by stating that, by means of a probability density function P we are capable of generating a vector of latent variables z such that zǫZ which we can easily sample by P (z) over the definition of Z. Therefore having the data X, the latent variable z, the probability distribution of the latent variable P (z), the probability distribution of the data P (X) and the distribution of generating data given a latent variable P (X|z). We will be able to model the data using the law of probability in relation of the latent variable as follows:

P (X) = Z

P (X|z)P (z)dz (2.2) If we only know P (X|z) and P (z) we will infer P (z) using P (z|X) which turns out to make sense if we want to make our latent variable z likely to our data. The problem here is that we have to infer P (Z|x) which can be done by the Variational Inference method.

In regards to the loss term, two separate losses are summed up. The first one measures how well the network reconstructed the images by comparing the output

(16)

image with the original. This is also called the generative loss. The most common loss to use is the mean squared error function. This one goes element by element (pixel by pixel in the case of imagery) and computes the mean of all added errors. Formally, if we have n elements and our original image and reconstructed one are Ioand Irrespectively

we will have that:

mse = P(Io− Ir)

2

n (2.3)

The second loss function is the latent loss which measures the error in the latent vector in regards of the unit Gaussian distribution. The error will go up if the latent vector does not stick to this kind of distribution. The most common used one is the Kullback-Leibler divergence [9]. Thus, for imagery, which have discrete probability density functions P and Q, the Kullback-Leibler general KL divergence will be as follows:

DKL(P || Q) =

X

i

P (i) logP (i)

Q(i) (2.4) Finally, when combining both generative 2.3 and latent 2.4 functions, the network has to trade off, finding the best combination of low image loss (high similarity between input and output images) and low latent loss (unit gaussian distribution of latent vectors). The latent loss only evaluates to zero, that is to say, it will only be perfect when the mean is 0 and the standard deviation is 1 (unit Gaussian). Therefore, the main idea is to treat this as an optimization problem by modeling p(z|x) using a the Gaussian distribution and minimize the difference using the both loss functions [10]. Thus, it is vital to make an appropriate choice of what the similarity metrics are going to be as they provide the main part of the training signal via the reconstruction error objective.

Normally and as shown, element-wise metrics are used for this, however, they are not very suitable for imagery since they do not model the properties of the

(17)

Figure 2.2: Architecture and example output of a VAE

Left: Architecture of a Variational Autoencoder and its flow of information. The first image represents the input and the last one the output. Right: An output a VAE obtained from the MNIST dataset [12].

human visual perception. As an example, an small image translation would make the pixel-wise error be very large, whereas in reality and as seen from a human perspective, the change would not be that noticeable. Furthermore, when the mean square error metric is used in a VAE, it tends to produce blurry images [11] as shown in figure 2.2 (right). Therefore, if x is a likely sample, unless the encoding q(z|x) is lossless it will map x to the same encoding z thus resulting in a highly non-Gaussian posterior p(x|z). Here is basically where the fuzziness comes from: the mean of the best fitting Gaussian is just some average of p(x|z).

In figure 2.2 we can see the flow of information and a sample output of a variational autoencoder. Initially, we input an image which will be flattened to a one-dimensional array and sent to a encoding convolutional network which will compress it into two vectors namely the mean and the standard deviation vectors. Once in this form, and given the arrays, we can apply a normal distribution and create the so-called latent space. This is done via the reparameterization trick stated in [4] where both mean and standard vectors are combined during training to get the latent one. This latent representation is so representative of the initial image that we will be able to recover it by a deconvolutional decoder network. However, and due to the fact that the generative loss function is based on an average, there will also exist a certain loss between the input

(18)

image and the output one, thus, the latter will normally have a blurry effect in it. If we take a look at figure 2.2 again, the right image displays this blurriness effect given by the output of a VAE after being trained on the MNIST (handwritten digits) dataset.

The fact that we can use these element-wise metrics and in spite of their blurriness effect, they are also a strength of the VAEs to take into account. Since they follow an encoding-decoding scheme, we can just compare the original image to the output ones and see how well the model is performing. As we are showing in the next section, the same strategy cannot be applied to Generative Adversarial Networks.

Generative Adversarial Networks

Generative Adversarial Networks or GANs [13] are a relatively new neural network paradigm whose main concept lays on the competition of two neural network models. One of them takes the role of generator by getting noised data as its input and generating samples from it. The second model is called the discriminator and its purpose is to take the samples from the generator and training data and be able to discern between the training and the noisy data. Thus, the purpose of this architecture is to train both models simultaneously and, ultimately make the generated samples to be indistinguishable from the real training data. An analogy of this paradigm can be seen if we think of the generative model as a team of counterfeiters whose purpose is to fake currency without being detected, while the discriminate model will play the role of the policemen trying to detect those counterfeit currencies. The competition against each others’ team will eventually make the counterfeits to be indistinguishable from the genuine. If we take a look at figure 2.3 left, we can see the basic architecture of a GAN and its flow of information: as stated, the generator G will compete against the discriminator D and try to deceive it. Thereby, the generator will get an input

(19)

of random noise and create new image samples based on it which later will be mixed with real sampled data and input to the discriminator. There exists feedback between networks G and D that will make the generator more deceiving each time.

Nowadays, GANs have been applied to the modelling of natural images and are producing excellent results in imagery generation tasks. Moreover, Generative Ad-versarial Networks have proven to be better at generating sharper results with imagery than variational autoencoders [4] do. However, we cannot compare the output of a GAN directly to the input to see how well it performed as these use random noise when sampling new imagery. This is in fact a weakness which, in contrast, variational au-toencoders do not have. Thus, GANs suffer from a lack of a heuristic cost function such as the element-wise independent mean squared error which is attractive to representa-tion learning. GANs can also be unstable to train and can result in generators that produce outputs without much sense. Additionally, GANs can become very efficient at getting the global statistics of the dataset they are working with, but not as good when trying to capture the details that make the samples look real to the human eye. In figure 2.3 (right) we can notice how at a quick glance the outputs seem very realistic as they appear with coherence but if we look closely enough, we will notice that they do not represent real things but just some sort of harmony and consistency between their shapes and colors.

Due to this, some new approaches have been published in order to try to improve the deficiency. In the Improved Techniques for Training GANs publication by OpenAI [14], they improve the effectiveness of GANs for semi-supervised learning by learning on additional unlabeled examples. They also propose a solution to the lack of an evaluation metric by introducing what they call the Inception score which gives a basis for comparing the quality of the models. Finally they claim to achieve state of the art results on a number of several datasets in computer vision. In a second

(20)

Figure 2.3: GAN architecture and output example

Left: The two learned models (generator G and discriminator D) during the training process in a Generative Adversarial Network and its flow of information. Right: An example of the output of a GAN after 17,800 iterations. Code done by Kevin Frans [16]. implementation by Chen et al. named InfoGAN [15], they present a modification to the GAN architecture which encourages it to learn more meaningful representations. They do so by maximizing the mutual information between a small noised variables subset and the observations. They claim that this implementation although being simple, it is surprisingly effective as it was able to discover hidden meaning representations on several image datasets such as the MNIST digits, the CelebA faces and SVHN house numbers datasets.

In conclusion, GANs are an effective way of creating new imagery from random data and are capable of obtaining better results than VAEs in terms of sharpness (VAEs tend to blurry their outputs due to the loss functions these use) but they still lack some detailing or depth in their results so as to get more realistic to the human eye as well as an evaluation metric.

VAEs and GANs mixed

Generative Adversarial Networks [13] are currently one of the best approaches for learning generative models. GANs allow their training with large datasets in a rel-atively fast way and when trained on iamgery these can produce visually compelling samples. However, these also come with some weaknesses. GANs lack of stability in

(21)

optimization that lead to the problem of mode-collapse where generated data does not reflect the diversity of the underlying data distribution. Nonetheless, there are GAN variants that aim to address these problems. Thus, autoencoders have already been merged with GANs to improve these deficiencies. Plug and play generative networks (PPGNs) [17], variational auto-encoder GANs (VAE-GAN) [18] or the adversarial gen-erator encoders (AGE) [19] are some examples of new approaches to merge VAEs with GANs.

Autoregressive models

An autoregressive model can be used to represent random processes. The output variable will be linearly dependent on an imperfectly predictable stochastic term and on the previous values. Thus, the term autoregression indicates that a regression is being used on the variable against itself. In a multiple regression model, we forecast the variable of interest using a linear combination of predictors. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself. Then, an autoregressive model of order p constant c and noise w will have the form following form (extracted from [20]):

yt= c + φ1yt−1+ φ2yt−2+ ... + φpyt − p + wt (2.5)

This equation has the notation AR(p) and can also be expressed like:

Xt= c + p

X

i=1

φiXt−i+ wt (2.6)

(22)

series and if we change [φ1, ...φp] we will vary the time series patterns.

PixelRNN

Recurrent neural networks are a type of artificial neural networks which have loops or directed cycles in between units. Thus they are capable of exhibiting dynamic temporal behavior and internal memory. These are very powerful models capable of offering a compact shared parameterization of a series of conditional distributions and turn out to be very suitable for speech and handwriting recognition.

Autoregressive models of deep neural networks have already been applied to generation of imagery in the past [21] and, LSTMs layers have been the state of the art in performance when estimating large scale datasets. The so called Long Short-Term Memory or LSTM layers [22, 23] compute all the states along one spatial dimension of data in each image. They were introduced by Hochreiter & Schmidhuber in 1997 [22], and have been refined with the time [23, 24]. Given that LSTM layers train the image pixel by pixel and taking into account the number of pixels in an image dataset, the required time to train the LSTM is huge. Given the hidden state H(i, j) of pixel p(i, j) we will have that:

H(i, j) = H(i − 1, j), H(i, j − 1), p(i, j) (2.7) Which means that until all previous hidden states of pixels have been com-puted, the hidden state of pixel p(i, j) will not be calculated either. Therefore there is no possible parallelization when computing the hidden states. In 2016, a new type of two-dimensional RNNs was presented in the Pixel Recurrent Neural Networks publication [1] which used bidimensional RNNs to the modelling of natural imagery. They presented the Row LSTM and Diagonal BiLSTM which and with these two new and different ar-chitectures of Pixel Recurrent Neural Networks they achieved better performance when

(23)

using LSTMs in the processing of images.

Multi-Scale PixelRNN: The PixelRNN approach has also a Multi-Scale implemen-tation. This one is composed of both a conditional and unconditional PixelRNN net-works. There can be one or more conditional PixelRNNs. Firstly, the unconditional RNN will generates a s × s image which is subsampled from the original one. Then, the conditional network take the subsampled image as and additional input and generates a larfer one.

Row LSTMs: Row LSTMs are unidirectional layers that, by means of a one-dimensional convolution, process the imagery one row at a time from top to bottom and computing the feature for each row at once. However, and unlike previous LSTMs these compute the hidden state of pixel p in the following manner:

H(i, j) = H(i − 1, j − 1), H(i − 1, j + 1), H(i − 1, j + 1), H(i − 1, j), p(i, j) (2.8) A graphic representation of this can be seen in the following figure 2.4

Figure 2.4: PixelRNN and PixelCNN architectures

Illustration of how the different models of PixelRNN compute the hidden state from the red pixel. Blue pixels are the dependencies of the red. From left to right, these are the architectures proposed in the PixelCNN, Row LSTM and Diagonal BiLSTM implementations. The figure was extracted from the Pixel Recurrent Neural Networks publication [1]

(24)

explicitly depend on the hidden states of the 3 pixels above it and those in turn will also depend on their correspondent 9 above pixels and so on. Therefore, the hidden state of a pixel has a triangular shape. However, as we can also see in the same figure, there are just a few pixels that depend on the one whose hidden state is being calculated. Unlike the original LSTM implementation where each pixel was dependent on all previous pixels from top to bottom and from left to right (see equation 2.7). Thus there is a loss due to this abstraction. Therefore, and as seen, the row LSTM implementation solves the dependency problem for each pixel but it has a cost: each pixel will have an incomplete context used for computing their hidden state.

Diagonal BiLSTMs: The Diagonal BiLSTM was designed in mind of both paralleliz-ing the computation of each pixel and to be capable of capturparalleliz-ing their entire context given any image size. Thus, each of the directions of the layer scans the image diago-nally by starting from a corner at the top until the opposite corner at the bottom. In each step, it computes at once the LSTM state along the diagonal in the image. In figure 2.4 (right) we can see that the hidden state of a pixel p(i, j) depends on pixels p(i, j − 1) and p(i − 1, j). It will cover forward and backward dependencies, thus, having all the previously generated pixels included in the history of the current pixel they are computing.

PixelCNN

In the aforementioned Pixel Recurrent Neural Networks publication [1], they also introduce the so called PixelCNN (figure 2.4 left). PixelCNN uses multiple convo-lutional layers to preserve spatial resolution and improve parallelization by using a large receptive field. However, this parallelization can only be applied in the training and evaluation of the imagery but not on their sampling. Preserving the spatial information

(25)

Table 2.1: Differences between PixelRNN and PixelCNN PixelRNN PixelCNN Strengths

Good performance by effectively handling long-range

dependencies.

Faster to train than PixelRNN. Weaknesses Sequential computation of each

pixel.

The receptive field is bounded and there exists a blind spot

problem.

is very important for pixel prediction values, therefore, there are no downsampling or upsampling layers such as max or min pooling. However, they do use what they call a 3 × 3 Mask B convolution to avoid seeing the future context but that produces the so called blind spot problem. This problem arises due to the fact that not all pixel dependencies are included on the computation of the hidden state of each pixel, that is to say, pixels are not dependent on all previous pixels. In figure 2.1 we can see the differences between the PixelRNN and PixelCNN implementations. As it will be seen in the following sections, we try to address and solve this problem with our approach.

The problem

As pointed out in the Pixel Recurrent Neural Networks publication [1], both training and evaluation of the PixelCNN implementation is sufficiently fast. However, the sampling of new imagery is done in a purely sequential way leaving the doors open to possible solutions (as the one we propose from with our approach). This autoregressive density estimator will model the joint distribution of an entire signal x:

x= {x1, . . . , xn} (2.9)

(26)

p(x) =Y

i

p(xi|x1:i−1) (2.10)

Each conditional probability density function or pdf is represented by a deep neural network; in practice, a convolutional neural network is used to produce all the conditional distributions in parallel. Because the output distribution of a sample is not well-represented by parametric distributions such as a mixture of Gaussians, typically the range for a sample is quantified to a small number of bits and the distribution is produced by softmax normalization.

In order to generate a sequence from the distribution, we must select each sample sequentially, meaning that we must run a forward pass of the neural network at each iteration, thus making it very slow the more pixels in an image we want to have. The caching of network outputs can be used to speed up the process [25], but still requires one iteration per sample. Grouping samples together leads to faster synthesis [3]. Samples can be grouped into G groups of T samples, so that the joint distribution becomes a product over groups of samples rather than individual samples.

p(x1:G_1:T) =

G

Y

g=1

p(x(g)_1:T|x(1:g−1)_1:T ) (2.11) In the parallel multiscale approach [3], the sample groups are chosen such that no neigh-boring pixels are in the same group. This allows for parallel generation of samples in the group. However, this also creates less dependency on other pixels when sampling and, could have problems to generate highly structured imagery.

In addition, there is still a deficiency in regards to the PixelCNN implemen-tation and that we try to address from our approach: PixelCNN does not have a global coherency, that is to say, as stated in their publication [1], it bases the generative process on what it has learned from its previous convolutions (with a bounded receptive field)

(27)

but not on the whole image itself. As stated, the parallel multiscale implementation [3] also lacks this feature. In our work we propose to approach the flaw by first using a vari-ational autoencoder to represent each block of pixels into a Gaussian distribution and thus extract the most meaningful information from each image by making the PixelCNN neural network to learn from the previous encoded blocks.

(28)

CHAPTER III

METHOD

In order to approach the problem we propose the use of two models namely BlockVAE and BlockCNN. The former will encode groups of pixels to a latent space which will be fed to the latter. After the BlockCNN model trains from those blocks, it will be capable of generating its own ones in Gaussian distribution.

Block VAE

We propose an alternate grouping strategy which which we call blocked grouping. Each group is a contiguous block of n × n samples. In figure 3.1 we can see the multiscale grouping (left) and the block grouping of pixels (right). Thus with blocked grouping we can produce an entire signal of n samples in n/T iterations instead of the n we were used to with the original PixelCNN implementation [1].

In previous approaches, these only model the probability density function with an only channel of one sample. In this setup, it is possible to treat the sample as a discrete variable to model its distribution with a softmax normalization. However, with block grouping, we propose to model all channels of all samples in a same block. This is in fact not practical to do with Softmax as its output space grows exponentially with the number of channels and samples. Therefore, instead of directly attempting to model the joint distribution of a block of discrete variables, our proposal is to introduce

(29)

Figure 3.1: Illustration of pixel grouping examples 1 2 1 2

3 4 3 4 1 2 1 2 3 4 3 4

(a) Multiscale grouping

1 1 2 2 1 1 2 2 3 3 4 4 3 3 4 4

(b) Block grouping

Illustration of pixel grouping strategies. A pixel’s number indicates its group. In mul-tiscale grouping, groups do not contain neighboring pixels. In block grouping, groups contain contiguous blocks of pixels.

a Variational Autoencoder or VAE which will act on the blocks [4]. This VAE will learn to encode the blocks to a latent space where the distribution is well-modeled by a multivariate Gaussian. It will also learn to decode this latent representation back to the block. With this approach we can represent conditional probability density functions using multivariate Gaussians.Thus, the BlockVAE learns to encode a block of samples x(g)_1:T to a latent representation z(g) of lower dimension and decode them back when needed.

In order to tune the Block VAE implementation, we use two loss functions characteristic of the Variational Autoencoders: a latent loss function and a generative loss function. These are the Kullback-Leibler loss function and the binary crossentropy generative loss function. In regards to the KL-loss, it can be seen in equation 2.4 in its general form. Below we are providing the KL-loss as we implemented it and which was extracted from the Auto encoding variational Bayes publication [4]:

−DKL((qφ(z)||pθ(z))) = 1 2 J X j=1 (1 + log((σj)2) − (µj)2− (σj)2) (3.1)

Where σ represents the standard deviation vector and µ the variational mean vector as showed in figure 2.2 (left).

(30)

As far as the generative loss function is concerned, the binary cross entropy seems to be the most appropriate one to use. This one is the cross entropy between two probability distributions, in this case, the probability distribution function of the original image and the probability distribution function of the sampled one. For an unnatural discrete probability distribution q and a true discrete probability distribution p, the binary cross entropy will have the following form:

H (p, q) =X

x

p(x) log q(x) (3.2) The generative loss function will be crucial and can also be dependent on the dataset one is using. Another thing to bear in mind when choosing is the number of channels from each image. Each BlockVAE instance will behave as a Variational Autoencoder with the main difference on the aforementioned blocking strategy. Thus, the BlockVAE will be dependent on the dimensions of each block and the dimensions of the latent space (how much compression we want to apply to each image when encoding it to its latent variable).

The input shape of the BlockVAE will be computed from the chosen size of each block in its flattened shape. As an example, if a block size of 8 was chosen, its flattened input shape will be of 8 × 8 = 64. The intermediate layer must be smaller than the shape of the input layer, otherwise the VAE will only retain the information of each flattened block without making any compression. It can be implemented on its own with a densely connected neural network and a relu activation [26] which will make the samples (in this case, the blocks) to go from the original dimensions to the intermediate latent space. By adding n layers of this kind, the complexity of the network is also increased. If one wants to make the input conditional to the number of classes in the

(31)

dataset, the input should also be dependent on them and the intermediate layers will now include an additional dense layer which will be added to the previous. Regardless of the input, now two new dense layers will be needed in order to generate the two vectors shown in figure 2.2 (left) which are the mean and the standard deviation vectors. Those densely connected layers will both have He initialization kernels [27]. The next step and as also shown in the figure is to merge both vectors into the new one that will contain the encoded representation of the input. This new vector must have a Gaussian representation, that is to say, it needs to be a normal distribution of the input. So, if zmean and zlogvar are the mean and standard deviation vectors respectively, we will

create the latent variable z such that:

z = zmean+ ezlogvar (3.3)

Where e can be extracted from a random normal distribution with a mean of 0 and a standard deviation hyperparameter. Finally, the last step in the BlockVAE and following the VAE implementation is to generate the decoder (see figure 2.2 left) that will make the latent representation z to get back to its original state when needed. The decoder will have the same structure than the encoder in terms of layers: the output from the intermediate layer will be directed to a densely connected layer with the same amount of neurons and a relu activation function. In the same way as before, the more layers of this type are created, the more complexity one can add to the network. Finally, these series of layers will be added a final dense layer with a sigmoid activation function and a He normal initializer [27].

In the next chapter of this thesis, we will talk about how well the BlockVAE model performed by showing some of its times when generating samples as well as objective quality measurements.

(32)

Figure 3.2: Architecture of our proposed solution BlockVAE BlockCNN BlockCNN Loss BlockVAE loss encoded blocks Input image Output image conditional pdfs masked convs

Illustration of proposed training paradigm. A variational auto-encoder encodes and decodes contiguous blocks of the image; each block is processed independently. A Pixel-CNN operating on encoded blocks (BlockPixel-CNN) learns conditional pdfs representing the joint distribution of the encoded image blocks.

Block CNN

The BlockCNN is a density estimator that operates on blocks and learns to represent the conditional distributions over the latent space such that:

p(z(1:G)) =

G

Y

g=1

p(z(g)|z(1:g−1)) (3.4) Thereby, BlockCNN will have the blocks encoded from BlockVAE as the training samples. Note that this approach differs significantly from the PixelVAE [28], which replaces the decoder in the VAE with a PixelCNN and thus still needs to produce each sample sequentially. The BlockCNN implementation will use the aforementioned PixelCNN convolutional neural network in order to perform the same generative steps but in a blocky fashion. PixelCNN uses multiple convolutional layers to preserve spatial resolution and improve parallelization by using a large receptive field. However, this re-ceptive field although large, is not unbounded. Thus BlockCNN will be able to learn and represent the conditional distributions as shown in equation 3.4. The way it will work is by training on the latent space of all image blocks which are output by BlockVAE and

(33)

then use its decoder to generate every sample. One of the problems our implementation will inherit from PixelCNN is the fact that there is no accurate measure to see how well each block is generated.

Network final shape

In figure 3.2 we can see how the whole approach looks like and the flow of information when combining BlockVAE and BlockCNN models. In this section we are going to describe how them work step by step. It is worth mentioning that the figure is depicting the training process for just one input image.

First off, the image to be processed is separated into chunks of block size × block size pixels (the block size parameter has to be set up before training). After this, each one of the blocks will be sent to the BlockVAE’s Variational Autoencoder which will compress them to their latent space. The blocks will be decoded again and by means of both aforementioned losses: the Kullback-Leibler loss and the Binary crossentropy loss, the model will be able to tune itself and train until both are minimized.

Next, once the encoder and decoder are trained, we can use them to train the second model too. Bearing in mind the block size parameter, now the BlockCNN model will use it too and train on the same dataset as the BlockVAE has. Each one of the samples to train on will be divided again into the same number of chunks, then, the blocks are encoded with BlockVAE and sent to BlockCNN for this to train with them. The PixelCNN that operates on the latter, will learn the conditional probability density functions which represent the joint distribution of the encoded image blocks. Finally, once the model has trained, we can use it to generate samples of images, however, instead of doing it pixel by pixel as the PixelCNN implementation does, it will do it by blocks of pdfs (of size block size) which, just have to be decoded back by means of the decoder

(34)

in BlockVAE.

In our implementation, both models were trained separately, however, as shown in the same figure, it can also be done simultaneously. Right after being encoded, each block can be sent to BlockCNN and make it train in parallel.

EVALUATION

Datasets

As far as the experimentation is concerned, in order to prove our hypotheses we used three different datasets, namely CIFAR-10, MNIST and LFW.

CIFAR-10

The CIFAR-10 dataset is a labeled subset of the 80 million tiny images dataset [29]. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset contains 60,000 images of the ten different classes: airplane, automobile, bird, car, deer, dog, f rog, horse, ship and truck. Therefore, each class will contain exactly 6,000 images in which the dominant object of each image is the one depicted by the class name. It is worth mentioning that all of them are mutually exclusive, which means that for classes automobile and truck, none of them will include a pickup truck as it could be considered as being a hybrid of both. In this work, we are only going to experiment with the f rog class in the CIFAR-10 dataset. We decided to select the CIFAR-10 dataset as the images resolution we were looking for was 32 × 32 and there is already a function in the library we are using for the implementation that conviniently loads the f rog dataset at this resolution.

(35)

MNIST

The MNIST (Modified National Institute of Standards and Technology) dataset consists of digits written by high school students and employees of the United States Census Bureau [30]. It is a subset of the NIST dataset which contains 70,000 images of hand-written numbers: 60,000 correspond to the training set and 10,000 to the testing set. The original images from NIST were in black and white and can fit in a 20 × 20 pixel box. The resulting images in the MNIST contain gray levels due to the anti-aliasing technique used by the normalization algorithm. The images were also centered in a 28 × 28 image and subjected to a translation to position them in right in the center of their masses.

LFW

The Labeled Faces in the Wild (LFW) dataset is a collection that consists of labeled image faces mainly created for the study of Faces Recognition in uncostrained images [31], that is to say, its imagery of faces are not constrained to any fixed position, faces can be looking to any direction. The only constraint the images have is that these were detected by the Viola-Jones face detector[32]. However, it also contains four different types of LFW images including the aforementioned and three types of aligned ones: the ”funneled images” (ICCV 2007), LFW-a, which uses an unpublished method of alignment, and ”deep funneled” images (NIPS 2012). For the deep funneled images, those were aligned via an deep learning algorithm by Gary B. Huang et al [33] which ensures the same alignment for each picture, thus having less variation in them. Because of that, we decided to use this variation better. The dataset contains 13233 images of 5749 people and, among these people, 1680 have two or more images of themselves. Each of the faces were collected from the web and all images in it have a resolution of

(36)

128 × 128 pixels.

Initially, the CIFAR-10 dataset was our starting dataset but later switched to MNIST as it was more convenient for its simplicity. The latter has only one dimension of color (grey scale) instead of the three (RGB) in from the CIFAR-10, thus, less processing is needed. Nonetheless, and for our last experiments, we decided to switch back to CIFAR-10 but now instead of having a fixed class, we decided to train on the whole dataset to see how well it performed. Finally, we decided to try with the LFW dataset and see how well our models worked with this one too as the images are larger than the previous ones.

Experimentation measurements

Some measures were needed in order to assess the final quality of the gen-erated images by both the BlockVAE, after going through the process of encoding and decoding, and the BlockCNN, after generating the output images. One of the most ap-pealing measures to use for this job is the mean squared error or MSE. This is the most simplistic and commonly used metric of quality between two images. It is calculated by averaging the squared measures of the intensities of distorted an reference image pixels. However, although simplistic and mathematically convenient in terms of optimization, MSE does not very well match how we perceive visual quality. Thus, another more com-pelling measure named structural similarity measure or SSIM [34] was used instead. In regards to how well an image was reconstructed after the encoding and decoding process from BlockVAE, the chosen measure was Negative Log-Likelihood or NLL.

All experiments were conducted in two machines. Both of them used Ubuntu in its version 16.04.1 and had GPU capabilities. Machine I was the primary one, it had a Intel Xeon processor that worked at 2.33 GHz with 16GB of main memory and a

(37)

Nvidia GeForce 780 with 3GB of GDDR5 main memory. In regards to the secondary machine Machine II, this one was a laptop that had an Intel i5-6300 processor working at 2.30 GHz and 12GB of available main memory. Its GPU was a Nvidia GeForce GTX 960M with 2GB of memory.

Structural Similarity Measure

The Structural SIMilarity (SSIM) index is a method that measures the sim-ilarity between two grayscale images. It is an improved version of the universal image quality index. In order to assess the SSIM for color images we measured it for each channel and averaged them. The SSIM index can be viewed as a quality measure of one of the images being compared, provided the other image is regarded as of perfect quality. The main difference with respect to a technique such as the mean squared er-ror is that MSE approaches absolute erer-ror as it is assumed that the loss of perceptual quality is related to the visibility of the error in the signal. However two images with the same MSE can have very different types of errors as shown in their publication [34]. On the other hand, SSIM is rather a perception-based model which considers the image degradation too while incorporating perceptual phenomena in terms of luminosity and contrast. An SSIM value of over 95% can be considered as good enough in terms of human perception. Hence, the structural similarity measure is the most convenient one to use. In order to know an estimate of how well the BlockVAE was performing in terms of structural similarity measure, we chose to sample 10, 000 images using it and calculate their SSIM 1 value taking as reference each input image.

(38)

Negative Log-Likelihood

In order to know how right our experiments from the BlockVAE model were, we decided to use the so called negative log-likelihood. For the the MNIST dataset this measure is commonly expressed in nats and the same but in bits per dimension (bit-s/dim) for the CIFAR-10. The negative log-likelihood of two images can be calculated as follows:

N LL =X

i

pilog(qi) + (1 − pi) log(1 − qi) (3.5)

This equation will give the result expressed in nats. If we want it in bits/dim, we will need to plug in the result of 3.5 in the following formula [36]:

N LL (bits/dim) = −((N LL/32 × 32 × 3) − log(128))

log(2) (3.6) We chose these two as they are the current state of the art when calculat-ing how much difference exists between two probability density functions [28] and can compare our results directly to those shown in the PixelCNN publication [1]. The mea-surement uses the total discrete log-likelihood and is normalized by the dimensionality of the images. The negative log-likelihood can be understood as the amount of bits that a compression scheme based on our model will need to compress every RGB dimension. Once the negative log-likelihood of the BlockVAE is calculated, we also need to take into account that of the BlockCNN. In order to do so, it has to be derived from the Mean Squared Error probability distribution function:

mse of p(x) = 1 pdet(2πΣ))ǫ

−1₂(x−µ)t_Σ−1_(x−µ)

(3.7) Where x and µ are the BlockCNN output and input respectively. Now, let’s

(39)

assume that b is the block size, s is the image size where s × s makes the dimensions and n is n = s_b. Also we must bear in mind that, since we are using a Variational Autoencoder in the BlockVAE implementation, this outputs normal distributions that will be sent to BlockCNN as x, so, Σ will only be filled with ones in its diagonal thus, making Σ be the identity matrix. With this assumption we can formulate the following simplification:

mse of p(x) = 1 pdet(2πn₎₎ǫ

−1₂Σ(xi−µi)2 _(3.8)

Calculating the negative log-likelihood for BlockCNN will mean to apply the log() function to 3.8 thus getting:

N LL = −n

2 log(2π) − 1

2Σ(xi− µi)

2 _(3.9)

Currently, the output from the BlockCNN is evaluated as follows:

L = 1 N Σ N j=1 Σ(xji − µ j i)2 (3.10)

Where N are the number of images that we are sampling. Thus, the final negative log-likelihood for BlockCNN will have the following form:

N LL = 1 NΣ N j=1 n 2ln(2π) + 1 2L = N N( n 2ln(2π)) + 1 2L = n 2ln(2π) + 1 2L (3.11)

Hence, in order to calculate the correct averaged negative log-likelihood when using a certain block size, we need to first use 3.5 to get the NLL from BlockVAE and then 3.11 to do so form BlockCNN. Finally, the summation of both will give us the final

(40)

Figure 3.3: Preliminary results of BlockVAE

Example outputs of the preliminary results of BlockVAE. Left: Training image from CIFAR-10 (input and VAE output). Center: Testing image. Right: VAE decoder output after randomly sampling in latent space. Without the BlockCNN, the blocks were assumed independent and thus the resulting image does not resemble the training data.

NLL measure.

For each one of the next experiments with BlockVAE, and in accordance to our implementation of the solution, we first had to follow a series of preliminary steps which are the following: first we had to convert all training data into a series of blocks. The dimensions of each one can be selected by the hyperparameter block size, which, needed that each dataset was composed of imagery with the same number of pixels in width than height. This assumption was made so as to have an exact number of blocks per image.2 The two datasets used had dimensions of n × n. Thus, in order to speedup the selection of blocks within the training data, N randomly chosen blocks from the dataset were used as the training data for the BlockVAE. Once this is done and the BlockVAE instance is trained with them, the encoder and decoder are ready and can be used at our will. Thus, all experiments with BlockVAE had this first part in common. Both the encoder and decoder were used in the order shown in figure 2.2 (left).

Preliminary results

Once the first version of the BlockVAE was implemented, the first thing we had to know was how well it could perform against a series of training images. In order to do so, we used the 32 × 32 CIFAR-10 dataset; in particular, the f rogs class. Figure

2_{As an example of this, any CIFAR-10 image has a resolution of 32 × 32, so, if we make the block size}

(41)

3.3 shows the results of learning a VAE on 4×4 blocks. As mentioned it is first necessary to flatten the dimensions of each block. We must take into account that each image from CIFAR-10 has RGB values, therefore in this case the flattened input and output will be a one-dimensional array of 4 × 4 × 3 = 48. Thus, for this first experiment, we trained a small VAE with layer sizes of (48, 32, 16, 32, 48) on 10, 000 randomly sampled patches from the training data. The comparison of input and decoded images shows that the BlockVAE is able to accurately encode the small blocks using a small latent vector with a Gaussian prior.

As we can see, the rightmost image is a sample which depicts only random-ness and noise. This image shows the result of randomly sampling the latent represen-tation of each block and then decoding each one independently. That is to say, although it was generated following (partially) equation 3.4, gets this incoherence due to the fact that there is still no conditional dependence implemented between none of the generated blocks, and therefore, these are not decoded based on any previous ones. Hence, the last part from 3.4 was still needed to be applied and getting instead:

p(z(1:G)) =

G

Y

g=1

p(z(g)) (3.12) This also proved the need for the BlockCNN model which can represent the conditional distribution between blocks.

Experiment 1: Determining the number of epochs

It is true that usually a model is just trained once and after that, there is no necessity to train it every time we want to use it, in this case, to sample imagery. However, our main goal after creating both BlockVAE and BlockCNN models, was to find a good configuration from each one of them in order to get quality samples.

(42)

There-Figure 3.4: BlockVAE outputs with different epochs

Original 100 epochs 150 epochs 200 epochs 300 epochs 400 epochs

Example outputs of the BlockVAE. All of them were sampled with a block size of 4. The first one starting for the left is the original image, from there the sampled images had 100, 150, 200, 300 and 400 epochs respectively

Table 3.1: Timing, SSIM and NLL for different epochs with CIFAR-10 Epochs 100 150 200 250 300 350 400

Time 62.56s 92.97s 122.56 154.79 185.61 215.17s 245.86s SSIM 0.956 0.957 0.971 0.972 0.971 0.969 0.972

NLL 6.202 6.202 6.204 6.203 6.204 6.203 6.203

This table shows each the elapsed time, the SSIM and the NLL for different amount of epochs using the CIFAR-10 dataset. Epochs range from 100 to 400 in strides of 50. NLLs are expressed in bits/dim.

fore, we needed to play with several parameters every time and retrain both models. Nonetheless, some of these were more likely to matter less for the outputs than the others when using a same dataset, an example of this is the number of epochs which we decided to fix once we found the best number of them to use. By adopting some different epoch configurations we managed to gather the outputs and determine the best ones for both datasets. We were able to see that there was a trade off between the number of epochs we used and the quality of the imagery that was generated and the processing time that the networks needed to train. In general, after a certain number of epochs, the outputs won’t get much better but the elapsed time keeps its tendency to increase. However, we found that there was a sweet spot which made us decide for a specific number of epochs to use as a base. The first of the networks tested was BlockVAE with the CIFAR-10 dataset. We decided to start from an initial value of 100 epochs and increase it in strides of 50 until a maximum of 400. The block size selected here was

(43)

4 since it showed good quality samples in regards of both negative log-likelihood and structural similarity measure in previous executions. In figure 3.4 we can see that our theory was right: although it is difficult to appreciate big changes in the given samples, because all of them are very well decoded by the BlockVAE, the most significant change is found between the original image and the one to the right which was generated with 100 epochs. If we look closely enough, we can see that from 150 epochs onwards, we don’t get much better image quality.

In order to get a better and more objective way of knowing how good each set of epochs performed, we created a series of tests where we measured the training time and the SSIM and NLL. It is worth mentioning that these two measurements were averaged from 10,000 testing images which, after being encoded and decoded, they were compared with their ground truth image and thus obtain their correspondent NLL and SSIM values.

In figure 3.1 we can notice that the amount of time required to train each set increases linearly with the number of epochs selected: each 50 more epochs we add, the amount of time increases by ∼ 30 seconds. The NLL measure did not depict too much information in terms of variability in quality as all of them had almost the exact same value. Thus we had to rely on the structural similarity measure which showed values that were in accordance with our own judgment. There is a sensitive increase between using 150 and 200 epochs. Therefore, we could conclude that the best amount of epochs to use on the next set of experiments (if we did not vary significant values like the learning rate) was between those two values: 150 and 200.

Experiment 2: Testing BlockCNN

It is worth pointing out that, in short, the main purpose of BlockCNN was to implement the PixelCNN convolutional neural network explained in the Pixel Recurrent

(44)

Figure 3.5: CIFAR-10 frogs outputs from BlockCNN model Occluded 1 Half synthesized 2 3 Fully synthesized

First example outputs of the BlockCNN model after training on the f rogs class from the CIFAR-10 dataset.

Neural Networks publication [1] in a blocky fashion: by being trained on the encoded blocks that BlockVAE outputs, which happen to be in a normal distribution.

Once the first version of BlockCNN was created, we had to try how well it performed and to do so, our main goal now was to tune it and improve its weaknessess. Therefore, we decided to give BlockCNN examples of occluded CIFAR-10 f rog images, in specific, they were horizontally split right in the middle of them and, the top half was not occluded while the bottom half was. An example of this can be seen in figure 3.5 (left). Thus, the job of BlockCNN was now to take the probability density function of the intensity values in the top half in chunks of block size × block size blocks (to match its previous training based on blocks with the same size encoded with BlockVAE ). Next, BlockCNN could sample the bottom part of each image and, ultimately, use the decoder from BlockVAE to generate the bottom part. In figure 3.5 are some of the outputs of the process. We can see that it was capable of extracting the most important information from the dataset and generate a relatively decent output in terms of coherency for the bottom part (although not in quality). However, and as seen in the same figure too, a fully sampled image doesn’t show anything but randomness and noise.

The main conclusion extracted from this experiment was that, BlockCNN was trained on encoded blocks from BlockVAE which contained a certain amount of loss. The loss in them is translated into blurriness when decoded, thus, the generated

(45)

Figure 3.6: BlockVAE and BlockCNN samples with different block sizes BlockVAE example results

Block size 1 Block size 2 Block size 4 Block size 7 Block size 14 Block size 28

BlockCNN half synthesized results

Block size 1 Block size 2 Block size 4 Block size 7 Block size 14

BlockCNN fully synthesized results

Samples of the BlockVAE model (top row) and the BlockCNN model (bottom two rows) trained with the MNIST dataset when applying different block sizes.

blocks will inherently contain blur in them too.

Experiment 3: Measuring performance with MNIST

Finally, once both models were developed, what we wanted to achieve with these experiments was to know how well they could perform in regards of quality of the outputs and speedup that can achieve. In order to do so, we conducted 5 experiments where we used block sizes of 2, 4, 7 and 14 and 28 pixels, all of them multiple of 28 to make sure that all pixels from each image are covered when chunking them into blocks. In this experiment we measured several parameters for each of the models. In regards to the BlockVAE model we measured the elapsed time to generate two different types of images: for the first one, the main task of the model was just to encode a randomly

(46)

selected image from the dataset and decode it again to see how much of a difference there was between both of them. We have to remember that since BlockVAE implements a Variational Autoencoder, there will normally be a certain loss and we wanted to measure how big that loss was. In order to do so, we used the structural similarity measure and the negative log-likelihood. Both of them were taken by averaging the SSIM and NLL of 10,000 images. For each one of the SSIM calculations, these were obtained by comparing the first image (the one selected before being encoded) and its correspondent output image taken from the BlockVAE model after the former is decoded. In regards of the second set of experiments for BlockVAE, we also sampled a series of images which were fully generated from a normal distribution. Finally, we also conducted another set of experiments using the BlockCNN model which we will explain later more in detail. All experiments from this section executed in Machine I. Its specifications can be obtained from the Experimentation measurements section.

For the first experiment, the collected results can be seen in table 3.2 under the name of Encoded-decoded image and some outputs in figure 3.6. Analyzing the outputs images, we can tell that the model performed very well for almost all block sizes: it was capable of encoding to a latent space and later decoding the images without much loss. As a more objective way of proving this, we can take a look at the SSIM measure which only goes down the 90% of similarity when using a block size of 2. It is also worth noting that the SSIM results belong only to this experiment as in the second one we generate a whole image and there is no way we can compare that randomly sampled image with any other one. In terms of the negative log-likelihood, it was calculated using the whole set of test images and turns out not to be too competitive as we will see later in the final results table 3.4. We also included the generative loss and the KL-loss in order to show that, even though the loss of the former can get very low, there is a big trade off between both of them where it may be easy to have good results for one of

(47)

them but difficult for both at the same time. Finally, in the bottom table in 3.2 we can see the obtained speedups for the encoded images and how these converge to around 3x from a block size of 4 on which depicts that there may still be a sequential part in the execution that cannot be parallelized.

Table 3.2: MNIST BlockVAE results

Block size 28 14 7 4 2 1

Encoded-decoded image (s) 0.111 0.110 0.110 0.121 0.168 0.327 Sampled img (s) 0.00741 0.00615 0.00687 0.0113 0.0327 0.10047

SSIM 97.17% 96.84% 96.81% 97.88% 81.91% 92.48% NLL (nats) 363.20 425.84 417.45 455.76 395.62 1384.58 Gen. loss (nats) 53.94 54.92 52.47 51.13 85.67 52.67

KL loss (nats) 309.26 370.72 364.98 404.63 309.94 1331.90 BlockVAE Speedups

Encoded-decoded image 2.96x 2.96x 2.96x 2.69x 1.94x 1x Sampled image 13.56x 16.33x 14.63x 8.87x 3.07x 1x Measurements extracted from the BlockVAE model using MNIST dataset and different block sizes. Values for the images are expressed in seconds. SSIM is a percentage similarity (the greater the better) and NLL is expressed in nats (the lower the better). In the bottom table the equivalent speedups for both images are displayed.

In regards to the second experiment with BlockVAE, it was done using what the model learned from training but applying it to a random normal distribution, that is to say, instead of encoding and decoding an image from the dataset, now we were encoding a normal distribution. From the same table under the name of sampled image 3.2 we can see that the timings for it also improve the more pixels per block we use.

Regarding the BlockCNN model, we decided to conduct two experiments with the same set of block sizes. For the first one we selected 8 different images from the training set and evenly divided them in their horizontal half. The top half now was left as data for the model and the bottom half was occluded and left for the generator to fill. Some results we obtained can be seen in figure 3.6 in the middle row. From them

(48)

we only measured their quality based on the appearance of the outputs. As we can see for block sizes of 4, 7 and 14 the images sampled are very decent and there appears not to be much loss coming from the VAE.

Table 3.3: MNIST BlockCNN results

Block size 28 14 7 4 2 1 Sampled image (s) 0.0100 0.0369 0.0765 0.244 1.69 14.34

NLL (nats) 84.76 8.00 5.05 5.94 19.98 25.58

Time measurements of the generation of example images using the BlockCNN model with the MNIST dataset and different block sizes. All values are expressed in seconds.

In terms of the second experiment, the model now had to generate the whole output without any kind of ’hint’ but only based on what it had learned previously and using the aforementioned set of block sizes. In specific, it had to generate blocks in latent space from scratch getting a random standard normal distribution. After this, the blocks are decoded using the decoder in BlockVAE. Some of the outputs we obtained can be seen in the bottom row of figure 3.6. In table 3.3 we show the average amount of time that it took to generate each one as well as the speedup between the block size 1 and the rest. Finally, we also measured the negative log-likelihood from each one by following formula 3.11 and can also be seen in the same table. In terms of quality of the outputs, for block sizes of 4 and 7, the generated ones are still plausible, however for larger block sizes and the used configuration, the model was not capable of doing so. We decided not to use the block size of 28 for the half sampled images as the maximum block size we could use to generate half image was 14. Finally, with respects to the configuration for a block size of 1, this had to be slightly simplified in order for the GPU of Machine I to be capable of handling the amount of blocks to process: the training images were decreased to 1500 and the testing patch to 500. This change although necessary, made the model not learning enough as we can tell from its output image.

(49)

Table 3.4: Final MNIST results

Block size 28 14 7 4 2 1 PCNN Sampled image (s) 0.0174 0.0431 0.0834 0.255 2.02 14.44 N/A

Total NLL (nats) 447.99 452.37 442.48 482.33 415.6 1427.23 81.30 Speedup

Sampled 829.88x 335.02x 173.14x 56.63x 7.15x 1x N/A These are the final results of the experiments with BlockVAE and BlockCNN with the MNIST dataset.

Finally, table 3.4 shows the added up timings of BlockVAE and BlockCNN for both sampling experiments. As we can see in the same table, the negative log-likelihood is still not good enough to compete with the performance from PixelCNN. However, we can see that we are achieving great amounts of speedup that increase as the block size gets larger. We believe this happens due to the fact that for a block size of 1, which is the base case, there is a big overhead when initializing the GPU due to the number of blocks to process. However, if we take a block size of 4 as the base case, we can see that the speedup is stabilized with 3.6x, 5.9x and 14.7x with respect of the timings for block sizes of 7, 14 and 28 respectively.

Experiment 4: Measuring performance with CIFAR-10

In the same fashion as the last experiment, here we wanted to see how well both models were able to perform with slightly larger images and when its block sizes were to change. Taking into account that the CIFAR-10 dataset contains images of 32 × 32 pixels, the block sizes we tried this time had to be multiple of those dimensions, thus we used 1, 2, 4, 8, 16 and 32 pixels per side. The training set was composed of 10,000 different training images. All the experiments with CIFAR-10 were conducted with Machine I whose specifications can be seen in the Experimentation measurements section. We decided to use this second machine as the GPU in it was capable of handling

(50)

Figure 3.7: CIFAR-10 outputs of BlockVAE and BlockCNN BlockVAE example results

BlockCNN half synthesized results

BlockCNN fully synthesized results

Samples of the BlockVAE model (top row) and the BlockCNN model (bottom two rows) trained with the CIFAR-10 dataset when applying different block sizes.

Autoregressive Density Estimation in Latent Spaces