Introducing Sparsity into the Current Landscape of Disentangled Representation Learning

(1)

IN

DEGREE PROJECT

INFORMATION AND COMMUNICATION

TECHNOLOGY,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2021

Introducing Sparsity into the

Current Landscape of

Disentangled Representation

Learning

ELIAS ÅGEBY

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Introducing Sparsity into the

Current Landscape of Disentangled

Representation Learning

ELIAS ÅGEBY

Master’s Programme, Information and Network Engineering, 120 credits Date: January 27, 2021

Supervisor: Borja Rodríguez Gálvez Examiner: Ragnar Thobaben

(4)

Introducing Sparsity into the Current Landscape of Disentangled Representation Learning

(5)

I ABSTRACT

Abstract

In many scenarios it is natural to assume that a set of data is generated given a set of latent factors. If we consider some high-dimensional data, there might only be a few degrees of variability which are essential to the generation of such data. These degrees of variability are not always directly interpretable, but are still often highly descriptive. The desideratum of disentangled representation learning is to learn a representation which aligns with such latent factors. A representation that is disentangled will present optimal, task-agnostic properties and hence will be useful for a wide variety of downstream tasks. In this work we survey the current state of disentangled representation learning. We review recent advances within the field by discussing the definition, comparing state-of-the-art methods, and contrasting quantitative metrics. Further, we present the β-SVAE, which by modifying the prior distribution of a Variational Autoencoder successfully imposes a sparsity constraint on disentangled representation learning. The β-SVAE achieves higher sparsity than current state-of-the-art methods while remaining disentangled.

(6)

(7)

III SAMMANFATTNING

Sammanfattning

Det är i många scenarier naturligt att anta att en datamängd är genererad givet en mängd med latenta faktorer. Om vi betraktar någon högdimensionell data, så kan det möjligen endast finnas ett par grader av variabilitet som är essentiella för genereringen av sådan data. Dessa grader av variabilitet är inte alltid direkt möjliga att tyda, men är allt som oftast högst deskriptiva. Målet inom separerad representationsinlärning är att hitta en representation som står i linje med sådana latenta faktorer. En representation som är separerad kommer att uppvisa optimala, uppgiftsagnostiska egenskaper och kommer därför vara användbar för en stor mångfald av nedströmsupggifter. I detta arbetet överblickar vi det nuvarande tillståndet inom separerad representationsinlärning. Vi undersöker de senaste framstegen inom området genom att diskutera definitionen, jämföra de mest aktuella metoderna och kon-trastera kvantitativa mätmetoder. Utöver det så presenterar vi β-SVAE, som genom att modifiera fördelningen, som är definierad a priori i en Variational Autoencoder, introducerar en gleshetsrestriktion på den lärda repre-sentationen. Den presenterade metoden, β-SVAE, uppnår högre gleshet än de mest aktuella metoderna samtidigt som kravet på separarade representationer bibehålls.

(8)

(9)

V ACKNOWLEDGMENTS

Acknowledgments

Firstly, I would like to thank my supervisor, Borja Rodríguez Gálvez, for giving me the chance to work on this thesis, as well as for great guidance and support throughout. Secondly, I want to thank Ragnar Thobaben for valuable feedback. I would also like to acknowledge Zeheng Li, Johan Sörell and Hilding Wollbo for working with me on the precursor to this project. Finally, I am very appreciative of anyone who has taken their time to provide their insights and everyone who has supported me from start to finish.

Stockholm, January 2021 Elias Ågeby

(10)

(11)

VII CONTENTS

List of Figures

1.1 Binary Attributes of CelebA . . . 3

4.1 Example Images from the Datasets . . . 28

a. DSprites . . . 28 b. Shapes3d . . . 28 c. CelebA . . . 28 4.2 Metric Benchmarks . . . 32 4.3 Reconstruction . . . 33 a. DSprites . . . 33 b. Shapes3d . . . 33 c. CelebA . . . 33

4.4 Latent Traversals on Shapes3d . . . 33

a. β-VAE . . . 33

b. β-TCVAE . . . 33

c. FactorVAE . . . 33

d. β-SVAE . . . 33

4.5 Latent Traversals on DSprites . . . 34

a. β-VAE . . . 34

b. β-TCVAE . . . 34

d. β-SVAE . . . 34

4.6 Latent Traversals on CelebA . . . 35

a. β-VAE . . . 35

b. β-TCVAE . . . 35

d. β-SVAE . . . 35

4.7 Correlation of Metrics . . . 36

4.8 Correlation testing of DMIG and FactorVAE-Score. . . 37

a. Logarithmic Correlation . . . 37

b. Exponential Correlation . . . 37

4.9 Collapsed Model Reconstruction . . . 37

4.10 Number of Collapsed Dimensions . . . 37

A.1 Metric Progression . . . 51

A.2 Full Reconstruction . . . 52

a. DSprites . . . 52

(14)

LIST OF FIGURES X

c. CelebA . . . 52

A.3 Full Latent Traversal on DSprites - DMIG . . . 53

a. β-VAE . . . 53

b. β-TCVAE . . . 53

d. β-SVAE . . . 53

A.4 Full Latent Traversal on Shapes3d - DMIG . . . 54

a. β-VAE . . . 54

b. β-TCVAE . . . 54

d. β-SVAE . . . 54

A.5 Full Latent Traversal on CelebA - DMIG . . . 55

a. β-VAE . . . 55

b. β-TCVAE . . . 55

d. β-SVAE . . . 55

A.6 Full Latent Traversal on DSprites - FactorVAE Score . . . 56

a. β-VAE . . . 56

b. β-TCVAE . . . 56

d. β-SVAE . . . 56

A.7 Full Latent Traversal on Shapes3d - FactorVAE Score . . . 57

a. β-VAE . . . 57

b. β-TCVAE . . . 57

d. β-SVAE . . . 57

A.8 Full Latent Traversal on CelebA - FactorVAE Score . . . 58

a. β-VAE . . . 58

b. β-TCVAE . . . 58

(15)

XI LIST OF TABLES

List of Tables

4.1 Detailed Dataset Information . . . 28

4.2 R2_{-score . . . 30}

A.1 Architectural Details . . . 49

a. Encoder . . . 49

b. Decoder . . . 49

c. FactorVAE Discriminator . . . 49

A.2 Regularization Strength Hyperparameters . . . 50

A.3 Hyperparameters . . . 50

a. Model . . . 50

(16)

(17)

1 INTRODUCTION

1

Chapter

Introduction

1.1 Background

A vast amount of machine learning models are highly dependent on the choice of data representation, or features, of the data to which they are applied to. Consequently, a lot of effort is spent on designing preprocessing and transformations by using prior knowledge of the data at hand to extract the best features possible for some machine learning application. Feature extraction is an expensive task and a testament to a shortcoming within the field of machine learning, where algorithms are not able to extract and organize the discriminative information present in the data independently. It can further be argued that this is a major restriction in constructing general Artifical Intelligence (AI).

In high-dimensional data, there are in general often only a few explanatory factors that are relevant for any given machine learning task. Extracting all relevant features in an unsupervised manner, by decreasing the significance of manual feature engineering, would enable rapid progression in development of new models and new domains. Not only are high-dimensional input data in many cases excessive w.r.t. to relevant information, it is also not readily interpretable for humans; hard to visualize; resource demanding, both computationally and for storage; and suffers from the curse of dimensionality.

Moreover, unsupervised learning approaches are benefical for feature extraction, where the learnt represen-tation aligns with statistical structures in the data, they are also more resembling to biological intelligence, by utilizing low-level sensory data; less prone to overfitting [1]; and have been sucessfully applied in a wide variety of applications [2–4].

Representation Learning has, in recent years, developed into a self-sufficient field within machine learning with a dedicated conference the International Conference on Learning Representations (ICLR), incepted in 2013. The interest in the learning of representations is precedented by state-of-the-art results in a wide variety of appli-cations such as speech recognition [5–7], object recognition [8], and Natural Language Processing (NLP) [9–11]. Notably, to build Google’s image search, a combination of word embeddings and image representations are em-ployed [12].

Transfer learning, domain adaptation, and multi-task learning deal with sharing knowledge across different learning tasks and domains. Representation Learning shows promise in these fields (e.g., [13–15]) since this transferability is arguably inherent within the desired properties of learned representations.

Optimality of a representation is, however, ambiguous. No objective can be formulated for achieving opti-mality for an unknown downstream task, be it visualizing data or classification. Instead, we can establish some properties of optimal representations. An information theoretical approach deems such properties as sufficiency for a downstream task, minimality of the representation, and invariance to nuisances [16]. Such properties are reflected in the formulation of the Information Bottleneck (IB) [16, 17].

Within Disentangled Representation Learning, the desideratum is to learn a representation which aligns with the ground truth, data generative, factors. Such statistically independent features will be maximally compact while also being explicit and interpretable [18]. Disentangling the explanatory factors of input data is by many

(18)

INTRODUCTION 2

argued as fundamental for learning optimal representations [19–21] and presents possibilities of generalizable learning algorithms [22, 23]. Given a disentangled representation of the data, it is not only servicable as a representation with optimal properties, but also enables novel example generation and novelty detection in a straightforward manner [24].

Research Question 1.

What is the current state of disentangled representation learning? How is it defined, achieved, and evaluated?

Bengio et al. [19] present a set of general purpose priors, which by incorporating, would improve represen-tation learning in the direction of AI. One such prior is that of sparsity, implying that for some observation of the world, only a subset of possible explanatory factors are relevant. Further, settings that resemble that of the real world contain many factors that are sparsely distributed.

Sparsity is, beyond being descriptive of our natural world, beneficial from a computational perspective. Em-ploying sparsity enables reduction of storage requirements [25] and increased computational efficiency [26]. Ex-emplifying such efficiency, with the recent introduction of the NVIDIA Ampere archictecture, a NVIDIA A100 GPU has been shown to provide a 50% reduction in computational time when running Bidirectional Encoder Representations from Transformers (BERT) [27], a state-of-the-art method within NLP, with sparse computa-tions in place of dense [28].

Research Question 2.

Is it possible to enforce sparsity in disentangled representation learning, and, if possible, can it be beneficial?

1.2 Related Work

The Deep Variational Information Bottleneck (Deep VIB) [29] takes a variational approach to the IB using neural networks to parametrize the encoder and decoder. However, they consider the task of classification and mainly investigate the generalization performance and adversarial robustness. Another variational approach to the IB is considered in [30]. A lower bound on the IB is maximized using an algorithm analogous to variational Expecta-tion MaximizaExpecta-tion (EM). By defining a sparse prior (Student-t) on the representaExpecta-tion, with Gaussian encoder and decoder, they learn relevant sparse codes for the task of classification. Further, Achille and Soatto [31] investigate connections between disentanglement, Variational Autoencoder (VAE), and the IB. Another investigation within the area is done in [32], with the intention to learn disentangled, interpretable representations from sequential data, they present the Factorized Action Variational Autoencoder (FAVAE).

Locatello et al. presents an extensive experimental comparison of state-of-the-art disentangling methods [33], where results are gathered from a large-scale study. For a theoretical overview of autoencoder based representa-tion learning we refer the reader to [21] which gives a comprehensive comparison of recent methods.

1.3 Ethical Considerations

Due to the nature of this work, being primarily theoretical and virtual, its’ impact on sustainability is insignif-icant. Undoubtedly, the methods we present can be realized as virtual tools supporting decision making and can implicitly be beneficial for improving sustainability within other areas. However, the explicit effect of our presented results on sustainability, ecology, etc., is negligible.

More interesting is the impact of virtual, machine learning, tools on society and the ethical issues arising from them. In fact, the usage of these predictive tools in society is increasing [34] and is infiltrating credit scoring, hiring processes, and criminal investigations to name a few. Intuitively, it might appear beneficial from an ethical standpoint to use ‘objective’ machines to make decisions. But as a matter of fact, predictive algorithms are created by humans and human bias inevitably becomes embedded in both source code and the datasets that we use to train our machine learning models with.

(19)

3 INTRODUCTION

In [35] it is shown that a racial bias exists both in commonly used facial recognition datasets, as well as in commercially available gender classification systems. Most notably, the highest error rate of these classification systems were within the group of ‘darker-skinned females’, which further implies a gender bias in the systems. It is therefore very important to proceed with caution when implementing, training, and evaluating machine learning models, as well as composing datasets, to not introduce any ethically problematic biases. Especially, when working with human data.

Within machine learning, the goal is to achieve a good generalization from training data to test data and to accomplish such a generalization, outliers of the training data are inherently ignored. This is not necessarily a problem in the context of machine learning. But when contextualized into society, it becomes one, as minorities, or underrepresented groups, will become the outliers of the data and consequently ignored.

More specifically, when considering disentangled representation learning, the goal is to uncover latent, gen-erative, factors of the data. If there exists, for example, a gender bias in the dataset, a possible consequence is that gender is not uncovered as a factor. Resulting in all individuals being interpreted as male, or conversely, females.

The transparency of biases within datasets is therefore very important and accountability for introduced biases in machine learning algorithms working with data involving humans needs to be established. In this work, we have used the dataset CelebA [36] which is binary annotated with 40 attributes, the distribution of which is visualized in Figure 1.1. We note that there is a slight gender bias and a strong age bias.

0 0.2 0.4 0.6 0.8 1 5 o Clock Shadow

Arched Eyebrows Attractive Bags Under Eyes Bald Bangs Big Lips Big Nose Black Hair Blond Hair Blurry Brown Hair Bushy Eyebrows Chubby Double Chin Eyeglasses Goatee Gray Hair Heavy Makeup High Cheekbones 0 0.2 0.4 0.6 0.8 1 Male

Mouth Slightly Open Mustache Narrow Eyes No Beard Oval Face Pale Skin Pointy Nose Receding Hairline Rosy Cheeks Sideburns Smiling Straight Hair Wavy Hair Wearing Earrings Wearing Hat Wearing Lipstick Wearing Necklace Wearing Necktie Young

Figure 1.1 – The distribution of binary attributes in CelebA.

Within this work, we investigate the current state of disentangled representation learning within Research Question 1. We conclude that the practice of disclosing biases within datasets is not established within the field. Furthermore, we develop a new method as we explore Research Question 2. The method developed will not directly be usable as a tool which has an effect on society. However, it might be used to preprocess raw input

(20)

INTRODUCTION 4

data for a wide variety of downstream tasks, which might come to cause ethical implications. It is therefore important to note that the methods presented within this work are not invariant to biases. Biases within the datasets will induce a bias within learnt representations. We therefore urge, if utilizing the methods presented, to be aware of potential biases.

1.4 Notation

We will in this work refer to the probability density function or probability mass function of the random variable X, for an absolutely continuous or discrete random variable respectively, by writing p(X) or density of X.

(21)

5 UNSUPERVISED REPRESENTATION LEARNING

2

Chapter

Unsupervised Representation

Learning

In the following chapter we will establish concepts that are foundational to the work presented later. Simulta-neously, we aim to develop an intuition for the benefits of unsupervised representation learning and we discuss some notable connections within the area. We begin by establishing unsupervised learning and its role in modern machine learning in Section 2.1. In Section 2.2 we discuss the challenges that comes with high dimensional data and how representation learning solves the posed problem. We then formally present the goal of representation learning and how we can strive for optimal representations of high dimensional data in Section 2.3. Autoen-coders are in Section 2.4 presented as an unsupervised representation learning method. Finally, we discuss the IB [17] in Section 2.5 which will be important when following later discussions of representation optimality.

2.1 Unsupervised Learning

Definition 2.1.1 (Unsupervised Learning).

Let the random variablesX and Ψ be some input data and a set of parameters, respectively. Then,

unsuper-vised learning is a task of unconditional density estimation of the distributionp(X|Ψ), w.r.t. Ψ.

In contrast, by introducing a corresponding target, for some task, we can define supervised learning. Of which classification and regression are typical applications.

Definition 2.1.2 (Supervised Learning).

LetX, Y , and Ψ be random variables, where Y is a corresponding target to the input data X, and Ψ is a set

of parameters. Then, supervised learning is a task of conditional density estimation ofp(Y|X, Ψ).

Thus, the problem formulation of unsupervised learning assumes an explanatory approach to the data rather than describing how the target can be inferred from the data. For some cases, it is natural to assume that the data is generated from a set of hidden ground truth factors. Consider some high-dimensional data; there might only be a few degrees of variability which are essential to the generation of the data. These degrees of variability are not always directly interpretable in the data but are still often highly descriptive. Such hidden degrees of variability are the latent variables of the generative model of which we observe the data. Given such a case, unsupervised learning is a natural approach to discover these latent factors.

Take the task of visual object recognition as an example. A set of latent variables for the observed data could include color, position, rotation, and shape of the object. These features are a good description of higher-order correlations of pixel intensities and are often highly discriminative for predicting a class label. Therefore, when the input contains structures that can be well described by latent variables that simultaneously are highly

(22)

UNSUPERVISED REPRESENTATION LEARNING 6

informative about a downstream task, it is a more effective approach to first detect them through unsupervised learning [1], i.e. feature extraction.

Conversely, in a completely supervised approach, the backpropagation, optimization updates will ignore the inherent structure of the input data and will instead model how the output depends on the input. The model will therefore learn to approximate a function whose variations mainly have an effect at a decision boundary. This is a less effective approach if the inherent structure of the input data can better be explained by its latent variables [1]. Inherently, unsupervised learning permits the usage of unannotated data and does not demand human super-vision. Moreover, annotated data contains relatively little information, often an insufficent amount to reliably estimate parameters of complex models [37].

Beyond, unsupervised learning is more akin to biological intelligence where perception of new information does not necessarily require a label. As highlighted by Hinton [38],

When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information — even one bit per second — that way. The brain’s visual system has 1014 _{neural connections. And you only live for 10}9 _{seconds. So it’s no use learning one bit per}

second. You need more like 105_{bits per second. And there’s only one place you can get that much}

information: from the input itself.

Many high-performing algorithms of today, achieving state-of-the-art results, often lack the generalizing ro-bust properties that are characteristic for biological intelligence [22]. As argued by [19], AI must fundamentally be able to understand the world around us and the only way to achieve that is if an AI is able to identify and dis-entangle the underlying explanatory factors hidden in large amounts of observed low-level sensory data. Hence, unsupervised learning is a natural approach towards developing more generally applicable AI.

Unsupervised learning can in preference be used to learn feature extractors that align with statistical structures of the input data. Representing the input data in a way that aligns with explanatory factors will render the rep-resentations useful for most downstream tasks. Developing algorithms that can find the optimal reprep-resentations of raw input data is the endeavor of representation learning.

2.2 Challenges of High Dimensionality

A representation of some high-dimensional data intends, within representation learning, to be interpretable and compact, while also being informative. Data with large amounts of features is often uninterpretable, hard to visualize, computationally expensive and requires more storage; while simultaneously containing information that is not relevant for some given task.

In many cases, it is beneficial to reduce a high-dimensional input to a lower dimension where only essen-tial factors of the input data are captured and non-important information is filtered out. For example, a lower-dimensional input enables fast nearest-neighbor search and two-lower-dimensional reduction is often used for visual-izing complex data. Principal Component Analysis (PCA) is an example of dimensionality reduction, famously used for face recognition [39].

To paint the picture of excessiveness, consider the classification task of CIFAR-10 [40] from an information theory perspective. The task is to discriminate between 10 uniformly distributed classes, given an input image that is 8-bit, 32 by 32 pixels, and RGB. A well chosen feature space can represent all classes given only 4 binary features.1 _{While the original feature space of the images allows for a total of 2}24576_{combinations.}

Innately, when working with high-dimensional data, a common problem, the curse of dimensionality, will, inevitably, be encountered. First presented by Bellman [41] to describe the problem of an exponentially in-creasing volume when introducing extra dimensions in Euclidean space. Within machine learning, however, the problem often refers to the exponentially increasing sparsity of data in an increasingly large feature space.

1_{The entropy H(Y ) = −E [log}

2p(Y = y)] ≈ 3.32 bits, where p(Y = y) = 1

(23)

Generalizing correctly, from the training set to the full possible feature space, grows exponentially harder with the sparsity of data. Consider a feature space with 100 dimensions, which is considered as being moderate, and a training set of 1012_{examples, which is a vast set by modern standards. The training set only occupies 10}−18

of the full feature space [42].

If we further examine some high-dimensional input space, where we employ a distance metric, e.g. Euclidean norm. Given that a subset of the features are relevant and a majority are irrelevant. The effect of relevant features on the distance metric will be marginal with respect to the irrelevant features’ effect. Hence, making two similar examples indistinguishable to that of two dissimilar examples.

Extending even further, imagining a d-dimensional hypercube with uniform probability density which con-fines a hypersphere; it can be shown that the probability of drawing a sample within the hypersphere approaches zero as the dimensionality increases towards infinity. The density concentrates in the corners of a high-dimensional hypercube and points inside the hyperspheres center will, beyond some dimensionality, be closer to the face of the hypercube, rather than to its nearest neighbor. Consequently, the expected distance between two similar examples will, in a high-dimensional space, be larger than to that of something dissimilar, rendering distance metrics useless. Even in the case of only relevant features.

2.3 Representation Learning Desiderata

To combat the curse of dimensionality, while extracting relevant information for a wide variety of downstream tasks and imposing desired properties, we can learn efficient ways to represent high-dimensional data.

Definition 2.3.1 (The Task of Representation Learning).

Let the random variableX ∈ X ⊆ RC_{be some input data and let}_{T be a random variable describing some}

downstream task. We introduce a representation ofX as Z ∈ Z ⊆ RD_{, where typically}_{D < C. Then a}

mappingh :X → Z is sought s.t.

p(T_{|X) = p(T |Z)} (2.1)

andZ displays desired properties, e.g. compactness and/or interpretability.

In practice, deep and other neural network approaches to supervised learning tasks will inherently display feature extraction in accordance with Definition 2.3.1. If observing the full network, from input to output, the output of an arbitrary, intermediate, layer could be considered as a representation of the input. A partial expla-nation to the sucess of deep learning lies within the ability to extract features that are invariant to nuisances such as translations, rotations, and occlusions while also being disentangled. [16]

Naively, the optimal representation, when the downstream task is given, could simply be considered as the representation that optimizes the objective of the downstream task. E.g. in a classification task, the objective is to minimize misclassifications. However, in the reality of representation learning, the downstream task is unknown, or assumed to be a wide variety of tasks, while certain properties of the representation is desired. Naturally the question of optimality is no longer trivial.

Consequently, a fundamental challenge of representation learning is establishing a clear objective. Lacking such an objective, a representation cannot be deemed optimal. Instead, we have to rely on properties of learnt representations as discriminative between a good and a bad representation. The question of how to formulate training criteria to achieve such properties is an open one and is further discussed in [19]. Perhaps it is sufficient to train a good model for maximizing likelihood, or general-purpose priors need to be introduced with a hope that they benefit learning a representation with desired properties.

2.3.1 Properties of Optimal Representations

From a probabilistic perspective, a good representation is one which captures the posterior distribution of the underlying explanatory factors of the observed data while also containing the relevant information to be useful

(24)

as input to a supervised task. More formally, we can take an information theoretical approach to the question of optimal representations. Following that of [16], an optimal representation should display the properties of sufficiency, minimality, and invariance.

Definition 2.3.2 (Sufficient Representation).

Let the r.v.X be some input data with the r.v. Z as a representation and let the r.v. T describe some downstream

task. The representation is then sufficient for the task described byT if

I(T ; Z) = I(T ; X). (2.2)

Definition 2.3.3 (Minimal Representation).

Let the r.v.X be some input data with the r.v. Z as a representation. The representation is then minimal if I(Z; X) is minimized s.t. Z retains as little information about X as possible. In particular, if I(T ; Z) = I(T ; X) = I(Z; X), Z is a minimal sufficient representation.

Definition 2.3.4 (Invariant Representation).

Let the r.v.X be some input data with the r.v. Z as a representation and let the r.v. T describe some downstream

task. Assume that some r.v.N is a nuisance for the task described by T . Then, the representation Z is invariant

toN if

I(Z; N ) = 0. (2.3)

If a representation which satisfies these properties exists, it is not unique. A further property can hence be enforced, that of minimal statistical dependence between latent dimensions, i.e. maximizing disentanglement.

Ridgeway [24] summarizes the desired properties of representations as being compact and faithful to in-formation represented in the input, analogous to minimality and sufficiency; explicit in the representation of attributes for a downstream task, which includes invariance of attributes; and lastly, as being interpretable by humans.

2.3.2 Distributed Representations

Expressiveness is a key property of a good representation, meaning that for a reasonably-sized representation a large number of possible inputs can be represented. For the sake of comparison, consider the expressiveness of a representation as the number of parameters required to number of input regions (or configurations) that it can discriminate [19]. Representation learning algorithms such as traditional clustering algorithms, Gaussian mixtures, nearest-neighbor algorithms, decision trees, and Gaussian Support Vector Machine (SVM) that learn one-hot representations will require O(N) parameters (and/or O(N) examples) to differentiate between O(N) input regions. In contrast, distributed or sparse representations learners such as Restricted Boltzmann Machine (RBM), sparse coding, autoencoders, or multi-layer neural networks can represent up to O(2k_{) input regions}

using O(N) parameters. For dense representations k = N and in sparse representations k is the number of non-zero (or active) elements.

2.3.3 Invariant Features

The advantage of aquiring invariant features is that they, by definition, have a reduced sensitivity in the direction of invariance. For a given task, it is desirable to have features insensitive to variations of the data that is unin-formative for the task at hand. Generally, abstract features will present such invariance to local changes of the input. Since abstract concepts can be created through less abstract ones, deep hierarcichal architectures have the potential of representing abstract features. The convolutional neural network [43] demonstrates this by explicity including a pooling layer in the architecture and the C-cells of the Neocognitron [44] work in a similar manner, where small local changes will result in the same output.

(25)

It is, however, not trivial to determine a priori which variations are informative and which are not. Further-more, it is common to use the same set of representations for several different tasks which require different sets of relevant features. To combat this, the most robust approach is, according to Bengio et al., to ‘disentangle as many factors as possible, discarding as little information about the data as is practical’ [19]. Disentangled representations are further discussed in Chapter 3.

2.3.4 Sparse Representations

By learning sparse representations, we implicitly select the smallet set of features that provides relevant infor-mation for a downstream task. Accordingly, sparse representations can reduce storage and computation, prevent overfitting, and help with understanding the representation [37].

Sparsity is also beneficial when applying quantization and compression. A sparse feature is often considered as being active or inactive, facilitating the choice of quantization cells. Considering a sparse distribution, such as the zero-mean Laplace distribution with unit-variance, drawn samples will predominatly be close-to-zero. An efficient coding scheme could then assign a small codeword for close-to-zero samples and achieve a low rate.

In signal processing, wavelet basis functions are used to represent signals, such as images or speech, effi-ciently. Comparable to wavelet basis functions are the receptive fields of simple cells in the mammalian primary visual cortex. Olshausen and Field [45] have shown that by adopting a coding strategy that maximizes sparsity, the coding learns similar filters as of the biological equivalent. Hence, it can be argued that biological visual systems utilize sparsity.

Learning algorithms based on non-linear kernels are highly computationally limited, having to evaluate the kernel function for every pair of training samples. Training such learners can be computationally infeasible and prediction times are excessive. The SVM finds sparse, kernel-based, solutions to problems within, e.g. classi-fication, regression, and novelty detection in an efficient manner [46]. SVM only consider the relevant training samples for prediction, which constitute the support vectors, and is consequently a more efficient approach. Ge-ometrically, the SVM only considers the data training points on the decision boundary in the non-linear feature (or dual) space.

2.3.5 Meta-Priors of Representation Learning

A common approach within Machine Learning (ML) to combat problems such as generalizability and robustness is to introduce inductive biases aligning with underlying data structures through model archictecture [22,47–49]. Famously, the Convolutional Neural Network (CNN) [50] enforces the translation symmetry characteristic of visual data through the convolutional operator ingrained in the architecture of the network.

Alternatively, there is the approach of learning representations where the underlying data structure is char-acterized [16, 17, 29, 31, 51]. Instead of introducing biases through architecture, we can impose priors on rep-resentations that expresses notions of the world structure. Bengio et al. [19] present a set of examples of such general-purpose priors, that are not task-specific.

a) Smoothness: Let f be a target function that is to be learned. Then the assumption of smoothness is

s.t. x ≈ y generally implies f(x) ≈ f(y). The assumption of smoothness is present in many machine learning algorithms as a basic prior, it is however not sufficient to deal with the curse of dimensionality.

For advanced tasks, simple parametric models are not able to capture enough complexity of interest if they are not provided with an approppriate feature space. Instead, flexibility has been explored in local non-paramet-ric learners, such as kernel machines with a fixed genenon-paramet-ric local-response kernel, e.g. the Gaussian kernel. The learners are local in the sense that the learned target value f(x) mostly depends on training examples in the close neighborhood of x.

However, many of these methods only exploit the principle of local generalization [52–55]. Under the as-sumption of smoothness, the mapping of all of the target function wrinkles is solely done trough the provided training examples and generalization becomes a local interpolation between neighboring examples. With the curse of dimensionality, the number of wrinkles may grow exponentially with the number of relevant factors and

(26)

local generalization may not be able to reliably map the target function.

In short, smoothness is a useful assumption, but is however insufficient as a generic prior to deal with a raw input space. Bengio et. al. [19], instead, advocate for flexible and non-parametric algorithms that not only incorporate the general-purpose prior of smoothness.

b) Multiple explanatory factors: The data generating distribution is explainable by a set of different latent

factors of variation. What is learned about a factor is, for most configurations of the explanatory factors, gener-alizable. Distributed representations, further discussed in Section 2.3.2, extends upon this prior.

c) Hierarcichal organization of explanatory factors: The set of explanatory factors can be organized in

a hierarcichal manner. Factors that are the most abstract are at the top of the hierarchy and can be explained by less abstract ones. Much like how the world can be organized in terms of concepts.

d) Semi-supervised learning: Let X be an input and Y a prediction target, then a subset of latent factors

of X will also, to an extent, be explanatory to Y . Thus, representations that are beneficial for the unsupervised task are often also beneficial for the supervised task. Statistical strengths therefore transfer between supervised and unsupervised learning.

e) Shared factors across tasks: The set of explanatory factors of X that are useful for the task of predicting

Y are shared by other prediction tasks. Statistical strenghts of factors can consequently be shared across tasks, e.g. Transfer Learning.

f) Manifolds: Probability mass concentrates near regions with a much lower dimensionality than that of the

original input space. This phenomenon partly counteracts the curse of dimensionality, and we can consider it as a blessing of concentration. Consider the dataset of handwritten digits, MNIST [56]. Even though the input space is vast and can represent a wide array of images, examples of the dataset are not uniformly distributed in the full space, but rather concentrates in a local space representing digits. Some Autoencoders such as the Manifold Tangent Classifier [8] explicitly exploit this.

g) Natural Clustering: Different values of categorical variables, object classes for example, are associated

with separate manifolds. Local variations on the manifold tend to preserve the value of a category. Examples x_{∼ P (X|Y = i) for different classes i tend to be well separated and non-overlapping.}

h) Temporal and Spatial Coherence: Samples that are close in either time or space tend to (i) relate to the

same categorical values of relevant factors, or, (ii) result in a small variation on the surface of the high-density manifold. Different factors change at different temporal or spatial scales. Most interestingly, many categorical concepts of interest change slowly. The principle of temporal coherence is used in [57] to learn invariant features.

i) Sparsity: For an observation, only a small subset of the possible explanatory factors is important, which

in a representation could be exhibited either as features that are predominatly zero or insensitive to small varia-tions of the observation.

j) Simplicity of Factor Dependencies: A good set of extracted features are connected through simple,

typically linear, dependencies which can be demonstrated by employing a linear predictor on a learned represen-tation.

2.4 Autoencoders

An autoencoder [58–61] consists mainly of two parametrics mappings: an encoder of the input to a represen-tation, and a decoder that can reconstruct such an inout from its representation. For an extensive review of autoencoder-based representation learning the reader is referred to [21].

Design of autoencoders is made such that it simply does not learn to output perfect reconstructions of all possible inputs but rather only for inputs resembling training data. One way to accomplish this is by constraining the dimensionality of the representation to be less to that of the input in the form of an undercomplete autoencoder. Due to this restriction, the autoencoder needs to prioritize relevant features and often important properties of the input will be represented in the learnt code.

Architecturally, the autoencoder consists of two parts, the encoder fψ(x), which is a mapping from input to

(27)

and decoder are parametrized on parametric closed form by the sets of parameters ψ and ξ respectively. Modern autoencoders have extended on the idea of encoding and decoding as deterministic functions and consider the mappings as stochastic, i.e. pψ(Z|X) and pξ(X|Z) where X and Z are random variables.

Applications of autoencoders are diverse and include, among others, dimensionality reduction and informa-tion retrieval [62]. In [1], for example, a stack of RBM were trained and used to initialize a deep autoencoder which outperformed PCA with the same dimensionality reduction, w.r.t. reconstruction error, while providing qualitatively better representations.

Learning the parameters of the autoencoder is done, simultaneously, for encoder and decoder parameters, by minimizing some objective

LAE(ξ, ψ; x) =

X

i

L(x, fξ(fψ(x))) (2.4)

where L(·, ·) is some reconstruction error and x a training example.

2.4.1 Regularized Autoencoders

Given too much capacity in the encoder and decoder an undercomplete autoencoder fails to learn relevant fea-tures. The same is true when the code dimensionality exceeds that of the input, in the case of overcomplete autoencoders, or when they are equal, even with a linear encoder and decoder. It shall however be noted that overcomplete sparse autoencoders are able to learn relevant features. More recent research on autoencoders has shown promising results by constraining the representation by introducing a regularization of the representa-tion rather than by modifying dimensionality capacity. The purpose of introducing regularizarepresenta-tion is to force the representation to be insensitive to local changes in input. Applying this is the Denoising Autoencoder [63, 64], where the mapping becomes insensitive to small random permutations in the input, as well as the Contractive Autoencoder [65], where reconstruction error is forced to be high when moving in most directions around a training example.

2.4.2 Sparse Autoencoders

The idea of sparsity regularization was introduced in [66], where a linear encoder and decoder in combination with a ‘sparsifying’ non-linearity are used to learn sparse, overcomplete features. The k-Sparse autoencoder [67] investigates the effect of sparsity itself and is shown to outperform denoising autoencoders, RBMs and networks trained with dropout. Another perspective of the sparsity regularization is that of the sparse autoencoder frame-work as approximating maximimum likelihood training of a generative model with latent variables [62].

Penalizing hidden unit biases can enforce sparsity in the representation, by forcing more negativity of the offset parameters [66, 68–70]. Alternatively, the output of the hidden unit activations can be penalized [57, 71].

2.4.3 Connection to Probabilistic PCA

PCA is a technique used to study multivariate data, it is used in several applications such as dimensionality reduc-tion, feature extracreduc-tion, lossy data compression, and data visualization [72]. There exists two main definitions of PCA that result in the same algorithm [46]. Defined either as the orthogonal projection of the data onto a lower dimensional linear space such that the variance of the projected data is maximized [73]. Or equivalently, as the linear projection that minimizes the average projection cost, defined as the mean squared distance between projections and corresponding data points [74].

Reformulating PCA as the maximum likelihood solution of a probabilistic latent variable model brings sev-eral advantages over conventional PCA [46], such as being able to formulate a computationally efficient EM algorithm; being foundational for a Bayesian treatment of PCA; allowing samples to be provided from the dis-tribution by running the model generatively; allowing the management of missing values by combining the probalistic model with EM; and permitting its application to classification tasks due to its ability to model class-conditional densities.

(28)

Firstly, an explicit latent random variable Z is introduced which represents the principal component subspace. Secondly, a prior distribution of the latent variable is defined to be a zero-mean Gaussian with unit-covariance

p(Z) =N (Z|0, I). (2.5)

Lastly, the conditional distribution of the observed random variable X is defined as Gaussian

p(X|Z = z) = N (X|W z + µ, σ2_I). _(2.6)

This is known as probabilistic PCA which was independently proposed in [75, 76]. Probabilistic PCA is closely related to factor analysis [77], where the latter uses a diagonal covariance matrix rather than an isotropic.

A linear autoencoder, where the encoder and decoder are linear, that minimizes (2.4) with a squared recon-struction error will learn the same subspace as probabilistic PCA. This holds true when using a sigmoid as a non-linearity in the encoder [58], given that the weights of the encoder and the decoder are not tied (Wenc 6= WdecT).

It has further been shown in [78] that by adding a regularization termP_j,is(Wjx(i)) to a linear autoencoder with

tied weights using a convex non-linearity s(x), will result in an efficient learning method for linear Independent Component Analysis (ICA).

2.4.4 Connection to Sparse Coding

Sparse coding, first presented in [45] as a model of simple cells in the visual cortex, differs from PCA by adding a penalty to enforce that a sparse activation encodes each input. A latent representation is related to the data through a linear mapping known as the dictionary. Recovery of the sparse code z∗_{for a new input x is done by}

minimizing the reconstruction error and the sparsity penalty z∗= argmin

z kx − W zk

2

2+ λkzk1 (2.7)

and optimizing the objective

LSC = X t x(t)− W z∗(t) 2 2 (2.8)

will lead to learning the dictionary W [37]. Note that the columns of W are usually constrained to be unit-norm [19].

Similarly to PCA, sparse coding also has a probabilistic interpretation

p(Z) = Laplace(Z|0, I) (2.9)

p(X_{|Z = z) = N (X|W z + µ, σ}2I). (2.10)

Instead of a Gaussian prior on the latent variable Z, a Laplacian prior, which is sparsity enforcing and corresponds to ℓ1-regularization, is used instead [19].

It has recently been shown that the expected gradient of the squared loss function of a linear autoencoder is very close to zero in the neighborhood of generating an overcomplete dictionary [79], demonstrating that autoencoders can be trained by a gradient descent approach to solve the sparse coding problem.

2.5 The Information Bottleneck Method

Given two statistically dependent random variables X and Y with known joint density p(X, Y ), X ∈ X , Y ∈ Y, the IB method [17] considers the problem of extracting the relevant information that X contains about Y . An optimal representation of X would retain the relevant features while compressing X by dismissing the irrelevant information. We can quantify the compression of X by introducing the bottleneck variable Z as the mutual information with Z, I(X; Z), where Z is a representation of X under the assumption of the Markov chain

(29)

The relevant information is defined by Tishby et al. [17] as the information that remains from X about Y after compression, i.e. in the representation Z, given a minimum level of compression. Due to the Data Processing Inequality (DPI) [80] we know that the representation will always retain more information about X than about Y , i.e., I(X; Z)≥ I(Z; Y ). Hence, the optimal representation can be found by optimizing

max

Z I(Z; Y ) s.t. I(X; Z) ≤ r, (2.11)

where r is the minimum compression level. This problem is generally computationally intractable and by relaxing the problem a solution can be found by instead minimizing the Lagrangian

LIB

p(Z, X), p(Y, Z)

= I(X; Z)_{− βI(Y ; Z)} (2.12)

subject to the Markov chain constraint. The Lagrange multiplier β controls the trade-off between the compression of X and the information about Y retained in the representation.

(30)

(31)

15 DISENTANGLED REPRESENTATIONS

3

Chapter

Disentangled Representations

We will now move on to investigate disentangled representation learning in detail and simultaneuosly review the current state of disentangled representation learning. To begin with, we will in section 3.1 discuss the def-inition by exploring properties of a disentangled representation and a formal defintion proposed by Higgins et al. [22]. We then continue to propose a probabilistic interpretation, in section 3.1.1 which bridges the gap between the formal definition and current state-of-the-art methods. These methods are in detail presented in section 3.2. Subsequently, we review the evaluation practices of disentangled representations. A detailed discussion of quan-titative disentanglement metrics as well as qualitative evaluation methods is presented in section 3.3. Lastly, we make the case for introducing sparsity into the current landscape of current disentangled representation learning in section 3.4, where we also outline a method for enforcing sparsity. Throughout this chapter we also briefly touch on the implementation of methods and metrics. Implementations are publically available at [81].

3.1 Exploring the Definition

Firstly, we need to establish the world view that needs to be assumed when discussing disentangled representa-tions. We suppose that some observations have a set of generative, underlying factors of variation that are inde-pendent from each other. These factors are hidden from direct observation and are only observable indirectly. In the context of visual, disentangled representation learning, the observations are a set of images constituting our data and the latent factors are referred to as data generative factors. Some generative process creates the data given the latent factors. The goal is to learn a representation which axis-aligns with the data generative factors with some additional constraints which we infer by exploring the definition of disentangled representations.

Secondly, we must recognize that there is no universally accepted definition for disentangled representations. Although, three properties have been used to describe disentanglement in an intuitive manner, modularity,

com-pactness, and explicitness [18, 82]. Sometimes, to an extent interchangeably, refered to as disentanglement,

completeness, and informativeness. a) Modularity: requires that each dimension of the representation con-tains information about at most one factor. b) Compactness: a given factor is associated with one or a few code dimensions. c) Explicitness: requires that all factors can be decoded from the representation using a linear transformation.

Compactness is, however, one of several points of contention within the field, whether or not a factor should be allowed to be encoded in more than one dimension or not. Among those arguing against strict compactness are [18, 22]. There is also a disparity among metrics on the topic as highlighted by [22].

The first formal definition of disentangled representations [22] is in line with the established intuitions, while simultaneuosly providing principled resolutions to disagreements. By connecting the study of symmetry trans-formations to vector representations using group theory a definition can be formulated as:

(32)

DISENTANGLED REPRESENTATIONS 16

Definition 3.1.1 (Disentangled Representation).

A vector representation is called a disentangled representation, with respect to a particular decomposition of a symmetry group into subgroups, if it decomposes into independent subspaces, where each subspace is affected by the action of a single subgroup and the actions of all other subgroups leave the subspace unaffected.

In hopes of clarifying Definition 3.1.1, we briefly reiterate the formal definition of Higgins et al. while considering the example of a confined two-dimensional world. However, we refer the reader to [22] for the comprehensive definition, as well as preliminaries on group theory. On the two-dimensional plane of this world, there exists some object which can scale in size, move in the two directions along the plane, and assume a set of colors. These four properties characterize this world and we let V be the set of these world states.

Allow us to diverge shortly, to examine the concept of symmetry. A symmetry of some object is a

transfor-mationthat leaves certain characteristics of the object invariant. Typical examples of symmetry transformations

are translation and rotation. Take a potato as an example of an object, it will remain a potato whether or not it is moved up, down, or sideways; and it will remain a potato whether or not it is rotated on the table of which it rests. In the context of machine learning, recall the classification problem of MNIST [56], mapping images of handwritten digits to corresponding labels. Symmetries of the digits include the thickness of the lines and their tilt, when these properties are altered1_{, the digit still remains the same.}

The definition of Higgins et al. builds on the assumption that the world dynamics can be described by it’s symmetry transformations. They argue for this assumption by describing the role of symmetries within the field of physics and the paradigm shift of studying them that it caused, starting with Noether’s theorem [83]. Studying the transformations of a system proved to be beneficial to uncover new properties of the system and provided a way of generalizing knowledge to new domains. Intriguingly, this would theoretically apply to the field of visual machine learning as well. If we can understand what properties of the world remain the same when transformed in some particular ways, we could generalize our acquired knowledge. For example, a new object can be recognized as an object because it behaves similarly to other objects under symmetry transformations, which in the context of scene understanding could for example be translation and rotation. Notably, feature extraction within visual data analysis works with feature extractors that are designed to extract keypoints that are invariant to scale and rotation [84]. For example, the representations learned for MNIST digit classification can be useful to distinguish between different speed sign limits (domain adaptation) and/or to distinguish between different characters in Optical Character Recognition (OCR) (multi-task or transfer learning).

Returning to our postulated, two-dimensional, world, we note that the four distinct properties are the symme-try transformations of that world. Horizontal and vertical translation does not change the identity of the object, neither does changing color nor scaling. This set of transformations compose a symmetry group and the effects of these transformations are the actions of the symmetry group on the world state. It shall also be emphasized that the actions of these world states all are independent of each other. Moving horizontally does not affect the color, scale, nor vertical position of the object. Such actions are defined by Higgins et al. as disentangled group

actions.

Definition 3.1.2 (Disentangled Group Action).

Consider a group action_{·: G × V → V, where the group G decomposes as a direct product G = G}1× G2.

If we denote the action of the full group as_{· and the actions of each subgroup i as ·}i, then the action is disentangled, with respect to the decomposition of_{G, if there is a decomposition V = V}1× V2and actions

·i:Gi× Vi→ Vi, i∈ {1, 2} s.t.

(g1, g2)· (v1, v2) = (g1·1v1, g2·2v2). (3.1)

Or, in the case of the group decomposition_{G = G}1× . . . × Gn. The action is disentangled, with respect to the

1_{Within reasonable limits, e.g., the line should be thin enough to allow the readability of the digit, and the tilt can be within −}π

6 and

π

6,

(33)

17 DISENTANGLED REPRESENTATIONS

decompostion of_{G, if there is a decomposition V = V}1× . . . VnorV = V1⊕ . . . ⊕ Vns.t. eachViis fixed by

the action of all the_Gj, j6= i and affected only by Gi.

In our example, the symmetry group can be decomposed into four separate symmetry subgroups: one for hori-zontal and vertical translation respectively, one for color, and one for scale.

Continuing to the central topic of representations. We observe the world state through some observations in the group X and an action ·: G × V → V. Then, we strive to find a corresponding action ·′_:

G × Z → Z so that the symmetry structure of V is reflected in Z. I.e., the action on the representation group Z should correspond to the action on V. If we change the color of the object, acting on the world states, we want that action to be reflected in the representation. This will be the case if the mapping from world states to representation, f : V → Z, is equivariant.

Definition 3.1.3 (Equivariant Map).

Given that_{V and Z are both group sets for the group G, i.e. ·}′:_{G × Z → Z and ·: G × V → V. The mapping} f :_{V → Z is equivariant if}

g· f(v) = f(g · v), ∀ g ∈ G, v ∈ V. (3.2)

With the above as a foundation, we can examine the formal definition of a disentangled representation by Higgins et al. and ask the question of when a representation of our two-dimensional example world can be defined as disentangled.

Definition 3.1.4 (Disentangled Representation).

The representation_{Z is disentangled w.r.t. the decomposition G = G}1× . . . × Gnof a symmetry groupG if

the following are satisfied:

1. There is an action_{·: G × Z → Z.}

2. The mapf :_{V → Z is equivariant between the actions on V and Z.}

3. There is a decomposition_{Z = Z}1× . . . × ZnorZ = Z1⊕ . . . ⊕ Zns.t. eachZiis fixed by an action

of all_Gj, j6= i and affected only by Gi.

In summary, we have established that our two-dimensional grid-world has a symmetry group G that can be decomposed as G = Gx× Gy× Gc× Gs. Where Gxis the set of all horizontal translation transformations, Gy

vertical translation transformations, Gccolor transformations, and Gsscaling transformations. Since we are able

to scale the object without affecting position or color, as previously discussed, we have an action ·: G × Z → Z that is disentangled w.r.t. to this decomposition, i.e. is a disentangled group action. Then what we need to find is an equivariant mapping from world states to representations. Given that such a mapping is found we can conclude that the representation Z is disentangled w.r.t. the group symmetries G.

3.1.1 A Probabilistic Interpretation

Having assumed the world view of disentangled representations, where we establish that a set of independent, underlying factors of variation constitutes a set of world states, we can now connect the formal definition (Def-inition 3.1.4) with the probabilistic framework of [85]. This way we can lay the foundation for some methods that learn the required equivariant map from world states to representations.

Let our observations of the world state be a random variable X ∈ X with a density p(X). Realizations of X are created by a generative model that involves the unobserved random variables V and W . The density function of the independent factors, or world states, V ∈ V ⊆ RK _{is assumed to be factorial p(V ) =}QK

k=1p(Vk). A set

of factors that has an effect on X will not be conditionally independent and we therefore introduce the random variable W as the conditionally dependent factors. The generative model is then defined by the density function

(34)

DISENTANGLED REPRESENTATIONS 18

p(X, V, W ) = p(X_{|V, W )P (W |V )P (V ).} (3.3)

The latent representation Z is a random variable s.t. Z ∈ Z ⊆ RD_{, where D ≥ K due to the constraint of}

modularity. Recall the composite mapping f : V → Z, f = h◦b, which we desire to be equivariant. For the task of unsupervised learning, the mapping from world states to observations, b: V → X is fixed and considered as the ground truth generative process. Hence, to learn an equivariant composite mapping, we develop a generative model which estimates the joint density of X and Z,

p(X, Z) = p(X_{|Z)p(Z) = p(Z|X)p(X).} (3.4)

Within the probabilistic interpretation of the framework established in Section 3.1 we can define the mappings that constitute the composite mapping f : V → Z as

bw_{= x}

∼ p(X|V, W = w): V → X , (3.5)

h = z∼ p(Z|X) : X → Z. (3.6)

In Section 3.2 we will investigate models that aim to estimate the joint density p(X, Z) to infer disentangled latent representations. Note that we will in the hereafter be denoting p(X = x) as p(x) for brevity.

3.2 Learning Disentangled Representations

State-of-the-art approaches for learning disentangled representations are predominatly based on the VAE [86]. Although an adverserial approach, the InfoGAN [87], have also been shown to learn representations that are disentangled and interpretable.

Mechanisms for enforcing the meta-priors of [19] in autoencoders are identified in [21] as regularization of the encoding distribution, choice of the encoding distribution, and choice of a flexible prior distribution of the representation. We will be considering completely unsupervised methods that regularize the encoding distribu-tion of a VAE. We will be considering completely unsupervised methods that regularize the encoding distribudistribu-tion of a VAE.

The models under consideration are the β-VAE [85], FactorVAE [88], and β-TCVAE [89]. Briefly sum-marizing the results of the methods, we can establish β-VAE as the baseline. The FactorVAE achieves a better disentanglement with an increased reconstruction error. Combatting this, the β-TCVAE modifies the objective of the FactorVAE to keep enforcing disentanglement while not penalizing terms benefitting reconstruction.

A detailed review and experimental comparison of several state-of-the-art algorithms (β-VAE [85], An-nealedVAE [23], FactorVAE [88], β-TCVAE [89], DIP-VAE-I and DIP-VAE-II [90]) is presented in [33].

3.2.1 Variational Autoencoder (VAE)

Following the approach of Kingma and Welling [86], let the dataset D =x(i) N

i=1consist of N i.i.d. samples

of a random variable X. We assume that x is generated by a random process that involves an unobserved random variable Z. Samples of X are generated by first generating a value z(i)_{from a prior distribution p}

θ(Z). Then

x(i)_{is created by the generative model p}

θ(X|Z = z(i)). Note that this framework is compatible with that of

Section 3.1.1.

As an approximation to the intractable true posterior pθ(Z|X), a probabilistic encoder qφ(Z|X) is

intro-duced. Similarly, we will refer to pθ(X|Z) as a probabilistic decoder while considering Z as a code. The

encoder and decoder are considered probabilistic since they output distributions, e.g. Gaussian, over possible values of their outputs.

Considering the marginal likelihood, which is a sum over the marginal likelihood of individual datapoints log pθ(x(i)). A variational lower bound on the marginal likelihood can be formulated as log pθ(x)≥ LVAE(θ, φ; x),

(35)

19 DISENTANGLED REPRESENTATIONS where LVAE(θ, φ; x) = Eqφ(Z|X=x) log pθ(X = x|Z = z) − KL qφ(Z|X = x) pθ(Z) (3.7) in which z is sampled from qφ(Z|X = x). We will be referring to (3.7) as the Evidence Lower Bound Objective

(ELBO) which we want to maximize w.r.t. variational parameters φ and generative parameters θ jointly. By introducing an auxilliary variable ǫ with marginal p(ǫ) it is often possible to express z as a deterministic variable z = gφ(ǫ, x), where gφ(·) is some function parametrized by φ. Thus, the reparametrization trick [86] allows for

the construction of an unbiased, differentiable, w.r.t φ, Monte Carlo estimator of the lower bound.

Described above is a general approach which is exemplified through the VAE [86]. We can relate the ELBO (3.7) to the objective of a regularized autoencoder. The first term of the ELBO is the log-likelihood of the input x(i)_{under the generative model p}

θ(X|Z = z(i)) which can be considered as a negative reconstruction

error. The KL divergence of the approximate posterior from the prior, second term of (3.7), is a regularization term.

For the VAE formulation, the prior for the latent variables are assumed to be a centered isotropic multivariate Gaussian pθ(Z) =N (Z; 0, I). The generative model pθ(x|z) is assumed to be multivariate Bernoulli in the case

of binary data and Gaussian for real-valued. Let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance qφ(Z|X) = N (Z; µ(i), σ2(i)I). The distribution parameters for both the generative

model and the variational posterior are outputs of neural networks. The samples for the estimation of the ELBO can be drawn from the posterior qφ(Z|X = x(i)) as z(i)= gφ(x(i), ǫ) = µ(i)+ ǫσ(i)where ǫ ∼ N (0, I). When

inferring the representation, z is taken to be the mean of the posterior qφ(Z|X).

3.2.2 β-VAE

The β-VAE [85] modifies the VAE by reweighting the ELBO, pressuring the variational posterior to be closer to the latent prior.

Lβ-VAE(φ, θ; x, β) = Eqφ(Z|X=x) log pθ(X|Z = z) − β KL qφ(Z|X = x) p(Z) (3.8) Higgins et al. [85] argue that using a β > 1 is important to enforce a stronger constraint on the latent bottleneck than in the original VAE formulation. By enforcing this constraint the capacity of Z is restricted, which in combination with maximising the log-likelihood of the data X given the parameters φ, θ, forces the model to learn the most efficient representation of the data. It is shown in [85] that the β-VAE outperforms, quantitatively and qualitatively, the original VAE formulation by [86] as well as InfoGAN [87] and DC-IGN [91]. It is hypothesized in [85] that higher values of β encourages the learning of disentangled representations. Because the data is generated by some conditionally independent ground truth factors and the KL-divergence term of the β-VAE objective encourages conditional independence in the encoding distribution qφ(Z|X), the added pressure from

a higher β will enforce learning of disentangled representations.

However, there is a trade off between information preservation and latent channel capacity restriction. When β > 1, restricting the latent channel can lead to poorer reconstructions due to a loss of high frequency component details.

3.2.3 FactorVAE

It is shown in [88] that the Kullback-Leibler divergence term in (3.8), with regards to the underlying data distri-bution p(X), decomposes as E_p(X) KLqφ(Z|X) p(Z) = I(X; Z) + KL qφ(Z) p(Z) . (3.9)

In the β-VAE, a higher value of β will, through the KLqφ(Z)

p(Z) term, push the marginal encoding distribution towards the factorial prior and encourage indepence between the latent variables. It will however

Introducing Sparsity into the Current Landscape of Disentangled Representation Learning

IN

DEGREE PROJECT

INFORMATION AND COMMUNICATION

TECHNOLOGY,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2021

Introducing Sparsity into the

Current Landscape of

Disentangled Representation

Learning

ELIAS ÅGEBY

KTH ROYAL INSTITUTE OF TECHNOLOGY

Introducing Sparsity into the

Current Landscape of Disentangled

Representation Learning

ELIAS ÅGEBY

Abstract

Sammanfattning

Acknowledgments

Contents

List of Figures

List of Tables

1

Chapter

Introduction

1.1

Background

1.2

Related Work

1.3

Ethical Considerations

1.4

Notation

2

Chapter

Unsupervised Representation

Learning

2.1

Unsupervised Learning

2.2

Challenges of High Dimensionality

2.3

Representation Learning Desiderata

2.3.1

Properties of Optimal Representations

2.3.2

Distributed Representations

2.3.3

Invariant Features

2.3.4

Sparse Representations

2.3.5

Meta-Priors of Representation Learning

2.4

Autoencoders

2.4.1

Regularized Autoencoders

2.4.2

Sparse Autoencoders

2.4.3

Connection to Probabilistic PCA

2.4.4

Connection to Sparse Coding

2.5

The Information Bottleneck Method

3

Chapter

Disentangled Representations

3.1

Exploring the Definition

3.1.1

A Probabilistic Interpretation

3.2

Learning Disentangled Representations

3.2.1

Variational Autoencoder (VAE)

3.2.2

β-VAE