Improving Zero-Shot Learning via Distribution Embeddings

(1)

STOCKHOLM SWEDEN 2020,

Improving Zero-Shot

Learning via Distribution Embeddings

VIVEK CHALUMURI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Examiner

Hedvig Kjellström

KTH Royal Institute of Technology

Supervisors

Bac Nguyen Cong Sony, Stuttgart

Josephine Sullivan

KTH Royal Institute of Technology

(3)

Zero-Shot Learning (ZSL) for image classification aims to recognize images from novel classes for which we have no training examples. A common approach to tackling such a problem is by transferring knowledge from seen to unseen classes using some auxiliary semantic information of class labels in the form of class embeddings. Most of the existing methods represent image features and class embeddings as point vectors, and such vector representation limits the expressivity in terms of modeling the intra-class variability of the image classes. In this thesis, we propose three novel ZSL methods that represent image features and class labels as distributions and learn their corresponding parameters as distribution embeddings. Therefore, the intra-class variability of image classes is better modeled. The first model is a Triplet model, where image features and class embeddings are projected as Gaussian distributions in a common space, and their associations are learned by metric learning. Next, we have a Triplet-VAE model, where two VAEs are trained with triplet based distributional alignment for ZSL. The third model is a simple Probabilistic Classifier for ZSL, which is inspired by energy- based models. When evaluated on the common benchmark ZSL datasets, the proposed methods result in an improvement over the existing state-of-the-art methods for both traditional ZSL and more challenging Generalized-ZSL (GZSL) settings.

Keywords: Zero-Shot Learning (ZSL), Generalized Zero-Shot Learning (GZSL),

Image Classification, Metric Learning, Distribution Embeddings, Triplet Loss

(4)

klassinbäddningar. De flesta av de befintliga metoderna representerar bildfunktioner och klassinbäddningar som punktvektorer, och sådan vektorrepresentation begränsar uttrycksförmågan när det gäller att modellera bildklassernas variation inom klass. I denna avhandling föreslår vi tre nya ZSL-metoder som representerar bildfunktioner och klassetiketter som distributioner och lär sig deras motsvarande parametrar som distributionsinbäddningar. Därför är bildklassernas variation inom klass bättre modellerad. Den första modellen är en Triplet-modell, där bildfunktioner och klassinbäddningar projiceras som Gaussiska fördelningar i ett gemensamt utrymme, och deras föreningar lärs av metrisk inlärning. Därefter har vi en Triplet-VAE-modell, där två VAEs tränas med tripletbaserad fördelningsinriktning för ZSL. Den tredje modellen är en enkel Probabilistic Classifier för ZSL, som är inspirerad av energibaserade modeller. När de utvärderas på de vanliga ZSL- datauppsättningarna, resulterar de föreslagna metoderna i en förbättring jämfört med befintliga toppmoderna metoder för både traditionella ZSL och mer utmanande Generalized-ZSL (GZSL) -inställningar.

Nyckelord: Zero-shot lärande, Generaliserat zero-shot-lärande, Bildklassificering,

Metrisk inlärning, Distribution inbäddningar, Triplet-förlust

(5)

First, I would like to thank my supervisor Bac Nguyen Cong for guiding me through

this project. I would also like to thank my KTH supervisor Josephine Sullivan and

examiner Hedvig Kjellström for overseeing my thesis. Finally, I would like to thank

my family for their support and encouragement throughout this thesis period.

(6)

NLP Natural Language Processing DNN Deep Neural Network

CNN Convolutional Neural Network

RNN Recurrent Neural Network

VAE Variational Auto-Encoders

SOTA State of the Art

(7)

1 Introduction 1

1.1 Motivation . . . . 2

1.2 Objective . . . . 3

1.3 Ethics and Sustainability . . . . 3

1.4 Thesis Outline . . . . 4

2 Background 5 2.1 Deep Learning . . . . 5

2.1.1 Multi-Layer Feed-Forward network . . . . 6

2.1.2 Convolutional Neural Networks (CNNs) . . . . 7

2.1.3 Feature Extraction . . . . 8

2.1.4 Variational Autoencoders . . . . 9

2.2 Metric Learning . . . . 10

2.2.1 Distribution Metrics . . . . 10

2.2.2 Loss Functions . . . . 12

2.3 Zero-Shot Learning . . . . 13

2.3.1 Semantic Space . . . . 14

2.3.2 Learning Settings . . . . 15

2.3.3 Evaluation Metrics . . . . 15

2.4 Related Work . . . . 16

2.4.1 Zero-Shot Learning . . . . 16

2.4.2 Distribution Representation . . . . 17

2.4.3 Sample Synthesis . . . . 18

2.4.4 Dataset Splits . . . . 18

2.4.5 Summary . . . . 19

3 Methods 20

(8)

3.3.4 Classifier Model . . . . 30

3.3.5 GZSL Sample Synthesis . . . . 31

4 Experiments and Evaluation 33 4.1 Baseline Generative Model . . . . 33

4.2 Best Distance Metric . . . . 34

4.3 Model Evaluation . . . . 36

4.4 Sample Synthesis . . . . 39

5 Conclusion 41 5.1 Discussion . . . . 41

5.2 Future Work . . . . 42

(9)

Introduction

An abundance of labeled data has led to significant advances in Computer Vision in the past decade. In a conventional image classification task, a model learns from numerous representative images of each visual category it encounters while training. However, with new visual classes emerging every day, collecting and training a model with sufficient labeled data is challenging. Due to the lack of data availability, classifying examples from classes that were unavailable during the training becomes a much important task. Humans, on the other hand, are exceptional at recognizing novel visual categories. We can leverage our background knowledge of these classes to recognize them. For instance, given the information that zebras look similar to horses but with black and white stripes, having seen a horse one might identify a zebra without ever seeing it before. Zero-Shot Learning (ZSL) for image classification aims to model this intuition and recognizes images from visual categories not encountered previously in the training phase. For this, ZSL assumes that even in the absence of labeled images for training, we have auxiliary semantic information that relates unseen classes to seen classes. This extra information could be in the form of human-annotated class attributes [1], word vectors [2], or natural language descriptions [3] of the class labels.

The common purpose of any such auxiliary semantic information is to encode the

distinctive features of each class.

(10)

transfer learning problem. Most previous methods try to achieve this knowledge transfer either in visual space, auxiliary information space, or some common space.

The common trait among most of these methods is that they try to tackle ZSL by representing class labels and image features as fixed embeddings in some vector space where their similarity is preserved, i.e., each example is surrounded by its correct label.

By doing so, classifying an unseen example corresponds to a nearest-neighbor search problem. As these existing methods often represent a class label as a point vector, this limits the expressivity in terms of modeling the intra-class variability among different image classes. A possible solution to overcome this limitation is to represent class labels by distributions (e.g., Gaussians), and learning its parameters (e.g., mean and variance) as distribution embeddings instead of the usual point vector embeddings.

These distribution embeddings of each class can be defined as the functions of the auxiliary semantic information of the respective classes. Similar to class labels, image features can also be represented as distribution embeddings to account for their intra- class variability. Thus learning distributions for features from both image space and label space in an efficient way could lead to an improved ZSL.

Recent research has also focused on a more challenging and practical setting of ZSL

known as Generalized Zero-Shot Learning (GZSL). Unlike ZSL, in this setting, we

have both seen and unseen categories together at the test time. By training only on

the seen classes, a strong bias towards them makes it challenging when we have to

predict among both seen and unseen classes together. Due to this, the most common

ZSL methods fail in this setting. Methods that treat ZSL as a missing data problem

and synthesize samples of unseen classes have shown great promise in GZSL. As once

we have synthesized samples, we can train a supervised classifier on all the image

samples, both seen and unseen together. But these methods for the GZSL setting don’t

necessarily guarantee the best performance for a traditional ZSL. In this project, we

build upon the previous research, to develop simple yet high-performance algorithms

that represent classes as distributions and also account for both ZSL and GZSL.

(11)

1.2 Objective

In ZSL, representation of features as vector embeddings limits the expressivity in terms of modeling the intra-class variability, which leads to performance degradation.

The objective of this thesis is to explore ZSL methods that represent features as distributions, thus accounting for intra-class variability among the image classes. As an end goal, we aim to develop high-performance novel ZSL methods that learn concept distributions in both input/image space and label/semantic space in an efficient way.

To summarize, the following research questions would be explored in this thesis:

• How to design a ZSL algorithm that takes into account the intra-class variability?

• Which distance metric is best suited to measure the similarity of two class distributions in ZSL?

Finally, this thesis will present three novel ZSL algorithms. The results and details of the relevant experiments would be presented. The proposed algorithms will be evaluated in ZSL and GZSL settings, and the results will be compared to the previous State-of-the-Art (SOTA) methods.

1.3 Ethics and Sustainability

The rate at which novel visual classes are being created or added every day makes it impractical for traditional image classification methods to accommodate them without complete re-training. The aim of this thesis and in general ZSL research is to develop approaches to recognize such novel visual classes on the fly. With such a scenario, going into the future, ZSL might become a necessary feature of all image classification methods to be more sustainable. And hence, saving a lot of model training resources and manual image labeling time.

As is the case with every new advancement in the field of AI, in relation to ethics,

there might be some concerns with potential applications in the future. For instance,

applications like reading body language and human activity recognition that can

greatly benefit from ZSL might also come with some privacy concerns. Moreover, there

is already existing research trying to apply ZSL for mind-reading or neural activity

recognition from MRI scans [6]. Currently, ZSL research is at a very nascent stage, and

ethical concerns over its potential applications are no different from any other field of

(12)

• Chapter 2: Background, contains the necessary background theory of machine learning concepts and relevant ZSL related works.

• Chapter 3: Methods, introduces the technical definition of the objective and presents the novel methods and models proposed in this project.

• Chapter 4: Experiments and Evaluation, contains the details of all the experiments conducted and their respective results.

• Chapter 5: Conclusion, as the name suggests, concludes this thesis with some

discussion on current and future work.

(13)

Background

This chapter describes the theory and concepts required to comprehend the ideas and experiments presented in this thesis. We start with a brief introduction to Deep Learning, followed by topics like Metric Learning and Zero-Shot Learning, which are more relevant to the methods presented in this thesis. The previous research related to Zero-Shot Learning is discussed at the end of the chapter.

2.1 Deep Learning

A Deep Learning (DL) model aims to learn complicated functions that represent high- level abstractions of the data using multiple levels of non-linear operations [7]. Each level tries to learn from the previous to extract higher-level features. A typical example would be the task of recognizing a Cat from an image. First, the model learns simple edge patterns, then the curves, and as we go to higher levels, it learns shapes and eventually learns to recognize a cat.

The fundamental component of any DL model is a Deep Neural Network (DNN) or a Multi-Layer Feed-Forward network. A DNN usually consists of multiple layers of linear and non-linear transformations that enable learning. There are many variations of such deep neural networks based on the application and type of data in hand. For instance, Convolutional Neural Networks (CNNs) for images [8] and Recurrent Neural Networks (RNNs) [9] for sequential data or text are some popular choices. Irrespective of the underlying architecture of these networks, the basic principles remain the same.

Understanding a Multi-Layer Feed-Forward network, which is the most basic of all

(14)

Figure 2.1.1: An Multi-Layer Feed-Forward neural network with one hidden layer. [10]

deep learning architectures, can give a precise understanding of the topic.

2.1.1 Multi-Layer Feed-Forward network

A Multi-Layer Feed-Forward neural network must have at least three sequential layers:

an input layer, a hidden layer, and an output layer. Adding more hidden layers increases the depth of the model, thus contributing the term ‘deep’ in deep learning.

Such a network can act as a universal function approximator [11], which is capable of learning any measurable function.

Each layer of the network is a set of neurons, where a neuron can be described as a mapping. Each neuron in a layer is connected to all the neurons in the subsequent layer. Figure 2.1.1 shows a three-layered neural network that has D neurons in the input layer, M neurons in the hidden layer, and K neurons in the output layer.

Mapping from one layer to another happens through weight matrices W

⁽¹⁾

and W

⁽²⁾

. Learning in this scenario corresponds to learning the values of weight matrices. All layers except the input layer use a non-linear activation. Rectified linear unit or ReLU is one of the most common activation function, that can be defined as h(x) = max {o, x}.

If h and σ are the activation functions for the hidden and output layer respectively, then

in the above network for a given input x ∈ R

^D

,

(15)

z

_m

= h ( ∑

^D

i=1

W

_mi⁽¹⁾

x

_i

+ W

_m0⁽¹⁾

)

(2.1)

y

_k

= σ ( ∑

^M

m=1

W

_km⁽²⁾

z

_m

+ W

_k0⁽²⁾

)

(2.2)

W

_m0⁽¹⁾

and W

_k0⁽²⁾

in the above equations correspond bias parameters. They are linked to extra nodes x

₀

and z

₀

that take a fixed value 1.

E(W ) = compare(y, y

^t

) (2.3)

W

^{τ +1}

= W

^τ

− η∇E(W ) (2.4)

Neural Network training happens in a supervised fashion using stochastic gradient descent via backpropagation [12]. Wherein an error function E(W ) compares output y with real targets y

^t

to compute an error value. And the gradient ∇E(W ) is calculated for each network parameter, i.e., weights W

⁽¹⁾

and W

⁽²⁾

with respect to the error function which is then used to update the values of the network parameters. For this reason, activation functions used to induce non-linearity in the network must have gradients defined everywhere. In the above equation 2.4, η is called learning rate, this parameter controls the amount of update in each iteration. This process of optimization is repeated until a local minima is achieved for the error function. To conclude, a Multi- Layer Feed-forward Neural Network is simply a non-linear function that maps inputs {x

i

} to outputs {y

k

} controlled by adjustable parameters {W

⁽ⁿ⁾

}.

2.1.2 Convolutional Neural Networks (CNNs)

Convolutional Neural Networks or CNNs are specialized neural networks capable of

accounting for the spatial structure of the data [13]. CNNs can be used with any data

with grid-like topology [14]. For instance, natural images where each pixel is often

spatially correlated with surrounding pixels or in speech or natural language text where

words in the local context are related to each other. At present, they are widely used

for applications in computer vision and image analysis.

(16)

Figure 2.1.2: A simple CNN architecture with two convolutional layers.

The presence of one or more convolutional layers in the network differentiates CNNs from regular Feed-Forward networks. In a convolutional layer, we define a kernel of fixed size (say 5 × 5 pixels), which slides across the image resulting in the convolution of input with the kernel. The advantage of such an approach is that now we have only 5 × 5 = 25 weight parameters to learn, and they will detect a particular type of feature present anywhere in the image. Such weight sharing [13] reduces the total learnable parameters required when compared to Fully-Connected Feed-Forward networks where every point of one layer is connected to a point in the next layer. The usual practice is to define multiple kernels per convolutional layer, each kernel acting as a filter and detecting one type of feature from the input. The output of convolutional layers is called feature maps. As we add more convolutional layers, we attempt to learn higher-level features from the current feature maps.

Figure 2.1.2 shows the architecture of a simple CNN with two convolutional layers.

Convolutional layers are usually followed by a pooling layer, which has no learnable parameters. The purpose of the pooling layer is to reduce the spatial size of feature maps. Average pooling and max pooling are two of the common techniques used where we take the average or maximum value of a portion of the feature map.

Furthermore, since pooling summarizes a local portion of the image, it helps in extracting the dominant features invariant to transformations [14]. For brevity, the Figure 2.1.2 excludes activations but each trainable layer includes a non-linear activation like ReLU.

2.1.3 Feature Extraction

As we saw in the above section, convolutional layers enable us to extract hierarchical

features from the images efficiently. For image classification, we usually need to

(17)

learn a complicated non-linear relationship between image features and image classes.

And as mentioned in Section 2.1.1, a multi-layer feed-forward network can act as a universal function approximator. Thus it can be used after convolutional layers to take the features learned from images to classify them. Hence in all CNNs, the set of convolutional along with pooling layers is separately referred to as the feature extraction part. And then, their final output is condensed to a vector representation used for any other purpose like classification.

In this past decade, many deep CNN architectures have been proposed for image classification. VGGNet [15], GoogleNet [16], ResNet [17] are few of the popular ones.

These networks are trained and evaluated on Imagenet[18] data, which is one of the largest hierarchical image datasets available. When trained with such huge datasets, these deep networks learn many features that are fundamental to most natural images.

For this reason, these deep CNNs pre-trained on any large dataset can be used to extract features from any new image. In this thesis, we utilize ResNet-101, which is pre-trained on ImageNet data, to extract features from images of all our datasets.

2.1.4 Variational Autoencoders

A Variational Autoencoder (VAE) [19] is a powerful generative model to learn the latent representations of the data. A VAE architecture consists of two neural networks: an Encoder and a Decoder. In a variational inference, for a data point x ∈ R

^D

and its latent representation z ∈ R

^L

, the posterior distribution p(z |x) is approximated by the variational distribution q

_ϕ

(z |x) where ϕ are the parameters of the Encoder network.

The probability distribution q

_ϕ

(z |x) is modeled as a Gaussian and hence, the encoder outputs the mean (µ

_x

∈ R

^L

) and the diagonal covariance (Σ

_x

∈ R

^L^×L

). From this Gaussian N (µ

x

, Σ

_x

) a latent vector z is sampled. As the Encoder maps data from the input space to the latent space, the decoder maps the data from the latent space to the input space by learning the distribution p

_θ

(x |z) where θ are the parameters of the Decoder network. The learning happens by maximizing the variational lower bound given by:

L

_{V AE}

= E

qϕ(z|x)

log(p

_θ

(x |z)) − D

KL

(q

_ϕ

(z |x)||p

θ

(z)) (2.5)

The first term in the above loss can be seen as the reconstruction loss which is usually

computed as L1 or L2 distance between the reconstructed output and the input. And

the second term is the KL-Divergence between the approximated posterior distribution

(18)

Figure 2.1.3: VAE architecture with Encoder and Decoder networks.

q

_ϕ

(z |x) and the prior p

θ

(z). Figure 2.1.3 further illustrates the VAE architecture.

2.2 Metric Learning

Metric learning aims to learn the relative similarity between the input samples. The two main components of metric learning are: (1) A distance metric or similarity measure to compare two samples. (2) A loss function that enables model learning by pulling similar samples closer together and pushing dissimilar samples further apart.

As seen in the previous section, CNNs can be used to represent images as compact feature vectors. We can then employ metric learning from these vectors to learn a space where similar images are close to each other. For vector representations, the squared Euclidean distance is the most common distance metric. If A and B are two images and f (.) is a feature extraction network, then the distance between them can be defined as,

D(f (A), f (B)) = ∥f(A) − f(B)∥

²2

(2.6) If images are represented as distributions like Gaussians instead of vectors, then we need to establish distance metrics for distributions. The next section will summarize a few such metrics.

2.2.1 Distribution Metrics

In this section, we will define the most common distance metrics for distributions.

All the distance metrics in this section would be defined for the following two

k-dimensional multivariate Gaussian distributions: d

₁

∼ N (µ

1

, Σ

₁

) and d

₂

∼

N (µ

2

, Σ

₂

).

(19)

• Wasserstein Distance

Wasserstein Distance which is also known as earth movers’ distance dates back to the eighteenth century. It was first formulated as an optimal transport problem to measure the effort required to move one pile of dirt to another pile of different shape [20]. The 2-Wasserstein distance for Gaussians has a closed-form solution and can be given as:

D

_W₂

(d

₁

, d

₂

) =

[ ∥µ

1

− µ

2

∥

²2

+ T r(Σ

₁

) + T r(Σ

₂

) − 2(Σ

2¹²

Σ

₁

Σ

1 2

2

)

¹²

]

¹

2

(2.7)

This can be further simplified in case of a diagonal covariance,

D

_W₂

(d

₁

, d

₂

) =

[ ∥µ

1

− µ

2

∥

²2

+ ∥Σ

1¹²

− Σ

2¹²

∥

²F

]

¹₂

(2.8)

• Bhattacharyya Distance The Bhattacharyya Distance [21] is another reliable measure of similarity between two probability distributions. For any two distributions µ and ν, it can be defined as:

D

_B

(µ, ν) =

∫

x

»

µ(x)ν(x)dx (2.9)

For our multivariate Gaussian distributions d

₁

, d

₂

this can be simplified to:

D

_B

(d

₁

, d

₂

) = 1

8 (µ

₁

− µ

2

)

^T

Σ

⁻¹

(µ

₁

− µ

2

) + 1

2 ln √ |Σ|

|Σ

1

||Σ

2

| (2.10)

where, Σ =

^Σ¹^+Σ₂ ²

• KL divergence KL divergence also known as relative entropy measures the difference between two probability distributions. For two distributions p and q, it can be defined as:

D

_KL

(p ||q) =

∫

x

p(x) log p(x)

q(x) dx (2.11)

Even though popularly used as a similarity measure, unlike the above two distance measures, KL divergence doesn’t qualify as a statistical metric, since it is asymmetric, D

_KL

(p ||q) ̸= D

KL

(q ||p). For our multivariate Gaussian distributions it can be defined as:

D

_KL

(d

₁

||d

2

) = 1 2 (

(µ

₂

− µ

1

)

^T

Σ

⁻¹₂

(µ

₂

− µ

1

) − k + T r(Σ

⁻¹2

Σ

₁

) + ln |Σ

2

|

|Σ

1

| )

(2.12)

(20)

are different loss functions like ranking loss, hinge loss, or triplet loss that serve the purpose. Irrespective of their naming, most of these loss functions have a very similar formulation. We will describe a few of them in this section. Since we have defined distance metrics for distributions in the previous section, we will also define loss functions for metric learning with distributions. For the rest of this section, we consider f (.) as a network that given feature vectors outputs the distribution parameters and D(., .) as the distance metric to calculate the dissimilarity between two distributions.

• Pair-wise Ranking Loss

For a pair of data points x

0

and x

1

, this loss can be defined as:

L

_p

(

x

₀

, x

₁

, y )

= y ∗ D (

f (x

₀

), f (x

₁

) )

+ (1 − y) ∗ max {

0, m − D (

f (x

₀

), f (x

₁

) )}

(2.13)

In the above equation, y is a flag which is 1 when both data points belong to the same class and 0 otherwise. m is the margin parameter. So, for points from the same class, the loss to minimize is the distance between them. When they are from different classes and are closer than a fixed margin, then we want to push them farther apart.

• Triplet Loss

Here instead of pairs, we pick a triplet of samples x, x

⁺

, x

⁻

. The first sample is called an anchor, the second sample belongs to the same class as the anchor, and the third sample should belong to any other class than that of the anchor. Here also we define the margin parameter m similar to the previous loss function.

L

_tri

(

x, x

⁺

, x

⁻

)

= max {

0, m + D (

f (x), f (x

⁺

) )

− D (

f (x), f (x

⁻

) )}

(2.14)

With this loss, with each triplet, two samples of the same class are pulled together,

and two samples of the dissimilar class are pushed apart in the same iteration.

(21)

Figure 2.2.1: Metric learning with triplet loss (left) and N-pair loss (right) [22]

• N-Pair Loss

N-pair [22] loss is a further extension to triplet loss. Here, we pick N pairs of samples from N different classes, and in one iteration, we simultaneously want to make each sample farther apart from all N −1 dissimilar samples. Figure 2.2.1 further illustrates the difference between triplet loss and N-pair loss. The loss is defined as:

L

_N_−pair

(

{x

i

, x

⁺_i

}

^Ni=1

) = 1 N

∑

N i=1

log (

1 + ∑

j̸=i

exp (

D(f (x

_i

), f (x

⁺_j

)) − D(f(x

i

), f (x

⁺_i

)) )) (2.15)

2.3 Zero-Shot Learning

Zero-Shot Learning (ZSL) aims to recognize instances from the classes that were not seen during the training phase. There is an increasing interest in ZSL as it offers an alternative where the labeled data is limited, or training with every new visual category is impossible. Apart from image classification, ZSL has wide-ranging applications in various domains. It has been applied to unseen action detection in videos [23]

and video retrieval with text descriptions. In NLP, it has been used for machine translation [24]. In the medical domain, [6] applies ZSL for decoding fMRI images.

In this thesis, we focus only on image classification. Irrespective of the application,

in all ZSL methods, the principle learning task is to transfer knowledge from seen to

unseen classes. For this, all ZSL methods leverage some form of auxiliary semantic

information that acts as a link between seen and unseen classes. The rest of this section

will introduce common ways of creating these semantic spaces and different problem

settings for ZSL.

(22)

Figure 2.3.1: Binary attribute space

2.3.1 Semantic Space

There are several ways to create a semantic space that contains the auxiliary semantic information required for ZSL. Each class is represented as an embedding in this space and is referred to as class prototype. Two of the most common ways of creating such semantic space are:

• Attribute Spaces: This is the most intuitive way of getting auxiliary information for ZSL. Here, we imitate the human style of novel class recognition. Humans connect seen and unseen classes based on the attributes of each class. So, by manually annotating each class with a fixed set of attributes, we can create a semantic space.

For instance, to recognize animals we can represent each class with the physical attributes of that animal, such as color, habitat, food habit, etc. Figure 2.3.1 shows one example for creating attribute-based class prototypes. Even though in this example, we have binary attribute values creating a binary attribute space[25], it is more common to have both binary and real-valued attributes.

• Learned Spaces: In learned semantic spaces, machine learning models are

employed to create the semantic space. With large text corpora like Wikipedia and

Google news datasets available, word embedding models can be learned efficiently

in an unsupervised way. Word2Vec [9] and Glove [26] are two of the most popular

ways. Word2Vec learns word embeddings by training a neural network to predict the

target words from the surrounding context words. And Glove uses co-occurrence

statistics of words in the text to learn the model. Once we have any such word

embedding model, we can obtain class prototypes for ZSL by simply picking the word

embeddings corresponding to each class label [2]. More recently, text descriptions

[27] of each class have also been used to create class prototypes. In these learned

semantic spaces, each dimension of the class prototype doesn’t have any specific

(23)

meaning, unlike in attribute spaces. But on the other hand, these are much less labor-intensive to create.

Most recent ZSL works are based on attribute-based semantic spaces. And in this project, we will also work predominantly with the attribute-based image ZSL datasets.

2.3.2 Learning Settings

ZSL can operate under different learning settings based on the data available at training and testing times. As introduced in Chapter 1, based on data available at test time, the two learning settings would be: (1) Traditional ZSL, also referred as just ZSL, when we have only unseen classes at the inference time. (2) Generalized ZSL or GZSL, when we have both seen and unseen classes present at the testing time. GZSL is a more challenging setting as training only on seen classes, often creates a bias that results in poor performance when the model has to classify among seen and unseen classes together. Now, based on the data at the training time, ZSL can be categorized as:

• Inductive ZSL: In this setting, at the training time, we have only labeled image data from seen classes. Also, we have class prototypes of both seen and unseen classes available. The objecting is to transfer knowledge from semantic space to visual space to predict images from unseen classes.

• Transductive ZSL: Here, we have unlabeled images from unseen class also available along with the labeled seen class images. Class prototypes of seen and unseen classes are available too.

The transductive setting is relatively easier compared to the inductive setting since we have some information on the distribution of unseen classes in the visual space. In this project, we will address both ZSL and GZSL but only in a more challenging inductive setting.

2.3.3 Evaluation Metrics

To encourage good performance for both dense and sparsely populated classes, ZSL

adopts average per-class top-1 accuracy. Here, we calculate the accuracy of all classes

(24)

GZSL by the harmonic mean of average per-class top-1 accuracies of seen and unseen classes. It can be defined as follows:

H = 2 ∗ Acc

seen

∗ Acc

unseen

Acc

_seen

+ Acc

_unseen

(2.17)

2.4 Related Work

This section describes the prominent trends in Zero-Shot Learning for image classification. Sections 2.4.1-2.4.3 discusses the relevant ZSL and GZSL methods while Section 2.4.4 talks about special requirements for dataset splits for benchmarking ZSL and GZSL.

2.4.1 Zero-Shot Learning

The interest in ZSL has increased significantly in recent years. [1, 28, 29] are few of the earliest works that utilized attributes as semantic information. Over the years, ZSL has been attempted with various other forms of semantic information like class embeddings obtained from word2vec [2], Glove [26], or wordnet hierarchy [30].

Natural language descriptions of each class have been utilized to create class embeddings in [27]. More recently, [31] utilizes human gaze localization to create embeddings for each class to be used for ZSL. In this project, we primarily focus on ZSL with attributes. All the datasets utilized in this project are accompanied by human- annotated attributes for each class.

Irrespective of the type of semantic information used, the crux of most of the ZSL

methods is to learn a mapping from image feature space to semantic space utilizing

the information from the seen classes. Earliest works in ZSL [28, 32, 33] focused

on directly or indirectly mapping image instances to class attribute space. Once the

mapping is learned, image features would be projected to the class-attribute space

and the label prediction would be based on the nearest neighbor search over unseen

(25)

class-attribute embeddings. Extending this, [34] formulated ZSL as a distance metric learning problem where image features are projected to semantic space, and a linear mapping of Mahalanobis distance is applied to learn the metric relations between pairs of image features and class-attributes.

Even without explicit metric learning formulation, learning a bilinear compatibility function between image feature space and semantic space has been a predominant technique in ZSL literature. A bilinear compatibility function pulls objects from the same class closer to each other and pushes objects of the different class away from each other [35]. ALE [36], DeVise [37] use ranking loss to learn the bilinear compatibility between image and semantic space, while SJE [2] utilizes structured SVM loss for the same. ESZSL [38] uses square loss with explicit regularization for learning bilinear compatibility. [39] takes a pair of visual and semantic features as the input to a deep neural network that outputs the compatibility score. None of these methods employ deep learning to create the metric space. In one of our methods, we employ deep neural networks along with a triplet loss to learn the metric space.

2.4.2 Distribution Representation

Most of the ZSL methods learn the cross-modal mapping between visual and semantic space with a point vector representation of features. Such representation leads to the loss of information regarding the intra-class variability among the features.

To improve over this, [40] advocates for the use of distribution-based embeddings for both image and semantic features. [40] represents image and class categories as Gaussian distributions before learning a multi-modal mapping between them.

However, explicitly obtaining distribution-based embeddings for features of various modalities is a laborious task and thus not very effective as a general solution. To overcome the intra-class variability problem of vector representation, GFZSL [41]

introduces a generative framework for zero-shot learning, which represents each

class-conditional distribution as a Gaussian. [41] improves significantly over earlier

methods of ZSL by learning the parameters of the class-conditional distribution as a

function of class-attribute vectors. However, [41] utilizes kernel-based techniques for

parameter estimation and uses offline learning to learn the parameters of seen classes

separately. Inspired by these, in our project, we focus on creating novel ZSL methods

that account for intra-class variability of the features from both modalities. In the

(26)

data problem, where unseen class samples are missing from a regular supervised learning setup. Recently, as more and more generative frameworks for ZSL are being proposed, synthesizing samples [42–44] to account for the missing data is becoming common. [43] uses GANs for generating complete images while [42] utilizes conditional variational autoencoders to generate image features. On the other hand, CADA-VAE [44] applies aligned variational autoencoders to generate features in a low dimensional latent space. By generating samples of the unseen classes, ZSL can now be treated as a regular supervised learning problem.

A common problem in ZSL is the bias for the seen classes, as by definition in ZSL, we usually train only with the seen class information. This becomes a major problem, especially in GZSL, which is a more generalized setting and has seen classes present at the test time. Most conventional ZSL methods tend to fail at GZSL even with decent results for ZSL, as these models usually are highly biased towards classes seen during training. Sample synthesis based methods like the one described above, offer a remedy for such bias problem. Moreover, by being able to generate seen and unseen class samples, these models can control the operating point of the subsequent classifier by generating a different number of features per class for seen and unseen classes. For instance, [44], which is one of the current SOTA methods for GZSL, generates 400 samples per unseen class while generating only 200 samples per seen class before training a supervised softmax classifier on them.

2.4.4 Dataset Splits

A majority of ZSL methods, irrespective of deep learning based or not, utilize feature vectors representing images instead of the whole images as the input to their models.

These feature vectors are usually extracted from deep networks like GoogleNet [16],

VGGNet [15] or ResNet [17] that were pre-trained on huge image datasets like

ImageNet [18]. ZSL by definition requires disjoint train and test classes. [45] observes

that this feature extraction using deep networks is a part of the training procedure

(27)

and hence should not include any information of the test sets to obey the zero-shot assumption. This makes it essential to split the datasets into train and test sets such that there is no overlap among the image classes in the test set and the ImageNet used for training the feature extraction network. Moreover, [45] shows how the results of most prior ZSL methods degrade on common benchmark datasets as they use new data-splits obeying the ZSL assumption. Since then, as a standard method of evaluation, all ZSL and GZSL methods proposed are evaluated with the new dataset splits proposed by [45].

2.4.5 Summary

In this project, we explore all the best techniques for ZSL described above which

include, deep neural networks for metric learning, distribution treatment of features

from both modalities to account for intra-class variability, and finally sample synthesis

to improve GZSL. Also, for all our methods, we evaluate on benchmark datasets with

dataset splits proposed by [45].

(28)

We start this chapter with a formal definition of Zero-Shot and Generalized Zero-Shot learning problems along with the mathematical notation to be followed throughout the thesis. We then describe the datasets used in the project. And finally, we discuss all the methods explored in this project along with their respective network architectures, background theory, and inference techniques.

3.1 Problem Definition

In an experimental setup, seen classes (S) are referred to as a set of classes to which labeled training instances belong. Further, we have unlabelled testing instances which belong to unseen classes (U ). We have N

_s

seen classes and N

_u

unseen classes. The D-dimensional feature space (X) is a real number space in which instances from both seen and unseen classes are represented as vectors. The vectors are obtained from each instance or image by feature extraction as explained in Section 2.1.3. We operate under the assumption that each instance belongs to only one class. The auxiliary information of each class y in Y ⊂ S∪U is represented by a vector known as class-attribute vector or a class-prototype (a

_y

). These class-prototypes reside in an L-dimensional space called as semantic space or auxiliary information space (A).

A training sample is represented by (x

i

, y

i

), where x

i

∈ R

^D

is a feature vector of

the training instance, y

_i

∈ S is the corresponding class label. Since we have unique

class prototype for each class, training sample could also be referred as (x

_i

, a

_y_i

) where

a

_y_i

∈ R

^L

. Figure 3.1.1, further illustrates a sample from the training data looks like.

(29)

Figure 3.1.1: An illustration of a training sample.

Notation Description

X D-dimensional Feature space A L-dimensional Semantic space

S, U Set of seen and unseen classes respectively N

_s

, N

_u

Number of seen and unseen classes respectively X

_tr

, X

_te

Training and testing instances respectively

Y

tr

, Y

_te

Labels of training and testing instances respectively N

_tr

, N

_te

Number of training and testing instances respectively

D

_tr

Training dataset containing samples from seen classes (x

_i

, y

_i

) i

^th

training sample: x

_i

∈ X and y

i

∈ S

a

yi

class-prototype of instance with label y

i

Table 3.1.1: Key Notations

A test sample is represented by (x

_te

, y

_te

) where, x

_te

∈ R

^D

. For ZSL, y

_te

∈ U and the learning objective is get a model f

_zsl

: R

^D

→ U. While, for GZSL, test sample can belong to either seen or unseen class, hence, y

_te

∈ S ∪ U and the objective is to obtain a model f

_gzsl

: R

^D

→ S ∪ U. The key notations are summarized in the table 3.1.1.

3.2 Datasets

In this section, we describe the benchmark datasets we use to evaluate our ZSL and

GZSL methods. The datasets AWA1 and AWA2 are coarse datasets of animals in

diverse backgrounds. SUN and CUB datasets are much more fine-grained datasets

with limited per-class data, making them very challenging. As the objective of our

evaluation is to compare our methods with previous competitive methods, we use the

(30)

same feature extraction technique as used by others. We use ResNet-101 pre-trained on ImageNet to extract features without any further fine-tuning from the images of all the datasets. All the datasets along with training, validation, and testing splits are provided by [45]. The datasplits ensure that the unseen classes in the test split are not overlapping with 1000 classes of the ImageNet so that the zero-shot assumption is intact.

• Animals with Attributes (AWA)

AWA1 [28] is the one of the most popular datasets to evaluate zero-shot and attribute based classification methods. It contains 30,475 images of animals from 50 classes with diverse backgrounds. The dataset includes 85-dimensional attribute vector for each class. These attributes constitute both binary and continuous features and are obtained manually by human annotators. For ZSL, we use 40 classes to train and 10 classes to evaluate. The raw images of AWA1 are not available freely anymore, but the features are still accessible. For this, [45] published an updated dataset AWA2 with open-sourced 37,322 images with same features as AWA1. We evaluate our methods both on AWA1 and AWA2.

• SUN Scene Recognition (SUN)

The SUN [46] dataset consists of 14,340 images from 717 different classes of scenes like kitchen, lobby, resort, etc. This dataset is a very fine-grained one with limited data for each class, making it one of the most challenging ones to evaluate. The dataset includes a 102-dimensional human-annotated attribute vector with each image. Such vectors from images of each class are averaged separately to get a class- attribute vector. For ZSL, data are split into 645 seen and 72 unseen classes obeying zero-shot assumption [45].

• Caltech UCSD Birds 200 (CUB)

The CUB [47] is another fine-grained dataset with 11,788 images from 200 classes

(31)

of different bird species. For each class, we have around 60 images. For ZSL, we use 150 seen and 50 unseen classes. The dataset provides a 312-dimensional human- annotated attribute vector for each class. Along with that, it also provides a textual description of each class. A character-level convolutional recurrent neural network (CRNN) is trained on these text descriptions by [27] to obtain 1024-dimensional embeddings for each class. Most recent works evaluate on CUB dataset with these embeddings as they provide superior performance.

3.3 Models

This section describes the three novel models for Zero-Shot and Generalized Zero- Shot Learning. First, we begin with Section 3.3.1 that presents a baseline generative model. This model is inspired by the works of GFZSL [41] that demonstrated a simple generative framework for ZSL. Section 3.3.2 gives the details of our Triplet model, where we represent the image and attribute features as Gaussian distributions in a common space and apply metric learning to learn the associations. In Section 3.3.3, we present a Triplet-VAE model, where two VAEs are trained with triplet based distributional alignment for ZSL. Section 3.3.4 gives the details of our third model, which is a simple Classifier model for ZSL and GZSL inspired by energy-based models. Each of these subsections contains the model details, network architecture, and relevant theory. Finally, in Section 3.3.5, we suggest an alternative and more effective way of GZSL inference for all three models.

3.3.1 Generative Model

In this model, we learn the class distributions that are parameterized on the class

attribute vectors of seen and unseen classes in an end-to-end fashion. Given the data

from a class, the class conditional distribution p(x

i

|a

yi

, θ) is modeled as a Gaussian

where θ represents the global parameters of the model. The parameters of this

Gaussian distribution are the mean vector µ

_y_i

∈ R

^D

and diagonal covariance Σ

_y_i

∈

R

^D^×D

. These parameters are modelled by a neural network f

θ

: a

yi

→ {µ

yi

, Σ

yi

}. Being

a generative model, the aim is to maximize the joint distribution P (X, Y |θ), where

Y = S ∪ U. Since we don’t have unseen class samples, we maximize P (X, S|θ) and

expect the learned θ to generalize well for unseen classes in the test time.

(32)

Figure 3.3.1: An illustration of our generative model.

P (X, S |θ) = ∏

(xi,a_yi)∈D^tr

p(x

_i

, a

_y_i

|θ) (3.1)

log (

P (X, S |θ) )

= ∑

(xi,a_yi)∈Dtr

log (

p(x

i

, a

yi

|θ) )

= ∑

(xi,a_yi)∈D^tr

log(p(x

_i

|a

yi

, θ)) + log(p(a

_y_i

|θ)) (3.2)

By modelling only the class conditional distribution the objective becomes:

argmax

θ

∑

(xi,a_yi)∈Dtr

log (

p(x

_i

|a

yi

, θ) )

(3.3)

Figure 3.3.1 illustrates the network architecture for this model. This model is inspired by the works of GFZSL [41]. They first learn the parameters of the class conditional distribution using the MAP/MLE estimation from the data and then separately learn the mapping from attributes to Gaussian parameters for each class using simple regression. Unlike them, we use neural networks and perform end-to-end learning of the distribution parameters.

For each sample (x

_i

, a

_y_i

) ∈ D

tr

, the neural network takes a

_y_i

as the input. For efficiency, we model the covariance as a diagonal matrix. The inner workings of the model can be formulated as follows:

µ

_y_i

= f

_θ_µ

(a

_y_i

) (3.4)

Σ

_y_i

= f

_θ_Σ

(a

_y_i

) (3.5)

(33)

For N

_B

samples in a batch, the objective in Equation 3.3 can be formulated as a batch loss in the following way:

Loss =

NB

∑

i=1

− 1

2 log( |Σ

yi

|) + (x

i

− µ

yi

)

^T

(Σ

_y_i

)

⁻¹

(x

_i

− µ

yi

) (3.6) Once the model is trained with parameters θ = {θ

µ

, θ

_Σ

} learned, for every input seen or unseen class attribute vector a

_y_i

the model outputs the corresponding parameters of Gaussian distribution {µ

yi

, Σ

_y_i

}. For ZSL inference, input feature x

⁺

would result in label y

⁺

as prediction based on:

y

⁺

= argmax

yi∈U

log (

p(x

⁺

|a

yi

, θ) )

(3.7)

For convenience in calculations, instead of diagonal covariance, we model the neural network output to be the inverse of diagonal covariance. Also, to ensure a positive semi-definite covariance matrix, we ensure the covariance output of the network to have only positive values. For all the datasets, we use two neural networks with two layers each, for predicting mean and diagonal covariance vector outputs. All hidden network layers are followed by a batchnorm and a dropout layer. For non-linearity we use ReLU for all hidden layers.

3.3.2 Triplet Model

For ZSL, representing images and attributes as point vectors result in the loss of important information about the intra-class variability. Thus, in this model, to account for intra-class variability, we aim to learn a common latent space where both image-feature and the class attribute is represented by a distribution. We then employ metric learning with an online batch triplet loss among these distributions.

Another advantage of this approach is that once the model learns to associate image distributions with the corresponding attribute distributions, we can use this to generate seen and unseen class samples in the latent space from the attributes vectors.

Metric learning and variations of triplet loss have been previously used for ZSL (2.4.1).

But unlike them, we represent features from both modalities as distributions instead of point vectors and also we employ deep neural networks to create the metric space.

Figure 3.3.2 illustrates the network architecture for our Triplet model for ZSL.

In the latent space we represent the features as Gaussian distributions. Networks

(34)

Figure 3.3.2: Illustration of the proposed Triplet Model

f

θ

: X → N

x

(µ

^x

, Σ

^x

) and f

_ϕ

: A → N

a

(µ

^a

, Σ

^a

) generate parameter vectors for the respective distributions, where µ

^x

, µ

^a

∈ R

^K

are the mean vectors and Σ

^x

, Σ

^a

∈ R

^K

are corresponding diagonal covariance vectors.

To measure the similarity between two distributions, we experimented with KL divergence(KlD), Bhattacharya(Btc) distance, and 2-Wasserstian(W2) distance (2.2.1).

We found that W2 distance produced the best results. For multivariate Gaussians, the closed-form solution of the W2 distance for distributions d

₁

and d

₂

is given by

W

d1,d2

=

[ ∥µ

1

− µ

2

∥

²2

+ T r(Σ

1

) + T r(Σ

2

) − 2(Σ

2¹²

Σ

1

Σ

1 2

2

)

¹²

]

¹

2

(3.8)

In the case of diagonal covariance matrices, this distance simplifies to:

W

_d₁_,d₂

=

[ ∥µ

1

− µ

2

∥

²₂

+ ∥Σ

₁¹²

− Σ

₂¹²

∥

²_F

]

¹

2

(3.9)

The triplet loss with negative attribute samples resulted in better results compared to the model with negative image sample triplets or even a mixed model. The triplet loss for a positive pair x

_p

, a

_p

and a negative attribute sample a

_n

is given by,

L

t

(x

p

, a

p

, a

n

) = [

α + D (

f

θ

(x

p

), f

ϕ

(a

p

) )

− D (

f

θ

(x

p

), f

ϕ

(a

n

) )]

+

(3.10)

In the above equation [.]

₊