Self-Supervised Representation Learning for Content Based Image Retrieval

(1)

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/STAT-A--20/008--SE

Self-Supervised Representation

Learning for Content Based

Im-age Retrieval

Hariprasath Govindarajan

Supervisor : Amanda Olmin Examiner : Jose M. Peña

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Automotive technologies and fully autonomous driving have seen a tremendous growth in recent times and have benefitted from extensive deep learning research. State-of-the-art deep learning methods are largely supervised and require labelled data for training. However, the annotation process for image data is time-consuming and costly in terms of human efforts. It is of interest to find informative samples for labelling by Content Based Image Retrieval (CBIR). Generally, a CBIR method takes a query image as input and re-turns a set of images that are semantically similar to the query image. The image retrieval is achieved by transforming images to feature representations in a latent space, where it is possible to reason about image similarity in terms of image content.

In this thesis, a self-supervised method is developed to learn feature representations of road scenes images. The self-supervised method learns feature representations for images by adapting intermediate convolutional features from an existing deep Convolutional Neu-ral Network (CNN). A contrastive approach based on Noise Contrastive Estimation (NCE) is used to train the feature learning model. For complex images like road scenes where mutiple image aspects can occur simultaneously, it is important to embed all the salient image aspects in the feature representation. To achieve this, the output feature represen-tation is obtained as an ensemble of feature embeddings which are learned by focusing on different image aspects. An attention mechanism is incorporated to encourage each ensemble member to focus on different image aspects. For comparison, a self-supervised model without attention is considered and a simple dimensionality reduction approach using SVD is treated as the baseline.

The methods are evaluated on nine different evaluation datasets using CBIR perfor-mance metrics. The datasets correspond to different image aspects and concern the im-ages at different spatial levels - global, semi-global and local. The feature representations learned by self-supervised methods are shown to perform better than the SVD approach. Taking into account that no labelled data is required for training, learning representations for road scenes images using self-supervised methods appear to be a promising direction. Usage of multiple query images to emphasize a query intention is investigated and a clear improvement in CBIR performance is observed. It is inconclusive whether the addition of an attentive mechanism impacts CBIR performance. The attention method shows some positive signs based on qualitative analysis and also performs better than other methods for one of the evaluation datasets containing a local aspect. This method for learning fea-ture representations is promising but requires further research involving more diverse and complex image aspects.

(4)

Acknowledgments

I would like to thank my supervisor at Linköping University, Amanda Olmin, for being so supportive and for guiding me throughout the thesis by providing useful feedback. Thanks for your careful supervision.

I would like to thank my external supervisors at Veoneer, Dennis Lundström and Peter Lindskog for their continued support starting from the first collaboration in the Research Project course during the Fall 2019 semester. My sincere gratitude for trusting me with this challenging thesis and for all the valuable feedback on all my works.

Many thanks to Fredrik Lindsten for making the Research Project possible which paved the way for much greater things.

I would also like to thank my classmates who have helped and inspired me for the inter-esting and enjoyable learning experience.

I thank my friends, Sreedeep and Sankar, for always being there for me and especially for all the entertainment during corona times. Thanks for always being there during those much needed breaks.

Last but not the least, thanks to my parents for encouraging me and making it possible for me to pursue the master’s programme.

(5)

List of Figures

2.1 Scene with foggy conditions . . . 7

2.2 Scene with snowy conditions . . . 8

2.3 Scene inside a tunnel . . . 8

2.4 Scene containing a gravel road . . . 8

2.5 Scene containing a waterlogged road . . . 8

2.6 Scene containing roads with tar marks . . . 8

2.7 Scene containing shadows on the road . . . 9

2.8 Scene containing tires lying on the road . . . 9

2.9 Scene containing pedestrians lying on the road . . . 9

3.1 Commonly used activation functions namely Sigmoid, tanh and ReLU . . . 10

3.2 An example of a Feedforward Neural Network . . . 11

3.3 Example of convolution operation . . . 12

3.4 Example of a CNN used for image classification . . . 13

3.5 Content Based Image Retrieval Process . . . 16

4.1 ABE Model Architecture . . . 22

4.2 Attention Module, A1 m(x(k+p)) . . . 23

4.3 Global Feature Embedding Function . . . 24

4.4 Model without Attention . . . 24

5.1 mP@K using one query image . . . 32

5.2 mP@K using two query images . . . 32

5.3 mAP@K using one query image . . . 33

5.4 mAP@K using two query images . . . 33

5.5 CBIR Results for eval-fog . . . 37

5.6 CBIR Results for eval-snow . . . 37

5.7 CBIR Results for eval-gravel . . . 38

5.8 CBIR Results for eval-waterlogged . . . 38

5.9 Visualization of attention masks for the first eval-fog query . . . 39

5.10 Visualization of attention masks for the second eval-fog query . . . 40

7.1 Training curve of ABE-8 model loss . . . 52

7.4 CBIR Results for eval-tunnel . . . 53

7.5 CBIR Results for eval-tarmarks . . . 54

7.6 CBIR Results for eval-shadows . . . 54

7.7 CBIR Results for eval-tire . . . 55

7.8 CBIR Results for eval-pedestrians . . . 55

7.9 Visualization of attention masks for the first eval-fog query . . . 56

(8)

List of Tables

2.1 Evaluation Datasets . . . 7

4.1 Selected optimal hyperparameters for the models ABE-8, ABE-12, ABE-16 and No Attention . . . 27

4.2 Model Training Details . . . 28

5.1 mP@K for 1 query image aggregated over 10 query sets . . . 31

5.2 mP@K for 2 query images aggregated over 10 query sets . . . 31

5.3 mAP@K for 1 query image aggregated over 10 query sets . . . 31

5.4 mAP@K for 2 query images aggregated over 10 query sets . . . 31

5.5 Separation distance measure for one query image . . . 34

(9)

1 Introduction

1.1 Background

While driving an automobile, humans face unpredictable situations on the road and human intelligence has proven to be effective at managing these uncertainties in the process of per-ceiving and making safe decisions. In the last decade, the mobility industry has seen an unprecedented transformation powered by innovative automotive technologies and has ad-vanced towards bringing fully autonomous vehicles on roads to reality. However, these smart automotive technologies must ensure the highest levels of safety even in unexpected scenar-ios in order to be trusted and widely adopted.

Veoneer, a Swedish automotive active safety company, focuses on developing safety elec-tronics, Advanced Driver Assistance Systems (ADAS), Collaborative Driving and Automated Driving (AD). Vision systems which are often powered by deep learning models are a key component of such technologies and perceive the surroundings of a vehicle by means of im-ages. In order to ensure that a system works in a safe manner, the underlying models need to be reliable and are expected to have high accuracy and low uncertainty. This is achieved using state-of-the-art deep learning models by increasing the amount of training data [46]. However, such models are mostly supervised in nature and require labelled data to train them. For instance, a semantic segmentation model requires different regions of an image to be accurately labelled as different object categories. Labelling images for vision tasks is an expensive and time consuming process. On the other hand, data collection and storage technologies have advanced. So, a large amount of unlabelled data is available.

One proposed solution to this problem is to identify and label only the informative sam-ples which would help in improving the model. Active Learning is a research area which works on this principle by iteratively adding more informative samples to the set of labelled training data. But, it is an open challenge to identify the informative samples. One method to tackle this problem is to frame the task of finding informative samples as an information retrieval task. The images which lead to incorrect or highly uncertain predictions in a model can be used to find more images in the unlabelled dataset that are semantically similar and that could constitute the informative samples [55]. Such an information retrieval task of find-ing semantically similar images based on given query/search images is called Content Based Image Retrieval (CBIR). In this thesis, a method for learning semantic visual representations for CBIR applied to road scene images is investigated.

(10)

1.2. Aim

The performance of CBIR methods are primarily governed by the feature representation and the similarity metric that they employ. One of the most challenging problems in CBIR research is to overcome the "semantic gap" between primitive pixel level information stored by machines and high-level semantic concepts that a human perceives [51]. A feature repre-sentation is a numeric vector containing features that denote high-level semantic concepts of an image. For instance, a simple binary feature vector of length N could contain N features which denote the presence or absence of N objects in an image. In the past, feature represen-tations were extracted using hand-crafted features or features derived using computer vision techniques from colours, shapes and textures [26]. In recent times, deep convolutional neural networks (CNN) have led to state-of-the-art performance in a range of visual learning tasks like image classification, semantic segmentation and object recognition. A CNN consists of a sequence of layers of which some of them are convolutional layers and each output in a layer of a CNN learns a complex combination of features from the previous layer. The layers deeper in the CNN tends to learn complex semantic concepts of the image which help in per-forming the particular task that the CNN is trained to perform [12]. Feature representations extracted from such CNN layers have shown promising results in CBIR methods [51].

Much of CBIR research has been focused on image datasets like ImageNet [44], Caltech256 [15], Paris6k [39], Oxford [38] and Google Landmarks Dataset [33] where the images contain one subject which dominates the image. These subjects can be a famous monument as in the case of Google Landmarks Dataset, Caltech256, Paris6k and Oxford or it can be some distinct object class (example: cat, dog, car, etc.) as in ImageNet. Up to our knowledge, there is no research that evaluates CBIR for a dataset consisting of complex images like road scenes. Road scenes are more challenging as they contain multiple different objects in each image and multiple different image aspects that exist simultaneously. For example, a road scene image can consist of multiple objects like cars, buildings, traffic signs, etc. and also display other conditions like fogginess, poor road conditions, etc.

1.2 Aim

The main aim of this thesis is to develop a CBIR method for road scene images. A CBIR method takes as input a query containing one or more images and outputs a set of results retrieved from an image database which are the most relevant to the query. Internally, a CBIR system needs to compare the semantic similarity of the query to images in the image database based on their content. Since road scenes contain multiple objects and scene aspects, it is difficult for a user to precisely convey their intention with a single query image. For instance, a user might want to search for images containing fog but the query image that they use might also contain other objects like trees and cars. One way to emphasize on the fog aspect in a query is by providing two query images which both contain fog but are different in content otherwise.

In order to compare images, a CBIR system uses high-level semantic feature representa-tions to reason about the similarity of images to a query. When using multiple query images to emphasize on an image aspect, the image aspect should exist in the feature vectors of the images without being influenced by other image aspects. To do this efficiently, different overlapping aspects of an image should be represented at different positions in the feature vector. One way to achieve this is by means of attention [2]. By incorporating an attention mechanism in the representation learning model, it is possible to focus on specific regions of an image while learning specific parts of the feature representation of that image. Hence, an attention-based representation learning model will be investigated and the performance of the learned features will be evaluated for the task of CBIR using standard evaluation metrics. Veoneer have trained a deep CNN that performs well on various visual tasks. It would be beneficial to leverage the high-dimensional intermediate features learned by this model for CBIR. Since multiple image aspects can be observed simultaneously in road scenes, it is not

(11)

1.3. Related Work

possible to categorize the images into disjoint groups. Labelling image pairs in a large dataset to be similar or dissimilar is avoided as it is a very tedious task. As similarity labels for the images are not available, the method to be developed in this thesis should be unsupervised or self-supervised in nature. Self-supervised learning is a research area that is proving to be ef-fective for learning feature representations. The self-supervised feature representations with minimal fine-tuning have been shown to produce good results on classification, segmenta-tion and object detecsegmenta-tion tasks which sometimes even outperform supervised baselines [19]. Based on this motivation, the research questions for this thesis are stated as follows:

1. How effective is self-supervised learning at adapting features from a road scene CNN to perform CBIR?

2. What is the effect on image retrieval performance of using multiple query images to emphasize a query concept?

3. How does the performance vary between queries for aspects which concern the image at different spatial levels - global, semi-global and local. For instance, fog and snow are global aspects, road conditions are semi-global aspects and presence of certain objects like cars or debris on the road are local aspects.

4. What is the effect on image retrieval performance of using an attention-based model for learning image features?

1.3 Related Work

A self-supervised learning method is characterized by an artificially designed pretext task. The formulation of the loss function that is used to train the self-supervised learning model depends on the pretext task. By using a pretext task to formulate the model loss, a self-supervised learning model can be trained on unlabelled datasets. Jing et al. [21] categorize these pretext tasks or methods into four types namely - generation-based methods, context-based methods, free semantic label-context-based methods and cross modal-context-based methods. Among these, the generation-based methods and context-based methods are of most relevance.

In generation-based methods, semantic feature representations are learned by generating either an entire image or a smaller patch of an image. Zhang et al. [56] propose colorization as a pretext task where the model learns to generate a color image from a black-and-white image. This method differs from autoencoders [13] in terms of the input and output being different channels of the same image. Pathak et al. [37] propose inpainting as a pretext task where the model learns to reconstruct a missing patch in an image based on context from the rest of the image. Approaches based on Generative Adversarial Networks (GANs) [14] namely CycleGAN [58], BiGAN [8] and BigBiGAN [9] have also been proposed where a dis-criminator learns to differentiate between representations from a data encoder and a fake representation generator. In methods based on GANs, the pretext task is to generate images that meet a specific purpose like transferring image style [58] or enhancing image resolution [28]. Such pretext tasks could be challenging even for a human as it involves understanding and reasoning about the semantics of different images and their spatial relationships.

Context-based methods are learned through pretext tasks that predict the context of a sin-gle image or multiple images or different patches of the same image. Gidaris et al. [10] pro-pose a model that learns semantic feature representations by classifying images into classes based on their artificially introduced rotations (0°, 90°, 180°, 270°). Another context classi-fication task formulated by Doersch et al. [7] involves predicting the relative spatial context between a pair of patches from the same image. As a further extension, several models have been proposed to learn visual representations by solving jigsaw puzzles consisting of nine patches of each image [34, 1, 22]. Alternatively, DeepCluster [3] uses pseudo-labels that are

(12)

1.3. Related Work

generated iteratively by clustering the learned representations. When video data is avail-able, it is possible to learn representations from the temporal context of frames by predicting whether a given order of frames is correct [31, 52] or by sorting a set of frames to match the correct order [29].

In comparison to generative and predictive methods of learning features, recent works in contrastive learning [35, 19, 18] have shown results which surpass the former’s perfor-mance on downstream tasks like image classification and object detection. Contrastive meth-ods learn representations by contrasting positive and negative examples corresponding to a context. Contrastive Predictive Coding (CPC) [35] is a framework that learns latent feature representations applicable to different input modes like images, audio and text. The latent feature representation is learned by an autoregressive model (GRU RNN) that is trained us-ing the Noise Contrastive Estimation (NCE) loss. The NCE loss is explained in detail in the Theory chapter. For vision tasks, Oord et al. [35] achieve this by dividing each image into 49 patches using a 7x7 grid. A feature embedding function is learned that converts each image patch into a feature representation. The feature representations are then used to select the correct patch at a position given the patches that appeared previously along the height of the image. Hénaff et al. [19] improve the performance of the CPC framework for vision mainly by increasing model capacity and addition of random patch-level augmentations.

In Contrastive Multiview Coding (CMC) [47], feature representations are learned from two different views of the same scene and trained by contrasting positive and negative pairs of these feature representations. For instance, the views can be different channels of an image or nearby frames in a video. Wu et al. [53] propose a contrastive model with a non-parametric classifier that is trained using NCE loss to bring image pairs from the same class closer in the feature space and push image pairs from different classes farther apart in the feature space.

The attention mechanism, first proposed by Bahdanau et al. [2], has gained prominence in the Natural Language Processing domain and is an essential component of state-of-the-art language models based on the Transformer architecture [50] like GPT [40], BERT [6] and GPT2 [41]. The attention model pays "attention" to different sets of words of a sentence irrespective of their position while learning each feature in a representation. This is useful while learning visual representations for complex images as well.

Selfie [49] is an attention based model that is trained by predicting masked image patches similar to how BERT [6] predicts missing words in text. The Selfie model is similar to CPC [35] but replaces the recurrent model with attention. However, the task of predicting masked image patches is difficult for road scene images. For instance, even humans cannot predict confidently if a car should be present in a certain masked region of the image only based on the context of the remaining image. The method developed in this thesis is a contrastive method that also incorporates an attention mechanism to learn feature representations of images. This method is built on ideas from Attention Based Ensemble (ABE) for deep met-ric learning [23]. In this work, Kim et al. [23] propose a method that learns representations concatenated from an ensemble of learners which are trained to pay "attention" to different aspects of an image. The loss function consists of a divergence component that requires the learners to learn representations which are as diverse as possible and a contrastive compo-nent that requires these representations to be effective at contrasting between similar and dissimilar pairs of images.

Commonly, features learned by self-supervised learning are evaluated on the basis of their performance in the image classification task using the ImageNet [44] dataset. Some of the approaches also report promising performance on image retrieval tasks [34, 23, 18, 22, 3, 53]. However, the image retrieval datasets used for evaluation namely ImageNet [44], Caltech256 [15], Paris6k [39] and Oxford [38] are much simpler than images of road scenes.

(13)

1.4. Outline

1.4 Outline

The thesis is structured as follows:

• Data chapter introduces the training and evaluation datasets and provides an overview about them.

• Theory chapter presents the theoretical background related to the methods that are for-mulated and developed in this thesis.

• Method chapter explains the methods in detail along with details about their imple-mentation.

• Results chapter reports all the evaluations in terms of the performance metrics, visual-izations, examples of image retrieval and an analysis of the obtained results.

• Discussion chapter discusses the methods developed in this thesis in a wider context and attempts to reason about the results obtained.

• Conclusion provides the conclusions drawn regarding the research questions that this thesis aimed to answer.

(14)

2 Data

This chapter introduces the image datasets that are used in the thesis. These image datasets come from an internal Veoneer data collection that is obtained from vehicles using their 4th generation Mono Vision cameras. The datasets contain cropped high definition (HD) images with 3 color channels and correspond to diverse scenarios in terms of road types, locations, weather conditions and time of day.

2.1 Visual Representation Learning Dataset

The dataset that is used for training the self-supervised visual representation learning method developed in this thesis consists of 15092 image sequences. In the rest of the thesis, this dataset will be referred to as SSVRL-Dataset. The image sequences from the SSVRL-Dataset are divided into a training set (SSVRL-Train), containing 11801 image sequences and a validation set (SSVRL-Val), containing 3291 image sequences. Each image sequence contains 6 image frames from a video and the time interval between the frames is 5 seconds. This means that there are 15092 ˆ 6 image frames in total.

2.2 Evaluation Datasets

The CBIR method developed in this thesis is to be evaluated on unseen data which was not involved in the training of the model. For this purpose, the following 9 image concepts of interest for CBIR are identified:

1. Scenes with foggy conditions (Figure 2.1) 2. Scenes with snowy conditions (Figure 2.2) 3. Scenes inside a tunnel (Figure 2.3)

4. Scenes containing gravel (Figure 2.4)

5. Scenes containing a waterlogged road (Figure 2.5) 6. Scenes containing roads with tar marks (Figure 2.6)

(15)

2.2. Evaluation Datasets

7. Scenes containing shadows on the road (Figure 2.7) 8. Scenes containing tires lying on the road (Figure 2.8) 9. Scenes containing pedestrians lying on the road (Figure 2.9)

With respect to each image concept, the images which contain the image concept are pos-itive samples and the images which do not contain the image concept are negative samples. Each image retrieval task involves searching and retrieving the positive samples of a par-ticular image concept from a dataset containing both positive and negative samples of that image concept. Hence, an evaluation dataset containing both positive and negative samples is constructed for each image concept. The images from all of the datasets are considered to come from a common data distribution of road scene images. Similarly, the positive sam-ples for each image concept is considered to come from a distribution of road scene images conditional on the image concept.

Labels marking the presence or absence of fog, snow, tunnel, gravel and waterlogged road are readily available and these are used to construct evaluation datasets for aforementioned image concepts. For the remaining image concepts of interest, labels are not readily available. In these cases, images are selected by careful human inspection. As human inspection is a tedious process, the sizes of these evaluation datasets which are constructed manually are smaller than other evaluation datasets. Table 2.1 shows the sizes of these evaluation datasets and their counts of positive samples for each CBIR task. In the rest of the thesis, these datasets will be referred to using the names shown in table 2.1. Figures 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8 and 2.9 show positive samples from each evaluation dataset.

# CBIR Evaluation Concept Dataset Name # Total # Positive

1 Scenes with foggy conditions eval-fog 5667 60

2 Scenes with snowy conditions eval-snow 18426 388

3 Scenes inside a tunnel eval-tunnel 14690 1348

4 Scenes containing gravel eval-gravel 19618 161

5 Scenes containing a waterlogged road eval-waterlogged 20832 1447 6 Scenes containing roads with tar marks eval-tarmarks 989 141 7 Scenes containing shadows on the road eval-shadows 822 79 8 Scenes containing tires lying on the road eval-tire 291 19 9 Scenes containing pedestrians lying on the road eval-pedestrians 261 33

Table 2.1: Evaluation Datasets

(16)

Figure 2.2: Scene with snowy conditions

Figure 2.3: Scene inside a tunnel

Figure 2.4: Scene containing a gravel road

Figure 2.5: Scene containing a waterlogged road

(17)

Figure 2.7: Scene containing shadows on the road

Figure 2.8: Scene containing tires lying on the road

(18)

3 Theory

In this chapter, the theoretical background for the methods used in this thesis are explained.

3.1 Feedforward Neural Network

A neural network is a mathematical model that is inspired by the structure of neurons in a human brain. If y= f‹₍_x₎_{is the underlying function that describes the relationship between}

a response variable y and an explanatory variable x, a neural network is a non-linear mapping f(x; θ)that approximates this function by learning the parameters θ corresponding to the best approximation [12]. Here, the vector y would hold continuous numerical values in case of regression models and discrete class labels in case of classification models.

The most basic unit of a neural network is a neuron which takes as input a vector xnand

computes an output that is a linear combination of the input, given by: zn =xnTwn+bn

where wnis a vector of weights and bnis the bias term. Usually, a non-linear transformation

is applied on this output and this is known as the activation function. Examples of commonly used activation functions such as Sigmoid, tanh and Rectified Linear Unit (ReLU) [32] are shown in Figure 3.1. A non-linear activation function enables the neural network to learn a non-linear mapping from the input x to the output y.

Sigmoid(x)= 1 ( 1+e−x ) tanh(x) ReLU(x) = max(0, x) −10 −5 0 5 10 −10 −5 0 5 10 −1.0 −0.5 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 −1.0 −0.5 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 x y

(19)

3.1. Feedforward Neural Network Input Layer Hidden Layer Output Layer

Figure 3.2: An example of a Feedforward Neural Network

Neurons are grouped into layers to form a hierarchical structure and each layer applies a non-linear transformation on the input from the previous layer. The parameters for each layer contains the parameters for all the neurons in that layer and these are aggregated into a matrix of weights W and a vector of bias terms b. A neural network with N layers can be represented as a chain of functions applied on the input, fN(fN´1(... f3(f2(f1(x))))). The first layer is

known as the input layer that gets the input to the neural network, the last layer is known as the output layer which produces the output of the neural network and the intermediate layers are known as hidden layers. The functional form of the n-th layer can be written as:

zn=φ(xT_nWn+bn)

where znis the output of the layer and φ(¨)is an activation function. The final output of the neural network is denoted by ˆy. This kind of neural network where the output is computed

by passing the input forward through the sequence of layers without feedback connections is known as a feedforward network [12]. An example of a feedforward neural network with one hidden layer is shown in Figure 3.2.

The training process of the neural network involves learning the optimal parameters θ=

t(W1, b1),(W2, b2), ...,(WN, bN)ufor all of the layers in the network. At the beginning, the

parameters are usually initialized randomly with values close to zero. Then, the input is propagated through the layers in the network to obtain the output ˆy and this step is known

as forward propagation. In the case of supervised learning where labelled data is available, the true value y corresponding to every training input x is known. A loss functionL(y, yˆ ) is defined for the neural network that compares the output ˆy = f(x; θ)with the actual values of y from the data. The layer parameters are updated using the backpropagation algorithm [43] where the loss is propagated backwards through the layers of the network. Specifically, the gradients of the loss with respect to each layer parameter is computed and this gradient is used to update the layer parameter using a defined learning rate, η. For example, the parameters of the n-th layer namely the weight matrix Wnand the bias term bn are updated

as follows: Wn=Wn´ η δL δWn bn=bn´ η δL δbn

(20)

3.2. Convolutional Neural Network (CNN)

This type of parameter update is known as gradient descent. For large datasets, this is usu-ally done by dividing the data into smaller parts known as batches. When gradient descent is performed in batches, it is known as batch gradient descent and when gradient descent is per-formed using individual data samples, it is known as stochastic gradient descent [5]. Compared to earlier optimization algorithms, Adam [24] is a gradient based optimization algorithm that has been shown to be more robust for large datasets and better suited for non-convex opti-mization problems which are common in neural networks. Adam [24] achieves this by adapt-ing the learnadapt-ing rates for each parameter in the neural network usadapt-ing exponentially movadapt-ing averages of the first and second statistical moments of the gradients.

3.2 Convolutional Neural Network (CNN)

Feedforward neural networks can be used for modelling data with images as inputs but there are some limitations. An image is denoted as a matrix of pixels where each pixel has a specific color. Pixel values are generally represented as a combination of multiple color channels -RGB, CMYK and Lab are a few examples. Hence, images of width W and height H having C color channels are stored as 3-dimensional arrays of size W ˆ H ˆ C. In a feedforward network, every neuron in the input layer is connected to every neuron in the first hidden layer. For high-dimensional inputs like images, this results in a very large number of layer parameters and computations.

A CNN [27] overcomes this computational problem by utilizing the convolution operation in some of the layers. In the simple case with only one channel, the convolution operation is performed using a kernel (also known as filter) matrix that is applied to an input matrix to get an output matrix as shown in Figure 3.3 by moving the kernel matrix along the height and width of the input matrix. At each position where the kernel matrix is moved to, the sum of the element-wise products of the kernel matrix and the overlapping region in the in-put matrix is comin-puted. The convolution operation requires smaller number of parameters than a layer in a feedforward network because the same kernel is applied at all locations of the input. When 2D convolution is applied to 3-dimensional input like images, the kernel is also 3-dimensional with the same number of channels as the input. The operation is called 2D convolution as the movement of the kernel still occurs along only 2 dimensions - height and width of the input. The amount by which the kernel is moved every step is known as stride. A convolution layer can be viewed as a feature extraction layer that applies several kernels in parallel, where each kernel extracts different features. For example, the earlier con-volutional layers in a CNN would identify primitive features like edges and shapes whereas layers deeper in the CNN would identify complex objects.

The convolutional layer is commonly followed by an activation function which applies a non-linear transformation to the convolution output and a pooling layer that reduces dimen-sionality by aggregating the activation values within a small neighborhood of each neuron. There are different types of pooling methods and MaxPooling is one of them. MaxPooling

a b c e f g i j k d h l w x y z

*

aw + bx + ey + fz bw + cx + fy + gz cw + dx + gy + hz ew + fx + iy + jz fw + gx + jy + kz gw + hx + ky + lz

=

(21)

3.3. Attention ... ... ... CAR BICYCLE Input _Convolution + Activation Pooling Convolution + Activation Pooling Convolution + Activation Flatten Fully Connected Softmax Classiﬁcation Feature Learning

Figure 3.4: Example of a CNN used for image classification

[42] enables the model to have translational invariance when input features are displaced by a small amount. It is common to collectively refer to this set of three consecutive layers as a convolutional layer and this convention is followed in this thesis as well. An example of a CNN for image classification having two convolutional layers is shown in Figure 3.4. Dropout and Batch Normalization are commonly used methods to regularize CNNs and pre-vent overfitting. Dropout enables robust training of CNN parameters by randomly dropping some neurons based on a given dropout probability [45]. Batch Normalization [20] normal-izes the output of the layer to which it is applied. This reduces internal covariate shift leading to stability in training the CNN and faster convergence.

From a statistical viewpoint presented by Goodfellow et al. [12], a CNN is equivalent to a fully connected network with an infinitely strong prior for the weights. This strong prior requires weights for all output neurons in a spatial neighborhood to be shared but shifted in position. Also, the weights for an output neuron are required to be non-zero only within a small spatial neighborhood of the input defined by the dimensions of the kernel.

3.3 Attention

Attention is a concept that was first proposed as an approach to solve machine translation tasks [2]. The core idea of attention is to focus or in other words "attend" to a specific set of inputs while computing an output. In machine translation, a sentence is translated from one language to another but the order of the words could be different in the semantics of the target language. Attention is used as a mechanism to select those words in the source sentence to which the model should give importance while predicting the word at a particular position in the target sentence.

In an early application of attention for computer vision tasks, Xu et al. [54] used the at-tention mechanism to generate captions for images. In this work, the model pays atat-tention to specific regions in the image while generating each word in the caption. Consider the output of a convolutional layer in a CNN, z of dimension NxˆNyˆNc. Convolutional layers extract

features at each spatial position from a neighborhood of that position from the previous layer. This means that each position in a convolutional layer corresponds to a spatial region in the original image. Each spatial position in z is represented as a vector zi of length Nc. Given

the input z, the role of attention is to compute an output by focusing on the selected position

(px, py). If all positions are serialized, let pibe an indicator variable that indicates if location i

is selected and let s be the output of the attention mechanism. In the simplest case, the output is the vector corresponding to the selected position as follows:

s= NxNy

ÿ

i=1

pizi

Xu et al. [54] proposed two different ways of using the attention mechanism and called them "hard" attention and "soft" attention. Hard attention does this in a stochastic way by

(22)

3.4. Self-Supervised Visual Representation Learning

sampling a spatial position pi = (px, py) from a multinoulli distribution over all the

posi-tions, pi „Multinoulli(αi). The parameters αi of the multinoulli distribution are computed

as a function of the input z and a context vector c representing the current position in the out-put, αi = f(zi, c). When αi is computed only using the relationship with the input features

in z such that αi = f(zi, z)it is known as self-attention. The stochastic process of hard

atten-tion is not differentiable and training the model by backpropagaatten-tion requires more complex methods like policy gradients.

On the other hand, soft attention is a deterministic way of applying attention and computes the output s as an expected value instead of sampling. This set of values αiwhich are in the

range of 0 to 1 is known as the attention mask and they denote the importance of each position in the image. s=Ep   NxNy ÿ i=1 pizi  = NxNy ÿ i=1 αizi

However, this is not the only way to apply attention and many other ways exist. In this thesis, attention is applied in a manner similar to Kim et al. [23] where the attention is ap-plied not only on the spatial positions but also on the channels leading to three dimensional attention masks.

3.4 Self-Supervised Visual Representation Learning

Supervised learning produces state-of-the-art results when provided with large amounts of labelled data. But labelled data is often expensive to obtain, especially for complex problems where human annotation is necessary. For example, semantic segmentation requires anno-tation of object classes for every pixel in an image and object recognition requires bounding boxes with appropriate object classes for every object in the image. Such annotation is time consuming and expensive in terms of human efforts. On the other hand, technological ad-vancements in data collection and storage has enabled the collection and storage of very large amounts of data. However, from the perspective of supervised learning, these large collec-tions of data are not useful unless they are labelled.

Unsupervised learning is a type of learning which does not involve any data labels or human supervision. Self-supervised learning can be viewed as a type of unsupervised learn-ing but it distlearn-inguishes itself from conventional unsupervised learnlearn-ing methods by utilizlearn-ing supervisory signals that are derived from the data itself. Language models like BERT [6] are self-supervised representation learning models which are trained using the task of predict-ing a masked word in a sentence given the context of words that appeared before and after it. Such tasks which are artificially constructed from unlabelled data without any human supervision are known as pretext tasks.

One simple example of a pretext task for images involves randomly rotating some images in a dataset by 900_{, 180}0 _{or 270}0 _{and the task would be to predict the correct rotation label}

[10]. In order to perform well in this task the model should learn a latent feature representa-tion that can efficiently reason about the feasible orientarepresenta-tions of different objects. The latent feature representations learned by a model while performing the pretext task could contain semantic information which is useful for other tasks as well. A latent feature representation of an image is a numerical vector that describes different aspects of the image in a latent space. The performance on the pretext task is usually not so important and the learned represen-tations are evaluated based on their performance in other tasks known as downstream tasks. Downstream tasks could be tasks like image classification, semantic segmentation or CBIR.

For downstream tasks, the visual feature representations for new images are obtained from the intermediate feature representations of the self-supervised learning model. In other words, the self-supervised learning model is used as an image encoder that transforms high-dimensional pixel data to high-level semantic feature vectors. Models for tasks like image

(23)

3.5. Contrastive Learning

classification or semantic segmentation can be trained using these feature representations as input [21]. Since the feature representations have lower dimensionality than the images, these models would be much smaller in size compared to models that are trained directly on images as input. For image retrieval tasks, these representations can either be directly used for distance/similarity metric computations or fine-tuned with a dataset having similarity labels. Methods based on self-supervised representations have been shown to outperform supervised pre-training baselines on several visual tasks [18].

3.5 Contrastive Learning

In generative or predictive pretext tasks for self-supervised representation learning, the model loss is computed in the data space (x, y). For instance, reconstruction loss and ad-versarial loss are commonly used for generative pretext tasks [36] and cross-entropy loss is used for predictive classification tasks [56]. On the other hand, contrastive tasks involve model loss computation in the latent space by contrasting latent representations of positive and negative samples given a specific context. For instance, in a task of selecting the correct image patch that fits a masked region of a bigger image as in [49], the unmasked image re-gion acts as the context. Given this context, the objective would be to select the correct image patch (positive sample) among other distractor image patches (negative samples).

The goal of a contrastive method is to learn an encoder f that satisfies the condition: p(f(x+)|c)ąp(f(x´₎_|c₎_{where x}+_{is a positive sample, x}´_{is a negative sample and c is the}

context. This can be achieved by using a softmax classifier considering the entire set of target images X as the possible classes. The lossLfor such a classifier is commonly computed using the negative log-likelihood and this loss is also known as cross-entropy loss.

L =´EX logř exp(g(f(c), f(x)) x1_PXexp(g(f(c), f(x1))) (3.1) Here, g(a, b)is a function that computes a similarity metric between two given feature vectors a and b. The output of the classifier denotes the probability of xi being the positive

sample from the set X can be derived using Bayes theorem as follows [35]:

By comparing equations 3.1 and 3.2, it can be seen that exp(g(f(c), f(xi)))models the

density ratio p(xi|c)

p(x_i). When xiis the true positive sample, this density ratio is at its maximum

and hence, it is more likely that xicomes from the conditional distribution p(xi|c)than from

the data distribution p(xi).

However, there could be a large number of negative samples given each context and this results in the set of targets X being large as well. So, computing the quantity, ř

(24)

3.6. Content Based Image Retrieval (CBIR)

Noise Contrastive Estimation (NCE) [17] is a method to overcome this problem by con-sidering a smaller number of negative samples which are sampled randomly at the time of loss computation. The NCE loss [35] for a set XNcontaining N ´ 1 negative samples out of a

total of N samples is given by:

LNCE=´EXN " logř exp(g(f(c), f(x)) x1_PX Nexp(g(f(c), f(x 1₎₎₎ #

As derived in Oord et al. [35], the mutual information between c and a positive sample x has a lower bound given by:

I(x; c)ělog(N)´LNCE

It can be seen that minimizing the NCE loss maximizes a lower bound on the mutual information between the context and the positive sample. Also, higher values of N lead to a tighter bound on the mutual information.

3.6 Content Based Image Retrieval (CBIR)

The purpose of a CBIR method is to take as input a query which could consist of one or more query images and retrieve a set of images from an image dataset that best reflects the query. A CBIR method (Figure 3.5) consists of two important components namely the feature rep-resentation and the similarity metric [51]. The feature reprep-resentation is a way to transform the pixel-level information of images into a numeric feature vector which denotes meaning-ful semantic information about the image. In recent research, deep learning based methods to learn image representations for CBIR have outperformed conventional approaches using hand-crafted global and local features [51].

Feature Extraction Query Images Query Features Dataset Images Features Feature Similarity/Distance Evaluation CBIR Result Dataset Images

Figure 3.5: Content Based Image Retrieval Process

A similarity/distance metric is a measure of relevance between the query and an image from the database [57]. Larger distance metric corresponds to smaller similarity and smaller

(25)

3.7. Singular Value Decomposition

distance metric corresponds to larger similarity. For a query containing N query images Qi

and an image I in the image database, the distance metric can be computed as the mean of the distances to each query image as follows:

distance metric= 1 N N ÿ i=1 d(f(I), f(Qi))

where f(¨)transforms an image to its feature vector and d(¨, ¨)calculates the distance metric between two vectors. The similarity metric is also computed in a similar way but using a similarity function in place of the distance function. For example, Euclidean distance is a commonly used distance metric and Cosine similarity is a commonly used similarity metric. For two feature vectors v1and v2, the euclidean distance and cosine similarity between them

are computed as follows:

deuclidean(v1, v2) =||v1´v2||2

scosine(v1, v2) =

v1‚v2

|v1||v2|

where ‚ denotes the dot product between two vectors. When the vectors are of unit length, the cosine similarity between two vectors is simply their dot product, scosine(v1, v2) =v1‚v2.

Usually, the feature vectors are computed and stored for all the images in the image database beforehand. For each query, the feature vectors are computed for the query im-age(s) and the similarity/distance metrics are computed for the query for each image in the image database. The top K images with the lowest distance metric values or highest similarity metric values are returned as the retrieval result.

3.7 Singular Value Decomposition

Dimensionality reduction is one simple approach to transform high-dimensional pixel-level information to a low dimensional feature representation. Singular Value Decomposition (SVD) [11] is a dimensionality reduction technique that can be used to learn a low-dimensional la-tent representation of a dataset matrix. In text mining, documents are commonly represented as term count vectors which contain the counts of all the words in the vocabulary. A set of documents are represented as a term count matrix by stacking the individual term count vec-tors. Applying SVD to a term count matrix is known as latent semantic analysis [25]. This method extracts semantic concepts of text into a low dimensional latent space and these la-tent feature vectors are commonly used in text retrieval systems [30]. This low-rank matrix factorization can be mathematically formalized for a dataset matrix X of dimension n ˆ p as follows:

X « UΣVT

where U is an orthonormal matrix of dimensions n ˆ r such that UTU = I,Σ is a diagonal matrix of dimensions r ˆ r containing the singular values and V is an orthonormal matrix of dimensions p ˆ r such that VTV= I. Given this decomposition of the dataset matrix, the low-dimensional latent representation of the data with r dimensions can be obtained by comput-ing the matrix product XV. Any new data can then be transformed to the low-dimensional latent representation by computing the matrix product with V.

3.8 Evaluation Metrics for CBIR

Essentially, CBIR is an information retrieval task and standard evaluation metrics from in-formation retrieval research are applied to CBIR as well. The most widely used evaluation

(26)

3.8. Evaluation Metrics for CBIR

metrics in CBIR research are Mean Average Precision up to particular ranks (mAP@K) and Mean Precision at particular ranks (mP@K).

Mean Precision at particular ranks (mP@K)

The precision at a particular rank (P@K) [30] denotes the accuracy of the retrieved results up to that rank according to their relevance to the query. Mean precision averages precision at a particular rank over a set of queries. For a set of Q queries and K retrieved images denoted by R(q)_k for each query q, the mean precision at rank K (mP@K) is computed as follows:

mP@K= 1 Q Q ÿ q=1 1 K K ÿ k=1 I(R(q)_k )

where I(R(q)_k ) is an indicator function which evaluates to 1 if the retrieved image R(q)_k is relevant to the query q and 0 otherwise. mP@K is not stable when averaged over different types of queries, that is, when each query q has a different user intention or information need. This is caused by different information needs having different amounts of relevant items in the sample search space. However, this is not a problem if the averaging occurs over queries with the same information need.

Mean Average Precision up to particular ranks (mAP@K)

Even though two different methods could have the same precision, one of them might rank the relevant images in a more appropriate manner. So, it is also important to evaluate the quality of the ranking when the images retrieved are ranked in the order of their relevance. Mean average precision up to a particular rank measures the quality of ranking by computing the mean of average precisions up to that rank for a set of queries. Average precision [30] for a set of retrieved results is defined as the average of precisions computed at every rank where a relevant retrieval result is present. For a set of Q queries and K retrieved images denoted by R(q)_k for each query q, the mean Average Precision up to rank K (mAP@K) is computed as follows: mAP@K= 1 Q Q ÿ q=1 1 |Kr|   ÿ k:kPKr Precisionq(k)  

where Kr is the set of all ranks with relevant retrieval results, |Kr|is the number of relevant

retrieval results in the set of K retrieved images for query q and Precisionq(k)is the precision

(27)

4 Method

4.1 Content Based Image Retrieval - A Statistical Perspective

The main aim of this thesis is to develop a CBIR method specifically for images of road scenes. As explained in the Theory chapter, a CBIR method takes as input a query consisting of one or more query images and outputs a set of images from an image dataset which match the query best as a result. This is achieved by transforming the images from pixel values to a seman-tic feature vector using a feature embedding function and computing a distance/similarity metric in the feature embedding space. So, the key components of a CBIR method are the feature embedding function that transforms an image to the feature vector and the similar-ity/distance metric.

The road scene images, x are of dimensionality W ˆ H ˆ C. These are here considered to come from a distribution of road scene images: x „ DRS. From a statistical viewpoint, the

CBIR method can be viewed as modelling the conditional probability distribution p(x|Q), where Q is a set of query images. The conditional probability distribution is analogous to the similarity metric and measures the relevance between the image x and the query Q. The CBIR method would then list the top K images from the dataset having the highest probability values as the result. In general terms, the relationship between the conditional probability and the similarity metric can be expressed as follows:

p(x|Qi)9f(sim(x, Qi))

where sim()denotes the similarity between two images and f()is some function of the simi-larity value. However, the simisimi-larity of images should be evaluated in terms of image content perceived by a human user. A human user perceives high-level concepts of an image like the objects which are present in the image and the environment in which the objects are present. So, the similarity metric cannot be directly computed based on the primitive pixel level in-formation and a high-level feature representation of images is required which denotes the semantic concepts that exist in the image.

In order to develop an effective CBIR method, the main task is to build a feature embed-ding model, f˚₍_x₎_{that transforms raw pixel level information of images to a latent feature}

space that contains high-level semantic features. For every image retrieval task, the similarity metric has to be computed between the query and all the images in the dataset. The compu-tational cost of this task increase when the size of the feature representation increases. So,

(28)

4.2. Baseline Method - Singular Value Decomposition (SVD)

low dimensional latent feature spaces are preferred. Kim et al. [23] find feature vectors of size 512 to be suitable for their evaluation datasets. However, their evaluation datasets are much simpler. Since road scene images are more complex, a higher feature size is expected to perform better. Throughout this thesis, feature vectors of size 768 are used. The choice to maintain fixed feature vector sizes across all methods is made to maintain uniformity and fairness while comparing different methods.

Veoneer have trained a deep CNN which performs well on multiple visual tasks and this CNN will be referred to as VCNN. The intermediate convolutional layers of a CNN learn high-level spatial features [12] and it is beneficial to leverage these features for learning the feature representation for CBIR. Let v(k)(¨)represent the application of all the convolutional layers until the layer k of VCNN. The output of layer k from VCNN can be obtained as x(k)₌

v(k)(x)and it denotes the intermediate convolutional features. These intermediate features, x(k) of dimensionality W(k)ˆH(k)ˆC(k)can be considered to come from a distribution of road scene images in a latent space of the same dimensionality, x(k)„D_RS(k). By leveraging the high-level spatial information learned in VCNN, feature vectors for images are learned by using intermediate features from VCNN as input. Hence, the feature embedding model can be expressed as f˚₍_x_{) =} _f₍_x(k)_{) =} _f₍_v(k)₍_x₎₎_{. This thesis is focused on learning the function}

f(¨)by using the SSVRL-Dataset explained in section 2.1.

Firstly, a simple baseline method using the dimensionality reduction method Singular Value Decomposition is presented. Then, the Attention Based Ensemble method that is built on ideas from Kim et al. [23] is presented. Finally, the evaluation experiments and their methodology are presented.

4.2 Baseline Method - Singular Value Decomposition (SVD)

A method to transform a set of high dimensional features to a low dimensional latent space is by means of dimensionality reduction. In previous research, low dimensional features obtained through a straightforward approach of applying MaxPooling to every channel of the high dimensional convolutional features is found to perform effectively on simpler datasets [48]. One such simple dimensionality reduction method is Singular Value Decomposition (SVD) [11] which is considered as a baseline method in this thesis. As explained in Section 3.7, SVD is mathematically expressed as follows for a dataset matrix X of dimension n ˆ p:

X « UΣVT

For every image x, the intermediate features from layer k of VCNN is obtained and it is flattened into a feature vector, xf. The SVD is performed by using xf as the input. The feature

values xf are centered by subtracting their respective means, x1f = xf ´ ¯xf. The data matrix

should be formed by stacking up x1

f for all images in the SSVRL-Dataset. However, x1f is

high-dimensional of approximately 2 ˆ 105_{dimensions and the dataset consists of 15092 ˆ 6}

image frames. Computing SVD for the complete dataset is intractable based on the available computational resources. Instead, only one frame is considered per image sequence and the total number of image frames are limited to 10000. These 10000 image sequences are sampled at random from the total of 15092 image sequences.

By performing SVD, the matrices U,Σ and V are obtained. The low-dimensional feature vector, fsvd(xf)for any image having input feature vector xf is obtained by computing the

matrix product:

(29)

4.3. Attention Based Ensemble for Representation Learning

4.3 Attention Based Ensemble for Representation Learning

Conceptual Presentation

The SVD method learns a low dimensional feature representation based on patterns in the input features but this representation is not specifically trained to be good at reasoning about image similarity. The SVD method also does not have any means to infer which image aspects are useful for evaluating the similarity of images. For this reason, a more involved method that focuses on different image aspects and specifically trained using image similarity based objectives is beneficial for reasoning about image similarity.

For learning such a feature embedding model, the concept of a metric space is relevant. A metric space is defined as a combination of a set or a multidimensional space and a metric in that set or space. A metric is simply a function that defines the distance between any 2 members from that set or 2 points in that space. Let us consider two metric spacesX andY

of dimensions NX and NY, having metrics dX and dY respectively. The function f :X ÑY

is called an isometric or distance preserving embedding function between the spacesX and

Y if dX(xi, xj) =dY(yi, yj)where yi = f(xi)and yj= f(xj).

Consider the metric spaceX in which the images are represented in terms of their pixel values of dimensionality W ˆ H ˆ C. While learning the isometric feature embedding func-tion for a set of images, the intenfunc-tion is to go from the metric spaceX where the metric to define the semantic similarity of images is unknown to a low-dimensional latent metric space

Y having a known metric.

Model Architecture

Kim et al. [23] proposed the method M-way Attention Based Ensemble (ABE-M) where the transformation f :X ÑYis broken down into the following two steps:

1. Spatial feature extraction, s :X ÑZ

2. Global feature embedding function, g :Z ÑY

The output feature embedding is proposed to be an ensemble of M feature embeddings. If individual feature embeddings are denoted by fm(x), the ensemble of M feature

embed-dings is obtained by concatenation as follows: f(x) =tf1(x); f2(x); ...; fM(x)u. An ensemble

of feature embeddings is useful when the feature embeddings of the ensemble members are diverse. This diversity is achieved by encouraging each feature embedding to focus on dif-ferent aspects of the image. So, each feature embedding varies only in terms of it’s spatial feature extractor such that fm(x) = g(sm(x)). The overall model architecture is shown in

Figure 4.1.

Each spatial feature extractor sm(x)extracts features related to diverse aspects of the

im-age. sm(x)is a deep CNN and contains an attention module. The attention module enables

each sm(x)to focus on different regions of the image to extract diverse spatial features. For

instance, one of the spatial feature extractors could focus on the regions containing a road to extract features related to the quality of the road, road surface, road markings, etc. Another spatial feature extractor could focus on the vehicles and other objects on the road. In this manner, overlapping aspects of a road scene could be separately extracted and encoded into different feature embeddings. Such an approach leads to a feature space where it is meaning-ful to define content based similarity.

Let us consider the output of a convolutional layer k from VCNN as x(k) =v(k)(x)which is a spatial volume of dimensions WkˆHkˆCk. This convolutional layer is a layer that is

close to the bottleneck of VCNN. The spatial feature extractors are formulated as sm(x) =

s1

m(x(k)) =s1m(v(k)(x)). The m-th attention module selects relevant features from x(k)for the

(30)

4.3. Attention Based Ensemble for Representation Learning Bottleneck

....

VCNN Attention Modules Image, x Frozen Trainable

Figure 4.1: ABE Model Architecture

compute an attention mask that denotes the relevance of each feature in the WkˆHkˆCk

space of x(k). Instead of directly using x(k), another convolutional layer that is p layers ahead in VCNN is considered:

x(k+p)=v(k+p)(x) =v1₍_v(k)₍_x₎₎

This layer is chosen similar to Kim et al. [23] to use convolutional features from a layer ahead to learn the attention masks. The convolutional features x(k)contains fine-grained but slightly lower level features compared to x(k+p). The fine-grained features in x(k)are more useful than x(k+p)for learning a feature embedding. For example, a layer deep in the CNN might only identify a car but some earlier layer might identify different parts of a car. The future layer is useful to identify the spatial location of the car. But, the earlier layer contains fine-grained features which can differentiate between cars of different types. The attention masks should identify regions of the image where a spatial feature extractor must focus. But fine-grained features can be obtained from an earlier layer. Figure 4.2 shows the architecture of the attention module starting from x(k+p). With this rationale, the attention masks are computed as follows:

am= Am(v(k)(x)) =A1m(v(k+p)(x))

In VCNN, x(k+p) is of lower dimensionality compared to x(k). So, an upsampling block is used in the attention module to ensure that the dimensionality of the attention mask amand

(31)

4.3. Attention Based Ensemble for Representation Learning Upsampling Block Convolution 2D (1x1 ﬁlter) Sigmoid Activation Convolution 2D Batch Normalization ReLU Activation Upsampling 2D Attention Module

Figure 4.2: Attention Module, A1

m(x(k+p))

Consider a binary random variable rm of dimensionality WkˆHkˆCk. rm,whc denotes

whether the feature x_whc(k) is relevant to the m-th spatial feature extractor or not. Let these relevance variables rm,whcbe Bernoulli distributed, each with parameter am,whcas follows:

rm,whc„Bern(am,whc)

By using this relevance variable, the spatial features from x(k)which are selected by the m-th spatial feature extractor is given by:

zm,whc=x(k)whc¨rm,whc

However, such a stochastic hard attention model is not differentiable and cannot be trained easily by backpropagation. To avoid this, a deterministic soft attention model is pre-ferred. The soft attention model is obtained by computing sm(x)whc as the expectation of

zm,whc.

sm(x)whc=E(zm,whc) =x(k)whc¨E(rm,whc) =x(k)whc¨am,whc

In vector form, this is equivalent to (d denotes element-wise multiplication): sm(x) =x(k)dAm(x(k)) =x(k)dA1m(x(k+p))

The global feature embedding function g(¨) is then applied on the spatial features ex-tracted in each attention head to obtain the feature embedding, fm(x) =g(sm(x)). The global

feature embedding function is a CNN that contains a series of convolutional layers followed by a global average pooling layer and a fully connected layer. The global feature embedding function is common for all M attention heads and all the parameters are shared between the heads. Using a common function g(¨)for all M attention heads ensures that the diversity in the M feature embeddings come only from the diversity in the spatial regions attended by each of the M spatial feature extractors. Figure 4.3 shows the structure of the global feature embedding function, g(¨).

In order to evaluate the effect of using attention on the feature embedding, another model that does not use the attention mechanism to extract spatial features is also considered. This

(32)

4.3. Attention Based Ensemble for Representation Learning Downsampling Block 1 Downsampling Block 2 Global Average Pooling Fully Connected L2 Norm Downsampling Block 3 Spatial Dropout Convolution 2D Batch Normalization ReLU Activation MaxPooling Global Feature Embedding Function

Figure 4.3: Global Feature Embedding Function

Bottleneck

Global Feature Embedding CNN VCNN

Image, x

Frozen Trainable

Figure 4.4: Model without Attention

model directly applies a global feature embedding function to the intermediate convolutional features x(k). The structure of the global feature embedding function is the same as the at-tention based model (Figure 4.3) but the output layer is adjusted such that the output feature embedding is of size 768 instead of 768/M. The structure of this model is shown in Figure 4.4.

Pretext task and loss functions

The SSVRL-Dataset consists of 15092 image sequences and each image sequence consists of 6 image frames. The image frames are separated by a time interval of 5 seconds. Any two

Self-Supervised Representation Learning for Content Based Image Retrieval

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/STAT-A--20/008--SE

Self-Supervised Representation

Learning for Content Based

Im-age Retrieval

Hariprasath Govindarajan

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Background

1.2

Aim

1.3

Related Work

1.4

Outline

2

Data

2.1

Visual Representation Learning Dataset

2.2

Evaluation Datasets

3

Theory

3.1

Feedforward Neural Network

3.2

Convolutional Neural Network (CNN)

*

=

3.3

Attention

3.4

Self-Supervised Visual Representation Learning

3.5

Contrastive Learning

3.6

Content Based Image Retrieval (CBIR)

3.7

Singular Value Decomposition

3.8

Evaluation Metrics for CBIR

Mean Precision at particular ranks (mP@K)

Mean Average Precision up to particular ranks (mAP@K)

4

Method

4.1

Content Based Image Retrieval - A Statistical Perspective

4.2

Baseline Method - Singular Value Decomposition (SVD)

4.3

Attention Based Ensemble for Representation Learning

Conceptual Presentation

Model Architecture

....

....

....

Pretext task and loss functions