Unsupervised Learning of Visual Features for Fashion Classification

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2019 ,

Unsupervised Learning of Visual Features for Fashion Classification

SUMEET DHARIWAL

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Unsupervised Learning of Visual Features for Fashion Classification

SUMEET DHARIWAL

Master in Data Science Date: June 27, 2019 Supervisor: Dr. Ying Liu

Examiner: Dr. Vladimir Vlassov

Industrial Supervisor: Abubakrelsedik Karali

School of Electrical Engineering and Computer Science Host company: RISE SICS

Swedish title: Unsupervised Lärande av Visuella Funktioner för

Fashion Classification

(4)

(5)

iii

Abstract

Deep Learning has changed the way computer vision tasks are being solved in the current age. Deep Learning approaches have achieved state-of-the-art re- sults in computer vision problems like image classification, image verification, object detection, and image segmentation. However, most of this success has been achieved by training deep neural networks on labelled datasets. While this way of training the neural networks results in classifiers with better ac- curacies, but it might not be the most efficient way to solve computer vision problems. This is so because it is a resource consuming process to manu- ally label the images/data-points and can cost a lot of time and money to the organizations that employ deep learning for developing various products and services.

Fashion and e-commerce is one such domain where there is a need to lever- age the image data without relying too much on labels. This process can be beneficial to automatically label the category, attributes and other metadata of images, generally used to show the inventory digitally, without relying on humans to manually annotate them.

The aim of this master thesis is to explore the effectiveness of unsuper- vised deep learning approaches for fashion classification so that the data can be classified by only relying on a few labelled data points. Two unsupervised approaches, one based on clustering of features called DeepCluster and the other based on rotation as a self-supervision task, are compared to a fully su- pervised model on DeepFashion dataset.

Through empirical experiments, it has been shown that these unsupervised

deep learning techniques can be used to attain comparable classification accu-

racies (~1-4 % lesser than that achieved by a fully supervised model) and thus

making them as suitable alternatives to supervised approaches.

(6)

iv

Sammanfattning

Deep Learning har förändrat hur datorvisionsuppgifter löses under nuvaran- de ålder. Deep Learning-metoder har uppnått toppmoderna resultat i dator- visionsproblem som bildklassificering, bildverifiering, objektdetektering och bildsegmentering. Emellertid har det mesta av denna framgång uppnåtts ge- nom att träna djupa neurala nätverk på märkta dataset. Medan detta sätt att träna de neurala nätverken resulterar i klassificerare med bättre noggrannhet, men det kanske inte är det mest effektiva sättet att lösa problem med dato- rens syn. Det här beror på att det är en resurskrävande process för att manuellt märka bilderna / datapunkterna och kan kosta mycket tid och pengar till or- ganisationer som använder djupt lärande för att utveckla olika produkter och tjänster.

Mode och e-handel är en sådan domän där det finns behov av att utnyttja bilddata utan att förlita sig för mycket på etiketter. Denna process kan vara till nytta för att automatiskt märka kategorin, attribut och andra metadata för bilder, vanligtvis används för att visa inventeringen digitalt, utan att förlita sig på människor för att manuellt annotera dem.

Syftet med denna magisteruppsats är att undersöka effektiviteten av oöver- vakade djupt lärande tillvägagångssätt för modeklassificering så att data kan klassificeras genom att endast förlita sig på några märkta datapunkter. Två oö- vervakade tillvägagångssätt, en baserad på kluster av funktioner som kallas DeepCluster och den andra baserat på rotation som självövervakningsuppgift, jämförs med en helt övervakad modell på DeepFashion dataset.

Genom empiriska experiment har det visat sig att dessa oövervakade djupa

inlärningstekniker kan användas för att uppnå jämförbara klassifikationsnog-

grannigheter (~1-4 % mindre än det som uppnås genom en helt övervakad

modell) och därigenom göra dem till lämpliga alternativ för övervakade till-

vägagångssätt.

(7)

Chapter 1 Introduction

Computer Vision is a field of study where the goal is to make a computer see the world the same way a human does. For a human eye, it is very easy to look at an image, a video or an outdoor scene and point out what objects, colours or scenes it contains. However, when given the same problem to a computer it can be an extremely difficult task.

One of the primary reasons behind this is that researchers have still not been able to crack how the human vision works and how does it map to the perception of the human brain. One of the most noted works to understand this problem was Hubel and Wiesel’s experiment [1] to understand the visual cortex of a cat and how cells interact with each other to create a map of the visual world.

Over the years, researchers came up with different ideas and models to simulate a computer’s vision to bring it closer to human vision. One of the most promising ideas was Artificial Neural Networks where researchers tried to draw inspiration from the human brain and how neurons interact with each other to understand an external stimulus and generate a response.

In recent years, with the availability of computing power and big data, neu- ral networks have been able to outperform previous methods to show that they can be used to understand the content of an image. Although computers are still not as efficient as humans in visually understanding the data, neural net- works have definitely made significant contributions to bring computer vision closer to human vision.

Neural networks have led to the emergence of the field called Deep Learn- ing, where researchers are using neural networks with deep architectures i.e networks that involve multiple hidden layers, to build a complex model so that it is able to capture more details when learning a model.

1

(12)

2 CHAPTER 1. INTRODUCTION

Deep Neural Networks have shown superior performances in Computer Vision tasks like image classification (classifying the type of image), object de- tection (detecting objects in an image), image segmentation (understand which pixels belong to an object in an image) etc.

1.1 Deep Learning and Labelled Data

Supervised learning is a subfield in Machine learning, where a labelled dataset D(X,Y) is used for learning a model, where X represents the data and Y repre- sents the label associated with it.

In Supervised Learning, the training of a model is supervised by the labels associated with the data points. The model is iterated to update its parameters until it achieves a good accuracy on the training data.

It can be safely said that much of the early success of Deep Learning has been led by using labelled data to train deep neural networks. ImageNet [2]

was one of the popular datasets that popularized using neural networks on labelled data to achieve superior results than their predecessors. AlexNet [3]

results demonstrated that given enough labelled data to a deep neural network, it can outperform all previous machine learning approaches based on hand- crafted features, proving the effectiveness of deep networks.

According to talk given by Andrew Ng at ExtractConf 2015 [4],

“almost all the value today of deep learning is through supervised learning or learning from labelled data”.

1.2 What is Unsupervised Deep Learning?

Besides the supervised way of training a deep learning model, there is another way of training the model without relying on any labels associated with the data. This approach is called Unsupervised Learning.

Unsupervised Learning is carried on a dataset D(X), where X represents

a data point in the form of image, text etc. The data does not have any la-

bel or category associated with it. Unsupervised Deep Learning is a widely

studied domain in the machine learning and deep learning community and

has been used for clustering, dimensionality reduction, anomaly detection,

density estimation etc. It has also been used frequently in computer vision

applications as well. For example, the Bag-of-features model has been used

for image classification [5] [6]. Similarly, unsupervised methods have been

used for dimensionality reduction. Autoencoders [7] showed the application

(13)

CHAPTER 1. INTRODUCTION 3

of learning latent variables on an unlabelled dataset and thus providing good quality general-purpose features.

1.3 Motivation behind Unsupervised Deep Learn- ing

ImageNet challenge [2] is one such example where researchers have shown that high accuracies can be achieved by leveraging on the capability of neural networks to learn complex features, provided there is enough data to train on.

Most of the computer vision tasks like classification, object detection, seman- tic segmentation etc. are being solved using Supervised Learning. Supervised learning is comprised of the following steps:

• Obtain lots of data and label them,

• Define an objective/loss function,

• Train the network.

But there is an inherent problem in this approach. The first step of obtain- ing data and manually labelling each and every data point is an expensive task and organizations have to pay lots of money to manual annotators before they can start training their models. So the question that arises from this point is that is there any other way of training the neural networks without actually relying on manually annotated labels?

If the answer is yes, there is a possibility to tap on billions of data points that are being uploaded to the internet in form of images, videos, audios etc. without needing to manually annotate those data points.

Unsupervised learning is one of the machine learning sub-fields that shows this possibility of training the neural networks on unlabelled data.

1.4 Research Question and Hypotheses

1.4.1 Challenge

Fashion classification and recognition using computer vision have gained a lot

of attention because of the emergence of several e-commerce sites and shift of

human behaviour from offline to online shopping. But before a fashion item

is listed on an e-commerce website, a lot of steps are done to put an item into

(14)

4 CHAPTER 1. INTRODUCTION

the digital inventory.

Firstly, images of the product captured from several angles need to be obtained.

Then those images need to be categorized by the item type e.g. a dress, a pant, a shirt etc. On top of that, several attributes like design, colour, texture also need to be labelled e.g. a pink floral dress, a black and white striped cotton dress, a plain blue nylon shirt etc.

The problem does not seem to be complex if there are a handful of items that need to be listed. A human labeller can do it within a few minutes. But imagining the scale of fashion giants like Amazon and H&M where there are millions of products listed on their website and the inventory is updated daily, the situation becomes complex.

A lot of times these fashion companies employ human labellers to classify and categorize these images. However, this is a costly and time-consuming process and still error-prone.

Most of the previous substantial works in computer vision research for fashion classification have been done using a supervised approach. So, the question that needs to be tackled is whether it is possible to automate this process by just using a few labelled data points?

1.4.2 Research Questions

The following two research questions have been addressed in this thesis:

Research Question 1: "Is it possible to use an unsupervised deep learn- ing approach to achieve classification accuracy comparable to that of a fully supervised deep learning model on a fashion dataset?"

To answer this question, a comparison of prominent unsupervised deep learning approaches needs to be done with a fully supervised model trained on the chosen dataset.

Research Question 2: Is it better to use dataset coming from the same domain (fashion in this case) during both the pretext task and downstream task than using a dataset from one domain during pretext task and using that model to classify a dataset from a different domain during the downstream task?

To answer this question, the rotation based model was pre-trained on dif-

ferent datasets(both fashion and non-fashion datasets) and then fine-tuned on

DeepFashion and the classification accuracies were compared.

(15)

CHAPTER 1. INTRODUCTION 5

1.4.3 Hypotheses

The hypothesis for Research Question 1 is that although a fully supervised model would perform better, by choosing a suitable unsupervised deep learn- ing approach and a suitable architecture, it is possible to close the gap between an unsupervised and a supervised approach for fashion recognition.

The hypothesis for Research Question 2 is that transfer learning between same domain(i.e. using fashion datasets) during pretext task and downstream task will give superior accuracies than if a dataset from a non-fashion domain was used during pre-training.

1.4.4 Contributions

The main contributions of the thesis project are as follows:

• An empirical analysis of classification accuracies of two implemented unsupervised deep learning approaches vs a fully supervised one.

• showing the impact of choosing different deep learning architectures, on classification accuracy.

• demonstrating the impact of choosing different datasets during unsuper- vised pre-training on accuracy.

• showing a way of doing a qualitative assessment of learnt visual features on a fashion dataset.

1.5 Ethical Considerations and Sustainabil- ity

In view of technological advances in AI and Computer Vision systems, there is a need to discuss the social, legal and economic impact of such systems from the perspective of ethics and sustainability.

A Computer Vision system should avoid the pitfalls of bias based on race,

gender, and other such attributes. Some works by researchers working towards

supporting ethical practices in AI have exposed the vulnerability of Computer

Vision systems to be a biased system if trained on a highly skewed dataset. For

example, Buolamwini et al. [8] showed that current facial recognition systems

are discriminative on attributes like gender or skin complexion. The reason

for that is predominantly the datasets chosen for training the facial recognition

(16)

6 CHAPTER 1. INTRODUCTION

models. These chosen datasets were highly imbalanced in gender and skin type. As a result, the recognition system gave the lowest classification error on light skinned males and highest error on dark skinned females.

Also, another property that should be followed by a good Computer Vision system is transparency and explainability. A model should not be a black box and the developers and end users of the system should be able to interpret the results in a transparent manner. This also helps to ensure that the model is working in an unbiased manner.

In terms of sustainability, an advanced and accurate Computer Vision sys- tem can foster innovation and contribute to efficient resource utilization by fully or semi-automating the manual tasks. That aligns with Goal 9 (Industry, Innovation and Infrastructure) of UN’s Sustainable Development Goals 2015- 2030 [9]. Similarly, an unbiased Computer system that does not discriminate on basis of gender, race, ethnicity etc. can be in line with Goal 5 (Gender Equality) and Goal 10 (Reduced Inequalities) of UN’s Sustainable Develop- ment Goals 2015-2030.

In this project, the DeepFashion dataset has been chosen which is an open dataset available for research. While the dataset contains pictures representing mostly fair-skinned men and women models wearing fashion items, no atten- tion has been paid on the gender or skin type of persons in the image. The sole goal of the trained models is to classify the fashion item present in the image.

In terms of interpretability of results, Grad-CAM analysis (Section 3.6.2 and Section 4.4), help in understanding where the CNN model was focusing on images, during training and testing. The analysis clearly shows that the faces of the models were not contributing towards deciding the classification results.

In terms of sustainability, an automated classifier in fashion and e-commerce

domain can lead to efficient use of resources and innovation in the sector, thus

contributing towards the Goal 9 discussed above. However, one possible side-

affect that might arise out of such automated fashion classifier systems is that

they might replace the jobs of human annotators doing manual labelling of

data in fashion and e-commerce related organizations. This indicates towards

the need of preparing this segment of workers to acquire other advanced skills

so that they can transition to other roles to support their livelihoods.

(17)

Chapter 2 Related Work

2.1 Transfer Learning for Computer Vision tasks

Transfer learning is a subdomain in the deep learning field, where the features learnt on a dataset D1 during task T1 can be to transferred to another dataset D2 during a task T2.

The main difference between transfer learning and traditional learning is the fact that in traditional learning, the training on task T1 is isolated from from a different task T2. Figure 2.1 and Figure 2.2 visually depict this difference between the two approaches.

There is no knowledge transfer between two tasks and the models need to be trained from scratch for each of the tasks individually. Traditional learning seems to work well when there is enough labelled data for the task T2, for which a reliable model needs to be developed. But it fails when there is not enough labelled data.

That is where transfer learning can be leveraged by training a model on some initial task T1 and using the knowledge learnt in that task to create a model for a task T2 for which there is not enough labelled data.

7

(18)

8 CHAPTER 2. RELATED WORK

Figure 2.1: Traditional Learning [10]

Figure 2.2: Transfer Learning [10]

Currently, the most widely used neural net architecture in Computer Vi-

sion, is a Convolutional Neural Network or a CNN [11].

(19)

CHAPTER 2. RELATED WORK 9

Previous works have shown that pre-training phase of CNNs is a funda- mental step while solving Computer Vision tasks like classification, object detection, and segmentation [12] [13].

2.1.1 Extracting features from a pre-trained CNN

The standard practice in computer vision tasks is to use pre-trained CNN and transfer its knowledge to the task at hand.

Figure 2.3: Transfer Learning in a CNN [14]

A CNN is trained typically on a dataset like ImageNet during task T1 and the features learnt up to a certain layer are extracted. Then these features are combined with a classifier layer based on the final task T2. Figure 2.3 is an example of a transfer learning approach used in CNN architectures. The idea is that the features learnt during the pre-trained phase are general-purpose fea- tures and can be transferred easily. Generally, the classifier layer or the last few convolutional layers in addition to the classifier layer are fine-tuned during the latter task T2.

2.2 Supervised approaches for Fashion Clas- sification

Most of the work in computer vision research related to fashion has been done by employing supervised approaches. Most of the previous works [15] [16]

[17] [18] [19] rely on training the models in a supervised way on published

and labelled datasets like DeepFashion [20]. The original paper on Deep-

Fashion [20], proposed a model called FashionNet that is trained in a super-

vised way and jointly learns features by predicting landmarks and attributes.

(20)

10 CHAPTER 2. RELATED WORK

The most recent work [16] that achieves the highest accuracy published so far, uses a two-stream network that combines the task of landmark detection and attribute/category classification. The training was done using supervised way on the proposed deep convolutional network.

2.3 Semi-supervised approaches for Computer Vision tasks

Semi-supervised learning is a class of machine learning that uses a subset of labelled data in combination with unlabelled data to train a model. A recent paper [21] shows superior results on ImageNet benchmark by employing a teacher/student paradigm, where the teacher is trained on a labelled dataset and does prediction and sampling on a huge unlabelled dataset, that is further used to pre-train a student model.

2.4 Unsupervised approaches for Computer Vision tasks

2.4.1 Clustering based approach

This approach is based on the idea of pre-training the CNNs by using some clustering approach on the intermediate features.

Coates et al. [22] used the k-means clustering method to pre-train CNNs. The idea was to learn each sequential layer in a bottom-up approach. Yang et al.

[23] used a recurrent framework to learn deep representations for a CNN which could then be transferred to other tasks. However, the published work did not explore bigger datasets like ImageNet and the work was shown on datasets of an order of 10k images. Bojanowski et.al [24] trained a CNN end-to-end with- out any supervision using a loss that is similar to k-means or discriminative clustering. Similarly, Caron et.al [25] (DeepCluster) pre-trained a CNN by it- eratively clustering the intermediate features and used the cluster assignments as pseudo-labels to provide supervision for training.

2.4.2 Self Supervised approach

Self-supervised learning is a form of unsupervised learning where data pro-

vides the supervision.

(21)

CHAPTER 2. RELATED WORK 11

As a general strategy, some part of the data is withheld and the neural network is tasked to predict it. A loss is defined for this proxy task and the objective is to learn some meaningful representations in the process of solving this task so that it can be transferred to the final task. In common terminology used in Self-Supervised research domain, there are two types of tasks:

• Pretext task : The proxy task used to learn representations on an unla- belled dataset e.g. predicting the rotation angle of an image.

• Downstream task : The final task to which the representations, learnt in the pretext task, are transferred. An example of a downstream task is classification or object detection on a given labelled fashion dataset.

Figure 2.4 depicts the general strategy of knowledge transfer, used in the field

of Self-Supervised Learning.

(22)

12 CHAPTER 2. RELATED WORK

Figure 2.4: Self-Supervised Learning Pipeline [26]

(23)

CHAPTER 2. RELATED WORK 13

Self Supervision in Images

This kind of approach is generally used when the pretext task is trained on static image datasets.

Doersch et al. [27] was one of the initial works that sparked further works on self-supervision in images. The idea was to extract random patches from a given image and task the network to predict the position of second patch relative to the first patch. Figure 2.5 shows an example of how the input data and its label is constructed.

Figure 2.5: Learning By Context Prediction [27]

The architecture used was a Siamese [28] network where each CNN was

fed one patch and the output of the network was to classify one of 8 possible

locations. Figure 2.6 shows a high level architecture of the network used.

(24)

14 CHAPTER 2. RELATED WORK

Figure 2.6: CNN Architecture for Context Prediction

Built on this work was Noroozi et.al [29] where the pretext task was to predict the correct order of a permutation of patches in a jigsaw puzzle. This work showed improved accuracies over the previous work. Figure 2.7 shows the Siamese based CNN architecture used in this work.

Figure 2.7: Siamese CNN for solving Jigsaw Puzzles [29]

(25)

CHAPTER 2. RELATED WORK 15

Another work by Dosovitskiy et.al [30] used data as a signal by using ex- emplar networks. The idea was to perturb original image patches by crop- ping/affine transformations and asking the network to classify the exemplar images as the same class.

Gidaris et al. [31] used rotation as a pretext task where the image was rotated by a predefined set of angles and the objective of the neural network was to predict the angle. This approach is the best performing method so far for self-supervision in images.

Self Supervision in Videos

This kind of supervision uses unlabelled videos during the pretext task. As a general strategy, various approaches rely on video sequence order, video tracking etc.

Wang et al. [32] used video tracking to find corresponding pairs i.e the

first and last frame of a tracked object. The CNN network used was a Siamese-

triplet network with a ranking based loss function to differentiate closer pairs

from the farther random pairs. Figure 2.8 shows this way of self-supervision

in videos.

(26)

16 CHAPTER 2. RELATED WORK

Figure 2.8: Unsupervised tracking in videos [32]

Misra et al. [33] give a sequence of frames to the CNN network and task

it to predict whether the temporal order of frames is correct or not. Figure 2.9

depicts some examples of frames in correct as well as incorrect order. With

this simple task, they show that a CNN can learn meaningful representations

which can then be transferred to other downstream tasks.

(27)

CHAPTER 2. RELATED WORK 17

Figure 2.9: Predicting correct temporal order [33]

2.4.3 Generative models based approach

Autoencoders and GANs [34] are unsupervised learning approaches that learn a mapping between a given input and produced output. Works like Donahue et al. [35]( the BiGAN framework ) and Domoulin et al. [36] show that us- ing GAN frameworks representations can be learnt that can give competitive results to previously discussed unsupervised and self-supervised approaches.

A more recent work, Chen et al. [37] is also a GAN based approach that

uses a discriminator network with rotation as self-supervision. The role of

the discriminator thus becomes to identify between the fake and real image

as well as predicting the rotation. The authors show that this Self-Supervised

GAN learns relatively powerful visual representations that compete with self-

supervised approaches like Context-prediction, Rotation and GAN-based ap-

proaches like Donahue et al. on the downstream task of ImageNet classifica-

tion.

(28)

Chapter 3 Methods

3.1 Dataset

Since the focus of this project was fashion classification, the primary dataset that has been used is DeepFashion [20]. The dataset contains images of cloth- ing items with labelled categories like top, dress, skirt, sweater etc. It also has attributes associated with each clothing item in five groups: texture, fabric, shape, style and part. In addition to that, the dataset also provides landmark annotations.

There are around 200k images in training dataset, 40k images in validation dataset and around 40k images in the test dataset. The experiments in this project have been done for category classification. There are 46 categories to classify. Some example images from this dataset are shown in Figure 3.1.

18

(29)

CHAPTER 3. METHODS 19

Figure 3.1: Example images of different categories and attributes in Deep- Fashion [38]

There is also a data imbalance in the categories in the dataset, although the train/validation/test datasets have similar looking distribution. Figure 3.2, Figure 3.3 and Figure 3.4 show the distribution of clothing item categories in training, validation and test dataset respectively.

Figure 3.2: DeepFashion training dataset (Top 20 categories)

(30)

20 CHAPTER 3. METHODS

Figure 3.3: DeepFashion validation dataset (Top 20 categories)

Figure 3.4: DeepFashion test dataset (Top 20 categories)

Besides using DeepFashion as the primary dataset, several secondary datasets were chosen to primarily answer the Research Question 2 (Section 1.4.2).

Following are the secondary datasets chosen from two domains :

• Fashion Domain

1. FashionAI [39]

(31)

CHAPTER 3. METHODS 21

• Non-Fashion Domain

1. ImageNet(ILSVRC2012) [40]

2. Places365-Standard [41]

3. CelebA [42]

3.2 Implemented Models

Three approaches have been compared for fashion category classification on DeepFashion dataset:

1. Fully supervised pre-training 2. Deepcluster based pre-training [25]

3. Self Supervised pre-training using Rotation [31]

3.2.1 Fully supervised pre-training

In this approach, a model is trained on a dataset A in a supervised way i.e the labels of the dataset are used to train the model. Then the learnt features are transferred to train the network on DeepFashion dataset.

3.2.2 DeepCluster based pre-training

This approach is based on unsupervised learning of features. DeepCluster [25]

uses a clustering method to learn the parameters of the neural network. At each epoch, the features of the convolutional network are clustered and the cluster assignments give the pseudo label for each data point. These pseudo-labels serve the purpose of supervising the training network by acting as a proxy for the true label of the image. Figure 3.5 depicts the DeepCluster approach.

Figure 3.5: DeepCluster approach [25]

(32)

22 CHAPTER 3. METHODS

3.2.3 Self Supervised pre-training using Rotation

This approach is based on using self-supervision to pre-train a network on dataset A where the task of a CNN is to predict the rotation angle of an image.

The authors of this work [31] used four geometric transformations of 0, 90, 180 and 270 degree rotations and the CNN based model had to predict the right rotation angle of the image. The authors showed that the learnt features when transferred to downstream tasks like classification, gave superior results than other self-supervised techniques. Figure 3.6 shows the task of predicting the rotation angle.

Figure 3.6: Self-Supervised task of predicting rotation [31]

3.3 Computational Resources Specification

For carrying out experiments, the following platforms were used:

• Google Colab [43] with GPU specification: Tesla K80 12GB

• RISE’s internal GPU Servers with GPU specification : GeForce GTX TITAN X 12GB

• Hopsworks [44] by Logical Clocks [45] with GPU specification: GeForce

GTX 1080 Ti 12GB

(33)

CHAPTER 3. METHODS 23

3.4 Choice of CNN architectures and Hyper- parameters

For carrying out all the experiments, the two CNN architectures that were chosen are AlexNet and VGG-16. The reason behind choosing AlexNet was to give a baseline for a given approach. AlexNet was one of the earliest ar- chitectures that showed impressive results in the ImageNet and is also used by several researchers to establish a baseline accuracy during their experiments.

The reason behind using VGG-16 was to show effect of accuracy, when a much deeper architecture is used in comparison with AlexNet. VGG-16 is a much deeper architecture than AlexNet and at the same time computationally ex- pensive too but not as computationally expensive as other deeper architectures like ResNet or DenseNet and hence was a reasonable choice, considering the amount of computational resources available during the project.

For the choice of the learning rate, a suitable value was chosen after some experimentation. For experiments during the fine-tuning phase, a learning rate of 0.001 was used and decayed by half every 25 epochs.

For fully connected block constituting the final classifier layers, following configuration of layers was used during fine-tuning:

• AlexNet: Linear Layer(512 units), Dropout(0.2), Linear Layer(128 units), Dropout(0.1), Linear Layer(46 units).

• VGG-16: Linear Layer(4096 units), Dropout(0.2), Linear Layer(512 units), Dropout(0.1), Linear Layer(46 units).

3.5 Choice of Deep Learning Framework

Pytorch 1.0 [46] was used for carrying out all the experiments. Pytorch makes

its easy to implement deep learning models and is heavily used by academic

researchers these days. Also, Pytorch provides pre-trained models on Ima-

geNet [47] on different CNN architectures and thus can be helpful to carry out

experiments.

(34)

24 CHAPTER 3. METHODS

3.6 Evaluation Metrics

3.6.1 Quantitative

Since the task for all the quantitative experiments was category classification, top-k accuracy was used as an evaluation metric.

Top-k accuracy means that the prediction is considered to be correct if the actual class is predicted in at least one of the top-k predictions. Both the pretext and downstream tasks used this metric.

3.6.2 Qualitative

For qualitative analysis of the results, an approach called Grad-CAM was used which is based on [48]. The Grad-CAM approach was a generalization over the CAM approach laid out in [49], to support any deep learning architecture and hence was suitable in this project.

The idea behind this approach is to understand which portion of the im- age, the convolutional neural network is actually focusing. This is achieved by generating a heat map overlayed on the actual image to understand which portion of the image the network was paying attention to, that led it to classify the image to a certain category.

Class Activation Mapping(CAM) [49] used global average pooling on the

final convolutional layer. Corresponding to each activation map in the final

layer, a feature is outputted by global average pooling. These features then

become a part of a fully connected layer that is connected to the final Softmax

layer. In order to generate a class activation mapping, the weights for a given

category are projected back to individual activation maps of the final convo-

lutional layer and all the activation maps weighted by these projected weights

are summed to give the final mapping. Figure 3.7 shows the CAM approach.

(35)

CHAPTER 3. METHODS 25

Figure 3.7: Class Activation Mapping [49]

However, the CAM approach was suitable for CNN architectures without any fully connected layers. Gradient weighted Class Activation Mapping(Grad- CAM) overcame this limitation to support any CNN-based architectures i.e with or without fully connected layers. Grad-CAM achieves this by propagat- ing the class-specific gradients back to the last convolutional feature maps.

Figure 3.8: Grad-CAM [48]

Figure 3.8 taken from the original paper shows the Grad-CAM approach.

The image is first forward propagated through the network producing scores

for each class. To compute the heatmap for a target class, all the gradients

except the ones for the target class which is set to 1, are set to 0. The gradients

for the target class are back-propagated to the last convolutional layer and the

(36)

26 CHAPTER 3. METHODS

heatmaps are generated by weighing the activation maps by weights obtained

by global average pooling of these back-propagated gradients.

(37)

Chapter 4 Experiments and Results

This chapter gives details about various experiments carried out, and the anal- ysis of those results.

Three main empirical experiments have been carried out to answer Re- search Question 1 and 2 posed in Section 1.4.2. The first two experiments show DeepFashion test dataset classification accuracies of different models pre-trained on ImageNet(with and without labels depending upon the approach) and then fine-tuned on DeepFashion training dataset. The third experiment shows the impact of choosing different datasets during pre-training and then fine-tuning on a smaller fraction of labelled images(DeepFashion validation dataset) as compared to the first two experiments. While the first three ex- periments are primarily quantitative, finally, a qualitative analysis using Grad- CAM is shown.

4.1 Experiment 1

4.1.1 Objective

This experiment had two objectives:

1. To find out which of the approaches listed in Section 3.2 performs best on DeepFashion dataset for category classification.

2. To find out how the performance changes with the change of CNN ar- chitecture.

27

(38)

28 CHAPTER 4. EXPERIMENTS AND RESULTS

4.1.2 CNN architecture used

For the choice of architectures, AlexNet [3] and VGG-16 [50] with batch nor- malization were used. The choice was made based on the available compute power and also to demonstrate the impact of choosing a relatively deeper ar- chitecture.

4.1.3 Pretext task: Classification on ImageNet dataset

First, a model(from Section 3.2) was pre-trained on ImageNet dataset. Figure 4.1 shows the layers in the AlexNet model that were trained during this task.

Figure 4.1: PRETEXT TASK: ImageNet Classification

For ImageNet Supervised model (Section 3.2.1), the training was done using the actual labels of the ImageNet dataset. The task of the model during training was to predict the actual class of the image, given by the label. The official pre-trained models provided by Pytorch [47] were used.

For DeepCluster model (Section 3.2.2), the training was done on ImageNet dataset without using any labels. The task of the model during training was to predict the pseudo-label of the image, obtained by clustering of the features.

The implementation was done using official code [51] implemented by the authors of the paper.

For Rotation model (Section 3.2.3), the training was done on ImageNet

dataset without using any labels. The task of the model during training was to

predict the rotation angle of the image. The implementation was done using

official code [52] implemented by authors of the paper.

(39)

CHAPTER 4. EXPERIMENTS AND RESULTS 29

4.1.4 Downstream task: Clothing item category clas- sification on DeepFashion test

Then the learnt model was fine-tuned on DeepFashion train dataset.

Taking the pre-trained model, the conv layers were frozen and a fully con- nected block with 3 linear layers was attached so that the final linear layer outputted 46 values. A Softmax layer was used to give the final output. Fol- lowing was configuration of fully connected layers during fine-tuning:

• AlexNet: Linear Layer (512 units), Dropout (0.2), Linear Layer (128 units), Dropout (0.1), Linear Layer (46 units).

• VGG-16: Linear Layer (4096 units), Dropout (0.2), Linear Layer (512 units), Dropout (0.1), Linear Layer (46 units).

Figure 4.2 shows the layers fine-tuned in the AlexNet model for the down- stream task.

Figure 4.2: DOWNSTREAM TASK: DeepFashion Clothing Item Category Classification

4.1.5 Implementation Steps

• Three different models were pre-trained on ImageNet with AlexNet ar- chitecture:

– ImageNet Supervised (Section 3.2.1) – DeepCluster (Section 3.2.2)

– Rotation (Section 3.2.3)

(40)

30 CHAPTER 4. EXPERIMENTS AND RESULTS

• Then the Fully Connected layers were fine-tuned for 50 epochs on Deep- Fashion training dataset, taking the model with the lowest error on Deep- Fashion validation dataset.

• and then accuracies were checked on DeepFashion test dataset

• The first three steps were repeated then for VGG-16 architecture

4.1.6 Results

Experiment 1

AlexNet VGG-16

Method Top-1 | Top-3 | Top-5 Top-1 | Top-3 | Top-5 ImageNet Supervised 56.34 | 79.09 | 87.27 60.22 | 82.12 | 89.60 DeepCluster 52.29 | 75.24 | 84.83 58.18 | 79.75 | 87.96 Rotation 55.74 | 78.51 | 87.30 59.56 | 81.02 | 88.30 Table 4.1: Category Classification accuracies on DeepFashion test dataset.

All the Top-k accuracies are in percentages. Each row corresponds to a different model pre-trained with AlexNet/VGG-16 architecture on ImageNet dataset and then fine-tuned(fully connected layers) on DeepFashion training dataset.

As expected the fully supervised model performs the best but the rotation is also pretty close to it in terms of accuracies. This shows that even an unsu- pervised model can be effective at learning meaningful representations which can then be transferred to other tasks.

Also, the VGG-16 vs AlexNet results show that a deeper architecture is ex- pected to perform better than a shallower one.

4.2 Experiment 2

4.2.1 Objective

This experiment had the following objective:

• To find if fine-tuning the model for more layers than just the Fully Con-

nected layers would lead to an improvement in accuracy or not.

(41)

CHAPTER 4. EXPERIMENTS AND RESULTS 31

4.2.2 CNN architecture used

Due to computational resources constraint, this experiment was just carried out for the AlexNet architecture.

4.2.3 Pretext task: Classification on ImageNet dataset

First, a model(from Section 3.2) was pre-trained on ImageNet dataset. Figure 4.3 shows the layers in the AlexNet model that were trained during the pretext task.

Figure 4.3: PRETEXT TASK: ImageNet Classification

For ImageNet Supervised model (Section 3.2.1), the training was done using the actual labels of the ImageNet dataset. The task of the model during training was to predict the actual class of the image, given by the label. The official pre-trained models provided by Pytorch [47] were used.

For DeepCluster model (Section 3.2.2), the training was done on ImageNet dataset without using any labels. The task of the model during training was to predict the pseudo-label of the image, obtained by clustering of the features.

The implementation was done using official code [51] implemented by the authors of the paper.

For Rotation model (Section 3.2.3), the training was done on ImageNet

dataset without using any labels. The task of the model during training was to

predict the rotation angle of the image. The implementation was done using

official code [52] implemented by authors of the paper.

(42)

32 CHAPTER 4. EXPERIMENTS AND RESULTS

4.2.4 Downstream task: Clothing item category clas- sification on DeepFashion test

Then the learnt model was fine-tuned on DeepFashion train dataset.

Taking the pre-trained model, the conv1, conv2, conv3 layers were frozen and the conv4, conv5 layers along with the fully connected layers were fine-tuned.

Following was configuration of fully connected layers during fine-tuning:

• AlexNet: Linear Layer (512 units), Dropout (0.2), Linear Layer (128 units), Dropout (0.1), Linear Layer (46 units).

• VGG-16: Linear Layer (4096 units), Dropout (0.2), Linear Layer (512 units), Dropout (0.1), Linear Layer (46 units).

Figure 4.4 shows the layers in the AlexNet model that were fine-tuned for the downstream task.

Figure 4.4: DOWNSTREAM TASK: DeepFashion Clothing Item Category Classification

4.2.5 Implementation Steps

• Three different models were pre-trained on ImageNet with AlexNet ar- chitecture:

– ImageNet Supervised (Section 3.2.1) – DeepCluster (Section 3.2.2)

– Rotation (Section 3.2.3)

(43)

CHAPTER 4. EXPERIMENTS AND RESULTS 33

• Then the conv4 + conv5 + Fully Connected layers of the AlexNet model were fine-tuned for 100 epochs on DeepFashion training dataset, taking the model with the lowest error on DeepFashion validation dataset.

• and then accuracies were checked on DeepFashion test dataset

4.2.6 Results

Experiment 2 AlexNet

Method Top-1 | Top-3 | Top-5

ImageNet Supervised 62.93 | 83.69 | 90.78 DeepCluster 60.82 | 82.07 | 89.61 Rotation 62.94 | 83.81 | 90.78

Table 4.2: Category Classification accuracies on DeepFashion test dataset.

All the Top-k accuracies are in percentages. Each row corresponds to a dif- ferent model pre-trained with AlexNet architecture on ImageNet dataset and then fine-tuned(conv4 + conv5 + fully connected layers) on DeepFashion train- ing dataset.

The results are quite interesting here. With more fine-tuning of the previous layers, the accuracies improved for all the models and surprisingly rotation model performed as good as the fully supervised model. The general increase in accuracies may be because initially the last few convolutional layers were overfitting to the pre-training task and fine-tuning led to some regularizing effect.

4.3 Experiment 3

This experiment just involved the rotation model (Section 3.2.3). The reason

for specifically choosing this model was because it was the best performing

unsupervised model, as observed from the previous two experiments. And to

add to that, the goal of this experiment is to see how different settings i.e. dif-

ferent datasets during pre-training and fewer labels during fine-tuning, would

affect the classification accuracies of an unsupervised model.

(44)

34 CHAPTER 4. EXPERIMENTS AND RESULTS

4.3.1 Objective

Following were the objectives of the experiment:

1. To validate the hypothesis for Research Question 2 (stated in Section 1.4.2 and 1.4.3)

2. To check the performance of the model when it is trained on just a few data points of the DeepFashion dataset. This was important because generally in any given domain, we might be able to acquire a significant amount of data but only a fraction of it could be labelled in a reasonable time. So it was important to see whether the unlabelled data along with the few labelled data points could be leveraged to classify the unseen data.

4.3.2 CNN architecture used

Due to computational resources constraint, this experiment was just carried out for the AlexNet architecture.

4.3.3 Pretext task: Rotation Classification on a given dataset

The Rotation model (Section 3.2.3) was trained on different datasets without

using any labels. The task of the model during training was to predict the

rotation angle of the image. The implementation was done using official code

[52] implemented by authors of the paper. Figure 4.5 shows the layers trained

in the AlexNet model, during the pretext task.

(45)

CHAPTER 4. EXPERIMENTS AND RESULTS 35

Figure 4.5: PRETEXT TASK: Rotation Angle Classification

Following datasets were used without using any labels:

• ImageNet

• DeepFashion training dataset(full and its subsets)

• FashionAi [39]

• CelebA [42]

• Places365-standard [41]

4.3.4 Downstream task: Clothing item category clas- sification on DeepFashion test

Then the learnt model from pretext task was fine-tuned on DeepFashion vali- dation dataset.

Taking the pre-trained model, the conv layers were frozen and a fully con- nected block with 3 linear layers was attached so that the final linear layer outputted 46 values. A Softmax layer was used to give the final output. Fol- lowing was configuration of fully connected layers during fine-tuning:

• AlexNet: Linear Layer (512 units), Dropout (0.2), Linear Layer (128 units), Dropout (0.1), Linear Layer (46 units).

• VGG-16: Linear Layer (4096 units), Dropout (0.2), Linear Layer (512

units), Dropout (0.1), Linear Layer (46 units).

(46)

36 CHAPTER 4. EXPERIMENTS AND RESULTS

Figure 4.6 shows the layers in AlexNet model that were fine-tuned during the downstream task.

Figure 4.6: DOWNSTREAM TASK: DeepFashion Clothing Item Category Classification

4.3.5 Implementation Steps

• Rotation model was trained for 50 epochs on AlexNet on following datasets without using any labels:

– ImageNet

– DeepFashion training dataset(full and its subsets) – FashionAi [39]

– CelebA [42]

– Places365-standard [41]

• Then the Fully Connected layers were fine-tuned for 50 epochs on Deep- Fashion validation dataset, taking the model with the lowest error.

• and then accuracies were checked on DeepFashion test dataset.

(47)

CHAPTER 4. EXPERIMENTS AND RESULTS 37

4.3.6 Results

Experiment 3

Dataset #images Top-1 Top-3 Top-5

ImageNet ~1.2M 52.11 75.30 84.98

FashionAi ~160K 52.19 75.33 85.10

CelebA ~182K 45.98 69.74 80.62

Places365-standard ~1.8M 51.90 75.11 84.47

DeepFashion train 200K 52.58 75.82 85.29

DeepFashion train subset 1 100K 47.24 71.67 82.62 DeepFashion train subset 2 50K 45.45 69.89 80.98 Table 4.3: Category Classification accuracies on DeepFashion test dataset.

All the Top-k accuracies are in percentages.Each row corresponds to the rotation model pre-trained on a given dataset using rotation model and then fine-tuned on DeepFashion validation dataset.

Comparing these results from Experiment 1, the first observation that can be made is that, although we just used one-fifth of data( i.e 40K images of validation dataset instead of 200K images of actual training dataset) for fine- tuning and category classification on DeepFashion test dataset, the accuracy just dropped by ~3-4% (comparing with results from Experiment 1 since only fully-connected layers were fine-tuned). This shows the possibility of getting some realistic accuracies by using relatively less labelled fashion data images.

Secondly, we can see that as long as the datasets are not too specific e.g.

CelebA(dataset containing faces of people), the rotation model can learn pretty good transferrable features.

Although the results for fashion datasets(DeepFashion and FashionAi) were expected to be significantly higher with the hypothesis that it might be better to use datasets from same domain during pretext and downstream tasks, that is not the case as ImageNet and Places365 show almost close results.

Hence, it might be safe to say that it is better to use a dataset during the pretext

task where the model has the opportunity to learn a variety of features instead

of learning features overly specific to one domain like faces.

(48)

38 CHAPTER 4. EXPERIMENTS AND RESULTS

4.4 Grad-CAM visualizations

For some qualitative analysis, Grad-CAM visualizations were produced using the AlexNet model trained using rotation model implemented in Experiment 2. Figure 4.7 shows visualizations on some images from the DeepFashion test dataset.

Figure 4.7: Grad-CAM visualizations on Deep Fashion test dataset The figure above gives a glimpse of where the network was failing. Since the dataset has images with one labelled category, it sometimes confused on the images having a full body of a person wearing multiple clothing items.

So for example, in the Figure a, the model is wearing a blouse and shorts and the image had an assigned label of a blouse, but the CNN focused just on shorts. In terms of training samples for these categories, there were ~17.7K images for blouse and ~14.1K images for shorts. Since, the sample size for both categories are almost similar, its hard to decipher why the model chose one category over the other.

However, the case was different in Figure b, containing a person wearing hoodie and jeans. The training samples for jeans were ~5.1K whereas for hoodie there were around ~2.9K images. Since, the number of jeans images shown to the model during training is almost twice to the hoodie images, it might be the case that model learnt to detect jeans much more dominantly than a hoodie and thus a misclassification happened in the image that was labelled as a hoodie but contained both hoodie and jeans.

In Figure c, it was relatively easier for the network to correctly identify the

category since the major portion of the picture contained one clothing item.

(49)

CHAPTER 4. EXPERIMENTS AND RESULTS 39

Overall, these results point to the fact that it is better to train the model on

multi-labels so that CNN looks at multiple regions of the image. Another point

that came out from this analysis is that it is important to visualize the outputs

of CNNs than just evaluating them numerically in terms of accuracies. This

helps to understand where your network is failing and which parts of the image

it is discriminating on, to make decisions.

(50)

Chapter 5 Discussion

5.1 Limitations and Challenges

5.1.1 Dataset issues

• One of the issues that were present in the dataset is the class imbalance problem. This led to most of the denser classes having better predic- tions than others. An approach to weighing the cross-entropy loss using the weight parameter [53] was attempted, but that did not affect the ac- curacies. Also, this challenge of data imbalance has been not talked in the previously published works using this dataset, so some way to tackle this data imbalance needs to be researched.

• Another issue that was observed from Grad-CAM observations is that images would need multi-labels since several categories of clothes are present in a given image.

• Also, some of the images were not good quality and contained irrelevant content. It is a hard challenge to programmatically see which images are irrelevant.

5.1.2 Choosing hyperparameters

It was observed that identifying a good learning rate and the number of fully connected layers is more of a trial and error process and can be a time-consuming process. But as a general practice during pre-training phase, it is better to use a relatively higher learning rate like 0.01 for relatively shallower networks like AlexNet but for deeper architectures a much lower learning rate needs to be

40

(51)

CHAPTER 5. DISCUSSION 41

used. Also, while fine-tuning during the downstream task, it was observed af- ter some experimentation, that better results were seen if the training started with lower learning rates of the order of 10

⁻³

and gradually decayed after some epochs.

5.1.3 Computational resources

Having access to computational resources like disk space, GPU was a chal- lenge. For example for training the ImageNet dataset, around 150 GB of disk space and multiple GPUs were required to train it on different models.

Also, because of this factor the variance in accuracies achieved by different models used in the experiments could not be published.

5.2 Applications and Future Work

5.2.1 Future Work

As a part of future work, following extensions could be done:

• Deeper CNN architectures like ResNet [54] trained using the shown un- supervised approaches could be used to evaluate the accuracies on Deep- Fashion. One of the recent papers [55] that demonstrates the highest ac- curacy on DeepFashion classification, uses Resnet50 model trained in a fully supervised way. So a comparison with that model by training a Resnet50 in an unsupervised way would be interesting.

• Also, other self-supervision approaches based on images/videos can be compared with the current results.

• The current work is just limited to category classification using unsuper- vised approaches. But this work can be extended to do attribute classifi- cation and landmark detection, since the DeepFashion dataset provides this extra information about the images.

5.2.2 Applications

An application of the approaches shown in the project could be a fashion item

category classifier used by a fashion e-commerce team, that is trained on a

big dataset mostly containing unlabelled images along with a few data points

(52)

42 CHAPTER 5. DISCUSSION

labelled by human labellers. This would reduce effort, time and resources on part of the fashion firm to manually annotate the entire dataset.

Also, as a natural extension of this work, classifiers that do attribute clas-

sification could be built, i.e. to identify whether a given image of shirt con-

tained a solid colour shirt or a striped colour shirt. Since the dataset contains

attributes, it would just require modifying the network architecture to predict

multiple attributes. Having such a functionality would be helpful for fashion

companies to do much fine-grained digital inventory management without re-

lying on humans to label the attributes, which otherwise can be a cumbersome

and an error-prone process.

(53)

Chapter 6 Conclusions

The scope of the project was to explore the effectiveness of unsupervised deep learning models in fashion classification in comparison to their supervised counterparts.

The quantitative experiments showed four major things. Firstly, that unsu- pervised approaches like DeepCluster or Rotation Self-Supervision can give quite close accuracies, i.e ~1-4 % lesser than a fully supervised approach. Sec- ondly, that deep architectures tend to give better accuracies. Thirdly, that re- training last few convolutional layers during fine-tuning tend to increase ac- curacy, irrespective of whether the model was pre-trained in a supervised or unsupervised way. Finally, that rotation model that is pre-trained in an unsu- pervised way and then fine-tuned on a much smaller dataset is only ~3-4 % less accurate than a fully supervised model with similar fine-tuning procedure but trained on five times bigger labelled dataset during the fine-tuning phase.

Also, the qualitative analysis demonstrated that visualizing how the CNN model is focusing on the image can be helpful to interpret where the model is working fine and where it is misclassifying.

Overall, experiments showed that unsupervised approaches can be used to overcome limitations of requiring labelled data and achieve comparable if not superior results than fully supervised deep learning for fashion classification.

However, more research efforts are required for improving the efficacy of un- supervised deep learning. This would help us to tap on a huge amount of data generated through internet, sensors and other devices.

Also, a key conclusion from this project is that Self-Supervised Learning shows great potential for bridging the gap between supervised and unsuper- vised deep learning. Therefore, more research efforts in this field e.g coming up with better pretext tasks can help move forward in that direction.

43

(54)

Bibliography

[1] David H Hubel and Torsten N Wiesel. “Receptive fields, binocular in- teraction and functional architecture in the cat’s visual cortex”. In: The Journal of physiology 160.1 (1962), pp. 106–154.

[2] ImageNet Large Scale Visual Recognition Competition (ILSVRC). http:

//www.image-net.org/challenges/LSVRC/.

[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105.

[4] Andrew Ng’s slides in Extract Data Conference | SlideShare. https:

//www.slideshare.net/ExtractConf.

[5] Eric Nowak, Frédéric Jurie, and Bill Triggs. “Sampling strategies for bag-of-features image classification”. In: European conference on com- puter vision. Springer. 2006, pp. 490–503.

[6] Marcin Marszałek et al. “Learning object representations for visual ob- ject class recognition”. In: Visual Recognition Challange workshop, in conjunction with ICCV. 2007.

[7] Geoffrey E Hinton and Ruslan R Salakhutdinov. “Reducing the dimen- sionality of data with neural networks”. In: science 313.5786 (2006), pp. 504–507.

[8] Joy Buolamwini and Timnit Gebru. “Gender shades: Intersectional ac- curacy disparities in commercial gender classification”. In: Conference on Fairness, Accountability and Transparency. 2018, pp. 77–91.

[9] About the Sustainable Development Goals - United Nations Sustainable Development. https://www.un.org/sustainabledevelopment/

sustainable-development-goals/.

[10] Transfer Learning - Machine Learning’s Next Frontier. http : / / ruder.io/transfer-learning/.

44

(55)

BIBLIOGRAPHY 45

[11] Convolutional neural network - Wikipedia. https://en.wikipedia.

org/wiki/Convolutional_neural_network.

[12] Shaoqing Ren et al. “Faster r-cnn: Towards real-time object detection with region proposal networks”. In: Advances in neural information processing systems. 2015, pp. 91–99.

[13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolu- tional networks for semantic segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 3431–

3440.

[14] Transfer learning in CNN. https://cdn-images-1.medium.

com/max/2000/1*qfQ3hmHLwApXZBN-A85r8g.png.

[15] Mariya I Vasileva et al. “Learning type-aware embeddings for fashion compatibility”. In: Proceedings of the European Conference on Com- puter Vision (ECCV). 2018, pp. 390–405.

[16] Peizhao Li et al. “Two-Stream Multi-Task Network for Fashion Recog- nition”. In: arXiv preprint arXiv:1901.10172 (2019).

[17] Kota Hara, Vignesh Jagadeesh, and Robinson Piramuthu. “Fashion ap- parel detection: the role of deep convolutional neural network and pose- dependent priors”. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE. 2016, pp. 1–9.

[18] Beatriz Quintino Ferreira et al. “A Unified Model with Structured Out- put for Fashion Images Classification”. In: arXiv preprint arXiv:1806.09445 (2018).

[19] Wenguan Wang et al. “Attentive fashion grammar network for fashion landmark detection and clothing category classification”. In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recogni- tion. 2018, pp. 4271–4280.

[20] Ziwei Liu et al. “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1096–1104.

[21] I Zeki Yalniz et al. “Billion-scale semi-supervised learning for image classification”. In: arXiv preprint arXiv:1905.00546 (2019).

[22] Adam Coates and Andrew Y Ng. “Learning feature representations with k-means”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 561–

580.

Unsupervised Learning of Visual Features for Fashion Classification

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2019 ,

Unsupervised Learning of Visual Features for Fashion Classification

SUMEET DHARIWAL

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Unsupervised Learning of Visual Features for Fashion Classification

SUMEET DHARIWAL

Master in Data Science Date: June 27, 2019 Supervisor: Dr. Ying Liu

Examiner: Dr. Vladimir Vlassov

Industrial Supervisor: Abubakrelsedik Karali

School of Electrical Engineering and Computer Science Host company: RISE SICS

Swedish title: Unsupervised Lärande av Visuella Funktioner för

Fashion Classification

iii

Abstract

Through empirical experiments, it has been shown that these unsupervised

deep learning techniques can be used to attain comparable classification accu-

racies (~1-4 % lesser than that achieved by a fully supervised model) and thus

making them as suitable alternatives to supervised approaches.

iv

Sammanfattning

Genom empiriska experiment har det visat sig att dessa oövervakade djupa

inlärningstekniker kan användas för att uppnå jämförbara klassifikationsnog-

grannigheter (~1-4 % mindre än det som uppnås genom en helt övervakad

modell) och därigenom göra dem till lämpliga alternativ för övervakade till-

vägagångssätt.

Contents

1 Introduction 1

1.1 Deep Learning and Labelled Data . . . . 2

1.2 What is Unsupervised Deep Learning? . . . . 2

1.3 Motivation behind Unsupervised Deep Learning . . . . 3

1.4 Research Question and Hypotheses . . . . 3

1.4.1 Challenge . . . . 3

1.4.2 Research Questions . . . . 4

1.4.3 Hypotheses . . . . 5

1.4.4 Contributions . . . . 5

1.5 Ethical Considerations and Sustainability . . . . 5

2 Related Work 7 2.1 Transfer Learning for Computer Vision tasks . . . . 7

2.1.1 Extracting features from a pre-trained CNN . . . . 9

2.2 Supervised approaches for Fashion Classification . . . . 9

2.3 Semi-supervised approaches for Computer Vision tasks . . . . 10

2.4 Unsupervised approaches for Computer Vision tasks . . . 10

2.4.1 Clustering based approach . . . 10

2.4.2 Self Supervised approach . . . 10

2.4.3 Generative models based approach . . . 17

3 Methods 18 3.1 Dataset . . . 18

3.2 Implemented Models . . . 21

3.2.1 Fully supervised pre-training . . . 21

3.2.2 DeepCluster based pre-training . . . 21

3.2.3 Self Supervised pre-training using Rotation . . . 22

3.3 Computational Resources Specification . . . 22

3.4 Choice of CNN architectures and Hyperparameters . . . 23

v

vi CONTENTS

3.5 Choice of Deep Learning Framework . . . 23

3.6 Evaluation Metrics . . . 24

3.6.1 Quantitative . . . 24

3.6.2 Qualitative . . . 24

4 Experiments and Results 27 4.1 Experiment 1 . . . 27

4.1.1 Objective . . . 27

4.1.2 CNN architecture used . . . 28

4.1.3 Pretext task: Classification on ImageNet dataset . . . . 28

4.1.4 Downstream task: Clothing item category classifica- tion on DeepFashion test . . . 29

4.1.5 Implementation Steps . . . 29

4.1.6 Results . . . 30

4.2 Experiment 2 . . . 30

4.2.1 Objective . . . 30

4.2.2 CNN architecture used . . . 31

4.2.3 Pretext task: Classification on ImageNet dataset . . . . 31

4.2.4 Downstream task: Clothing item category classifica- tion on DeepFashion test . . . 32

4.2.5 Implementation Steps . . . 32

4.2.6 Results . . . 33

4.3 Experiment 3 . . . 33

4.3.1 Objective . . . 34

4.3.2 CNN architecture used . . . 34

4.3.3 Pretext task: Rotation Classification on a given dataset 34 4.3.4 Downstream task: Clothing item category classifica- tion on DeepFashion test . . . 35

4.3.5 Implementation Steps . . . 36