Applying Similarity Condition Embedding Network to an industrial fashion dataset

(1)

Applying Similarity Condition

Embedding Network to an

industrial fashion dataset

VIKTOR TÖRNEGREN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Applying Similarity Condition

Embedding Network to an

industrial fashion dataset

VIKTOR TÖRNEGREN

Master in Machine Learning Date: September 15, 2020

Supervisor: Anastasiia Varava from KTH Supervisor: Stefan Gudmundsson from H&M Examiner: Danica Kragic

School of Electrical Engineering and Computer Science Host company: Hennes & Mauritz (H&M) AB

(3)

(4)

iii

Abstract

(5)

Sammanfattning

(6)

Acknowledgments

I sincerely would like to thank my industrial supervisor Stefan Gudmundsson for his encouraging help and suggestions. I would also like to thank my aca-demic supervisor Anastasiia Varva for taking her time to read my report and respond with useful feedback. In addition I am also very grateful that H&M provided me with all the resources I needed to complete my master thesis. Last but not least a big thank you to my master thesis collegeaus at H&M, Geor-gios Deligiorgis and Lingxi Xiong for taking their time to have discussions and coming with useful suggestions in times of need.

(7)

1 Introduction 1

2 Background 3

2.1 Theory . . . 3

2.1.1 Convolutional Neural Networks . . . 3

2.1.2 ResNet . . . 6

2.1.3 Triplet Loss . . . 7

2.2 Related work . . . 8

2.2.1 Bidirectional LSTM . . . 8

2.2.2 Type-Aware Embeddings . . . 9

2.2.3 Similarity Condition Embedding (SCE) Network . . . 10

2.2.4 Compatibility Estimation . . . 10

2.2.5 Fill-In-The-Blank (FITB) . . . 11

3 Methods 12 3.1 Data . . . 12

3.1.1 Polyvore Dataset . . . 12

3.1.2 H&M Outfit Dataset . . . 13

3.1.3 H&M Transaction Dataset . . . 15

3.2 SCE Network . . . 15

3.3 Evaluation Metrics . . . 17

3.3.1 Compatibility Estimation (CE) . . . 17

3.3.2 Outfit Completion, Fill-In-The-Blank (FITB) . . . 18

3.3.3 Similarity and classification . . . 18

3.3.4 Customer basket completion, Fill-In-The-Blank (FITB) 19 3.4 Software and hyperparameters . . . 20

4 Results 21 4.1 Polyvore outfits . . . 21

4.2 H&M outfits . . . 22

(8)

CONTENTS vii

4.3 H&M transaction . . . 28

5 Discussion 30

5.1 H&M outfits: CE and FITB . . . 30 5.2 H&M outfits: Similarity and Classification . . . 32 5.3 H&M transaction: FITB . . . 32

6 Conclusions 34

Bibliography 36

(9)

(10)

Chapter 1 Introduction

Research on fashion data that utilizes computer vision has mostly been focus-ing on classification of clothes [4, 5] and retrievfocus-ing clothfocus-ing items [6, 7, 8, 9]. However, due to the increasing popularity of online shopping as well as the fact that fashion has an important role in a lot of peoples lives, in recent years researchers has started experimenting with models that can complete an outfit and measure the compatibility between pair of items within an outfit. To be able to compose an outfit it is important to distinguish between item similarity and item compatibility. Item similarity means that for instance two t-shirts are good substitutes to each other while item compatibility is defined as garments of different product type that fits well together in an outfit. One of the first model to utilize different similarity conditions on fashion data was presented by Veit, Belongie, and Karaletsos [1], they applied their network on a dataset containing shoes and they were able to cluster this data on gender, heel height and fashion sense based on age etc. After that [2, 10, 3] followed with algorithms that can measure outfit compatibility (see Section 2.2.4) as well as recommending items to complete an outfit (see Section 2.2.5).

Moreover, [2, 10, 3] uses a dataset named Polyvore that consists of outfits that are put together by regular people, thus, it is not really representative for how fashion companies constructs outfits. In addition it is not clear if this data only contains items for women or if it also contains garments for men. How-ever, by looking over the image from the Polyvore dataset it seems as it only contains outfits for women. Hence, there exist a need for a new dataset that contains clothing items for both men and women and where outfits better re-flects how the fashion companies constructs outfits. In this project we have in collaboration with fashion experts from a Swedish company Hennes &

(11)

ritz (H&M) created a dataset that contains outfits for both men and women, see Section 3.1.2.

The purpose of this paper is to reimplement the Similarity Condition Embed-ding Network (SCE-Net) from [3]. This model will be evaluated on outfit compatibility see Section 3.3.1 and outfit completion see Section 3.3.2. We will further investigate how well this network generalizes to unseen clothing categories as well as doing a cross-gender evaluation, that is train on a dataset that only contains clothing items for women and test on a dataset that only contains items for men and vice versa. In addition we will also evaluate if the feature vectors from the general embedding space (feature vectors extracted from a Convolutional Neural Network) can be utilized to measure similarity between clothing items respectively if they can be used for classification of clothing type and the colour of a garment. Moreover, in addition to the H&M outfit data we have also created a second dataset that contains transaction data from both physical stores as well as H&Ms online shop, see Section 3.1.3. On this data we will investigate if the SCE-net can be used to predict a missing item in a customers shopping basket, see Section 3.3.4. That is given the items in a basket of clothing items what would be the most likely additional item to the predefined basket.

The research questions that we address are:

• can a neural network be applied to a clothing outfit to asses the compat-ibility of an outfit as well as completing an outfit by recommending an additional clothing item.

• can a neural network predict the next item a customer will buy.

To answer these questions, we will use our reimplementation of the SCE-network, and apply it to an outfit data that we have constructed in collaboration with H&M as well as transactional data provided by H&M.

(12)

Chapter 2 Background

2.1 Theory

2.1.1 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) date back to 1989 when LeCun et al. [11] applied it to handwritten digits. However, due to high computational cost that requires strong GPUs, deep CNNs did not gain popularity until 2012 when Krizhevsky, Sutskever, and Hinton [12] introduced AlexNet. CNNs are an extension of regular feed-forward neural netowks that are specialized to handle data with grid-like topology such as images and time-series data [13]. These networks are built by several hidden layers that consists of: convolu-tional layers, pooling layers and fully-connected layers. An example of a CNN architecture can be seen in Figure 2.1

(13)

input image convolutional layer with non-linearities pooling convolutional layer with non-linearities pooling FC layer output layer

Figure 2.1: Example of a simple CNN as in [11], the network alternates

be-tween convolutional layers and pooling layers. The feature maps from the last pooling layers is then fed into fully-connected layers (FC layers). To get the output as probabilities, softmax is often applied.

(14)

CHAPTER 2. BACKGROUND 5

Figure 2.2: An example of 2D convolution. The kernel is sliding over the

width and height of the input and the output is equal to the dot product between the kernel and the input. Image is taken from [13].

(15)

corners which is something that is usually picked up by the earlier layers in a CNN.

As explained above within a convolutional layer a kernel and or kernels are moved over the input which creates several feature maps that are stacked to-gether. After that a non-linear activation function such as the rectified linear activation function (ReLu) is applied to the feature maps. The next step is to utilize a pooling function which is applied to the output of the convolution, that is the feature map. The pooling function reduces the dimension of the output by replacing different parts of the feature map with a summary statis-tic. A popular pooling function is the max pooling which replaces different neighborhoods of the output with the maximum value from each neighbor-hood. This in turn make the network invariant to small shifts and distortion of of the input. As shown in Figure 2.1 most CNNs consist of an alternation of convolutional layers and pooling layers. These are in most cases followed by one or several fully-connected layers and to get the probabilities over the output a softmax function is often applied to the output from the last fully-connected layer. For more information regarding convolutional neural networks the in-terested reader is recommended to read chapter 9 in [13].

2.1.2 ResNet

(16)

to directly optimize the original mapping, this is illustrated in Figure 2.3. By utilizing this technique they were able to build a model with 152 layers that in 2015 helped them win the ILSVRC 2015 classification task.

Figure 2.3: Example of an residual building block. Images is taken from [17].

2.1.3 Triplet Loss

Within computer vision when one wants to train a model to distinguish images by similarity in some unified embedding space, it is very common to apply the triplet loss function [10, 2, 1, 3, 22, 23]. The core idea is that the model should be able to reduce the distance between similar images and increase the distance between dissimilar images. To do this the triplet loss function utilizes an anchor image, a positive image that is similar to the anchor and a negative image that is dissimilar to the anchor. The function is defined as

(17)

Figure 2.4: The triplet loss reduces the distance between the anchor image

and the positive while it increases the distance between the anchor and the negative image. This figure is taken from [22].

2.2 Related work

2.2.1 Bidirectional LSTM

(18)

2.2.2 Type-Aware Embeddings

One of the current state-of-the-art method when it comes to outfit compati-bility and outfit completion is called Type-Aware Embeddings. This method is introduced by Vasileva et al. [2] and it is based on the paper Conditional Similarity Networks written by Veit, Belongie, and Karaletsos [1]. When it comes to building an outfit Vasileva et al. [2] agrees with the reasoning of [25, 10] that to create an outfit one needs to be able to measure both item similarity as well as item compatibility. Here item similarity means that for instance a jacket is a good substitute to another jacket while item compatibility means that two items of different type complete each other and makes a good match in an outfit. However, according to Vasileva et al. [2], using a pairwise dis-tance, e.g. the Euclidean distance to measure compatibility between items in a unified embedding space will give rise to improper triangles. This means that if a jacket is compatible with a shirt and if this specific shirt is compatible with a pair of shoes, then the model will force the jacket to be compatible with the pair of shoes even though they might not match. To overcome the aforemen-tioned issue the authors of [2] suggest that instead of only relying on a shared embedding space it is important to also utilize type specific embeddings that make sure that two items only can match in one context.

(19)

2.2.3 Similarity Condition Embedding (SCE) Network

One of the disadvantages with Vasileva et al. [2] type-aware embedding model is that the type-spaces need to be predefined, in fact they are using 66 of these. To overcome this issue, Tan et al. [3] suggest to mask different parts of the feature vector from the general embedding space and re-weight them to learn different types of similarity conditions without explicit supervision. Their so-lution is based on the fact that clothing items are similar with respect to mul-tiple visual attributes, e.g. category, shape, colour-pattern, colour, season etc. This is aligned with the reasoning in [10, 2, 1]. These attributes are contained in the feature vector from the general embedding space. By masking different parts of the feature vector and then re-weight the masked vectors, the model can learn different representations of an item which can then be used to mea-sure compatibility between clothing items. Both the masks and weights are randomly initialized and then learned during training.

To extract feature vectors each image is projected into a general embedding space via a ResNet CNN [25] (see Section 2.1.2) with 18 layers. The masks and re-weighting is then applied via elementwise product to pair of images that projects them into different similarity embeddings that are used to mea-sure pairwise compatibility utilizing the Euclidean distance. This step is vital since for instance a blue summer jacket and a beige pair of shorts are dissimilar in colour and clothing category but similar in season (summer), which means that they might be compatible. Similar to [2] the outfit compatibility is calcu-lated via the average of the pairwise compatibility between each item within an outfit. Moreover, this model is trained end-to-end and as in [10, 2]. The goal is to measure outfit compatibility as well as predicting a missing item in an outfit (outfit completion).

2.2.4 Compatibility Estimation

(20)

only consists of for instance sweaters. Further Vasileva et al. [2] argues that by randomly put together an outfit it is very likely that this set of items contains clothing items that together does not make an outfit. This in turn will make it very easy for the model to distinguish between real and fake outfits. Instead Vasileva et al. [2] suggests that it is important to take into account item cat-egory when sampling an item. For instance if an real outfit consists of three items (I1, I2, I3), when creating the fake outfit (I

0 1, I 0 2, I 0 3), I 0 1 should be sam-pled from the same category that I1 belongs to, I

0

2 from the same category as I2and I

0

3from the same category as I3. This will then yield a set of items that can be seen as an outfit but at the same time the probability that these items are pairwise compatible is very low, this technique is also utilized by [3].

2.2.5 Fill-In-The-Blank (FITB)

(21)

Methods

3.1 Data

3.1.1 Polyvore Dataset

Before the web page (www.polyvore.com) was shutdown it was one of the more popular platforms were users could create outfits consisting of different fashionable clothing items. These outfits contain images, text description of each image as well as ratings from other users. Hence, this platform has also been used by researchers to create fashion datasets. Han et al. [10] were the first ones to create such a dataset and also make it publicly available. It con-sists of 21,899 outfits and 164,379 unique items and is often referred to as the Maryland Polyvore dataset. However, since Vasileva et al. [2] did not agree with Han et al. [10] sampling technique, see Section 2.2.4 and Section 2.2.5. They decided to create their own dataset from (www.polyvore.com), that con-tains 68,306 outfits and 365,054 items1. Further, Vasileva et al. [2] realized that an clothing item can be contained in more than one outfit. Thus, they created one dataset called disjoint where an item that belongs to outfits in the train split cannot be in the validation split nor in the test split. In addition, they also created a dataset named non-disjoint were an item can appear in all three splits, whereas an outfit can only exist in one of the three splits (train, valida-tion, test). The experiments conducted by [2] reveal that the disjoint dataset and the non-disjoint dataset produce similar results. Moreover, in this thesis we will utilize the SCE Network from [3] and since they are using the non-disjoint datset we will also use that dataset to see if we can reproduce their

1_The _Polyvore _data _from _[2]: _{https://drive.google.com/file/d/}

13-J4fAPZahauaGycw3j_YvbAHO7tOTW5/view.

(22)

CHAPTER 3. METHODS 13

results regarding compatibility estimation and fill-in-the-blank.

3.1.2 H&M Outfit Dataset

One of the biggest contribution from this thesis is the outfit dataset from H&M. This dataset will be publicly available and it well help researchers to improve machine learning models for the fashion industry, due to the fact that pairwise compatible items are assigned by fashion experts instead of regular users as in the polyvore dataset. Another difference between the H&M data and the polyvore data is that for the former we can choose to train on items for men, women or both men and women. While for the polyvore dataset it is not clear if it contains items for both men and women, however, from just looking at the images from the polyvore data it seems as this dataset only contains items for women. Further it should be noted that this dataset was not just handed over to us from H&M, we had to create it by ourself, and this has actually been the most time consuming part of this thesis. From H&M we got a dataframe that contained pairwise compatible clothing items, we then had to merge this dataframe with several other dataframes to gather all the necessary informaiton about the images. This information was then used so that we could download the correct images from the H&M database. A correct images should show the front of the item and it should not contain a model nor a mannequin, an example of a correct and an unwanted image can be seen in Figure 3.1.

(a) Correct image (b) Unwanted image

Figure 3.1: To the left is an example of how an image should look like to be

(23)

Figure 3.2: Example of an outfit from the H&M dataset.

To build the outfits, we utilized the fact that each pair of compatible cloth-ing items can be seen as an edge in a graph. Hence, we built an undirected graph that contains all of the pairwise compatible items. We then extracted all maximal cliques in graph [26], where each clique is an outfit. This technique ensures that each item in an outfit needs to be pairwise compatible with all of the other items within the clique. It should also be noted that we had to set up some rules before building the graph. Items, that cover the torso were divided into three categories short-sleeve/shirts, sweaters and jackets. Acces-sories were divided into five categories headwear, bag, scarves, sunglasses respectively belt. Items that cover the legs were assigned to a single category and the same goes for dresses and shoes. We then decided that an outfit can-not contain more than one item from each category and a dress cancan-not be in an outfit that already contains an item that cover the legs, an example of an outfit is illustrated in Figure 3.2. Utilizing this technique, three different datasets was created menswear, ladieswear and men/ladies/divided, where divided is a special category from H&M that contain items for both men and women. Total number of outfits as well as the number of unique items within each dataset can be seen in Table 3.1. Here, it should be noted that in contrast to the Polyvore datatset the H&M outfit data contains more outfits than unique items. This means that an clothing item can be represented in multiple outfits.

Table 3.1: Comparison of the different outfit datasets

Dataset Nr of outfits Nr of unique items

Polyvore [2] 68,306 365,054

H&M men/ladies/divided 115,433 28,649

H&M Ladieswear 50,228 14,999

(24)

3.1.3 H&M Transaction Dataset

In addition to the H&M outfit data we also generated a dataset that contains basket of items based on customer transaction from both physical stores and the online shop. We have created one dataset that contains items bought from male customers and another dataset that contains items bought from female customers. To create the data we used two constraints: firstly each basket needs to have at least two items and secondly each customer needs to have at least four purchase history within a specific time period. Except from the aforementioned conditions we used the same dataframes, categories and image types as for the H&M outfit data in Section 3.1.2.

3.2 SCE Network

(25)

training. Each mask has dimension D and in Figure 3.3 they are denoted as C1, . . . , CM. For each feature vector the model will learn M new embeddings

Eij = Vi Cj, (3.1)

where is the elementwise product and Eij denotes the resulting embedding from applying mask Cj to the feature vector Vifrom image xi. Applying each mask to a feature vector, Vi, yields a matrix O = [Ei1, , EiM] with dimension M × D, where each column is an embedding. To get an importance weighted embedding, a vector w that contains weights is multiplied with the embedding matrix O

E_i(ij) = w(ij)OT. (3.2)

These weights make it possible for the model to learn what the most important concepts are in an unsupervised way. More specifically, by applying weights the model can learn the importance of each similarity condition mask for pairs of images. To learn the weight vector w(ij), feature vectors from a pair of images xiand xj are extracted and then concatenated

y = concat(Vi, Vj), (3.3)

where Vi is the feature vector from image xiand Vj is the feature vector from image xj. The vector, y, is then fed into a multilayer perceptron with two fully-connected layers, a ReLU activation in between the fully-fully-connected layers and in the end a softmax activation function is applied which results in a weight vector w with dimension M , that is one weight for each mask. This operation is denoted as condition weight branch in Figure 3.3.

For the model to be able to distinguish between similar and dis-similar con-tents between pairs of images the triplet loss is utilized. Due to the fact that this loss function strive to reduce the distance between images with similar concepts and increase the distance between images with dis-similar concepts, see section 2.1.3. We define xa as the anchor image, xp as a positive image, that is an image that is classified by an fashion expert to be compatible with the anchor image and xnas a negative image. The negative image is randomly sampled from the same product category as the positive image, thus it is most likely not to be compatible with the anchor image. We define the triplet loss function as

(26)

where E are the important weighted masked embeddings from Equation 3.2, d(·) is the Euclidean distance and µ is a margin. Further, to regularize the similarity condition mask to be sparse, as in [3, 1] we apply the L1 norm

L1(C) = kCk1, (3.5)

and as in [3, 1, 2] to keep regularity in the general embedding space g(x; µ) we utilize the L2 norm

L2(g(x; θ)) = kg(x; θ)k22. (3.6) Hence, the final loss is a composition of Equation 3.4, Equation 3.5 and Equa-tion 3.6

L = Ltriplet+ λ1L1+ λ2L2, (3.7) where λ1 and λ2 are hyperparameters. Moreover, it should be noted that this network is trained end-to-end and an illustration of the SCE-network is pre-sented in Figure 3.3.

Figure 3.3: An illustration of Similarity Condition Embedding Network. This

image is taken from [3].

3.3 Evaluation Metrics

3.3.1 Compatibility Estimation (CE)

(27)

item from the outfit. The model then calculate the Euclidean distance between each pair and then takes the average as an estimation of outfit compatibility. The performance of outfit compatibility is evaluated with the Area Under the receiver operating Curve (AUC). Hence, we will use both positive and negative outfits, as in [3, 2] a negative outfit is created by taking a positive outfit and for each item in that outfit we randomly sample a new item from the same clothing type. This will ensure that an outfit will only consists of items that actually can form a real outfit, for instance t-shirt, jeans and shoes. But at the same time it is very unlikely that this random set of items are pairwise compatible.

3.3.2 Outfit Completion, Fill-In-The-Blank (FITB)

As in [3, 10, 2] the models ability to complete an outfit will be measured with the metric Fill-In-The-Blank (FITB), see section 2.2.5. This metric will be applied to the H&M outfit data as well as the Polyvore data. The FITB task can be seen as a test to see if the SCE-network can recommend a clothing item to a users predefined garments that together creates a fashionable outfit. The idea is to mask a random item from an outfit, then create a answer set that contains four items. One of these clothing items is the masked item and similar to [3, 2] the three remaining garments are randomly sampled from the same product type as the masked item. The outfit with the missing item is often referred to as the question. During test time the model iterates over each item in the answer set, pair it together with the clothes from the question and calculate the Eucledian distance between each pair of garments and then take the average as an estimation of how well the different clothing items fit together as an outfit. When this is done for all of the different questions, the FITB is evaluated on accuracy, where the accuracy of the Fill-In-The-Blank task is defined as the total number of correct answers divided by the number of questions. Since the answer set consists of four items where only one of these products is the correct item, a random model would yield an accuracy of 25%.

3.3.3 Similarity and classification

(28)

wants to divide the data in. This method is initialised with random centers to each cluster and each vector is then assigned to a cluster by minimizing the squared Eucledian distance. After that new centers are assigned by calculating the mean of the vectors in each cluster, once the centers has stopped to move the algorithm has converged. When the K-means algorithm has clustered the data we will utilize the mode from each cluster to decide the colour/product type of each cluster. To get a metric that represents how well these vectors can be used for similarity we will use the accuracy for each cluster. Where accu-racy is defined as the number of items that belongs to the mode, divided by the total number of items in a cluster. Regarding the classification task we will im-plement one multilayer perceptron that classifies colour and another one that classifies product type. These multilayer perceptrons consists of three hid-den layers with ReLu activation in between and as loss function we will apply Cross-entropy loss.

3.3.4 Customer basket completion, Fill-In-The-Blank

(FITB)

(29)

3.4 Software and hyperparameters

(30)

Chapter 4 Results

4.1 Polyvore outfits

Our reimplementation of the similarity conditional embedding (SCE) network will firstly be evaluated on the non-disjoint Polyvore dataset, described in sec-tion 3.1.1. The reason for this is that we want to make sure that we can repro-duce the results from [3], since the authors from that paper are the ones who introduced the SCE-network.

Table 4.1: Comparison of different models as well as our reimplementation

of the SCE-net on the Polyvore dataset. The models are evaluated on out-fit compatibility and fill-in-the-blank (outout-fit completion). Within parenthesis, dim stands for the dimension of the feature vector from the general embedding space.

Model CE AUC FITB Acc

Bi-LSTM (dim=512) [2] 0.65 39.7

TAE-net (dim=64) [2] 0.86 55.3

SCE-net (dim=64) [3] 0.91 61.6

SCE-net (dim=64) (ours) 0.90 60.2

In Table 4.1 the results from the experiment conducted on the Polyvore dataset is presented. Firstly, it should be noted that within parenthesis, (dim), stands for the dimension of the feature vectors from the general embedding space. More specifically it is the dimension of the feature vectors extracted from a CNN, for the SCE-net the CNN is a 18 layer deep ResNet. Secondly, from Table 4.1 it can be seen that our reimplementation of the SCE-net gives very

(31)

similar results to the original one introduced in [3]. However, we get slightly lower results on both compatibility estimation respectively FITB. The reason for this could be that Tan et al. [3] trained their model several times and then picked the model with the best results. Hence, we will maintain that our re-sults are good enough for reproducibility. Thirdly, Table 4.1 also shows that the SCE-net performs better than the TAE-net from [2] even though the TAE-net requires predefined similarity conditions (also known as type-aware embed-dings), while the SCE-net learn these conditions in an unsuperwised way.

4.2 H&M outfits

Similar to the Polyvore dataset the H&M dataset is also non-disjoint. That is outfits that are contained in the train split cannot appear in the validation nor in the test split, while we still allow items to be contained in outfits across train, validation and the test set. However, as explained in Section 3.1.2 a big difference between the Polyvore data and the H&M data is that the lat-ter contain outfits for both men and ladies. Thus, our first test on the H&M data was to see how the SCE-network performs on data that contains outfits for men/ladies/divided, only ladies and only men. At the same time we also investigated how the dimension of the feature vectors extracted from ResNet 18 (the general embedding space) affects the performance on compatibility estimation respectively FITB. As can be seen from Table 4.2 our reimplemen-tation of the SCE-net produces good result on all three datasets, further this table also shows that it is a very small difference between using feature vectors with dimension 64 respectively 512. Hence, for the remaining experiments we will utilize feature vectors with dimension equal to 64. It should also be noted that the data containing outfits for men ladies and divided has the best per-formance in terms of outfit completion (FITB) and compatibility estimation. Second best is the dataset that only has outfits for ladies and last is the data that only contains clothing items for men. This is mostly likely due to the fact that the data men/ladies/divided is the largest in terms of outfits as well as clothing items and the data that only contains ladieswear is the second largest.

(32)

CHAPTER 4. RESULTS 23

Table 4.2: Comparison of different versions of the H&M dataset as well as

comparison of different dimension size of the feature vectors from the general embedding space. The models are evaluated on outfit compatibility and fill-in-the-blank (outfit completion).

Dataset men/ladies/divided

SCE-net (dim=64) 0.92 77.3

SCE-net (dim=512) 0.93 78.6

Dataset Ladieswear

SCE-net (dim=64) 0.90 72.8

SCE-net (dim=512) 0.89 72.7

Dataset Menswear

SCE-net (dim=64) 0.88 70.3

SCE-net (dim=512) 0.88 69.8

Table 4.3: Comparison of how different numbers of similarity conditions

masks affects the the performance of the SCE-net (dim=64).

Dataset Menswear Dataset Ladieswear Number of masks CE AUC FITB Acc CE AUC FITB Acc

1 0.86 65.9 0.88 69.5

2 0.87 68.0 0.89 70.3

5 0.88 70.3 0.90 72.8

8 0.87 67.2 0.90 71.8

16 0.85 66.1 0.89 71.4

(33)

contained in the men dataset are also contained in the ladies dataset. More-over, the results from Table 4.4 are not incredible, however, as we mentioned in Section 3.3.2 for the FITB if the model would randomly guess it would yield an accuracy of 25% and since the accuracy for both our test are greater than 25% it seems that the model actually can be trained on one gender and then be applied to another gender.

Table 4.4: Cross-gender evaluation of the SCE-net (dim=64) model by

train-ing and testtrain-ing across different genders. The rows shows which gender the model has been trained on while the columns show which gender the model has been tested on.

Evaluated on Men Evaluated on Ladies Trained on CE AUC FITB Acc CE AUC FITB Acc

Menswear 0.88 70.3 0.61 37.2

Ladieswear 0.67 41.9 0.90 72.8

To further investigate how the SCE-net generalize to unseen data we train a model without accessories and then test it on a dataset that contains acces-sories. As stated in Section 3.1.2 we have five different types of accessories: headwear, bags, scarves, sunglasses and belts. From Table 4.5 it it can be seen that the results for the dataset that only contains clothes for men and the one that only contains garments for ladies are very similiar. Further, the results are not perfect but they still shows that the model can adapt to unseen categories.

Table 4.5: Evaluation of a SCE-net model that is trained on a dataset without

accessories and then tested on a dataset that contains accessories. Dataset Ladieswear

SCE-net (dim=64) 0.73 50.1 Dataset Menswear

SCE-net (dim=64) 0.74 50.9

(34)

(35)

Table 4.6: Evaluation of the general embedding space on similarity of colour

between clothing items.

Cluster Nr of items in cluster Mode Score

1 146 Black 41/146=0.28 2 194 Beige 80/194=0.41 3 192 White 46/192=0.24 4 117 Blue 42/117=0.36 5 197 Black 112/197=0.57 6 172 Black 53/172=0.31 7 139 Black 36/139=0.26 8 182 Grey 33/182=0.18 9 80 Blue 43/80=0.54 10 197 Black 74/197=0.38 11 128 Black 61/128=0.48 12 155 Blue 89/155=0.57

(36)

Figure 4.2: The first item in each row is the query product, the remaining

garments are the closest neighbours to the query item. The distance between the products are measured with the Euclidean distance.

Figure 4.3: An illustration of how the accuracy for similarity of colour

(37)

Figure 4.4: An illustration of how the accuracy for similarity of product type

increases as the number of clusters increases. Similarity is measured on the feature vectors from the general embedding space.

Table 4.7: Evaluation of feature vectors from the general embedding space

regarding classification of colour and product type. These vectors are extracted from a deep residual neural network with 18 layers and their dimension are equal to 64.

Colour Product type

74% 72%

4.3 H&M transaction

(38)

set are sampled from the same garment type as the masked item and Table 4.9 shows results for FITB where each item in the answer set are sampled such that the most frequently bought items are weighted to have a greater probabil-ity of being sampled. By comparing these tables we can see that the network performs better when the items in the answer set are picked from the most fre-quently bought items. These tables also shows that we get better results on the data that contains items for men compared to the data with products for ladies.

Table 4.8: Evaluation of SCE-net (dim=64) on the transaction data, where the

answers in the FITB set are sampled from the same product cathegory as the masked item.

Dataset Menswear Dataset Ladieswear

Model FITB Acc FITB Acc

SCE-net (dim=64) 56.6 47.4

Table 4.9: Evaluation of SCE-net (dim=64) on the transaction data, where

the answers in the FITB set are sampled in such away that the most frequently bought item will have a higher probability of being picked.

Dataset Menswear Dataset Ladieswear

Model FITB Acc FITB Acc

(39)

Discussion

In this chapter we will reflect over the results for the SCE-net on the Polyvore outfit data, H&M outfit data and the H&M transaction data. To begin with we would like to highlight the fact that the Similarity Conditional Embedding Network outperforms the Type-Aware Embedding network on the Polyvore dataset even though the TAE-net uses predefined similarity conditions while the net learns these conditions in an unsupervised way. Further, the SCE-net does not hold any constraints on in which order images from an outfit are sent in to the network as opposed to the Bi-directional LSTM approach. These two facts played a very strong role in why we choose to utilize the SCE-net in our quest to help H&M to apply an algorithm that can recommend products to their customers as well as measure similarity between clothing items.

5.1 H&M outfits: CE and FITB

Morover, our results in Table 4.2 shows that our reimplementation of the SCE-net has no problem to handle data that are specific to a gender as well as data that contains items for both men and ladies. This shows that the network learns similarity conditions that are specific to outfits for men and outfits for ladies. However, it should be noted that the dataset that contains items for both men and women has the best results on both outfit completion and compatibility estimation. This is quite expected since that dataset is the largest that we use and it has a very high variation of clothing items. When it comes to the gender specific dataset it can be seen that the network performs better on the ladies data compared to the one that only contains outfits for men. Also here the same statement of larger data and variability is the most reasonable

(40)

CHAPTER 5. DISCUSSION 31

tion. Further, Table 4.2 also shows that the SCE-net is very effective when it comes to compatibility estimation and outfit completion even though the di-mension of the feature vectors are small. In fact embeddings of size 64 and 512 produces almost similar results. This is important since lower dimension of the feature vectors means that the model will be less heavy, hence, the com-putational cost of the model will be reduced.

Regarding the results in Table 4.3, i.e. the results from the experiment on how the number of similarity conditions masks affects the performance of the SCE-net. Here it should firstly be noted that the number of masks that produces the best results are five. This should be compared to the TAE-net that utilizes 66 conditions and keep in mind that these conditions needs to be predefined while the SCE-net learns these conditions via unsupervised learning. Secondly, it should be noted that the results for the data that only contains items for men and the one that only contains items for ladies perform very similiar. However, when we increase the number of condition masks to 16, the performance for the ladies data is better than the data for men. This is most likely due to the fact that womens outfits has a higher variety of clothing items than outfits for men. Hence, when we increase the number of masks the model will overfit faster on the men dataset.

(41)

5.2 H&M outfits: Similarity and

Classifica-tion

As we have mentioned before in this project we distinguish between similar-ity and compatibilsimilar-ity. We define similarsimilar-ity as two items that have the same colour or that they are from the same product type. While compatibility is defined as clothing items from different product types that fit well together in an outfit. Compatibility is measured on masked feature vectors and similarity is measured on the feature vectors extracted from a CNN (the general embed-ding space). Our results regarembed-ding similarity of colour revealed that it is not possible to divide colour in a few clusters such as black, blue, white, green etc. Colour is far more complex than that, since garments often contains multiple colours, patterns and stripes which we illustrate with some sample images in Figure 4.1. Further we provide evidence that accuracy for our metric of sim-ilarity for colour as well as product type increases as the number of clusters (labels) increases. Which make sense because what one person defines as the colour yellow or how that person defines a blouse might not be the same as another person.

Regarding the feature vectors from the general embedding space we provide evidence that they can be used as input to a simple multilayer perceptron that classifies colour and product type. This experiment is not relevant to the Sim-ilarity Conditional Embedding Network. However, it indicates that the infor-mation learnt from the SCE-net can be applied to other projects and this will increase the business value of the model for companies such as H&M. Keep in mind that it is both time consuming and expensive to train a model that utilizes a deep convolutional neural network. Hence, it is valuable for a company to be able to use the output from a network on many different projects.

5.3 H&M transaction: FITB

(42)

CHAPTER 5. DISCUSSION 33

(43)

Conclusions

In this project we have reimplemented the Similarity Conditional Embedding Network from [3]. It is a model that is trained end-to-end and it uses a con-ditional weight branch to learn the significant importance of different simi-larity condition masks. We present evidence for that our reimplementation of the SCE-net works very well on data that contains items for both men and women as well as datasets that are gender specific, when it comes to complete an outfit respectively predict fashion compatibility of an outfit. We have fur-ther presented results that indicates that the SCE-net can generalize to unseen categories and also that the network can be trained on one gender and then be tested on data containing products for another gender. We also present results for that the feature vectors from the general embedding space can be retrained with a simple multilayer perceptron to classify product type and colour. Re-garding the feature vectors from the general embedding space we provide evi-dence for that these vectors can be used to measure similarity between clothing items and that they can be send into a MLP that classifies clolour of garments as well as the product type of the clothing item. In addition we have also pre-sented results that implies that the SCE-net can be utilized on transactional data to predict clothing items in a customers basket.

Moreover, our results as well as the process of training the model sheds lights over some concerns regarding the quantity and quality of the data. We have spent a significant amount of time to get familiarized with the data and to choose what tables, columns and labels that needs to be included. Firstly, we noticed that the model is very sensitive to the input images, for instance it can easily be disturbed by images that contains mannequins, zoomed in im-ages that are not showing the entire item and imim-ages that are showing multiple

(44)

CHAPTER 6. CONCLUSIONS 35

colour version of a clothing item. We can also see that the labels for the colours makes it very hard to measure similarity with the feature vectors from the gen-eral embedding space, since these labels has a very high overlap with each other. Further our results also indicates that the network performance has a positive correlation with the amount of data.

Regarding the limitations of our project, it could be argued that we should have experimented with the Type-Aware Embedding model [2] on the H&M data. However, this model uses 66 predefined similarity conditions and the exact definitions of these conditions are not mentioned in the paper from Vasileva et al. [2]. Secondly, just creating the H&M data took up most of the time for this project, thus, we did not have time to create a new dataset with 66 predefined similarity conditions. Hence, we leave it for future work to try the Type-Aware Embedding Network against the SCE-net on the H&M dataset. Moreover, an-other thing that would be interesting to investigate is if the results for outfit compatibility estimation can be improved if we use a multilayer perceptron instead of averaging over the pairwise compatibility. This will however in-crease time complexity for both the training and testing of the model. Due to the fact that we still need the pairwise compatibility for the Fill-In-The-Blank quest.

(45)

[1] Andreas Veit, Serge Belongie, and Theofanis Karaletsos. “Conditional similarity networks”. In: Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition. 2017, pp. 830–838.

[2] Mariya I Vasileva et al. “Learning type-aware embeddings for fashion compatibility”. In: Proceedings of the European Conference on Com-puter Vision (ECCV). 2018, pp. 390–405.

[3] Reuben Tan et al. “Learning Similarity Conditions Without Explicit Su-pervision”. In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 10373–10382.

[4] Lukas Bossard et al. “Apparel classification with style”. In: Asian con-ference on computer vision. Springer. 2012, pp. 321–335.

[5] Tong Xiao et al. “Learning from massive noisy labeled data for image classification”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 2691–2699.

[6] Xiaodan Liang et al. “Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval”. In: IEEE Transac-tions on Multimedia18.6 (2016), pp. 1175–1186.

[7] Si Liu et al. “Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set”. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. 2012, pp. 3330–3337.

[8] Ziwei Liu et al. “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1096–1104. [9] Yuying Ge et al. “Deepfashion2: A versatile benchmark for detection,

pose estimation, segmentation and re-identification of clothing images”. In: Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition. 2019, pp. 5337–5345.

(46)

BIBLIOGRAPHY 37

[10] Xintong Han et al. “Learning fashion compatibility with bidirectional lstms”. In: Proceedings of the 25th ACM international conference on Multimedia. 2017, pp. 1078–1086.

[11] Yann LeCun et al. “Backpropagation applied to handwritten zip code recognition”. In: Neural computation 1.4 (1989), pp. 541–551.

[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105. [13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning.

MIT press, 2016.

[14] Olga Russakovsky et al. “Imagenet large scale visual recognition chal-lenge”. In: International journal of computer vision 115.3 (2015), pp. 211– 252.

[15] Karen Simonyan and Andrew Zisserman. “Very deep convolutional net-works for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).

[16] Christian Szegedy et al. “Going deeper with convolutions”. In: Pro-ceedings of the IEEE conference on computer vision and pattern recog-nition. 2015, pp. 1–9.

[17] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.

[18] Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of train-ing deep feedforward neural networks”. In: Proceedtrain-ings of the thir-teenth international conference on artificial intelligence and statistics. 2010, pp. 249–256.

[19] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: arXiv preprint arXiv:1502.03167(2015).

[20] Kaiming He and Jian Sun. “Convolutional neural networks at constrained time cost”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 5353–5360.

(47)

[22] Florian Schroff, Dmitry Kalenichenko, and James Philbin. “Facenet: A unified embedding for face recognition and clustering”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 815–823.

[23] Vassileios Balntas et al. “Learning local feature descriptors with triplets and shallow convolutional neural networks.” In: Bmvc. Vol. 1. 2. 2016, p. 3.

[24] Christian Szegedy et al. “Rethinking the inception architecture for com-puter vision”. In: Proceedings of the IEEE conference on comcom-puter vi-sion and pattern recognition. 2016, pp. 2818–2826.

[25] Ruining He, Charles Packer, and Julian McAuley. “Learning compati-bility across categories for heterogeneous item recommendation”. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE. 2016, pp. 937–942.

[26] Coen Bron and Joep Kerbosch. “Algorithm 457: finding all cliques of an undirected graph”. In: Communications of the ACM 16.9 (1973), pp. 575–577.

(48)

Appendix A

Ethics and Sustainability

In recent years the interest for machine learning and artificial intelligence has rapidly increased. A lot of companies wants to incorporate machine learning and artificial intelligence into their business and this has created a fear among people that they will lose their job to a computer or a robot. Hence, we would like to state that even though the Similarity Condition Embedding Network can decide if an outfit is fashionable or not as well as completing an outfit. This network is dependent on fashion expert since it is trained with labeled data that has been created in collaboration with stylists. Hence, this model will not work without the help from fashion experts and the main idea is to make it eas-ier to create multiple outfits from items that has been paired together by stylists. Moreover, another big concern for the world right now is sustainability. This is of course a big issue for fashion companies, for instance these companies has a lot of different people that decides what kind of clothes the company should sell and these people might not know exactly what kind of garments that are in stock right now. Hence, a fashion company might end up with a numerous amount of clothing items that are practically the same and this is not good in an environmental perspective. Thus it would be beneficial for a company to easily find out how many products they have that are very similar to each other. Our results shows that the feature vectors extracted from the general embedding space can be used to cluster clothing items that are similar both in colour and product type. This will hopefully help fashion companies to keep track of similar items and in the end help them to decrease production of products that are basically the same.

(49)