Towards color compatibility in fashion using machine learning

(1)

Towards color compatibility in

fashion using machine learning

XINHUI WANG

KTH

(2)

Towards color compatibility

in fashion using machine

learning

XINHUI WANG

Master in Information and Network Engineering Date: July 9, 2019

Supervisor: Dr. Ying Liu (KTH) & Christer Norström (Norna) Examiner: Prof. Vladimir Vlassov

School of Electrical Engineering and Computer Science Host company: Norna AB

(3)

(4)

iii

Abstract

(5)

Sammanfattning

(6)

v

Acknowledgement

(7)

1 Introduction 1

1.1 Motivation . . . 1

1.2 Aim . . . 2

1.3 Research Question . . . 2

1.4 Contributions . . . 3

1.5 Ethics and sustainability . . . 3

1.5.1 Ethics . . . 3

1.5.2 Sustainability . . . 4

1.6 Thesis organization . . . 4

2 Background 5 2.1 Artificial neural network . . . 5

2.2 Convolutional neural network . . . 7

2.3 Collaborative filtering . . . 9 3 Related work 11 3.1 Semantic segmentation . . . 11 3.2 Color compatibility . . . 12 4 System design 14 4.1 Pipeline . . . 14 4.2 Deeplab V2 . . . 15 4.3 Color extraction . . . 16 4.3.1 K-means . . . 16 4.3.2 Pantone color . . . 17 4.4 Matrix factorization . . . 18 4.4.1 Principle . . . 18 4.4.2 Loss function . . . 19 4.4.3 Training algorithm . . . 20

4.5 Item-to-item collaborative filtering . . . 20

(8)

CONTENTS vii 4.5.1 Principle . . . 20 4.5.2 Color quantization . . . 21 5 Datasets 23 5.1 Modanet . . . 23 5.1.1 Description . . . 23 5.1.2 Data acquisition . . . 24

5.1.3 Ground truth generation . . . 24

6 Experiments and results 25 6.1 Semantic segmentation . . . 25

6.1.1 Implementation . . . 25

6.1.2 Evaluation results . . . 26

6.1.3 Model comparison for semantic segmentation . . . 28

6.2 Color extraction . . . 32 6.2.1 Implementation . . . 32 6.2.2 Evaluation results . . . 33 6.3 Matrix factorization . . . 34 6.3.1 Implementation . . . 34 6.3.2 Evaluation results . . . 35

6.4 Item to item collaborative filtering . . . 37

6.4.1 Implementation . . . 37

6.4.2 Evaluation results . . . 39

7 Discussion and conclusions 44 7.1 Discussion . . . 44

7.2 Conclusions . . . 45

(9)

(10)

Chapter 1 Introduction

1.1 Motivation

Fashion understanding and analysis have been a popular topic, and it has great value in business. Tasks such as predicting the most popular clothing styles of the next season, making personal recommendations on clothing items, and online clothing retrieval are all of high interest in the fashion industry [1]. In recent years, machine learning techniques have been shown to play a signifi-cant role in the fashion field, especially in trend forecasting, interactive search and recommendation [2]. Ma et al. [3] propose a Bimodal Correlative Deep Autoencoder to build a Fashion Semantic Space and quantitatively learn the fashion styles. Kalantidis, Kennedy, and Li [4] formulate an automatic sugges-tion model for retrieving similar clothing products based on a pose estimasugges-tion model and classification model. These attempts give an insight in fashion and are influencing how people shop and what they buy [2].

Fashion to a large degree is connected to images. Images from social media platforms, blogs of celebrities, authoritative magazines or organizations have a significant impact on fashion trends. Abundant visual data extracted from images is beneficial of fashion analysis. Obtaining suitable feature represen-tations is the first step to do the downstream fashion-related tasks. To this end, semantic segmentation of clothing, also known as clothing parsing, is an right attempt to tackle this problem. Semantic segmentation is to segment the image into the regions of objects of interest, classifying each pixel to an object. In our case, the objects can be different classes of clothing items. By achieving segmentation masks of different objects, it will be easier and more effective to abstract their own feature representations, which is crucial for fashion analysis.

(11)

[4, 5, 6] utilize some deep convolutional models to study the clothing seman-tic parsing problem, but the research in this field hasn’t yet achieved as much attention. Especially the improvement of accuracy, processing speed and data acquisition are popular tasks remained to be investigated.

Among all the features of clothing, such as texture, fabric and shape, color is one of the dominant feature. A Canadian study showed that people’s first impression on clothing is color, which is instantaneous and can last long. Zou et al. [7] employs machine learning techniques to build a quantitative model and investigates that color has a higher influence on the clothing fashion up-date than some other attributes. To understand which color or which color combination is fashioned in the current season or predict popular colors in next period is much valued in the retail business. Learning the compatibil-ity between colors, it is possible to make recommendations. For example, a person is wearing a red dress and wants to buy a hat, then based on the color combinations learned in fashion, it will be easy and credible to recommend a suitable color of a hat. Such recommendations improve customers’ online shopping experiences and help them to find the right product from thousands of items in a shorter time, which is tailored for their styles.

1.2 Aim

The aim of this thesis work is to implement semantic segmentation on fashion datasets and develop models to automatically learn what kind of color com-binations of clothing items are popular so that it can be further employed to recommend proper colors. This study investigates color compatibility quanti-tatively in the fashion field and possibly make color recommendations. To this end, it gives an insight in fashion trends in color as well as improves customers’ online shopping experiences.

1.3 Research Question

This study will answer the following questions:

• How to deploy a semantic segmentation model to segment different cloth-ing items out and extract their dominant colors?

(12)

CHAPTER 1. INTRODUCTION 3

combinations?

To solve these questions, there are some subtasks needed to be taken into con-sideration. Due to the fact that the data availability in fashion is scarcer, how to achieve a high-quality, large-scale dataset and make the best use of it is a task of interest. Besides that, how to extract correct colors of each clothing item is also of vital importance.

1.4 Contributions

• We implement a semantic segmentation model on a large-scale and high-quality fashion dataset, ModaNet. Our model achieves state-of-the-art performance and is more generalizable comparing to other models pro-posed in the fashion field.

• We employ item-to-item collaborative filtering to construct a color rec-ommendation system and learn color compatibility in fashion quantita-tively.

• Our recommendation system makes high-quality color recommenda-tions with a hit-rate of 0.49 for top 5 recommendarecommenda-tions.

1.5 Ethics and sustainability

1.5.1 Ethics

(13)

1.5.2 Sustainability

In terms of sustainability, one of the most significant issues lies in the retail business. Fashion changes fast, so the stock of retailers should change with fashion accordingly. Otherwise, it is likely that the retailers produce some products much more or much fewer than the market demands, which both lead to economic loss. In this end, if we could apply some machine learning tech-niques to understand and predict fashion trends accurately, retailers could in advance adjust their stocks so as to keep balance with the market demands.

1.6 Thesis organization

(14)

Chapter 2 Background

2.1 Artificial neural network

Artificial neural network (ANN) works as a mathematical model that is in-spired by biological neurons. It tries to mimic the neural network of our brain and simulates functionalities of neuron activation [8].

The basic units of ANN are neurons. The neurons are connected by synapses which convey information. Working principle of an artificial neuron is shown in Figure 2.1. The input signal xiis multiplied by its associate weight wiand

the sum is passed to the neuron. A positive weight means the activation of the neuron while a negative weight means the inhibition of it. Most often a bias is added to the sum. Then an activation function is utilized to the sum to get the output y. A mathematical description of a neuron is depicted below:

y = ϕ(

m X

i=0

w(i) · x(i) + b) (2.1)

An artificial neural network is composed of multiple neurons or even multiple layers of neurons which can solve complex problems. One popular framework of neural network is a feed-forward network where information flows only go in one direction from input to output. Multilayer perceptron (MLP) is a com-mon kind of feed-forward network. Its architecture is shown in Figure 2.2. It consists of an input layer that passes the input vector to the network, one or more hidden layers and an output layer. It has been investigated that such networks can be trained to approximate almost any measurable function and are generalized to new, unseen data [9]. There are many activation functions

(15)

that can be applied to the network, such as sigmoid, tanh and Relu [10]. An activation function is usually non-linear to enable the network to solve com-plex problems and learn complicated architectures. The choice of activation functions depends on the functionality one wants to achieve. In many cases, sigmoid is commonly used due to its easily differentiable property which is important for the back-propagation algorithm [9].

Back-propagation is a learning algorithm that aims to find the best maps from inputs to the target outputs by updating the weights and biases [11]. A cost function is needed to define the loss between the predicted value and the true value. Usually mean square error (MSE) is a common cost function, especially for regression problem. It is calculated as in Equation 2.2, where yirepresents

the true value, ˆyi represents the predicted value and K represents the number

of neurons in the output layer.

ε(y, ˆy) =

K X

i=0

(yi− ˆyi)2 (2.2)

Backpropagation contains two procedures, forward propagation and backward propagation. In forward propagation, outputs of each neuron of hidden layer and output layer are calculated based on Equation 2.1. Then backward prop-agation starts from the output to the input to reduce the error [9]. Gradient descent as an optimization algorithm is utilized to minimize the cost func-tion. The basic idea is to iteratively update the parameters in the direction of steepest descent, which is defined by negative gradient [12]. The mathemat-ical expression is shown in Equation 2.3, where wij denotes the weight from

neuron i in the current layer to neuron j in the next layer, ε denotes the total error, η is the learning rate. However, gradient descent has the problem of getting stuck at local optima. Therefore, some techniques such as momentum and regularization term are applied.

wij = wij − η

∂ε ∂wij

(16)

CHAPTER 2. BACKGROUND 7

Figure 2.1: Principle of an artificial neuron [8]

Figure 2.2: Architecture of a MLP [13]

2.2 Convolutional neural network

A convolutional neural network is a type of ANN designed for image-focused tasks, especially for dealing with pixel data. Nowadays, it is widely used in many fields such as image classification, object detection and segmentation [14]. Figure 2.3 depicts a CNN structure for classification. It takes in an image and through different kinds of layers, it tries to capture spacial dependencies within pixels and learn important features of interest. Unlike regular neural networks, CNN performs well to deal with different sizes of images by reduc-ing them to a form easier to process.

(17)

Figure 2.3: A CNN for image classification [17]

slides over the image spatially and computes dot products between the kernel and the area of input covered by the kernel. With each kernel, a convolutional layer produces a 2D activation map, whose size depends on whether there is a stride or padding. All the activation maps will stack together along the depth dimension and pass to the next layer [15]. We can see that neurons in a layer are connected to only a small region of the previous layer instead of in a fully-connected manner, which is different from a regular neural network. In this way, convolutional layers are to extract feature of images for making good pre-dictions. Specifically, the first few convolution layers are capable of learning low-level features, such as edges and corners. With more layer, it can capture high-level semantic features which have a concrete understanding of images as humans do.

(18)

CHAPTER 2. BACKGROUND 9

Figure 2.4: Pooling layer [18]

2.3 Collaborative filtering

Nowadays, recommendation systems are almost everywhere in people’s life. It aims to predict people’s future preference of a set of item and recommend those of highest interest to a user. It is a way of personalization, which makes it easier for people to choose the right thing they want from thousands of items on web. For instance, when people buy products on Amazon, it will automat-ically recommend other product they may have interest in according to their shopping histories as well as general taste.

(19)

dimensional space. Property in each dimension is called a latent factor. For a specific user, the model can recommend an item that is close to the user in the latent space.

(a) Neighborhood method (b) Latent factor method

(20)

Chapter 3 Related work

3.1 Semantic segmentation

Semantic segmentation of fashion images, also known as clothing parsing, is to segment different clothing items out by classifying each pixel to a clothing category. It has been a hot topic in computer vision, due to the huge potential of related applications, such as clothing retrieval and recommendation [4, 21, 22]. A lot of research work has been carried out in this field.

Kalantidis, Kennedy, and Li [4]present a framework to start with articulated pose estimation to segment the human area out and then cluster possible image regions so as to classify different clothing classes. Based on the segmentation results, they extract feature representations of color and texture and implement clothing retrieval.

Tangseng, Wu, and Yamaguchi [6] extend fully-convolutional neural network (FCN) architecture with a side-branch network called outfit encoder. They train the encoder to learn combinatorial preference in clothing parsing and ap-ply conditional random field (CRF) as post-processing to finetune the contours. Liu et al. [21] propose a weakly supervised parsing problem to automatically parse the fashion images tags instead of pixel-level tags. They combine the human estimation module for detecting human key points, Markov random fields (MRF) - based color and category module and category classifier with SVM to perform clothing parsing on images with image-level annotations. In this study, we deploy Deeplab V2, a state-of-the-art semantic

(21)

tion model with deep learning, which has remarkable performance on some challenging datasets. For currently proposed semantic segmentation models of fashion images, they are trained on relatively small datasets, such as the Fashionista and CFPD datasets. And the quality of these datasets are not quite high with some noisy annotations and inconsistent contours. In the thesis, we train Deeplab V2 with ModaNet dataset, which is a brand new, high-quality and currently largest fashion dataset for semantic segmentation. Having ade-quate fashion images helps to improve the performance of the model.

3.2 Color compatibility

Color compatibility theory as a research topic has been investigated for a long time. Understanding preference for color aesthetics is of vital significance for design-related industries, such as advertising and fashion [23].

O’Donovan, Agarwala, and Hertzmann [24] study people’s preference of col-ors in terms of calculating distances between hue palettes based on three datasets containing a large number of color themes with people’s ratings. Meanwhile, they develop a linear regression model for predicting ratings and learn which features of color themes matter most. In [23], they also research personal preference of different color themes. They build a linear feature-based matrix factorization model to predict individual ratings of different color palettes. Phan, Fu, and Chan [25] introduce a novel method, divide-and-conquer sort-ing algorithm, to sort colors in the palettes in a coherent order so that it is meaningful to apply density estimation and achieve interpolation and predic-tion on palettes. It is similar to the idea of sorting color palettes of clothing items in a fixed order in our case.

Kita and Miyata [26] construct a statistical model that can predict ratings of palettes of any number of colors, which breaks the restriction of fixed-number palettes. In addition, the model allows to extend a given palette to any length while maintaining the harmony of colors.

(22)

CHAPTER 3. RELATED WORK 13

(23)

System design

In this Chapter, we first describe the pipeline of our model and then explain the principle of each part separately according to the flow of the pipeline, includ-ing the semantic segmentation model, color extraction, matrix factorization and item-to-item collaborative filtering.

4.1 Pipeline

The pipeline of our model is displayed in Figure 4.1. The input is images from fashion dataset. Here, we use "ModaNet", a high-quality street fashion database with pixel-level annotations. First, we implement a semantic seg-mentation model trained on fashion images to segment the clothing items of interest. Having the segmentation maps, we deploy color extraction meth-ods to extract a dominant color of each item and quantify the colors to a lim-ited number. In this study, we select eight clothing items which are of the most interest in real-world commerce applications and order the colors in a top-to-bottom sequence according to the location where people usually wear these clothing. In this way, from every image, we can obtain an ordered color palette consisting of eight clothing items (leaving the place of non-existing item blank). These palettes are the feature representations of colors. Then, we propose two color recommendation methods based on the generated features, matrix factorization and item to item collaborative filtering, to analyze color compatibility in fashion and recommend proper color combinations.

(24)

CHAPTER 4. SYSTEM DESIGN 15

Figure 4.1: Pipeline of model

4.2 Deeplab V2

Deeplab model is a state-of-art model for semantic segmentation with deep learning. Deeplab V2 is an improved model than the first version, and it has remarkable results on some challenging segmentation datasets. It overcomes three main problems in the application of deep convolutional neural network to segmentation tasks, by utilizing convolution with upsampled filters which is called atrous convolution, atrous spatial pyramid pooling (ASPP), conditional random field (CRF) [27].

(25)

Figure 4.2: Pipeline of Deeplab V2 [27]

4.3 Color extraction

4.3.1 K-means

Having segmentation of clothing items, we can extract a color of each cloth-ing items separately. Here, we use one dominant color to represent every item. K-means is a state-of-the-art approach to extract colors from an image. K-means is a clustering method through which a set a data points can be grouped into "K" disjoint subsets where points in each subset are close in dis-tance. To be specific, the K-means algorithm works in the following way:

• Randomly select "k" points as the centroid of each subsets. How to choose "k" is determined by the number of groups desired to separate the data.

• Assign each data to the closest group based on a certain metric (Eu-clidean distance is a common one)

• Recalculate the mean of points in each subset and set new centroid of each group. Reassign the data to the new clusters according to new centroids.

• Repeat the previous step until the model is ready.

(26)

the human perception of lightness. The Lab color model is a three-axis sys-tem, as shown in Figure 4.3. The vertical L* axis represents lightness with darkest black at L* = 0 and brightest white at L* = 100. The a* axis goes from green to red while the b* axis goes from blue to yellow. In practice, the values of a*, b* usually range from -128 to +127.

In our case, for every segmentation mask of clothing, we first map the mask to the original image to obtain RGB values of related pixels, then we transform the colors to Lab space and adopt K-means to cluster pixels. We set K = 5 to get five subsets and choose the centroid of the subset with maximum number of pixels as the dominant color of the segmented clothing. The centroid is finally transformed back to RGB space for later processing.

Figure 4.3: CIELAB color space [30]

4.3.2 Pantone color

(27)

can find their corresponding RGB values so that the extracted dominant color of clothing is mapped to its closest Pantone color in Euclidean distance.

4.4 Matrix factorization

4.4.1 Principle

Matrix factorization is a popular method in producing product recommenda-tions. It works as a latent factor model, as we have discussed in section 2.3. Basically, it provides a way to exploit the interactions between users and items by finding their latent factors. The latent factors try to characterize user pref-erences and item patterns in lower dimension [20]. In this sense, matrix fac-torization is a dimension reduction approach that decomposes the user-item interaction matrix into the product of two low-rank rectangular matrices. To illustrate the principle of matrix factorization clearly, we use an example of predicting movie ratings, as shown in Figure 4.4. Here, the user-item matrix records the ratings of users giving to the movies. The matrix will be sparse be-cause users usually only watch a small portion of the total number of movies. The goal is to predict the ratings that users will give to the movies they didn’t see before and recommend those they may have interest in. We assume that each user can be characterized by N latent features. For example, one fea-ture can be how much a user likes comedy movies. Similarly, each item is described by N latent attributes, and one corresponding feature might be how much is the movie is related to comedy movies. The number of proper latent factors is not deterministic, which is a hyper-parameter we need to finetune during training. The training process is to find a user matrix and item matrix so that their dot product is as close to the original user-item matrix.

(28)

Figure 4.4: Matrix factorization of movie ratings [31]

Figure 4.5: Matrix factorization of color palettes

4.4.2 Loss function

Each row in the user matrix is denoted as a user vector puand each column in

the item matrix is denoted as an item vector qi. Basically, a predicted value of

user u giving to item i is calculated by Equation 4.1. ˆ

rui= qTi pu (4.1)

However, in practice, the data often has a bias associated with user and item [20], which will influence the tendency of values. For instance, some users tend to give higher values than others and some items tend to have low values. Therefore, we add a global bias µ, a user bias bu and an item bias bi when

predicting value ˆrui in Equation 4.2, where µ is calculated by the average of

all values in the user-item matrix, buand bineed to be updated during training.

ˆ

rui = µ + bi+ bu+ qTi pu (4.2)

Besides, in order to avoid overfitting and make the model more generalizable, we also add a regularization term. The loss function is derived as:

L =X u,i (rui− µ − bi− bu− qiTpu)2+ λ( p 2 u + q 2 i + b 2 u + b2i) (4.3)

where ruicovers all the known values in training set and λ controls the degree

(29)

4.4.3 Training algorithm

There are two approaches to minimize the loss function in matrix factoriza-tion, stochastic gradient descent (SGD) and alternating least squares (ALS). Here, we use SGD as a training algorithm because SGD is in general faster and easier to converge than ALS [32].

Stochastic gradient descent is to take an individual sample at a time and cal-culate the derivative of the loss function with respect to each parameter. Ac-cording to Equation 4.3, the update rules for user and item vectors and bias terms are calculated by:

pu ← pu + η(eui· qi− λ · pu) (4.4)

qi ← qi+ η(eui· pu− λ · qi) (4.5)

bu ← bu+ η(eui− λ · bu) (4.6)

bi ← bi+ η(eui− λ · bi) (4.7)

where eui= rui− µ − bi− bu− qiTpudenotes the prediction error and η is the

learning rate.

4.5 Item-to-item collaborative filtering

4.5.1 Principle

(30)

In our work, we consider different colors as items. By exploring the similarity between colors, we are able to learn what kind of color pairs are more likely to occur together in users’ wearing, which means they are good and compatible color combinations in general taste. Meanwhile, we are also capable of rec-ommending colors to a user based on the colors of clothing he/she is wearing. Based on the user-item matrix, the first step is to set up a co-occurrence ma-trix that counts the number of occurrences of each pairwise colors by iterating all color pairs of users. Figure 4.6 (a) shows a simple illustration of a co-occurrence matrix. We assume that we only have four items (colors). The intersection part of row Color2 and column Color1 is 3, indicating Color1 and Color2 show together three times in the dataset. Here, the co-occurrence ma-trix is symmetric if we treat all colors equally and don’t consider the clothing items each color belongs to.

Having the co-occurrence matrix, the next step is to compute a similarity ma-trix. In our case, the similarity between color pairs is a way to measure the compatibility of color combinations. One common method to compute sim-ilarity between two items is to use the cosine measure, which is calculated by: wij = |N (i)T N (j)| q |N (i)||N (j)| (4.8) where N (i) counts the number of times when Color i occurs, N (j) counts the number of times when Color j occurs, N (i)T

N (j) is the number of times when Color i and Color j co-exist which can be referred to the occurrence ma-trix. Figure 4.6 (b) shows a similarity matrix calculated by the co-occurrence matrix in Figure 4.6 (a). For example, the similarity between Color1 and Color2 is √ 3

(3+4)(3+5+2) = 0.39. Equation 4.8 describes how often Color i

occurs when Color j shows up. In other words, it computes the probability of Color i being in pair with Color j and a higher value in the similarity matrix indicates the two colors are very likely to occur together, which means they are good combinations in fashion. For a query color, by looking up the similarity matrix and find the N largest values of its column, we can recommend top N colors that are compatible to the query color.

4.5.2 Color quantization

(31)

fil-(a) Co-occurrence matrix (b) Similarity matrix

Figure 4.6: An example of creating co-occurrence matrix and similarity matrix

Figure 4.7: Color quantization method [35]

tering, this number of colors is too large. In order to prevent the probability matrix from getting too sparse, we need to quantize colors to a relatively small number. Meanwhile, the quantization colors should be sorted in a smooth way with a regular pattern so that it will be easier to figure out color behaviors when looking into the similarity matrix. Based on these two requirements, we find colors in Figure 4.7 a good quantization method with 117 colors in total. Fig-ure 4.8 sorts the colors in one dimension. We can see that every continuous nine colors represent a dominant color with different levels of brightness. We utilize this color sequence to build our color-to-color similarity matrix.

(32)

Chapter 5 Datasets

5.1 Modanet

5.1.1 Description

ModaNet is a large-scale new fashion dataset based on Paperdoll dataset [36] proposed by eBay. It consists of 55,176 high-quality, fully-annotated street fashion images, which is currently the largest fashion dataset for semantic seg-mentation and object detection [1]. To be specific, it has the following advan-tages over other datasets. First, ModaNet dataset is 10 × larger than previously proposed fashion datasets [21, 37] with pixel-level annotations which are usu-ally limited to hundreds or thousands of images. Having adequate number of images enables to perform state-of-the-art segmentation models with deep learning. Second, the images are carefully selected by applying some data cleaning techniques. Zheng et al. [1] implement Faster-RCNN to collect im-ages with only one person and train a classifier to detect imim-ages which are of high quality for annotation. Third, images in the dataset have diversity in dif-ferent human poses so that they can be used in more scenarios.

ModaNet groups highly-related categories to 13 meta categories that are of the most interest in real life and e-commerce. According to statistics, the most popular objects with the highest occurrences in this dataset are footwear, top, outer, pants and bags. Furthermore, images usually contain 3 to 5 fashion objects with an average of 4 objects [1].

(33)

5.1.2 Data acquisition

ModaNet is selected on top of Paperdoll dataset. In [1], they provide a JSON file containing all the image ids and their corresponding segmentation infor-mation. The image ids are the same ids in the Paperdoll dataset. The first thing is to match the ids in Modanet annotation file to the ids in the Paper-doll file, which lead to related URL links to download fashion images. In this case, some of the ids in ModaNet don’t exist in Paperdoll dataset. Also, some of the URLs are not available because the URLs refer to Chictopia web-site which is the largest fashion webweb-site where people can post their images and can also delete it. Therefore, after doing such pre-processing, we finally download 28,628 images which is still a considerable amount for training a segmentation model.

5.1.3 Ground truth generation

The segmentation annotations are provided by polygons which only record the vertices or key points of each category, therefore saving a lot of memory com-pared with giving the full segmentation map with the size of the image. The annotation data format of ModaNet follows the same style as COCO-dataset [38]. According to COCOAPI, there are related functions helping to trans-form polygons to segmentation maps. One thing to mention is that for some objects in the images, it is possible to use multiple polygon annotations to cover the same object, then we need to merge them and label them with the same category. Some generated ground truth masks are shown in Figure 5.1. For visualization, we use different colors to represent different categories, as depicted in Figure 5.2. For training the Deeplab model, the ground truth masks should be transformed to the format of a 2-D array where each element is the class number that the original image pixel belongs to.

Figure 5.1: Instances of ground truth with pixel-level annotations [1]

(34)

Chapter 6 Experiments and results

In this Chapter, we illustrate our experiments and evaluation results according to the pipeline shown in Figure 4.1. First, we start with the semantic seg-mentation model with impleseg-mentation details and both quantitative and qual-itative results. Secondly, we show some color extraction examples. Thirdly, we discuss the implementation of matrix factorization and display the results. Fourthly, we explain how to apply item-item collaborative filtering to the ex-tracted color pairs and evaluate the method with hit-rate and top 5 color rec-ommendations.

To be clear, we will go through the following experiments: • semantic segmentation model

• color extraction approach • matrix factorization

• item-to-item collaborative filtering

6.1 Semantic segmentation

6.1.1 Implementation

We randomly select 25,000 images as training set and 3628 images as test set. We implement the Deeplab V2 model in Tensorflow framework and utilize Resent-101 [39] pre-trained on Imagenet as backbone. Good initialization of weight in such deep models is very important; a random initialization may not lead to convergence. It is proved in [1] that adopting ResNet-101 has better

(35)

performance on segmentation results than VGG-16. The model is trained on Nvidia GeForce RTX 2080 with a batch size of 4 due to the limitation of GPU memory. The learning rate policy we choose is "poly" policy which is found to be more effective than "step" policy (gradually decrease the learning rate in a fixed step size [1]. "Poly" policy is defined as in Equation 6.1, where lr denotes the initial learning rate, iter denotes the current iteration, maxiter

is the total number of iterations /steps, lriter is the learning rate at current

iteration and power is the decay rate. Here, we set lr = 2.5e−4, power = 0.9, maxiter = 600000 which is equal to 96 epochs. The loss function is cross

entropy between the predicted mask ad the ground truth mask. lriter = lr × (1 −

iter maxiter

)power (6.1)

The standard evaluation methods we use are mean intersection over union (mIoU) and mean pixel accuracy. IoU measures the area of overlap between target and prediction mask divided by their union. It is calculated by Equation 6.2, where T P, F P, F N refer to true positive, false positive, false negative, respectively. mIoU is the average IoU of all classes (14 classes in our case).

IoU = target ∩ prediction target ∪ prediction =

T P

T P + F P + F N (6.2) Pixel accuracy is the percentage of pixels in the image that are correctly classi-fied, which is defined in Equation 6.3, where T N is true negative. Mean pixel accuracy is the average over all classes.

P ixel accuracy = T P + T N

T P + T N + F P + F N (6.3)

6.1.2 Evaluation results

Quantitative results

(36)

CHAPTER 6. EXPERIMENTS AND RESULTS 27

Figure 6.1 shows the IoU and accuracy of our training and test sets using Deeplab V2 and the results in [1]. We can see that our trained model eval-uated on test set has a reasonable performance with mIoU 0.64 and accuracy 0.96. For categories that cover small areas in images, such as belt, footwear and scarf, they have relatively low IoU. This may because these small objects are kind of hard to detect due to a small number of pixels they account for, and sometimes they tend to be over-smoothed to ensure category consistency. Taking a look at currently commonly used fashion datasets in semantic seg-mentation, Fashionista [6] and CFPD [21], in [6, 42] their models achieve mIoU around 0.50-0.54 on these two datasets. In this point of view, we believe that our model trained on ModaNet has promising performance in the field of semantic segmentation in fashion. Moreover, comparing our test results with the results implemented in [1] on ModaNet, our model performs better than theirs. Although Deeplab V3 is a more advanced model than Deeplab V2 we are using, in [1], they fine-tune the network which means they keep weights in all convolutional layers unchanged and only train the last few fully connected layer using ModaNet. This is a way of transfer learning, but it may suffer the problem of overfitting. Instead, we train the whole network of Deeplab V2 based on the weights pre-trained on Imagenet. It takes longer time for train-ing, but the model can gradually converge and isn’t easy to overfit which can be validated by the mIoU and accuracy of training and test sets.

Figure 6.1: IoU and accuracy of our training set, test set and results imple-mented in [1] using Deeplab V3

Qualitative results

(37)

Figure 6.2 shows two examples with good predictions. In Figure 6.2 (a), the prediction is fairly good without CRF except for some boundaries of the top around the neck; this is smoothed over after CRF. In Figure 6.2 (b), the model misclassifies some part of outer as scarf and detects a few pixels of footwear as boots in the predicted segmentation. However, CRF does a great job in removing these wrong labels. Admittedly, there are still some errors in the boundaries of top and bag which are kind of overlapped in this case.

In Figure 6.3, there is an example of misclassification and over-smoothing influence of CRF. We can see that the model predicts most part of outer as top because the outer is folded in an unusual shape. It seems that top and outer sometimes tend to be confused. Moreover, after adopting CRF, some part of sunglasses and footwear seem to be over-smoothed. CRF tries to maintain the consistency of categories in continuous regions in order to remove some misclassified predictions, but sometimes it is possible to filter out correctly-predicted small objects. Thus, how to choose suitable parameter configura-tions of CRF can be an essential task in future work.

In Figure 6.4, we plot two instances of misclassified and ambiguous predic-tions. In Figure 6.4(a), the model predicts outer as dress with some misclas-sified pixels of top which are smoothed after CRF. Actually, it is not obvious to detect from the ground truth that there is an outer outside of a dress and it is possible to regard it as a dress. Figure 6.4(b) shows a case that the model predicts top as dress. However, the annotation itself is ambiguous and it is hard to distinguish whether it is a top or dress even for humans.

6.1.3 Model comparison for semantic segmentation

(38)

(a)

(b)

Figure 6.2: Good examples for two test data images. From left to right: input image [1], ground truth segmentation, predicted segmentation and the predic-tion after CRF

(39)

(a)

(b)

(40)

cross validation, which means we train the model 5 times with different train-ing data and evaluate on 5 disjoint test sets to measure the performance. Table 6.1 shows accuracy and mIoU of two state-of-the-art models and our model trained on CFPD dataset. It can be seen that our model achieves competi-tive results compared with other models. The accuracy of our model almost reaches the highest value and mIoU is around 0.10 lower than that of the [43]. Besides, our results come from cross validation while the other two only test on a random split, which may not be as convincing as ours.

Table 6.1: Accuracy and mIoU of two state-of the-art models and our model trained on CFPD dataset

Model Accuracy mIoU without CRF mIoU with CRF OE[6] 0.915 0.514 0.547 FPN[43] 0.935 0.530 0.544 Our model 0.932 0.519 0.534

(41)

(a) Train on CFPD and test on ModaNet accuracy = 0.93, mIoU = 0.52

(b) Train on ModaNet and test on ModaNet accuracy = 0.96, mIoU = 0.66

Figure 6.5: Confusion matrices that train on different datasets and test on ModaNet

For conclusion, we can derive two insights from the comparison. The first is that the quality of ModaNet is better than that of CFPD. The size of ModaNet is much larger than CFPD and it has been found CFPD has some noisy labels and inconsistent boundaries of ground truth. In this sense, ModaNet is more suitable to train a semantic segmentation model. The other conclusion is that since we use a better dataset, our Deeplab model trained on ModaNet has more generalization and can be applied to more scenarios.

6.2 Color extraction

6.2.1 Implementation

(42)

(a) Train on CFPD and test on CFPD accu-racy = 0.96, mIoU = 0.62

(b) Train on ModaNet and test on CFPD accuracy = 0.96, mIoU = 0.64

Figure 6.6: Confusion matrices that train on different datasets and test on CFPD

results. Here, out of 13 categories the dataset can segment, we choose 8 main categories of most interest and extract their colors. Some small categories that don’t present not very often in the dataset such as belt, hat and scarf, are not taken into consideration. In addition, we place the dominant color of 8 cat-egories in a fixed order from top to bottom according to the location where people wear the clothing. The colors of categories come in the following se-quence: top, outer, dress, pants, skirt, bag, boots, footwear.

6.2.2 Evaluation results

We show some examples of color extraction. Figure 6.7 displays a successful instance, where (a) is the input image and in (b), the upper is the extracted color palette using K-means, the bottom is the palette after quantization. We use "/" to represent that there is no such category in the input image. We can see that in this case, K-means has a perfect performance in extracting colors of clothing with a pure and simple pattern. The quantized colors are quite similar to the original colors with little distortion.

(43)

(a) Input image [1]

(b) Color palettes

Figure 6.7: A successful example of color extraction

different small groups to balance the total distances. In this case, K-means will extract a color that is different from what we see directly from the dress.

(a) Input image [1]

(b) Color palettes

Figure 6.8: An example when K-means doesn’t work well

6.3 Matrix factorization

6.3.1 Implementation

(44)

Table 6.2: rmse of the training and test sets using different number of latent factors

Latent factors = 10 Latent factors = 20 training set 0.084 0.078

test set 0.161 0.157

these palettes to create the user-item interaction matrix that records the colors in RGB space. Therefore, we have 24 items with each item representing one color channel of one clothing item and 28628 users (images), resulting in a 28628 × 24 user-item matrix.

We normalize the values of RGB to the range from 0 to 1 (always positive values) and fill the missing values in the matrix as 0. Note that the 0 is just a sign of missing entries and it doesn’t mean the values are truly 0. To split training and test sets, we randomly remove 3 color values per user from the training data and use them as test set. We set the regularization term λ as 0.01 and the learning rate η as 0.001. For the number of latent factors, we utilize 10 factors and 20 factors separately to train the model for 3000 iterations. At each iteration, root mean square error (rmse) between the predicted values and true values is calculated.

6.3.2 Evaluation results

(45)

In addition, we plot some examples of predicted color palettes using matrix factorization. As shown in Figure 6.10, the predictions of the four known col-ors are relatively similar to the original colcol-ors, except for the footwear. In the fulfilled color palettes, the model suggests dark colors for pants and boots and warm light colors for the other items. Visually, the predicted colors show certain harmony to the extracted colors, but it is still a bit hard to validate how good are the color combinations. Figure 6.11 shows another case that the model predicts a visually compatible result. The generated color palette is in a pink tone and bright which is in accordance with the person’s original wearing. In Figure 6.12, we can see that the difference between prediction and the ex-tracted colors is relatively large and the model tends to recommend some cold colors. In fact, we find that in some cases where the extracted dominant colors are not very representative (common colors such as black and gray), the model has a preference of suggesting cold colors. One reason is that cold colors oc-cur very often in the dataset, which means they are popular colors in general taste. The other possible reason may be cold colors are usually small in one or two channels in RGB space, and predicting small values may help the model to decrease loss.

To sum up, matrix factorization seems to not a proper method to recommend colors, because the loss of the model is still relatively high and actually we don’t know what feature each latent factor represents which makes it difficult to evaluate the recommended colors.

(a) Latent factors=10 (b) Latent factors=20

(46)

(a) Input image [1] (b) Color palettes

Figure 6.10: Recommending colors using matrix factorization

6.4 Item to item collaborative filtering

6.4.1 Implementation

(47)

from so that the color pairs are category-unrelated. To avoid bias, from the 8 categories, we choose colors of 4 categories which have more even color distributions: top, outer, dress, skirt. The other categories, such as pants and boots, show a quite skewed color distribution in real life because a majority of these items usually have black, brown colors.

The evaluation metric we use here is the hit-rate (HR). Hit-rate [44, 45, 46] is a state-of-the-art method for top N recommendation algorithms. To split the training and test sets, we randomly select one of the extracted colors of each user to be the test set and use the rest as training set. Then for each user, we can make top N colors recommendations based on the colors in the training set he/she is wearing. Hit-rate is determined by:

hit − rate (HR) = number of hits

n (6.4)

(48)

6.4.2 Evaluation results

Results for ModaNet

For better visualization, we plot the heatmap of the generated similarity matrix of colors, which is shown in Figure 6.13. As we have discussed in section 4.4.1, the similarity matrix represents the probability of occurrence of one color when it comes in pair with other colors, so we also call it probability matrix in our case. In the probability matrix, we leave the places where the value = 0 blank. The colormap on the right side shows the range from 0 to the maximum value of the probability. Note that the probability here is a relative value and it is not normalized. The order of the colors in the matrix is placed as in Figure 6.14. From the probability matrix, we can obtain some insights in color combinations in fashion:

• In the bottom part and the right side of the heatmap, colors indexing from 108 to 116 have relatively uniform distributions over other col-ors. This means white, gray and black colors are compatible for almost all the colors. Especially in the bottom-right corner, the values in the square are higher than others, indicating that these colors are most com-patible within themselves. For example, when a query color is white, the model will tend to recommend white, gray and black colors as good combinations.

• In the upper left corner (colors indexing from 0 to 26) and in the middle part (colors indexing around 70), there are some colors distributions in these regions, showing that red, orange and blue are popular colors in the dataset.

(49)

Figure 6.13: Probability matrix of colors using MadaNet dataset

(50)

(a)

(b)

(c)

(d)

Figure 6.15: Some examples of top 5 color recommendations

HR value of 0.49, which means that almost half of the colors in the test set are successfully recalled in the top 5 recommendation. According to [44, 45], HR value varies a lot when using different dataset. For a large dataset that contains thousands of millions of items, a HR value around 0.2 is considered good, and with fewer items, the value may be higher. In our case, having a HR value of 0.49 for 117 items is quite reasonable.

Results for CFPD and DeepFashion

(51)

(a) Probability matrix using CFPD (b) Probability matrix using Deepfashion

Figure 6.16: Probability matrix of color using CFPD and Deepfashion

obtaining the probability matrix of color combinations of two datasets, we can have a better understanding of color compatibility.

Figure 6.16 (a) plots the probability matrix of colors using CFPD. It seems that it looks quite similar to the probability matrix using ModaNet, following the same color patterns. Figure 6.16 (b) shows the probability matrix of colors using DeepFashion. Here, we use 50000 images from category and attribute prediction benchmark of DeepFashion. We can see that its probability matrix also has similar color patterns as that of ModaNet, except that it has some large values along the diagonal, which means for some colors their most compatible colors are themselves. We believe this pattern of the diagonal is likely to be an error caused by inaccurate segmentation of images. Due to the fact that Deep-Fashion is not very clean and it contains quite a lot of magzaine images, photos of crowds and zoomed-in product photos, our segmentation model sometimes gets confused about these images. However, in general, the probability matri-ces of the two datasets follow the conclusions we obtain form ModaNet, which further validates our insight of compatible color combination in fashion.

Category-related probability matrix using ModaNet

(52)

(a) Probability matrix of top, outer color pairs

(b) Probability matrix of outer, skirt color pairs

(c) Probability matrix of top, skirt color pairs

(d) Probability matrix of outer, dress color pairs

(53)

Discussion and conclusions

7.1 Discussion

This thesis focuses on automating color compatibility of garments and pos-sibly making color recommendation. We go through the pipeline as shown in Figure 4.1. First, we implement Deeplab V2 trained on ModaNet dataset, which is a large-scale and high-quality fashion dataset. Having adequate train-ing images makes our semantic segmentation model outperform other pro-posed models trained on currently common fashion datasets. Secondly, based on the segmentation maps, we employ K-means to extract the dominant color of each garment. It is shown that K-means has a good performance when the clothing item has a simple and pure pattern, while it sometimes doesn’t work well when extracting the color of a garment with complex patterns. Thirdly, after having the color palettes, we first propose matrix factorization to predict the color values of the missing garments in order to make recommendations. However, this method has a relatively high loss because of the sparsity of data and inner connections among R, G, B values of one clothing item. It turns out that using matrix factorization to predict ratings is different from using it to directly predict color values where the values are dependent on others. Alter-natively, we propose item-to-item collaborative filtering to learn color compat-ibility quantitatively. Our recommendation model makes visually compatible color recommendations and the results are evaluated with a hit-rate of 0.49, indicating that almost half of the hidden colors are successfully recommended by the model.

For future work, one step is to improve the performance of the semantic seg-mentation model which is the basis for extracting correct colors. For

(54)

CHAPTER 7. DISCUSSION AND CONCLUSIONS 45

ple, we may pre-train the model on a larger dataset, such as Deepfashion, and use the pre-trained weights as the initialization of parameters. Also, finding a proper parameter configuration of CRF is crucial to improve the mIoU. An-other further step is to take the colors of skin and hair into consideration when making color recommendations of garments. We can deploy semantic seg-mentation models trained on skin and hair datasets and extract their colors, contributing to more convincing color recommendations.

7.2 Conclusions

(55)

[1] Shuai Zheng et al. “ModaNet: A Large-Scale Street Fashion Dataset with Polygon Annotations”. In: CoRR abs/1807.01394 (2018).

[2] Wei-Lin Hsiao and Kristen Grauman. “Creating Capsule Wardrobes from Fashion Images”. In: CoRR abs/1712.02662 (2017).

[3] Yihui Ma et al. “Towards Better Understanding the Clothing Fashion Styles: A Multimodal Deep Learning Approach”. In: AAAI. 2017. [4] Yannis Kalantidis, Lyndon Kennedy, and Li-Jia Li. “Getting the Look:

Clothing Recognition and Segmentation for Automatic Product Sug-gestions in Everyday Photos”. In: Proceedings of the 3rd ACM

Confer-ence on International ConferConfer-ence on Multimedia Retrieval. ICMR ’13. ACM, 2013, pp. 105–112.

[5] Yuhang He, Lu Yang, and Long Chen. “Real-Time Fashion-Guided Cloth-ing Semantic ParsCloth-ing: A Lightweight Multi-Scale Inception Neural Net-work and Benchmark”. In: AAAI Workshops. 2017.

[6] Pongsate Tangseng, Zhipeng Wu, and Kota Yamaguchi. “Looking at Outfit to Parse Clothing”. In: CoRR abs/1703.01386 (2017).

[7] Qin Zou et al. “Who Leads the Clothing Fashion: Style, Color, or Tex-ture? A Computational Study”. In: CoRR abs/1608.07444 (2016). [8] R. E. Uhrig. “Introduction to artificial neural networks”. In:

Proceed-ings of IECON ’95 - 21st Annual Conference on IEEE Industrial Elec-tronics. Vol. 1. Nov. 1995, 33–37 vol.1.

[9] M.W. Gardner and Stephen Dorling. “Artificial Neural Networks (The Multilayer Perceptron) – A Review of Applications in the Atmospheric Sciences”. In: Atmospheric Environment 32 (Aug. 1998).

[10] Bing Xu et al. “Empirical Evaluation of Rectified Activations in Con-volutional Network”. In: CoRR abs/1505.00853 (2015).

(56)

BIBLIOGRAPHY 47

[11] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Neu-rocomputing: Foundations of Research”. In: ed. by James A. Anderson and Edward Rosenfeld. MIT Press, 1988. Chap. Learning Representa-tions by Back-propagating Errors, pp. 696–699.

[12] Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: in COMPSTAT. 2010.

[13] How Neural Networks Work. http://https://chatbotslife. com/how- neural- networks- work- ff4c7ad371f7/. Ac-cessed May 15, 2019.

[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In:

Com-mun. ACM60.6 (May 2017), pp. 84–90.

[15] Keiron O’Shea and Ryan Nash. “An Introduction to Convolutional Neu-ral Networks”. In: CoRR abs/1511.08458 (2015).

[16] Jiuxiang Gu et al. “Recent Advances in Convolutional Neural Networks”. In: CoRR abs/1512.07108 (2015).

[17] Saha Sumit. A Comprehensive Guide to Convolutional Neural Networks. https://towardsdatascience.com/acomprehensiveguide to convolutional neural networks the -eli5-way-3bd2b1164a53/. Accessed May 20, 2019.

[18] HARSH SINGHAL. Convolutional Neural Network with TensorFlow

implementation. https : / / medium . com / data science group iitr / building a convolutional neural -network-in-python-with-tensorflow-d251c3ca8117/. Accessed May 3, 2019.

[19] Yunhong Zhou et al. “Large-Scale Parallel Collaborative Filtering for the Netflix Prize”. In: Proc. 4th Int’l Conf. Algorithmic Aspects in

In-formation and Management, LNCS 5034. Springer, 2008, pp. 337–348. [20] Yehuda Koren, Robert Bell, and Chris Volinsky. “Matrix Factoriza-tion Techniques for Recommender Systems”. In: Computer 42.8 (Aug. 2009), pp. 30–37.

[21] Si Liu et al. “Fashion Parsing With Weak Color-Category Labels”. In:

IEEE Transactions on Multimedia16 (2014), pp. 253–265.

(57)

[23] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. “Collabo-rative Filtering of Color Aesthetics”. In: Proc. Computational

Aesthet-ics (CAe). 2014.

[24] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. “Color Com-patibility From Large Datasets”. In: J. Irreproducible Results 17 (2016), pp. 4711–4721.

[25] Huy Q. Phan, Hongbo Fu, and Antoni B. Chan. “Color Orchestra:

Or-dering Color Palettes for Interpolation and Prediction”. In: CoRR abs/1703.06003 (2017).

[26] Naoki Kita and Kazunori Miyata. “Aesthetic Rating and Color Sugges-tion for Color Palettes”. In: Comput. Graph. Forum 35 (2016), pp. 127– 136.

[27] Liang-Chieh Chen et al. “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs”. In: CoRR abs/1606.00915 (2016).

[28] J. Long, E. Shelhamer, and T. Darrell. “Fully convolutional networks for semantic segmentation”. In: 2015 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR). June 2015, pp. 3431–3440. [29] Philipp Krähenbühl and Vladlen Koltun. “Efficient Inference in Fully

Connected CRFs with Gaussian Edge Potentials”. In: CoRR abs/1210.5644 (2012).

[30] What is CIE Lab Color Space? http://colorapplications.

com/2018/09/14/17/03/30/241/color-measurement/ cas_km/what-is-cie-lab-color-space/. Accessed April 30, 2019.

[31] Soumya Ghosh. Simple Matrix Factorization example on the Movielens

dataset using Pyspark. https://medium.com/@connectwithghosh/ simple matrix factorization example on the

-movielens - dataset - using - pyspark - 9b7e3f567536/. Accessed March 10, 2019.

[32] Yifan Hu, Yehuda Koren, and Chris Volinsky. “Collaborative Filtering for Implicit Feedback Datasets”. In: Proceedings of the 2008 Eighth

(58)

BIBLIOGRAPHY 49

[33] Badrul Sarwar et al. “Item-based Collaborative Filtering Recommenda-tion Algorithms”. In: Proceedings of the 10th InternaRecommenda-tional Conference

on World Wide Web. WWW ’01. Hong Kong, Hong Kong: ACM, 2001, pp. 285–295.

[34] G. Linden, B. Smith, and J. York. “Amazon.com recommendations: item-to-item collaborative filtering”. In: IEEE Internet Computing 7.1 (Jan. 2003), pp. 76–80.

[35] RGB Color Codes Chart. https://www.rapidtables.com/ web/color/RGB_Color.html/. Accessed April 12, 2019. [36] Kota Yamaguchi, M. Hadi Kiapour, and Tamara L. Berg. “Paper Doll

Parsing: Retrieving Similar Styles to Parse Clothing Items”. In:

Pro-ceedings of the 2013 IEEE International Conference on Computer Vi-sion. ICCV ’13. IEEE Computer Society, 2013, pp. 3519–3526. [37] Kota Yamaguchi. “Parsing Clothing in Fashion Photographs”. In:

Pro-ceedings of the 2012 IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR). CVPR ’12. IEEE Computer Society, 2012, pp. 3570–3577.

[38] Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context”. In: CoRR abs/1405.0312 (2014).

[39] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385 (2015).

[40] Liang-Chieh Chen et al. “Encoder-Decoder with Atrous Separable Con-volution for Semantic Image Segmentation”. In: CoRR abs/1802.02611 (2018).

[41] François Chollet. “Xception: Deep Learning with Depthwise Separable Convolutions”. In: CoRR abs/1610.02357 (2016).

[42] Wei Ji et al. “Semantic Locality-aware Deformable Network for Cloth-ing Segmentation”. In: ProceedCloth-ings of the 27th International Joint

Con-ference on Artificial Intelligence. AAAI Press, 2018, pp. 764–770. [43] Martinsson John and Mogren Olof. “Semantic segmentation of fashion

images using feature pyramid networks”. In: ICCV. to be published. [44] Mukund Deshpande and George Karypis. “Item-based top-N

Recom-mendation Algorithms”. In: ACM Trans. Inf. Syst. 22.1 (Jan. 2004), pp. 143–177.

(59)

[46] George Karypis. “Evaluation of Item-Based Top-N Recommendation Algorithms”. In: CIKM. 2001.

[47] Z. Liu et al. “DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations”. In: 2016 IEEE Conference on

(60)

TRITA -EECS-EX-2019:531