Semantic Image Segmentation on Clothing Imagery with Deep Neural Networks

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Semantic Image Segmentation on

Clothing Imagery with Deep Neural

Networks

HELENA ALINDER

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Semantic Image

Segmentation on Clothing

Imagery with Deep Neural

Networks

HELENA ALINDER

Master in Computer Science Date: June 29, 2020

Supervisor: Hamid Reza Faragardi Examiner: Erik Fransén

School of Electrical Engineering and Computer Science Host company: Sellpy (Sellhelp AB)

Swedish title: Semantisk bildsegmentering på bilder av kläder med djupa neurala nätverk

(4)

(5)

iii

Abstract

Semantic Image Segmentation is a field within machine learning and com-puter vision, where the goal is to link each pixel in an image with a label. A successful segmentation will label all pixels that belong to an object with the correct label, and this prediction can be measured with a score known as mean Intersection over Union (mIoU).

In a selling process of second-hand clothes, the clothes are placed on a mannequin and then photographed and post-processed. The post-processing algorithm attempts to remove the pole of the mannequin and crop out the mannequin itself to create a clear background. The algorithm uses traditional computer vision and requires specific lighting and position settings, and if these settings are faulty the algorithm performs bad. This thesis investigates how to conduct Semantic Image Segmentation with Deep Neural Networks for removing the pole and cropping out the mannequin, and if the networks perform better than the traditional algorithm on images with bad lighting.

Two deep neural networks were investigated: DeepLabv3+ and Gated-Shape CNN. The models’ performances were measured by their mIoU score and evaluated on a regular clothing dataset and an augmented clothing dataset, consisting of images that the traditional algorithm had problems with seg-menting.

The conclusion of the thesis is that DeepLabv3+ performs better than Gated-Shape CNN on regular clothing imagery, reaching an overall mIoU of 91.81%, and the overall performances of the networks on regular clothing imagery are statistically significantly different. DeepLabv3+ also performs better than the traditional algorithm when segmenting augmented clothing imagery, images that the traditional algorithm had problems with segmenting, and the overall performances are statistically significantly different. There is no statistically significant difference between the overall performance of DeepLabv3+ and GSCNN and the overall performance of GSCNN and the traditional algorithm when segmenting augmented images.

(6)

iv

Sammanfattning

Semantisk bildsegmentering är ett ämne inom maskininlärning och dataana-lys där målet är att koppla ihop varje pixel i en bild med en klass. En lyckad segmentering ger varje pixel som tillhör ett objekt samma korrekta klass och den förutspådda segmenteringen kan mätas med ett mått som kallas mean Intersection over Union (mIoU).

I en säljprocess för second-hand kläder ingår det att kläderna placeras på en docka och fotograferas och efterbehandlas. Algoritmen som sköter efter-behandlingen försöker ta bort stolpen som dockan är placerad på och klippa ut dockan för att skapa en klar bakgrund. Algoritmen använder sig av traditio-nell bildanalys och behöver speciella ljus- och placeringsinställningar, annars har algoritmen svårt att göra bra segmenteringar. Den här studien undersöker hur semantisk bildsegmentering kan göras med hjälp av djupa neurala nätverk för att ta bort stolpen och klippa ut dockan, och den undersöker även om de neurala nätverken får bättre resultat än den traditionella algoritmen på bilder med dåliga ljusinställningar.

Två djupa neurala nätverk undersöktes: DeepLabv3+ och Gated-Shape CNN. Nätverkens prestation mättes med deras mIoU och de utvärderas på ett dataset bestående av vanliga bilder på kläder och ett bestående av augmen-terade bilder på kläder - bilder som den traditionella algoritmen segmenterar dåligt.

Slutsatsen för studien är att DeepLabv3+ presterar bättre än Gated-Shape CNN på vanliga bilder på kläder och får en mIoU på 91.81% och det är en statistisk signifikant skillnad mellan deras resultat. DeepLabv3+ får även bätt-re bätt-resultat än den traditionella algoritmen när det kommer till att segmentera augmenterade bilder, bilder som den traditionella algoritmen hade problem med att segmentera, och det är en statistisk signifikant skillnad mellan deras resultat. Det finns ingen statistisk signifikant skillnad mellan DeepLav3+ och GSCNN resultat eller GSCNN och den traditionella algoritmens resultat när det gäller segmentering av augmenterade bilder.

(7)

4.2 DeepLabv3+ . . . 22 4.2.1 Hyperparameter Optimization . . . 22 4.2.2 Loss Weight . . . 23 4.3 Gated-Shape CNN . . . 23 4.3.1 Hyperparameter Optimization . . . 23 4.4 Evaluation of Results . . . 24 5 Results 25 5.1 Hyperparameter Optimization . . . 25 5.1.1 DeepLabv3+ . . . 25 5.1.2 GSCNN . . . 26

5.2 Performance on Regular Images . . . 27

5.2.1 DeepLabv3+ . . . 27

5.2.2 GSCNN . . . 28

5.2.3 ANOVA test . . . 29

5.2.4 Summary . . . 30

5.3 Performance on Augmented Images . . . 30

5.3.1 ANOVA Test . . . 30 5.3.2 Summary . . . 31 6 Discussion 33 6.1 Model Comparison . . . 33 6.1.1 Regular Dataset . . . 33 6.1.2 Augmented Dataset . . . 34 6.2 Methodology . . . 35 6.2.1 Hyperparameter Optimization . . . 35 6.2.2 Data . . . 36 6.2.3 Evaluation . . . 36 6.3 Validity of Results . . . 37

(9)

CONTENTS vii

7 Conclusion 39

7.1 Future Research . . . 40

(10)

(11)

Chapter 1 Introduction

The clothing industry plays a large part in the environmental changes, rank-ing fourth in terms of its impact [1], by causrank-ing 10% of the global carbon emissions and 20% of the global wastewater [2]. The impact it has is calcu-lated through the whole life cycle of clothing: the extraction of raw materials, the creation of the needed garments, the selling of the clothing, the usage of it and lastly the discarding of it. One efficient way to minimize the carbon emissions and wastewater from the clothing industry is to prolong the life of clothes and buy clothes second-hand instead of newly produced ones [1].

1.1 Problem Background

Sellpy1_{is a retail company that sells used goods online and handles the whole}

advertisement process for their users and thus minimizes the effort it takes for people to sell their used goods. The goods can be anything from clothes to electronics, however, the absolute majority of the sold items are clothes.

Two steps in Sellpy’s selling process is the description and photograph-ing of items. When photographphotograph-ing clothes the clothphotograph-ing is put on a rotatable mannequin that is placed on a pole and photographed from three different an-gles. Afterwards, the images are post-processed with an algorithm that has two main steps: one, the mannequin with clothing is correctly rotated if the pole the mannequin is placed on would be slating, and two, the mannequin with clothing and the pole are cropped out to create a clean image with a clear background. Since the clothes are second hand, their worth might be consid-erably lower than when they were first bought, causing Sellpy to earn little for each clothing. Because of this, it is important to streamline the process of

1_{https://www.sellpy.se/}

(12)

2 CHAPTER 1. INTRODUCTION

selling the clothes.

Computer vision is a constantly developing research area where image segmentation has proven useful in important fields such as medicine, e.g. lo-cating tumours [3], and object detection, e.g. car systems that detect pedes-trians [4]. Semantic image segmentation is the process of detecting specific objects in images by labelling each pixel. The most used technique for image segmentation today is based on artificial intelligence and deep neural net-works. Deep neural networks have shown to be very successful in image segmentation in ways that only a decade ago would have seemed impossible [5]. The concept of image segmentation is also used in commercial fields, such as analysing what clothing a person is wearing in fashion photographs [6] [7].

This thesis aims to investigate if deep neural networks can be used to per-form image segmentation on clothing imagery, where the goal is to segment three objects in the images: the background, the mannequin, and the man-nequin’s pole. Figure 1.1 visualizes what objects in an image that should be segmented.

Figure 1.1: The three objects to segment: background, mannequin and pole.

1.2 Objective

The process of cropping out the mannequin with clothes and the pole the mannequin is placed on is an important step in the selling process for Sellpy. The current solution uses traditional computer vision and is considered quite naive, since it requires specific photograph settings to perform correct seg-mentation. When these settings change, the algorithm segments the image incorrectly, causing the advertisement photographs to become bad. One ex-ample of this bad performance is when the image is too bright and contains

(13)

CHAPTER 1. INTRODUCTION 3

light clothing, causing the algorithm to have difficulties understanding where the contours of the mannequin are.

The objective of this thesis is to investigate two deep neural networks and determine which performs best semantic image segmentation on clothing im-agery and if they perform better than Sellpy’s current algorithm, which uses traditional computer vision to segment images. The performance of the net-works are evaluated with the metric mean Intersection over Union (mIoU), which describes how well the networks’ predicted segmentations are with regards to the ground truth mask’s position and covering, and the metric is described in more detail in section 2.1.1

1.3 Research Questions

The research questions that are investigated in this thesis are:

• What mean Intersection over Union do DeepLabv3+ and GSCNN achieve when conducting Semantic Image Segmentation on a dataset consisting of clothing imagery?

• What mean Intersection over Union do DeepLabv3+ and GSCNN achieve compared to a traditional computer vision algorithm when segmenting bright images?

1.4 Research Methodology

The research methodology for this thesis was originally suggested by the host company, and was then adapted according to the research methodology pre-sented by Walliman [8]. Figure 1.2 visualizes each step of this thesis’ work. The research problem was identified and the objective of the thesis were de-fined. A survey of related work was conducted for a clear overview of the research area, followed by a formulation of the research questions. After for-mulating the research questions, a thorough literature study was conducted to establish a deeper understanding of the research area and decide what deep neural networks to compare. The literature study was followed by the imple-mentation of the networks and the collection and analysis of the experiment data. If needed, the networks were modified and new data collected. The final results were analyzed and discussed, allowing conclusions to be drawn.

(14)

4 CHAPTER 1. INTRODUCTION

Figure 1.2: Research methodology adapted from Walliman [8]

1.5 Limitations

The purpose of this thesis is to investigate if deep neural networks can be used to perform image segmentation on clothing imagery. It is a long-term goal for Sellpy to be able to use potential findings from this thesis, but the scope of the thesis does not include implementing functionality in Sellpy’s selling process.

The deep neural networks will be trained on data created from the algo-rithm which uses traditional computer vision. Thus, the goal of the thesis is to investigate if it is possible to use deep neural networks to conduct seman-tic image segmentation on regular clothing imagery, and not to investigate whether the deep neural networks perform better than the traditional com-puter vision algorithm on regular clothing imagery. The comparison with the traditional algorithm is done on edited images, images that the algorithm is known for having problems with segmenting, and it is evaluated whether the deep neural networks can segment these images better.

Deep learning within image segmentation is a constantly changing field that is not as mature as traditional computer vision. It is therefore important that State-of-the-Art methods are compared with newer ones, with the goal to either establish the conventional methods more or strengthen newer ones. In this thesis, a State-of-the-art deep neural network is compared to a recently built deep neural network, and it is investigated which has the best perfor-mance when conducting image segmentation on different types of clothing imagery.

(15)

CHAPTER 1. INTRODUCTION 5

With deep learning within image segmentation being in its quite new state, the research field for commercial clothing imagery segmentation can be noted premature as well. However, the interest for image segmentation on clothing imagery is growing because the fashion industry believes it is valuable being able to for instance predict fashion trends or recommend the correct products. The fashion industry has a crucial impact on the environmental footprint, and the conclusions of this thesis might contribute to lowering the carbon emis-sions and the wastewater, since they might improve the possibility and success of selling second-hand clothing online.

1.6 Thesis Outline

This report organized in seven chapters. Chapter 2 presents the theoreti-cal background needed for understanding of semantic image segmentation. Chapter 3 presents previous work related to image segmentation as well as the two deep neural networks investigated in the thesis. Chapter 4 describes the methods used in the thesis. Chapter 5 presents the results of the conducted method. Chapter 6 discusses the results and methods as well as potential fu-ture research. Chapter 7 summaries the findings of the thesis and provides answers to the research questions.

(16)

Chapter 2 Background

This chapter provides a theoretical background for the thesis. It presents the subject of Semantic Image Segmentation and how it can be evaluated through the score mean Intersection over Union, as well as describes Convolutional Neural Networks and how they are used in semantic image segmentation. Lastly, it describes how deep neural networks can be effectively trained de-spite their complexity and how augmented data can be used to generalize the networks.

2.1 Semantic Image Segmentation

Semantic image segmentation is a field within machine learning and com-puter vision that is the process of linking each pixel in an image with a label, so-called pixel-wise semantic segmentation. A successful segmentation will label correct pixels with a corresponding object label, for instance, labelling all pixels belonging to a car in an image with the same label. Classic image segmentation has been done with machine learning and traditional computer vision, but deep learning techniques have recent years proven to outperform these traditional methods both in accuracy, how correct the predicted segmen-tation is, and sometimes efficiency, how fast the segmensegmen-tation is predicted. One of the best performing deep architectures is Convolutional Neural Net-works [5].

An extension of the pixel-wise semantic segmentation is the task of in-stance segmentation, where the goal is to find each inin-stance of objects of the same class, e.g. several cars in an image [5]. Semantic segmentation and instance segmentation may seem quite related, but instance segmentation is more difficult than semantic segmentation for several reasons, one of the most important ones being that each object class can have an arbitrary amount of

(17)

CHAPTER 2. BACKGROUND 7

instances and that it does not matter which instance gets which label, as long as each pixel in an instance gets the same label [9].

This thesis investigates deep neural networks for semantic image segmen-tation, because the dataset used for training and evaluation only have single instances of each label, rendering instance segmentation unnecessary for this use case.

2.1.1 Mean Intersection over Union

When evaluating the accuracy of semantic image segmentation, the mean In-tersection over Union (mIoU) metric is common to use to describe the per-formance. Intersection over Union describes the ratio between the area of the union of the segmentation and the ground truth and the area of their overlap, and serves as a standardized way of measuring the performance of a tation algorithm [5]. Scoring an mIoU of 1.0 would mean a perfect segmen-tation, thus, a high mIoU is considered a good segmensegmen-tation, given that the model is not overfitted to specific data, that is. Typically when training and evaluating segmentation models the segmented areas, also known as masks, are represented by coordinate arrays where each vertex in the mask/polygon is a pixel in the image.

Intersection over Union = Area of Overlap Area of Union

Figure 2.1: Illustration of the denominator and the numerator in the Intersection over Union

2.1.2 Overall Pixel and Per-Class

Two other metrics that can be used to measure the performance of a segmen-tation algorithm are Overall Pixel accuracy and Per-Class accuracy. Over-all Pixel accuracy measures the number of pixels that are correctly labelled, which causes biased measurements when a dataset consists of imbalanced classes. As an example, if a background label takes up 75% of an image and

(18)

8 CHAPTER 2. BACKGROUND

a segmentation algorithm predicts the whole image as background the Over-all Pixel accuracy would be 75% even though no other class was predicted. Per-Class accuracy measures the number of correctly labelled pixels for each class and averages over all existing classes. This causes biased measurements for datasets that have a large background class, since all incorrect pixels would only affect the accuracy of the background label and not the other object label accuracies [10].

2.2 Convolutional Neural Network

A Convolutional Neural Network (CNN) is a type of artificial neural network that has proven powerful in computer vision tasks, by having higher accu-racy and in many cases higher efficiency compared to traditional computer vision techniques [5]. CNN’s are built by different building components, the three most common ones being convolution layers, pooling layers, and fully connected layers. The convolution and pooling layers task is to extract fea-tures while the fully connected layers’ task is to map feafea-tures to final output. Conditional Random Fields and Atrous Convolution are two ways of giving the CNN a better understanding of the context of an image, and Gated Con-volutional Layers includes a gating mechanism for propagating information through layers.

2.2.1 Convolutional Layer

CNN’s high performance in computer vision is dependent on how they ex-tract features, which is done through a linear operation called convolution. In image processing, the linear operation will analyze each pixel of the image to be able to extract features, and since features in images can appear anywhere this convolution is a very efficient way of analyzing images [11]. This thesis makes use of CNNs when conducting semantic image segmentation and thus all examples from here on will be of images and how CNN’s can be used to process them.

Figure 2.2 visualizes the linear operation convolution, where the input tensor is an image and the kernel is the matrix of weights. The input tensor and the kernel is multiplied element-wise, and the resulting matrix is summed, giving the value of the corresponding position in the feature map [11].

(19)

CHAPTER 2. BACKGROUND 9 1 2 2 0 1 0 1 2 2 0 2 0 0 1 1 2 1 1 0 0 0 2 0 1 2 1 1 1 0 0 0 1 0 1 7 ... ... ... ... ... ... ... 4

Input tensor Kernel Feature Map

1 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 2 2 0 1 2 2 0 0 0 1 1 1 0 0 0 1 2 1 2 2 0 0 0 2 0 0 0 1 1 0 0 0 0 0 2 ⅀ ⅀ 7 4 = Element-wise product ⅀ = Sum matrix

Figure 2.2: Convolution: a linear operation for feature extraction

2.2.2 Pooling Layer

Convolutional layers extract the feature maps of an image by analyzing each pixel, which causes the feature map to be dependent on the dimensionality and structure of the image. A pooling layer’s task is to downsample the feature map so that all the important features are kept, but the feature map gets a re-duced in-plane dimensionality. The downsampling also decreases the param-eters of the network and thus its computational complexity [11] [12]. How-ever, when conducting semantic image segmentation it is preferable to be able to do accurate pixel-wise segmentation, which becomes difficult with pooling layers since they decrease localization, e.g. where an object is located exactly [13]. Two ways for making a CNN get a better understanding of the context of images is by using atrous convolution or adding Conditional Random Fields (CRF) as post-processing to refine the segmentation results [5].

(20)

2.2.3 Conditional Random Field

Conditional Random Fields (CRF) is a form of post-processing with the pur-pose of refining segmentation results by capturing fine-grained details in an image [5]. CRFs have been used for a long period of time and they refine segmentation results by combining the per-pixel class scores with low-level information, such as pixel interactions, thus allowing the pixels’ context in the image to be taken into account [14].

2.2.4 Atrous Convolution

Atrous convolution is a technique to reduce the pooling factors in neural net-works [13], and is widely known as dilated convolution, but will in this thesis be called atrous for consistency with other papers. By using atrous convo-lution it is possible to exponentially expand the receptive field of a neural network, while also keeping the resolution intact [15]. The way atrous convo-lution achieves this is by, without changing the value of the kernel, expanding the area of which the kernel is applied. Figure 2.3 represent an atrous convolu-tion of rate 1, which is a regular convoluconvolu-tion, and of rate 2 and 4 respectively. The white coloured matrix is the input and the cyan coloured matrix is the receptive field of the convolution, where the red dots represents the weights of the kernel and the cells without red dots are weights with value zero.

Figure 2.3: Atrous convolution as explained by Yu and Koltun [15]

2.2.5 Gated Convolutional Layer

Gating mechanisms allow models to have control over how information flows in neural networks and have shown to be especially useful in Recurrent Neural Networks (RNN) [16]. In convolutional neural networks, output gates have shown useful when wanting to control what information should be propagated through layers. Dauphin et al. [16] have shown that this gating mechanism is successful within language modeling, since it allows models to select relevant words or features for predicting the next coming word. Van den Oord et al.

(21)

CHAPTER 2. BACKGROUND 11

[17] have shown that a gating mechanism also is useful for conditional image generation by showing that the proposed Gated PixelCNN, consisting of gated convolutional layers, performs better than the predecessor PixelCNN both in terms of performance and convergence.

2.3 Transfer Learning

Due to the complexity of deep neural networks it is not always feasible to conduct full training on them, but rather use a model with pre-trained weights and fine-tuning it with appropriate data [5]. This process is called transfer learning, and studies have shown that even if the pre-trained models have been trained for very different tasks or datasets they can still be more appropriate to use than initializing the model with random weights [18].

2.4 Data Augmentation

In machine learning in general and deep learning in particular, data augmen-tation is the process of transforming training data in different ways to cre-ate better training data, with the goal to either speed up convergence or to minimize the risk of overfitting and creating a more general model [5]. The traditional way to transform data within image processing is to change the geometry or color of images, for instance by rotating images, change color space, or cropping the image. These transformations are all affine transfor-mations with regards to the original image, meaning that all points and lines in the image are kept intact in relation to each other [19].

(22)

2.5 Traditional Computer Vision Algorithm

As mentioned in section 1, Sellpy has an algorithm that uses traditional com-puter vision for segmenting clothing imagery. The algorithm consists of two main steps: one, segment and remove the pole of the clothing mannequin, and two, segment and crop out the mannequin to create a clear background. The algorithm uses the Python library OpenCV and starts by first altering the contrast of the image and adding a black and white filter. This is done so that OpenCV easier can find all significant contours of the image. What contours that are significant is decided by Sellpy and exact values are set in the code. When the pole contour has been found, the pixels within that contour are all set to white. The procedure of finding significant contours is repeated, so that the mannequin can be centered on a white background.

(23)

Chapter 3 Related Work

This chapter presents related work within semantic image segmentation in general and image processing applied to clothing imagery. First, two deep neural networks, DeepLab and Gated-shape CNN, and their development pro-cess and specific modules are presented. The two models were compared and evaluated in this thesis, and therefore an in-depth explanation of them is ap-propriate. Both models are convolutional neural networks (CNN) and section 2.2 consists of an explanation of CNNs. Secondly, related work within im-age processing of clothing imim-agery, both with traditional computer vision and deep learning models, is presented to provide a clear understanding of how and in what research field image processing is used successfully.

3.1 DeepLab

DeepLab is a State-of-the-Art Deep Convolutional Neural Network (DCNN) for semantic image segmentation that has been regularly developed and im-proved since 2014 [14] [20] [21]. DeepLab, and its latest version DeepLabv3+, has been part of the research for different fields within image segmentation where the neural network has been tweaked and extended to best fit different datasets, for instance medical images of tumors [3] and images of pedestrians [22]. This section presents DeepLab’s development and its latest and most successful version, which is the one investigated in this thesis.

3.1.1 DeepLabv1

In 2014 Liang-Chieh Chen and George Papandreou et al. [14] presented a new system called DeepLab, which combines DCNNs and probabilistic graphi-cal models, that is able to do pixel-level classification with State-of-the-Art

(24)

14 CHAPTER 3. RELATED WORK

accuracy. Chen et al. [14] shows that DCNN’s cannot perform pixel-level classification due to insufficient localization in the output of the final layer. This is caused by the pooling layers, see section 2.2.2, that are successful for high-level classification tasks, such as object detection, but have a localiza-tion trade-off. The presented solulocaliza-tion is to combine the output of the final layer with a fully connected Conditional Random Field (CRF), explained in section 2.2.3, and to use atrous convolution at each layer to extract dense feature maps efficiently. Atrous convolution is further explained in section 2.2.4. With 71,6% mIoU accuracy on the PASCAL VOC-2012 [23] dataset, DeepLab outperformed current State-of-the-art models for semantic image segmentation with at least 7%.

3.1.2 DeepLabv2

Chen et al. [20] continued to develop DeepLab, presenting a new version in 2017. DeepLabv2 combines DCNN and fully connected CRF, but employs a different kind of atrous convolution called Atrous spatial pyramid pooling (ASPP). ASPP consists of multiple parallel convolutional layers that utilize different dilations. Each convolutional layer creates different feature maps that are processed on separate branches before they are merged into one final feature map. These multiple parallel convolutional layers create multiple re-ceptive fields and thus enable understanding of objects and image context at multiple scales. DeepLabv2 performs with a 79,7% mIoU accuracy on the PASCAL VOC-2012 [23] and is thus a great improvement from DeepLabv1. DeepLabv2 is also, unlike DeepLabv1, evaluated on the Cityscapes dataset, scoring an mIoU of 70.4%.

3.1.3 DeepLabv3

DeepLabv3 further builds on DeepLabv2 and focuses on improving the Atrous Spatial Pyramid Pooling explained in section 3.1.2. Chen et al. [24] discov-ered that the true weights (not the zero weights added for dilation) of the ker-nel became smaller when the atrous rates of the convolutional layers became larger. As a comparison, if the sampled rate value were to be the same or close to the feature map size, the kernel would act as 1x1-filter because only the central weight in the kernel would prove to be effective. The problem is solved by using image-level features and adding a global average pooling on the last layer of the model. The new ASPP makes DeepLabv3 perform better than DeepLabv2 even though the CRF for dense feature extraction is removed, scoring an mIoU of 81.3% on the Cityscapes dataset [25].

(25)

CHAPTER 3. RELATED WORK 15

3.1.4 DeepLabv3+

DeepLabv3+ is an extension of DeepLabv3 that adds an encoder-decoder structure to improve segmentation results in general and along the bound-aries of the segmented objects in particular, and uses the improved backbone model Xception. DeepLabv3 is used as an encoder, and uses atrous convolu-tion and defines output stride as the ratio of the input image size to the final encoded feature. DeepLabv3 employs a low output stride of 16 to be able to extract denser feature maps, which is preferable in the task of semantic seg-mentation. The purpose of the decoder is to refine the segmentation along object boundaries [21]. Figure 3.1 represents the encoder-decoder structure and shows the concatenation of the upsampled features from the encoder with the convoluted low-level features from the DCNN. After concatenation the re-sult is refined with a 3x3-convolution and then lastly upsampled with a factor 4. DeepLabv3+ set a new State-of-the-Art performance, scoring an mIoU of 82,1% on the Cityscapes dataset [25].

Figure 3.1: The encoder-decoder structure of DeepLabv3+ [21]

3.2 Gated-Shape CNN

Gated-Shape CNN (GSCNN) is a new CNN architecture presented in 2019 by Takikawa et al. [26]. State-of-the-Art models used for semantic image segmentation process color, shape, and texture together, even though these at-tributes represent quite different types of information. Takikawa et al. [26] in-stead propose an architecture that consists of two streams, one regular stream and one shape stream. The regular stream is the standard CNN, but the shape stream is specifically designed to only process shape information. Gated con-volutional layers, which are thoroughly explained in sections 2.2.5, are used

(26)

to pass high-level information from the regular stream to the shape stream, so that the shape stream can filter relevant information. Relevant information in this context means information relevant to object boundaries. At the final step of the architecture there is a fusion module that consists of an ASPP, which task is to combine the regular stream and the shape stream to refine the segmentation results, both in terms of mask predictions in general and boundary predictions in particular. The architecture has proven successful by achieving a higher mIoU accuracy of 2% compared to the State-of-the-Art DeepLabv3+ on the benchmark dataset Cityscapes and performed especially well on poles, something that the neural networks investigated in this the-sis will try to segment also. The backbone’s used in the DeepLabv3+ used for comparison with GSCNN are ResNet-50, ResNet-101, and WideResNet. The research has been accepted by and published in International Conference on Computer Vision 2019. The general concept of GSCNN is visualized in figure 3.2.

Both DeepLabv3+ and GSCNN propose new architecture modules with the goal to better segment object boundaries, but they manage to do that in different ways. DeepLabv3+ makes use of a decoder that refines features by concatenating low-level features with features from the encoder. GSCNN, on the other hand, makes use of high-level features to be able to filter out information that is irrelevant for predicting object boundaries, and thus only keeping low-level boundary relevant information in the shape stream.

(27)

CHAPTER 3. RELATED WORK 17

3.3 Similar Tasks with Clothing Imagery

Most research within image segmentation with deep learning models is ap-plied to datasets that consist of common objects, e.g. vehicles, furniture, or animals, in common environments, e.g. in cities or at home, but there is less research on image segmentation on clothing imagery. Often different research tries to detect people in images, but more rarely the goal is to detect clothing specifically. However, there is some research within this area as well, and sec-tion 3.3.1 describes a dataset consisting of fashion clothing only, while secsec-tion 3.3.2 and 3.3.3 describes two research papers where clothing type or clothing patterns helps determine the predicted classification or segmentation.

3.3.1 DeepFashion2

The fashion industry finds great potential in fashion analysis, which has caused the research within the field to have grown during recent years [27]. Deep-Fashion [28] is a benchmark dataset presented in 2016 that accelerated the overall fashion analysis research due to its rich annotations of clothing im-agery. DeepFashion2 extends this benchmark with, among other things, more specific annotations and masks for segmentation. DeepFashion2 is a bench-mark for the following tasks: clothing detection, pose estimation, clothing retrieval, and clothing segmentation. Along with the benchmark the deep learning model Mask R-CNN is used to evaluate the dataset for the image segmentation task, scoring an average precision accuracy of 83,4% when the mIoU was above 75% when conducting image segmentation.

DeepFashion2 was evaluated with one deep learning model, whereas this thesis will present a comparison of two. The DeepFashion2 dataset also con-sists of images of people with clothing in different environments, causing the dataset to be different than the dataset from Sellpy, which consists of images of mannequins with clothing in one static environment.

3.3.2 Pedestrian Analysis

The ability to do pedestrian analysis is important for intelligent surveillance programs and other security systems that make use of computer vision, e.g. camera systems in cars. Liu et al. [4] presents HydraPlus-Net, a deep neural network that shows great performance in terms of efficiency and generaliza-tion ability when recognizing pedestrians’ attributes and re-identifying peo-ple, even outperforming State-of-the-Art methods. HydraPlus-net captures clothing types and attributes at high-level and clothing patterns on low-level,

(28)

which contributes to the network’s ability to distinguish people with similar appearance from each other.

HydraPlus-Net is trained and evaluated on images of people in different environments, with the goal to do reliable and efficient object detection and classification of people. Object detection and classification is a different task than the one presented in this thesis, however, the research of HydraPlus-Net provides an understanding of how and why clothes can be analyzed in different tasks with deep learning.

3.3.3 Relevant Product Suggestions

Kalantidis, Kennedy, and Li [29] presents a three-step approach to be able to suggest relevant products based on everyday photos of clothing imagery. First a pose estimation is conducted to locate relevant parts of the image in terms of clothing. Secondly, clothes are segmented and clustered according to the type of clothing. Lastly, the segmented parts are classified according to the clusters. The performance is State-of-the-Art and 50 times faster than previous methods.

This approach makes use of traditional computer vision, specifically us-ing a greedy segmentation algorithm presented by Felzenszwalb and Hutten-locher [30] that creates a graph representation of an image and makes pairwise region comparison. This thesis instead compares deep neural networks for image segmentation, since deep learning models has shown to achieve high mIoU accuracy when trained on large enough dataset.

(29)

Chapter 4 Method

This thesis investigates two deep neural networks for semantic image segmen-tation and compares their performance when segmenting mannequins with clothes. A literature study was conducted to decide which two deep neural networks to compare. This chapter starts with an explanation of what data was used in this thesis and how it was created. Afterwards, the implemen-tation of the chosen deep neural networks DeepLab and Gated-Shape CNN are presented. DeepLab is a State-of-the-Art semantic image segmentation model that generally performs well on different kinds of datasets, and Gated-Shape CNN is a novel deep neural network for semantic image segmentation that performed well on the Cityscapes dataset. Lastly, a description of how the models were evaluated is presented.

4.1 Data

As mentioned in Chapter 2, deep neural networks require a high computa-tional power when trained from scratch, and it is often preferable to start with pre-trained models and continue to train them on task-specific data. The mod-els used for this thesis were pre-trained on the Cityscapes dataset and then further trained on data from Sellpy’s current algorithm for cropping images.

Sellpy has an algorithm for segmenting mannequins with clothes that makes use of traditional computer vision. The main problem with this al-gorithm is that it requires specific lighting and positioning settings for the photos, causing the algorithm to perform badly when an image is for instance too bright or rotated. The goal of this thesis is to investigate if deep neural net-works can be trained on Sellpy data and perform better segmentation in terms of erroneous photographs. A custom dataset of Sellpy images was created for the training of the deep neural networks. The dataset was created with the

(30)

20 CHAPTER 4. METHOD

help of the existing algorithm that makes use of traditional computer vision, to be able to get a large amount of data, which would not have been possible if each image were to be segmented manually. To ensure valid segmentations by the current algorithm only images of colorful clothes were chosen. A second dataset was also created, which contained augmented images. These images had their gamma-levels adjusted to simulate brighter images, which the old algorithm has problems with segmenting, and this augmented dataset was split into four test datasets, so that a statistical test could be conducted. All images in the datasets were also resized to a smaller width and height in favor of the time needed to train the neural networks. These test datasets were used to evaluate the trained deep neural networks and find out if they can perform segmentation on erroneous images.

4.1.1 Cityscapes dataset

The Cityscapes dataset consists of images of urban scenes that applies 8 groups with a total of 30 classes which have annotations for both pixel-wise seman-tic segmentation and instance-wise semanseman-tic segmentation. The Cityscapes dataset has a total of 5000 fine annotations and 20 000 coarse annotations and is a widely used benchmark with more than 190 results from different models [25]. Due to Cityscapes license a more detailed description of the format of the dataset can not be provided.

Figure 4.1: Coarse ground truth Figure 4.2: Fine ground truth

4.1.2 Sellpy dataset

Both networks used models that were pre-trained with the Cityscapes dataset and then further trained, validated, and evaluated on data from Sellpy. The Sellpy dataset was created with the help of the existing naive algorithm that makes use of traditional computer vision to rotate and crop images of man-nequins with clothes on. The algorithm originally only saved the final version of the image after rotation- and cropping operations, which required changes to the algorithm to also be able to save correct contours to create a proper

(31)

seg-CHAPTER 4. METHOD 21

mentation dataset. The Sellpy dataset consists of three classes: background, mannequin, and pole.

The initial goal was to create a dataset that had the same format as the Cityscapes dataset, because both DeepLabv3+ and GSCNN had existing scripts that converted Cityscapes to correct format for the models. However, in the end the Sellpy dataset had two different structures. One imitated the Cityscapes format and was used for GSCNN and the second imitated the Pascal VOC 2012 format and was used for DeepLabv3+. The reason for using the struc-ture of Pascal VOC 2012 for DeepLabv3+ was simply because it was the easier structure to use when running the code for DeepLabv3+.

When creating the structure for the GSCNN model a total of five files were created, among them a JSON file containing the coordinate arrays represent-ing the masks in the images.

When creating the structure for DeepLabv3+ two images were saved, the original one and one 8-bit image containing a color palette which translates single channeled pixels to RGB colors. Text files specifying which images should be used for training, validation, and testing were also saved.

The Sellpy dataset was split into three parts: training, validation and eval-uation. The total amount of dataset images was divided approximately into 80% training data, 10% validation data, and 10% evaluation data, where the number of training images was 5134 and the number of validation images was 627 and the number of evaluation images was 628. An additional test dataset was also created, to be able to conduct a statistical test, which was split into four parts of approximately 175 images each.

(32)

4.1.3 Augmented Sellpy Dataset

To measure how the deep learning models performed compared to the existing algorithm, a dataset of augmented images was created. A total of 800 images of white clothing were collected, segmented with the old algorithm for ground truth masks, and then the images were edited by adding more gamma to sim-ulate a brighter photo. The edited images were once again segmented with the old algorithm to calculate its mIoU and then evaluated by DeepLabv3+ and GSCNN to compare mIoU. The 800 images were equally split over four smaller test datasets.

4.2 DeepLabv3+

The DeepLab model was run and implemented with the open-source Tensor-flow repository1_{, which is the same that is used by Chen et al. [21]. Amazon}

AWS was used to access computational power and GPUs, to be able to run DeepLab effectively. DeepLabv3+ used Xception_65 as the backbone, which was the backbone that achieved the best mIoU in Chen et al. [21], a pre-trained ImageNet network and the initial checkpoint with the best weights for the Cityscapes benchmark. In Takikawa et al. [26] ResNet and WideResNet was used as backbone for DeepLab.

4.2.1 Hyperparameter Optimization

Due to DeepLab being a state-of-the-art model the Github repository for the models includes extensive information on how to install the model, how to train the model (both on existing datasets and custom datasets) and what hy-perparameters to use. The FAQ of DeepLab explains that to get the best re-sults when fine-tuning the batch norm for a pre-trained network on a custom dataset, one should use as large training batch size as possible. With the con-cept of transfer learning (section 2.3) in mind the network used a pre-trained model. The training batch size that was tested was 16, 18, and 20, since the computational resources for the model was limited, and the learning rate spanned {1 ⇤ 10 2, 1 ⇤ 10 3, 1 ⇤ 10 4, 1 ⇤ 10 5}. The hyperparameter output

stride was always set to 16, since it is the recommended output stride to use together with atrous rates {6, 12, 18} [21].

The optimal hyperparameters were found by a grid search and due to time constraints the optimization was performed on only the validation part of the Sellpy dataset.

(33)

CHAPTER 4. METHOD 23

4.2.2 Loss Weight

Each image in the Sellpy dataset is relatively similar to each other, since the pole and mannequin with clothing have similar placement. However, se-mantic image segmentation is a classification problem on pixel-level, which causes the dataset to become imbalanced since the pole takes up a consid-erably lesser amount of pixels than the background and the mannequin. In general the pole consists of relatively few pixels, which may cause the model to get a good mIoU and loss, even though the pole is not predicted at all. Therefore, when training DeepLabv3+ loss weights were needed for each la-bel, so that the network would not settle for an overall good loss if the pole label would be badly predicted, and thus, making sure the networks are trained without the imbalance between the labels skewing the predictions.

4.3 Gated-Shape CNN

The GSCNN model was run and implemented with the open-source GSCNN repository2_{, which is the official code used in Takikawa et al. [26]. Amazon}

AWS was used for this model as well, since GPUs were a prerequisite for training the network. GSCNN used WideResNet-38 as backbone. The repos-itory also provides a checkpoint for the pre-trained network on Cityscapes, which was used as the initial checkpoint before further training on the Sellpy dataset.

4.3.1 Hyperparameter Optimization

The hyperparameters for GSCNN were optimized as well, although there was less extensive information available for the model. The training batch size that was tested was 14, 16 and 18, and the learning rate spanned { 1⇤10 2, 1⇤10 3,

1_{⇤ 10} 4, 1⇤10 5}. The hyperparameter output stride was set to 8, which was

the output stride recommended by Takikawa et al. [26] together with atrous rates {12, 24, 36}.

The optimal hyperparameters were found by a grid search for GSCNN as well and was also performed on only the validation part of the Sellpy dataset due to time constraints.

(34)

4.4 Evaluation of Results

A one-way ANOVA test was conducted to determine if the results of the mod-els differ significantly. Four test datsets were created with different images of clothing in each. The models were evaluated on each dataset and the resulting mIoU were used as data for the ANOVA test, to be able to decide if there is any statistically significant difference between the values and what the effect size is. The ANOVA test was performed with significance level 0.05.

A second ANOVA test was conducted to decide how the models perform compared to the old algorithm, when performing semantic image segmenta-tion on augmented images. The old algorithm has had problems with seg-menting images of lighter clothing, because the images are too bright and the way that traditional computer vision is used in the algorithm causes it to have problems with identifying contours. To test how the models performed com-pared to the old algorithm a smaller test dataset was created, where images of white clothing had been collected and then augmented to make the images brighter. This was done by editing the gamma-levels of each image to 2 times the current gamma-level.

(35)

Chapter 5 Results

This chapter presents the results that were collected when conducting the ex-periments described in chapter 4.

First, the experiments of the hyperparameter optimization of DeepLabv3+ and GSCNN are presented. The hyperparameter configurations included dif-ferent batch sizes and learning rates, and the networks were trained on a smaller portion of the Sellpy dataset.

Secondly, the models were extensively trained with the optimal hyperpa-rameter on the full Sellpy dataset and evaluated on four different test datasets by their mIoU. The resulting mIoU were used in an ANOVA test to decide if the results are statistically significant and what the effect size is.

Lastly, the models and the existing algorithm at Sellpy were evaluated on images that the existing algorithm was known to have problems with: bright images on clothing with light colors. The resulting mIoU were used in an ANOVA test as well.

5.1 Hyperparameter Optimization

5.1.1 DeepLabv3+

The batch size and learning rate were tested by a Grid search where all per-mutations of batch size ={16, 18, 20} and learning rate={1 ⇤ 10 2, 1 ⇤ 10 3,

1_⇤10 4, 1⇤10 5} were tested. Chen et al. [21] recommended batch size 18 for

the pre-trained network and the value was therefore used as a starting point, having the other batch sizes span higher and lower. The output stride was set to 16 and corresponding atrous rates were {6, 12, 18} as recommended by Chen et al. [21]. The best result was an mIoU of 0.8316 when training on regular images using batch size 18 and learning rate 1 ⇤ 10 3, as seen in

(36)

26 CHAPTER 5. RESULTS

table 5.1. Due to time constraints the hyperparameters were optimized for 26, 29, and 32 epochs respectively for each batch size. To decide if the train-ing time might have caused a lower mIoU for lower learntrain-ing rates, a set of hyperparameters with a lower learning rate were trained more epochs. After training with the hyperparameters batch size = 18, learning rate = 1 ⇤ 10 4

and output stride = 16 for 86 epochs the mIoU had improved from 0.5837 to 0.8307, and was continually improving, although at a considerably smaller rate than before. Since it required approximately three times longer training time to achieve almost the same mIoU, the training was stopped, since time constraints would have rendered the hyperparameters unusable when training on the full training dataset.

Epochs Batch size Learning rate Output stride mIoU

26 16 1_{⇤ 10} 2 16 0.8123 26 16 1_{⇤ 10} 3 16 0.2087 26 16 1⇤ 10 4 16 0.5832 26 16 1_{⇤ 10} 5 16 0.4019 29 18 1_{⇤ 10} 2 16 0.8147 29 18 1⇤ 10 3 16 0.8316 29 18 1_{⇤ 10} 4 16 0.5837 29 18 1_{⇤ 10} 5 16 0.3438 32 20 1⇤ 10 2 16 0.8267 32 20 1_{⇤ 10} 3 16 0.8093 32 20 1⇤ 10 4 16 0.5823 32 20 1⇤ 10 5 16 0.3414

Table 5.1: Hyperparameter optimization results for DeepLabv3+

5.1.2 GSCNN

A Grid Search was used to find the optimal hyperparameters for GSCNN as well, having the following configurations tested: batch size={14, 16, 18} and {1 ⇤ 10 2, 1 ⇤ 10 3, 1 ⇤ 10 4, 1 ⇤ 10 5}. Takikawa et al. [26] recommended

batch size 16 for the pre-trained network, and the value was therefore used as a starting point, having the other batch sizes span higher and lower. The output stride was set to 8 with corresponding atrous rates={12, 24, 36}, as recommended by Takikawa et al. [26]. The best result was an mIoU of 0.8841 when using batch size 18 and learning rate 1 ⇤ 10 3, as seen in table 5.2. The

hyperparameters were optimized for 25 epochs each. The resulting mIoU were, in comparison to the results of DeepLabv3+, quite equal, and therefore

(37)

CHAPTER 5. RESULTS 27

no further training was done for the second-best hyperparameters. Batch size Learning rate Output stride mIoU

14 1_{⇤ 10} 2 8 0.8595 14 1⇤ 10 3 8 0.8791 14 1_{⇤ 10} 4 8 0.8185 14 1⇤ 10 5 8 0.7483 16 1⇤ 10 2 8 0.8451 16 1_{⇤ 10} 3 8 0.8841 16 1⇤ 10 4 8 0.8700 16 1⇤ 10 5 8 0.7880 18 1_{⇤ 10} 2 8 0.8404 18 1⇤ 10 3 8 0.8739 18 1_{⇤ 10} 4 8 0.7156 18 1_{⇤ 10} 5 8 0.7399

Table 5.2: Hyperparameter optimization results for GSCNN

5.2 Performance on Regular Images

5.2.1 DeepLabv3+

DeepLabv3+ was trained with the optimal hyperparameters presented in sec-tion 5.1.1 with the batch size set to 18, the learning rate set to 1 ⇤ 10 3 and

the output stride set to 16. The network was trained for 60 epochs, when the training had not improved for 10 epochs, and then evaluated on the evaluation dataset consisting of 627 images. Table 5.3 shows the results of the training. The overall mIoU is 0.9290 and the network is most successful when pre-dicting the background segmentation, scoring an mIoU of 0.9708, and least successful when predicting the pole, scoring an mIoU of 0.8895. Figure 5.1 shows an original image and figure 5.2 shows the prediction DeepLabv3+ made.

DeepLabv3+ Overall Background Mannequin Pole

mIoU 0.9181 0.9673 0.9187 0.8684

(38)

Figure 5.1: Original image Figure 5.2: DeepLabv3+ prediction

5.2.2 GSCNN

GSCNN was trained with the optimal hyperparameters presented in section 5.1.2 with the batch size set to 16, the learning rate set to 1 ⇤ 10 3 and the

output stride set to 8. The model was trained for 70 epochs, when the training had not improved for 10 epochs, and then evaluated on the evaluation dataset consisting of 627 images. Table 5.4 shows the results of the training. The overall mIoU is 0.8897 and the model is most successful when predicting the background segmentation, scoring an mIoU of 0.9657, and least successful when predicting the pole, scoring an mIoU of 0.8112. Figure 5.3 shows an original image and 5.4 shows the prediction GSCNN made.

GSCNN Overall Background Mannequin Pole

mIoU 0.8897 0.9657 0.8921 0.8112

(39)

Figure 5.3: Original image Figure 5.4: GSCNN prediction

5.2.3 ANOVA test

The results were evaluated by a one-way ANOVA test and the null-hypothesis was that there is no statistically significant difference between the results from DeepLabv3+ and GSCNN, which was tested with significance level 0.05. The models were evaluated on four smaller test datasets. The null hypothesis could be rejected on significance level 0.05, and thus there is a statistically signifi-cant difference between the models’ results. The ANOVA test also yielded a large effect size of 9.84, which indicates that the difference between the mod-els’ results is large. The resulting mIoUs used in the ANOVA test are shown in table 5.5.

Model Test 1 Test 2 Test 3 Test 4 DeepLabv3+ 0.9211 0.9243 0.9256 0.9231

GSCNN 0.8907 0.8922 0.8944 0.8905

(40)

5.2.4 Summary

The final results of the deep neural networks’ performance on the dataset consisting of regular clothing imagery showed that DeepLabv3+ outperforms GSCNN with an mIoU difference of 3.2%, and that the difference between their overall mIoU is statistically significantly different on p-level 0.05 with a large effect size of 9.84.

5.3 Performance on Augmented Images

DeepLabv3+, GSCNN, and the Sellpy algorithm were evaluated on four aug-mented test datasets, consisting of images of white clothing and a gamma-level of 2 times the original. The Sellpy algorithm had an overall mIoU of 0.8642, whilst DeepLabv3+ and GSCNN had an overall mIoU of 0.8928 and 0.8834 respectively. The most successful algorithm for predicting the pole was the Sellpy algorithm, while GSCNN was most successful in predicting the background and the mannequin. However, GSCNN performed especially bad when predicting the pole, causing the overall mIoU to be lower than DeepLabv3+.

Algorithm Overall Background Mannequin Pole

Sellpy 0.8642 0.9283 0.7333 0.9328

DeepLabv3+ 0.8928 0.9538 0.8443 0.8803

GSCNN 0.8834 0.9665 0.8675 0.8161

Table 5.6: Each algorithm’s mIoU for each label when evaluating augmented images

5.3.1 ANOVA Test

The results were evaluated by a one-way ANOVA test and the null-hypothesis was that there was no significant difference between the results from DeepLabv3+, GSCNN and the Sellpy algorithm, which was tested with significance level 0.05. The models were evaluated on four smaller test dataset. The null hy-pothesis could be rejected on significance level 0.05, indicating that there is a statistically significant difference between the models’ results. More specifi-cally, the test showed that the means of DeepLabv3+ and the Sellpy algorithm are statistically significantly different, and that there was no statistically sig-nificant difference between the results of DeepLabv3+ and GSCNN and the results of the Sellpy algorithm and GSCNN. The observed effect size was

(41)

large, with a value of 1.58, which indicates that the difference between means of DeepLabv3+ and Sellpy algorithm is large. The resulting mIoU used in the ANOVA test are shown in table 5.7. Figure 5.5-5.8 shows an augmented im-age with the predictions of the Sellpy algorithm, DeepLabv3+ and GSCNN.

Algorithm Test 1 Test 2 Test 3 Test 4

Sellpy 0.8642 0.8736 0.8508 0.8578

DeepLabv3+ 0.8928 0.8978 0.8768 0.8907

GSCNN 0.8834 0.8757 0.8698 0.8718

Table 5.7: Mean IoU values for augmented images for ANOVA Test

5.3.2 Summary

The final results of the performances of Sellpy’s algorithm and the deep neu-ral networks on the dataset consisting of augmented clothing imagery showed that DeepLabv3+ performs better than Sellpy’s algorithm with an mIoU dif-ference of 2.8%, and that the difdif-ference between their overall mIoU is statis-tically significantly different on p-level 0.05 with a large effect size of 1.58. The scored mIoU of DeepLabv3+ is also higher than the mIoU of GSCNN, but there is no statistically significant difference between them. Similarly, the mIoU of GSCNN is better than the mIoU of Sellpy’s algorithm, but there is no statistically significant difference between them.

(42)

Figure 5.5: Original augmented

image Figure 5.6: The Sellpy algorithmprediction

Figure 5.7: DeepLabv3+

(43)

Chapter 6 Discussion

The purpose of this thesis was to investigate if deep neural networks could be used to conduct semantic image segmentation of advertisement images of clothing and compare the performance to a traditional computer vision algo-rithm. Two deep neural networks were trained with optimal hyperparameters and evaluated on regular images and augmented images. Their performance was compared to each other and to a traditional computer vision algorithm. The results presented in chapter 5 are discussed in this chapter, starting with a comparison of the networks and the old algorithm, followed by a discussion of the methodology and finished by a discussion regarding ethical, sustainable, and societal aspects of the work.

6.1 Model Comparison

6.1.1 Regular Dataset

When trained on the Sellpy training dataset and then evaluated on a sepa-rate dataset DeepLabv3+ and GSCNN scored an average mIoU of 0.9235 and 0.8920 respectively. DeepLabv3+ thus outperforms GSCNN with 3.2%, and the difference between the results were shown to be statistically significantly different according to a One-way ANOVA test of four means. The ANOVA test was used to ensure that the difference between them is valid, meaning that one of the models in fact has a better performance than the other, and excludes the possibility of there being coincidence involved in the results. As presented in chapter 3 Takikawa et al. [26] had proven that GSCNN per-formed better than DeepLabv3+ on the Cityscapes dataset. More specifically, GSCNN reached an mIoU that was 2% higher than DeepLabv3+, and the ob-ject that GSCNN was performing especially well segmentation on was poles.

(44)

34 CHAPTER 6. DISCUSSION

Not only did the results of this work show that DeepLabv3+ performed better than GSCNN on the Sellpy dataset, but DeepLabv3+ also performed better on all labels, including the label pole which it had the largest difference between the mIoU.

One could draw the conclusion that DeepLabv3+ is a generally better model than GSCNN, but it is important to note two things about the re-sults of GSCNN. First, the authors of GSCNN employed their own version of DeepLabv3+ which had the backbone’s ResNet-50, ResNet-101, and WideRes-Net. In this work the backbone used for DeepLabv3+ was Xception, the same backbone that Chen et al. [21] used when achieving new State-of-the-Art scores, 82.1% on the Cityscapes test dataset. This indicates that backbone Xception is in fact a better performing backbone than 50, ResNet-101, and WideResNet when used in DeepLabv3+, and might be the reason for DeepLabv3+ outperforming GSCNN in this experiment. Secondly, the results achieved when outperforming DeepLabv3+ were for a completely dif-ferent dataset. Not only does the Cityscapes dataset have 10 times more labels than the Sellpy dataset, but the images are also more complex than the Sellpy dataset, where each image is very similar to the others. As for GSCNN’s performance on the pole labels, one can note that in the Cityscapes dataset there often are many poles in one image and that they are longer and make up a larger portion of the image, as opposed to the Sellpy dataset, which only contains one pole per image that is also smaller.

The atrous rates used in DeepLabv3+ and GSCNN were based on the val-ues that were used in the works of Chen et al. [21] and Takikawa et al. [26] when training with the Cityscapes dataset. GSCNN used atrous rates {16, 24, 36} while DeepLabv3+ used {6, 12, 18}, meaning that GSCNN had a larger receptive field than DeepLabv3+ which indicates that the network is less de-pendent on the in-plane dimensionality of the images. One can argue that a large receptive field would not be as necessary when training on the Sellpy dataset, since the labels are placed relatively in the same position compared to for instance the Cityscapes dataset, thus rendering the in-plane dimensionality of the network not a necessarily bad feature in this use case.

6.1.2 Augmented Dataset

When evaluating DeepLabv3+, GSCNN and the Sellpy algorithm on aug-mented images, that had their gamma-levels adjusted so that they become brighter, DeepLabv3+ performed better than the old algorithm by 2.8%, scor-ing an average mIoU of 0.8895 whilst the Sellpy algorithm reached 0.8616, and the results showed to be statistically significantly different. GSCNN also

(45)

CHAPTER 6. DISCUSSION 35

performed generally better than the Sellpy algorithm, but the ANOVA test showed that the results were not statistically significantly different, meaning that it is not certain that the performance of GSCNN is in fact better and not possible to exclude that the higher performance depends on for instance coincidence. DeepLabv3+ performed generally better than GSCNN, but the ANOVA test showed that these results were not statistically significantly dif-ferent either.

6.2 Methodology

6.2.1 Hyperparameter Optimization

The hyperparameter optimization was done by training the model on a smaller portion of the dataset for different batch sizes and learning rates. Due to time constraints the batch sizes and learning rates were switched between 3 and 5 values respectively. When analyzing the results this limitation should be taken into account, since there exist the possibility that there might be an even more optimal set of hyperparameters if the batch sizes and learning rates would have been switched between a larger set of values.

The models were trained for a restricted number of epochs, which may have impacted the resulting optimal hyperparameters. The results in table 5.1 and table 5.2 indicates that no learning rate lower than 1⇤10 3is optimal, but

that might be because of the number of trained epochs. The optimal hyper-parameters for DeepLabv3+ was batch size=18 and learning rate=1 ⇤ 10 3,

reaching an mIoU of 0.8316. To strengthen this result as optimal hyperparam-eters DeepLabv3+ was trained longer with a lower learning rate of 1⇤10 4and

with batch size set to 18, and the configuration reached the mIoU of 0.8307. This mIoU indicates that the resulting hyperparameters is in fact most optimal compared to the rest hyperparameter configurations in the grid search.

Both models also had more hyperparameters that could be set before train-ing, but this work limited it to batch size and learning rate, which are two basic hyperparameters to choose for tuning. The rest of the hyperparameters were set according to the best performing values for DeepLabv+ by Chen et al. [21] and for GSCNN by Takikawa et al. [26], and those values are therefore considered valid in this work.

(46)

36 CHAPTER 6. DISCUSSION

6.2.2 Data

The purpose of this work is to investigate whether deep learning algorithms can be used instead of traditional computer vision when segmenting images of clothing, because the current Sellpy algorithm is known for making mistakes on images that are too bright or too dark. To be able to train the deep learning models a dataset had to be created, and this dataset was created from the old algorithm. Since the algorithm can make mistakes when segmenting clothes, the dataset has the risk of including images that have incorrect segmentations as ground truths. To minimize the risk of incorrect segmentations, the major-ity of the segmented images were of colorful clothing, which the algorithm was known to be good at segmenting. 1000 images were checked by hand to get an understanding of how often the old algorithm does mistakes, and 51 images were recorded having a incorrect segmentations of the mannequin as ground truths. Little to none images had incorrect segmentations of the pole as ground truths.

Thus, in terms of mIoU both DeepLabv3+ and GSCNN have relatively good performance, but since the training, validation, and evaluation is depen-dent on a dataset that is imperfect, the deep learning models’ performance of segmentation clothes, in general, might be skewed.

6.2.3 Evaluation

The models were evaluated on four different test datasets so that a One-way ANOVA test could be conducted to control whether their results are statisti-cally significantly different or not. A second One-way ANOVA test was con-ducted for the means of the deep neural network and the Sellpy algorithm. All the test datasets, both the ones with regular images and augmented images, contained distinct images. The use of distinct images means that no two im-ages were used in both training, validation or evaluation, which strengthens the evaluation since in two ways: one, it is possible to exclude that the net-works are fitted to a specific portion of data, which can be a risk when limited data is available, and two, the results represents the networks generality to additional images provided by Sellpy.

Two additional ways to improve the evaluation would be to use more than four test datasets and more images in the test datasets when performing the ANOVA tests. In the evaluation, approximately 175 images were in each test set, which might be consider low compared to the high number of training images (5134). Using more test datasets, and more images in each set, would strengthen the statistical analysis and could potentially yield another outcome for the statistical significant difference between the groups.

(47)

CHAPTER 6. DISCUSSION 37

The Sellpy dataset was only split once, meaning that one training, valida-tion and evaluavalida-tion set was used. When a limited amount of data is available it can be preferred to use N-fold cross validation, to access more training and validation data and create a more generalized model. In this work, 5134 im-ages was considered sufficient for training the neural networks, and thus, no N-fold cross validation was performed.

The metric mIoU was used for evaluating the deep neural networks and the Sellpy algorithm. Overall Pixel accuracy and Per-Class accuracy makes bi-ased measurements on datasets with imbalanced classes or with a large back-ground label, which is the case in the Sellpy dataset. The nature of the mIoU metric causes it to not make biased measurements on these types of datasets, and it was therefore considered suitable to use in the evaluation.

6.3 Validity of Results

The validity of this thesis’ results can be evaluated by considering the con-struct, conclusion and internal validity of the study [31]. This section sum-marizes parts of the discussion of the results and connects them with each validity aspect.

Construct validity is concerned with if the results reflect the theory that is behind it [31]. In this study, it is shown that DeepLabv3+ performs better than GSCNN, which is opposite to the results of Takikawa et al. [26]. These results can be caused by two main factors: first, the models are evaluated on a different dataset in this study, indicating that the models perform differently depending on the use-case, and second, the backbone used for DeepLabv3+ for this study is different than the one used by Takikawa et al. [26], which might be a threat to the validity of the results, but might also indicate that DeepLabv3+ is the better network when using its most successful backbone. Conclusion validity mainly concerns the statistical validity of the results. As mentioned in section 6.2.3, the statistical analysis would be strengthened if the models were evaluated on more than four test datasets and if each test dataset would consist of more images. The results of the statistical test when evaluating the augmented test dataset might yield another outcome if more datasets are evaluated, and therefore this fact can be a threat to the validity of the results.

Internal validity is concerned with the relationship between independent variables and the outcome of an experiment, i.e. if there exists another vari-able that might have affected the outcome of the experiment [31]. As men-tioned in section 6.2.1, a more extensive grid search for optimal hyperparam-eters might have yielded another outcome for the study.

Semantic Image Segmentation on Clothing Imagery with Deep Neural Networks

Semantic Image Segmentation on

Clothing Imagery with Deep Neural

Networks

HELENA ALINDER

Semantic Image

Segmentation on Clothing

Imagery with Deep Neural

Networks

HELENA ALINDER

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1 Problem Background

1.2 Objective

1.3 Research Questions

1.4 Research Methodology

1.5 Limitations

1.6 Thesis Outline

Chapter 2

Background

2.1 Semantic Image Segmentation

2.1.1 Mean Intersection over Union

2.1.2 Overall Pixel and Per-Class

2.2 Convolutional Neural Network

2.2.1 Convolutional Layer

2.2.2 Pooling Layer

2.2.3 Conditional Random Field

2.2.4 Atrous Convolution

2.2.5 Gated Convolutional Layer

2.3 Transfer Learning

2.4 Data Augmentation

2.5 Traditional Computer Vision Algorithm

Chapter 3

Related Work

3.1 DeepLab

3.1.1 DeepLabv1

3.1.2 DeepLabv2

3.1.3 DeepLabv3

3.1.4 DeepLabv3+

3.2 Gated-Shape CNN

3.3 Similar Tasks with Clothing Imagery

3.3.1 DeepFashion2

3.3.2 Pedestrian Analysis

3.3.3 Relevant Product Suggestions

Chapter 4

Method

4.1 Data

4.1.1 Cityscapes dataset

4.1.2 Sellpy dataset

4.1.3 Augmented Sellpy Dataset

4.2 DeepLabv3+

4.2.1 Hyperparameter Optimization

4.2.2 Loss Weight

4.3 Gated-Shape CNN

4.3.1 Hyperparameter Optimization

4.4 Evaluation of Results

Chapter 5

Results

5.1 Hyperparameter Optimization

5.1.1 DeepLabv3+

5.1.2 GSCNN

5.2 Performance on Regular Images

5.2.1 DeepLabv3+

5.2.2 GSCNN

5.2.3 ANOVA test

5.2.4 Summary

5.3 Performance on Augmented Images

5.3.1 ANOVA Test

5.3.2 Summary

Chapter 6

Discussion

6.1 Model Comparison

6.1.1 Regular Dataset

6.1.2 Augmented Dataset

6.2 Methodology

6.2.1 Hyperparameter Optimization

6.2.2 Data