Feature extraction from images with augmented feature inputs

(1)

Feature extraction from images with augmented feature inputs

ANDREAS DRANGEL

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

Feature extraction from images with augmented feature inputs

ANDREAS DRANGEL

Master in Computer Science Date: November 21, 2017 Supervisor: Anders Holst Examiner: Anders Lansner

Swedish title: Särdragsextrahering från bilder med förstärkt särdragsinmatning

School of Computer Science and Communication

(3)

(4)

iii

Abstract

Machine learning models for visual recognition tasks such as image recognition is a common research area as of lately. However, not much research has been made when multiple features are to be extracted from the same input. This thesis researches if and how knowledge about one feature influences model performance of a model classifying another feature, as well as how the similarity and generality of the feature data distributions influences model performance.

Incorporating augmentation inputs in the form of extra feature in- formation in image models was found to yield different results de- pending on feature data distribution similarity and level of generality.

Care must be taken when augmenting with features in order for the

feature not to be completely redundant or to completely take over in

the learning process. Selecting reasonable augmentation inputs might

yield desired synergy effects which influences model performance to

the better.

(5)

Sammanfattning

Maskininlärningsmodeller för uppgifter inom visuellt igenkännande så som bildigenkänning har på senaste tiden varit ett vanligt forsk- ningsområde. Dock har inte mycket forskning fokuserats på att ex- trahera multipla särdrag från samma inmatning. Detta examensarbete syftar till att undersöka hur kunskap om ett särdrag influerar en mo- dells prestanda som syftar till att klassificera ett annat särdrag, men även hur likhet och generalitet i särdragens datadistribution influerar modellprestanda.

Integrering av förstärkande inmatning i form av extra särdragsin-

formation i bildklassificeringsmodeller visades ge olika resultat bero-

ende på likhet och generalitet av distribution av särdragsdata. Hänsyn

måste tas när förstärkande särdrag används för att de förstärkande sär-

dragen inte ska bli helt redundanta eller helt ta över under tränings-

processen. Väljande av rimliga förstärkningssärdrag kan medföra öns-

kade synergieffekter vilket påverkar modellprestandan till det bättre.

(6)

Chapter 1 Introduction

The use of machine learning systems for visual recognition tasks is a field that has gained enormous traction in recent years in tasks such as image classification [15] and object detection [23, 22]. Early research in neuroscience in the 1960s [12, 10, 11] inspired the construction of convolutional neural networks which are used in the state-of-the-art machine learning models for visual recognition of today. Combining the concept of convolutional neural networks with the hardware ca- pabilities of today to construct deep neural networks (DNNs), has re- sulted in models that are on par with human performance, or in some particular use cases, models that even surpass human performance in visual recognition tasks [7].

Most DNN models of today aim to do classification of one type of feature in one particular domain where inputs are passed through the network with the goal of classifying the input as one of K possible out- put classes within the domain. This is also where most research in the area is focused. Less research is focused on the setting where multiple types of features should be extracted from the same input. The dataset to be investigated in this thesis consists of a wide collection of second hand items; mainly clothing articles but also shoes, kitchenware and a wide range of other sorts of items. The dataset is described in more de- tail in the beginning of the "Methods" chapter. Each item is described by five features - a general type and four levels of nested categories of varying specificity (deeper nested level means more specific). This thesis aims to research how information of one such feature influences the knowledge of other features.

1

(9)

1.1 Problem formulation

This thesis aims to research the problem of classifying dependent fea- tures of varying specificity using image inputs. The features to be clas- sified are "type" and four nested levels of "categories". The dataset and features are covered more in depth in the "Dataset" section of the

"Methods" chapter. The hypothesis is that having information about one feature influences what we may know about the other features because of how they are related to one another.

1.2 Research question

The research question this thesis seeks to answer is

Can information about one feature extracted from one model, influence the performance of another model classifying another (possibly dependent) feature? How does the representation of this extra feature information influence model performance?

This will be investigated by constructing base-line models with- out extra features incorporated, then constructing models where other features are represented in the inputs. The latter models will then be benchmarked against the base-line models in order to investigate how and if feature augmentations from other feature models influ- ences model performance.

1.3 Limited scope

The scope of this thesis will be limited to benchmarking and inves- tigating the specific dataset covered in the "Dataset" section of the

"Methods" chapter, with the specific features it entails. Due to time

constraints on the thesis it will also be limited to using convolutional

image features extracted from the ImageNet dataset from the VGG16

deep learning model, as well as the VGG16 deep learning network

structure.

(10)

CHAPTER 1. INTRODUCTION 3

1.4 Report outline

The first chapter (Introduction) introduces the problem and the field to

the reader, as well as the research question to investigate. The second

chapter (Background and related work) presents necessary theoretical

background and terminology in machine learning and deep learning

to the reader which is needed in order to understand methods and

arguments in later chapters. The third chapter (Methods) presents the

methods used in the thesis in order to investigate the problem and

research question. The fourth chapter (Results) presents the results of

the models produced in the Methods chapter, while the fifth chapter

(Discussion) discusses the results presented in the Result chapter and

aims to provide a conclusion to the research question.

(11)

Background and related work

This section aims to describe necessary background and terminology in machine learning, and can therefore be skipped by a reader with previous knowledge in the field.

2.1 Machine Learning

The aim of using Machine Learning in order to solve a task is to learn a model in the form of a functional mapping f from a given set of data X, rather than tackling the problem by using handcrafted rules or heuristics. The latter approach seldom yields good results due to exceptions of rules, edge cases and so on, while the former general- izes better and is often a better approximation of the true underlying function [4].

2.1.1 Classification

The goal of classification in machine learning is to use the functional of the learned model to assign an input vector X to one of K discrete classes, i.e.

f : R ⁿ → {1, . . . , K}

The output can either yield an immediate answer out of the K classes, or the function mapping outputs a probability array over the K classes and the class with the highest probability is returned [7].

4

(12)

CHAPTER 2. BACKGROUND AND RELATED WORK 5

2.1.2 Supervised vs unsupervised learning

The most common type of machine learning used in practice is super- vised learning, where we are given a dataset with inputs X with cor- responding outputs Y , often annotated by a human. In unsupervised learning we are instead only given inputs X, and the aim of the models are often to find interesting structures in the data. A common usage of unsupervised learning is data clustering [20].

2.1.3 Training

The actual learning of the model is done during the training phase. The dataset is divided into a training set and a test set, where the training set is used to iteratively alter the parameters to better fit the training set. During training optimization is done by minimizing the training error by minimizing a cost function J (θ). A common approach is to use maximum likelihood, which is done by minimizing the negative log-likelihood with respect to the parameters θ,

J (θ) = −E x,y∼ ˆ p

data

log p _model (y|x; θ) (2.1) i.e. the cross-entropy between the model distribution and the train- ing data [7]. When J ≈ 0 the model correctly classifies most inputs and the model is performing well, while a large value of J means that the model incorrectly classifies a lot of inputs and therefore performs poorly [21].

A machine learning model generalizes well if it performs well at classifying previously unseen data. Measuring the model’s ability to generalize can be done by applying the cost function to the test set which contains previously unseen data, and thereby calculating the test error. Overfitting occurs when the parameters θ of the model are too adapted to the training data and the model therefore generalizes badly, i.e. the gap between the training error and the test error is too large, while underfitting occurs when the training error is too large [7].

A common approach to prevent overfitting is to use regularization.

The idea is to reduce the test error but not influence the training error too much by putting constraints on the weights of the model in the form of

J (θ; X, t) = J (θ; X, t) + αΩ(θ) ˜ (2.2)

(13)

where α ∈ [0, ∞) is a hyperparameter controlling the relative contri- bution of the regularization constraint Ω [7].

Many machine learning models have certain parameters that con- trol the behavior of the model, which are not trained by the training set. The α parameter used for regularization is one such example, the learning rate λ of gradient descent another. One approach to optimiz- ing the hyperparameters of a model is to extract a validation set from the training set. The idea is then to learn the parameters θ from the training set, then use the validation set to learn the hyperparameters [7].

A common approach to training neural networks and optimizing the cost function is through stochastic gradient descent. The idea behind gradient descent is to iteratively watch the gradient at each step and use the gradient information to move in small steps in the direction of the negative gradient, which will effectively decrease the value of the cost function [7]. There are other proposed variants of gradient de- scent, such as Adagrad [6], RMSprop [27] and Adam [14] which aim to improve on gradient descent by using different gradient update meth- ods and adaptive learning rates.

2.1.4 Inference

Inference in a machine learning model is the process of inserting a pre- viously unseen input to the model to generate an output prediction.

This is a separate phase from the training phase and typically uses less computational power. The inputs are fed through the network us- ing the parameters produced from the training phase until an output is produced. The test set is used to measure the inference stage per- formance, which ultimately measures the practical performance of the model - i.e. how well the model generalizes to previously unseen data.

Optimizing for the generalization error is what differentiates machine

learning models from pure optimization problems where the optimal

solution can be found using the available data. In machine learning

models the training error is minimized as well, but more importantly

the generalization error should be low in order for the model to have

good performance. Low training error does not always translate to

a model with good performance as overfitting may occur which pre-

vents the model from having good performance. Using different reg-

ularization techniques during the training phase is a popular way of

(14)

CHAPTER 2. BACKGROUND AND RELATED WORK 7

achieving better generalization performance, which in turn yields bet- ter models [7].

2.2 Deep Learning

This section aims to describe relevant theory in deep learning.

2.2.1 Feedforward neural networks

The most essential deep learning models are called feedforward neural networks, so called since they feed information forward through the network until they produce an output. They resemble networks since they can be visualized by chaining together many different vector- based functions in a network structure f ⁿ (f ⁿ⁻¹ (· · · (f ² (f ¹ ))) , where the vector output of one function, also called layer, is fed as a fully- connected input to the next layer. The depth of the model refers to the length of the layer chain, hence the name deep learning. The term neu- ral arises from the fact that the network structure is loosely inspired by neuroscience and how the brain functions. Neural networks constists of an input layer, an output layer while any layer in between are called hidden layers [7]. Figure 2.1 depicts an example of a feedforward neural network.

2.2.2 Neurons and activation functions

Each node, or neuron, in a neural network is constructed as a linear

summation of weights W and features h from the previous layer, with

an additional bias b on the form of ˆ y = W ^T h + b. This summation is

then passed to an activation function which determines the binary out-

come of the neuron, i.e. whether it fires or not. An important concept

when designing neural networks is that the gradient of the cost func-

tion must be considerable in order to guide the parameter updates of

the model during training. Activation functions that saturate (become

flat and insensitive to small input changes) suffer from small gradients

which makes them unfeasible during training [7]. Figure 2.2 depicts a

representation of a neuron in an artificial neural network.

(15)

Figure 2.1: A depiction of a feedforward neural network with one in-

put layer, two hidden layers and an output layer. Image adapted from

Nielsen [21].

(16)

CHAPTER 2. BACKGROUND AND RELATED WORK 9

Figure 2.2: A depiction of a neuron as represented in a neural network,

on the form of y k = ϕ(W _k ^T x + b _k ). Image taken from Haykin [8].

(17)

Figure 2.3: The sigmoid function, saturating towards 0 and 1.

Sigmoid unit

The idea behind the sigmoid activation function is to smoothen the ac- tivation function by continuously saturating the value of the function towards 0 and 1. It is defined as

σ(x) = 1

1 + e ^−x (2.3)

The exponential in the denominator effectively causes the saturation.

Since the cost function uses negative log-likelihood, it cancels out the saturating exponential which helps to prevent small gradients due to saturation for the sigmoid. The sigmoid is a popular activation func- tion for the output layer since it squashes the output value between zero and one, which can be interpreted as a probability [7]. Figure 2.3 depicts an illustration of the sigmoid activation function.

Rectified Linear Unit

The most common activation function for hidden layer neurons in

feedforward networks are Rectified Linear Units, or ReLUs. A ReLU

is constructed as a non-linear activation function and is defined as

(18)

CHAPTER 2. BACKGROUND AND RELATED WORK 11

Figure 2.4: A depiction showing a ReLU activation function.

f (x) = max(x, 0) (2.4)

Since the function is close to linear (it is a junction of two linearities) the gradient is relatively easy to compute, which is convenient during the optimization in the training phase [7]. Figure 2.4 depicts an illustration of the ReLU activation function.

Softmax Unit

The softmax activation function saturates towards zero and one and is defined as

sof tmax(x) _i = e ^x _i P

j e ^x

^j

(2.5)

Due to similar logic as the sigmoid activation function (log-likelihood canceling out exponentials before calculating the gradient), the soft- max activation function is a feasible activation function. Similar to the ReLU, the Softmax activation function also squashes the input vector to an output vector of values between zero and one that sum to one.

This means that the output of the Softmax activation function also can

be interpreted as a probability and it is therefore a suitable output unit

activation function [17].

(19)

2.2.3 Convolutional neural networks

Neurophysiologists David Huben and Torsten Wiesel collaborated in the 1960s in order to determine how the mammalian visual system works. They connected measurement equipment to the primary visual cortex of cats and monkeys and displayed lights in various shapes and forms in front of their eyes in order to determine how the neurons in the visual cortex fired. They were later awarded with a Nobel prize for their accomplishments. Convolutional neural networks are heavily in- spired by this research in neuroscience and is perhaps one of the great- est success stories of how biology can inspire artificial intelligence.

A convolutional layer consists of three individual stages. The first stage performs several convolutions in parallel, the second stage per- forms a non-linear activation on the outputs of the convolutions, while the third stage uses a pooling function that further modify the output of the layer [7].

The convolution operation

The convolution operation is a linear transformation of two functions.

It is expressed as s(t) = (x ∗ w)(t) and is defined in the discrete domain as

s(t) =

∞

X

a=−∞

x(a)w(t − a) (2.6)

where x is referred to as the input and w as the kernel [7]. The opera- tion effectively computes the integral of the intersected product as the kernel function is shifted over the input function [28].

By making the kernel smaller than the input, the receptive field - i.e.

the neurons that effectively affect a particular neuron - is shrunk dra-

matically. This leads to sparse interactions between neurons compared

to a fully-connected layer. In a fully-connected layer each neural con-

nection is associated with its own weight, while re-using the same ker-

nel across a convolutional layer leads to parameter sharing. This means

that the kernel parameters are used for every input and kernel compu-

tation batch across the layer. Sparse interactions together with param-

eter sharing effectively leads to less memory requirements for param-

eter storage compared to a fully-connected layer. Another effect of the

convolution operation is that the output can be seen as small mean-

(20)

CHAPTER 2. BACKGROUND AND RELATED WORK 13

ingful features, for example edges can be represented in early layers by using a kernel that occupy tens or hundreds of pixels [7].

The pooling operation

The pooling operation performs a mapping of a convolutional output layer to a summary statistic of nearby outputs. The stride of the pooling operation determines how many neurons the pooling neighborhood bounding moves each iteration. A popular pooling operation in con- volutional networks is max pooling where a neighborhood of neurons are mapped to the maximum neuron value within the neighborhood.

Another is average pooling where the average over the neighborhood is returned. The pooling operation can be seen as an infinitely strong prior that the previous layer must be translation invariant, since small translations in the previous layer causes most outputs to keep their value due to the summation in the pooling. This is convenient for classification tasks, where we are more interested in whether a spe- cific feature exists within the data rather than the exact location of the feature.

The convolution operation together with the pooling operation down- sizes the network and requires less parameter storage compared to a fully-connected layer. The downsizing of the output layer put con- straints on the amount of convolutional layers that can be added to the network. A technique for getting round this problem is by zero padding the input by adding zeros on both sides of the inputs [7].

Lee, Gallagher, and Tu [16] investigate parameterized pooling func- tions in convolutional layers. They introduce parameterized pooling functions with relatively few learned parameters which results in a performance boost in comparison to their baseline convolutional net with max pooling, which most state-of-the-art convolutional models use. They investigate proportion-based combinations of max and av- erage pooling as well as other pooling functions such as gated max- average pooling and tree pooling where different pooling filters and combinations of the filters are learned. There is a trade-off between increased training time and increased performance when learning pa- rameters for the pooling function.

Simonyan and Zisserman [24] and Szegedy et al. [26] show that

deep convolutional networks with small filters are preferable to shal-

low convolutional networks with larger filters, by constructing and

(21)

benchmarking convolutional architectures which perform very well on the ImageNet dataset.

2.2.4 Training deep neural networks

Stochastic gradient descent or any of its variants with adaptive learn- ing rates are most often used during the training phase of a deep neu- ral network. A technique called back propagation uses the chain of cal- culus rule to iteratively update the parameters at each iteration [7].

Dropout is an effective regularization technique proposed by Sri- vastava et al. [25] which is implemented by keeping a neuron active with some probability p which is controlled by a hyperparameter. Oth- erwise it is set to zero, which in effect samples a sub-network from the full neural network. The effect of dropout is to make the presence of any particular neuron unreliable and thereby disallowing neurons to build co-adaptations on other neurons, which in turn makes the net- work generalize better to unseen data.

A recent reparametrization technique is batch normalization proposed by Ioffe and Szegedy [13] which has become common in the deep learning community. Using batch normalization they are able to use far less training steps but still yielding state-of-the-art performance.

Batch normalization aims to prevent Internal Covariate Shift which is defined as the change in the distribution of network activations due to the change in network parameters during training. The idea is to normalize training examples to unit Gaussian, but the layers can also learn to linearly transform the features at the last normalization step if required where the linear transformation is also learned. This way the layers can benefit from unit Gaussian normalization where bene- ficial, while the network through back-propagation can learn to undo the batch normalization where it is helpful. Batch normalization also functions as a form of regularization since it processes a batch of train- ing examples each iteration, which slightly decreases the dependence on other forms of regularization such as dropout.

Early stopping is a technique which is used to prevent overfitting

the parameters to the training data. The idea is to store the param-

eter setting at the point in time with the lowest validation error, so

instead of optimizing until the optimization algorithm reaches a local

minimum, it is run until the validation set error has not improved for

a certain amount of time. When the optimization terminates, the pa-

(22)

CHAPTER 2. BACKGROUND AND RELATED WORK 15

rameters with the best validation set error is returned rather than the latest parameter setting. Early stopping can be thought of as a way of optimizing the training time hyperparameter. [7].

Another common technique for training deep neural networks is using data augmentation. This is performed by making small changes to the the training data such as mirroring the images across the vertical axis, small rotations, adding noise to the images and similar measures.

The effect is that the training data space is expanded, which often in- creases the generalization ability of the model [7].

Goodfellow, Bengio, and Courville [7] suggests starting off using convolutional neural networks with ReLUs as activation functions in the hidden units for visual recognition tasks. Stochastic gradient de- scent with momentum or Adam are suggested as appropriate opti- mization algorithms to start with.

He et al. [9] proposes initializing the parameter weights for ReLU neurons by sampling from a zero-mean Gaussian with standard devi- ation q

2 n where n is the number of inputs, which is also backed by Li, Karpathy, and Johnson [17]. This initialization is shown to empirically improve the convergence rate. They also present a Parametric Recti- fied Linear Unit or PReLU, where the coefficient of the negative part is not constant but is rather learned during the training phase. The co- efficient parameter is shared across layers which makes the amount of extra parameters negligible compared to the amount of weights. They show that exchanging the ReLU activation functions by PReLU acti- vation functions improves classification accuracy. This suggests that learned activation functions improves upon accuracy, which is also concurrent with the work of Agostinelli et al. [1] where adaptive piece- wise linear units are learned. Learning activation functions increases the amount of parameters and therefore the training time, while yield- ing an increase in accuracy.

Andrychowicz et al. [2] investigates how to design the optimiza-

tion algorithm as a learning problem, by using gradient descent to

learn the gradient descent-based optimization algorithm. They also

investigate the transferability of the learned optimization algorithm to

other tasks and determine that the transferability is high for architec-

tures with similar activation functions and architectures, even when

transferring to models with more parameters. The learned optimiza-

tion algorithm converges slightly faster and to lower test errors com-

pared to current state-of-the-art gradient descent-based optimization

(23)

algorithms.

2.2.5 Transfer learning

Transfer learning is a subset of representation learning that aims to trans- fer knowledge from one domain to another. In the context of neu- ral networks this is done by transferring whole network architectures, subsets of networks and/or parameters from one trained network to another. In visual recognition, low-level features such as edges and visual shapes are often shared across the image domain, which is why transfer learning can improve generalization even when the datasets of the different networks are different [7].

Yosinski et al. [29] investigate the transferability of features in deep neural networks. They conclude that the transferability depends on the similarity of the datasets of the transfer network and the network to be constructed. However, initializing a network with transferred features compared to random initialization in almost any number of layers, often yields a generalization boost after tuning the network to the target dataset. This indicates that the effect of the transferred fea- tures persist even after fine-tuning.

Lim, Salakhutdinov, and Torralba [18] investigate a different ap- proach to transfer learning where the training data is augmented by borrowing and transforming instances from other classes. Their model learns which training instances to borrow and how the transforma- tions should be done in order to become more similar to the target class. They show that their object detector with borrowed and trans- formed examples improves upon the state-of-the-art on the small SUN09 dataset.

2.2.6 Software and libraries

Tensorflow is a machine learning software library presented in Martín

Abadi et al. [19]. It uses dataflow graphs to let tensors flow through

the computational graph. A more high level deep learning software

library is called Keras, presented by Chollet et al. [5]. Keras is very

common in the deep learning community and can be used to architec-

turally construct neural networks, while using Tensorflow as a back-

end. Keras also has a several pre-trained deep neural networks which

can be easily imported and used for transfer learning.

(24)

Chapter 3 Methods

This chapter aims to describe the methods and experiments conducted in the project. All models were programmed in the programming lan- guage Python (version 3.5.1). All experiments were conducted on an Amazon cloud computing cluster running Ubuntu 16.04 with a Tesla K80 graphics card.

3.1 The Dataset

The dataset consists of 835 383 images of second-hand items described by five features - a general type feature and four nested levels of cate- gories with more specificity the deeper the nested level is. The dataset mainly consists of clothing items, but other articles are also present in the distribution of data. More detailed descriptions of the distribu- tions of the different features are available in appendix A. All items are photographed at standardized photograph stations with a white background in order to blend in well on a web page market place with white background. Upper body clothing items are photographed dressed up on mannequins, while other items are photographed on photograph stations mounted on a table with white background. Due to poor lightning conditions, the background often has "bleeding" hints of gray in it. The photographs are taken by human workers which sug- gests that the angles and photograph distances of the items varies from item to item. A sample of item images from the dataset can be seen in Figure 3.1.

All images were downsized to fit within the measurement 224x224 pixels and the dataset was randomly divided into a training/validation/test

17

(25)

Figure 3.1: Example images of some items from the dataset

(26)

CHAPTER 3. METHODS 19

split of 80%, 10% and 10% respectively. Since the dataset comes from real human-labeled data, the distribution of labels within the differ- ent features is very imbalanced. For example the type feature comes from free-text human-labeled data and contains 14 150 different types, where the most common type contains 102 605 entries, while there are 7 964 different least common types with only one entry each. The type feature can be seen as dependent on the category features, meaning that information about one feature may influence the knowledge of the other. For example the most common type entry is Klänning (Dress in English) which is a subset of the most common top-level category Kläder (Clothes in English).

While the category features (hereby referred to as cat1, cat2, cat3 and cat4 for the nested category levels respectively) comes from a pre- defined set of categories, they still suffer from major imbalance be- cause of the nature of human-labeled data.

3.1.1 Handling imbalance

A minimum threshold of 1000 different items per label was set in order to obtain enough data per label to be able to train the models. This limited the distribution of labels within the feature data significantly, while still making for good performance in real models since the top class labels within the feature data with above 1000 entries still cover above 90% of the total amount of items within the different features.

3.1.2 Data augmentation

In order not to overfit the weights within the networks towards any

specific label during training, over-sampling and down-sampling was

used for the training set. In addition to the 1000 items per label thresh-

old, a down-and-over-sample threshold was set to 5000 items. Any

class label that had more than 1000 items but less than 5000 items was

oversampled to 5000 items using data augmentation in the form of ran-

domizations of slight rotations, zooming in on the image and horizon-

tal flips. Any class label that had over 5000 items was down-sampled

by simply randomizing a fraction of the total amount of items in order

to yield 5000 items. This yielded even class distributions within the

feature data for the different models, which would prevent them from

overfitting towards any specific class label by simply giving the model

(27)

Figure 3.2: An architectural representation of the VGG16 convolu- tional network model, with three fully-connected layers at the end.

Image taken from Baidu [3].

more data with a particular class label.

3.2 Base model

A base model was constructed based on the convolutional layers of the

VGG16 convolutional network trained on the ImageNet dataset. See

figure 3.2 for a depiction of an architectural overview of the VGG16

network. The convolutional layers were frozen which implies that

the weights of the convolutional networks was not updated during

back propagation in the training phase. This essentially implies that

the convolutional layers from the VGG16 network functions as a fea-

ture extractor for the following layers. The fully-connected layers was

stripped off of the VGG16 network, and two fully-connected layers

with 4096 neurons each and ReLU activation functions was added on

top of the last convolutional layers, as well as a fully-connected layer

with the number of class labels within the feature data distribution in

question with softmax activation function . The three fully-connected

layers of the model were then trained from scratch using the training

sets described above. Adam was used as an optimizer with a learning

rate of .0001, as well as a learning rate reducer which reduces the learn-

ing rate by a factor of 0.2 every time 5 epochs passed without any gain

in validation accuracy. Early stopping was also used and the training

algorithm ran for a maximum of 40 epochs.

(28)

CHAPTER 3. METHODS 21

Figure 3.3: A depiction of the 1-of-K input augmentation model

3.3 1-of-K input augmentation model

A first experiment was conducted emulating using the output of one base model as an additional 1-of-K-encoded input to the fully-connected layers. The image is passed through the frozen convolutional network and outputs abstract image features as in the base model, while an- other feature is passed as a 1-of-K-encoded feature, thereby emulat- ing the other feature coming as a result from another model network.

The image features are flattened and then concatenated with the 1-of- K encoding, then the whole concatenated layer is connected to three fully-connected layers as before.

The extra 1-of-K-encoded input is only encoded if the extra fea-

ture is among the pre-defined class labels selected from the 1000-label

threshold described in the training data preparation. Otherwise the

extra input vector is not encoded, yielding a vector of zeros which cor-

responds to no augmentation.

(29)

3.4 Multiple 1-of-K augmentation model

A second experiment was conducted using the same principle as the first experiment but with multiple 1-of-K-encoded augmentation in- puts from multiple additional features. As in the previous experiment the image was passed through the frozen convolutional network and an abstract image feature output was produced. The image features were then flattened and then concatenated with the 1-of-K encoded augmentation inputs.

3.5 No image model

A third experiment was conducted with the goal of determining how

well information about one feature maps to another feature. No image

data was used in this model; instead only a 1-of-K-encoded input from

one feature was fed to a small neural network consisting of the input

layer as well as a classification layer for the feature to be classified.

(30)

Chapter 4 Results

This chapter presents the results of the different models presented in the Methods chapter.

4.1 Base models

We see in table 4.1 that the base model classification based of image- features has bad performance across the board. The accuracy, pre- cision, recall and f-scores suggests that not much learning has been achieved and that the models have overfitted towards one or more classes.

Interpreting the results in the confusion matrix for the type base model in figure 4.1, many misclassifications comes from different type of similar-looking clothing items as well as shoes. For example a lot of different types of pants are misclassified as "Träningsbyxor" (Training pants), many different types of tops are misclassified as "Tröja" (Shirt) and different types of shoes are misclassified as "Kilklackar" (Wedge heels).

The cat1 base model confusion matrix in figure 4.2 suggests over- fitting towards six class labels since they are the only ones predicted.

This is probably due to broad and general class labels, as well as con- fusions between "Barnkläder och Barnskor" (Children’s clothing and children’s shoes), "Kläder" (Clothing) and "Skor" (Shoes) which is hard to learn for the model.

The confusion matrix in figure 4.3 for the base cat2 model suggests overfitting towards broader class labels, such as "Accessoarer" (Ac- cessories), "Herrkläder" (Men’s clothing) and "Barnkläder" (Children’s

23

(31)

Name Train loss Val loss Epochs Accuracy Precision Recall F-score type 12.9850 13.2140 7 0.1844 0.2093 0.0649 0.0991 cat1 12.2361 11.7193 22 0.3113 0.0905 0.0059 0.0110 cat2 12.5655 12.2516 1 0.2677 0.2313 0.1016 0.1411 cat3 12.6696 13.4867 26 0.1607 0.2143 0.0535 0.0856 cat4 14.7995 14.7828 8 0.0832 0.0831 0.0092 0.0166

Table 4.1: Result table of the base models

clothing). All the size class labels (which contains "Strl" (Size) in the la- bel) suffer from misclassification which most likely disturbs the model since extracting a size feature from an image in this model is a hard task due to that the images are photographed with varying distances and angles between the camera and the item.

The confusion matrix in figure 4.4 suggests misclassification mainly due to similarity between class labels. For example many instances of shirt-like items are misclassified as "Tröjor, kortärmat" (Shirts, short- sleeved) while many boot-like items are misclassified within the broader category "Boots & kängor" (Boots).

The base model for cat4 is heavily overfitted towards five different class labels, mainly due to the nature of the cat4 feature data distri- bution. This can be read from the confusion matrix in figure 4.5. The feature data mostly consists of different size labels, which is hard to extract from images.

4.2 No image models

From table 4.2 we can see that some feature mappings without image features are quite good. Reading from figure 4.18 we can see that the model is quite certain about some type-augmented cat1 labels, for ex- ample "Smycken & Ädelstenar" (Jewelry & Gems), "Kläder" (Clothes),

"Accessoarer" (Accessories) and "Skor" (Shoes), while all other class la- bels are misclassified as "DVD & Videofilmer" (DVDs & Videos). This explains the high accuracy but low recall that can be read from the results table.

Interpreting the results in figure 4.19 for the type-augmented cat2

model we can see a clear tendency towards a diagonal which explains

the relatively high accuracy, while a lot of items are misclassified as

(32)

CHAPTER 4. RESULTS 25

"Hushållsmaskiner" (Household appliances) which brings the recall down severely.

Reading the results for the cat1-augmented cat2 model from fig- ure 4.20 suggests that some label classifications are learned, while the spread of misclassifications are quite broad which explains the low precision and particularly low recall that can be read from the results table.

The type-augmented cat3 model performs quite well as can be seen in the results table. Interpreting the results in figure 4.21 we can see a clear diagonal tendency where many labels are correctly classified.

Some flat items are misclassified as "Fön, Lock & Plattänger" (Blow driers, Pliers & Curlers) which explains why the recall is not on par with the accuracy and precision.

Interpreting the results for the cat2-augmented cat3 model from the results table and the confusion matrix in figure 4.22 we can see a tendency towards a diagonal, while some vertical misclassification columns are present. Since the distribution of cat2 is more general than the distribution of cat3, the distribution of some classified labels are gathered towards a couple of labels. For example, upper body clothing articles are classified as "Klänningar" (Dresses) which is a sub category of for example the cat2 label "Damkläder" (Women’s clothing), while almost all types of shoes are classified as "Kilklackar" (Wedges) which is a sub category of the cat2 label "Damskor" (Women’s shoes).

The no image models for cat4 performs very badly which can be read from the results table and the confusion matrices in figures 4.23 and 4.24. This is due to the nature of the cat4 data distribution which mostly contains different sizes and how it relates to the respective aug- mentation features which are more descriptive of what type of item it is. The mapping between an item type and size is hard to make sense of for a model purely based on text data.

In general a lot of misclassifications in the no image models are caused by augmenting features not being present in the 1-of-K input representation. For example the vertical column of misclassifications of "Fön, Lock & Plattänger" (Blow driers, Pliers & Curlers) in figure 4.21 can be explained by this. Looking at the distribution of type of items labeled with the cat3 feature "Dukar & Tabletter" (Tablecloths

& Tablets) in the cat3 test set in table A.6 and the distribution of type of items labeled with the cat3 feature "Gardiner & Kappor" (Curtains

& Coats) in the cat3 test set in table A.7, we can see that none of the

(33)

Name Train loss Val loss Epochs Accuracy Precision Recall F-score cat1-type 2.4369 2.1216 115 0.4080 0.8670 0.3508 0.4978 cat2-type 2.4334 2.3015 91 0.3688 0.8303 0.2440 0.3755 cat2-cat1 1.5870 1.6832 104 0.3343 0.5579 0.0548 0.0991 cat3-type 1.4820 1.2456 45 0.6534 0.8591 0.5596 0.6769 cat3-cat2 4.0028 4.2099 1 0.1627 0.2401 0.1581 0.1906 cat4-type 2.1053 2.3346 94 0.1746 0.7160 0.0330 0.0625 cat4-cat3 3.8618 4.0075 1 0.1847 0.2021 0.1574 0.1770 Table 4.2: Result table of the models without incorporated image fea-

tures

types are available in the type feature space shown in table A.1 since none of the type labels sum above 1000 in the full dataset. This essen- tially translates to input vectors with only zeros for these items, which makes the model unable to learn these labels and guess wildly. The difference in generality between the features in the models also mat- ters.

4.3 1-of-K input augmentation models

As can be seen in table 4.3, having information about the type heavily influences the classification accuracy and performance of the models for features cat1 and cat2. Comparing with the base model results in table 4.1 and the no image model results in table 4.2 we can see that the image features and the type augmentation feature synergize in or- der to boost the performance of the model beyond the performance of either of the single feature models. The respective confusion matrices in figures 4.6 and 4.8 also suggests this synergy.

The result for the cat2-augmented model for cat1 is interesting in that it performs better than any of the single feature models, while the cat2 feature is more general than the cat1 feature. This suggests that the image features from the convolutional layers are clustered within the broader cat2 feature space which synergizes into a well-performing model.

The type-augmented model for cat3 performs very similar to the

no image model with the same augmentation. This suggests that the

model more or less ignores the image features and maps the type fea-

(34)

CHAPTER 4. RESULTS 27

Name Train loss Val loss Epochs Accuracy Precision Recall F-score cat1-type 0.8494 1.1952 3 0.6801 0.7639 0.6174 0.6825 cat2-type 0.9742 1.3488 3 0.5868 0.7439 0.4842 0.5858 cat2-cat1 0.8357 1.3251 4 0.6173 0.7503 0.5296 0.6204 cat3-type 0.7361 1.1028 4 0.6585 0.7300 0.5823 0.6474 cat3-cat2 12.9863 13.5488 14 0.1595 0.1961 0.0462 0.0747 cat4-type 14.4087 14.4009 8 0.1074 0.1121 0.0253 0.0413 cat4-cat3 14.2921 14.3751 10 0.1062 0.1139 0.0222 0.0372

Table 4.3: Result table of the 1-of-K input augmented models

ture input to a cat3 output, except for the cases where the items has zero-vector augmentation inputs due to the type feature not being present in the 1-of-K distribution of type available in table A.1, where instead the image features are considered. This explains why there is no clear misclassification column in the confusion matrix for the type-augmented image model for cat3 in figure 4.10, while there is one in the confusion matrix for the type-augmented no image model for cat3 in figure 4.21.

From the result table and the confusion matrix in figure 4.9 can be read that the cat2-augmented model for cat3 has no improvement to speak of, which suggests that no extra learning capability have come from the extra augmentation feature.

Continuing to read from the result table and the confusion matrices in figures 4.12 and 4.11 we can see that no particular learning has been achieved for any of the augmented cat4 image models.

4.4 Multiple 1-of-K augmentation models

Reading from table 4.4 for the multiple 1-of-K input augmentation models and comparing it to table 4.3 for the single 1-of-K input aug- mentation models, we can see that no significant improvement in per- formance has been achieved by using multiple augmentation features.

Comparing the different confusion matrices for the multiple 1-of-K

augmentation models with the respective confusion matrices for the

single 1-of-K augmentation models also suggests no particular perfor-

mance improvement, which suggests that the strongest feature and

image synergy is sustained and learned during the learning process,

while the weakest augmentation feature is ignored.

(35)

Name Train loss Val loss Epochs Accuracy Precision Recall F-score cat2-type_cat1 1.1273 1.3253 2 0.5996 0.7675 0.4868 0.5950 cat3-type_cat2 0.7455 1.0809 4 0.6635 0.7341 0.5879 0.6525 cat3-cat1_cat2 11.4602 12.2538 35 0.2405 0.2921 0.1001 0.1491 cat4-type_cat3 14.3473 14.4721 2 0.1038 0.1136 0.0168 0.0292 cat4-cat2_cat3 14.2954 15.2071 1 0.0592 0.1193 0.0159 0.0281 Table 4.4: Result table of the multiple 1-of-K input augmented models.

Figure 4.1: Confusion matrix for the type base model.

(36)

CHAPTER 4. RESULTS 29

Figure 4.2: Confusion matrix for the cat1 base model.

(37)

Figure 4.3: Confusion matrix for the cat2 base model.

(38)

CHAPTER 4. RESULTS 31

Figure 4.4: Confusion matrix for the cat3 base model.

(39)

Figure 4.5: Confusion matrix for the cat4 base model.

(40)

CHAPTER 4. RESULTS 33

Figure 4.6: Confusion matrix for the cat1 1-of-K model augmented

with the type feature.

(41)

Figure 4.7: Confusion matrix for the cat2 1-of-K model augmented

with the cat1 feature.

(42)

CHAPTER 4. RESULTS 35

Figure 4.8: Confusion matrix for the cat2 1-of-K model augmented

with the type feature.

(43)

Figure 4.9: Confusion matrix for the cat3 1-of-K model augmented

with the cat2 feature.

(44)

CHAPTER 4. RESULTS 37

Figure 4.10: Confusion matrix for the cat3 1-of-K model augmented

with the type feature.

(45)

Figure 4.11: Confusion matrix for the cat4 1-of-K model augmented

with the cat3 feature.

(46)

CHAPTER 4. RESULTS 39

Figure 4.12: Confusion matrix for the cat4 1-of-K model augmented

with the type feature.

(47)

Figure 4.13: Confusion matrix for the cat2 multiple 1-of-K model aug-

mented with type and cat1 features.

(48)

CHAPTER 4. RESULTS 41

Figure 4.14: Confusion matrix for the cat3 multiple 1-of-K model aug-

mented with cat1 and cat2 features.

(49)

Figure 4.15: Confusion matrix for the cat3 multiple 1-of-K model aug-

mented with type and cat2 features.

(50)

CHAPTER 4. RESULTS 43

Figure 4.16: Confusion matrix for the cat4 multiple 1-of-K model aug-

mented with cat2 and cat3 features.

(51)

Figure 4.17: Confusion matrix for the cat4 multiple 1-of-K model aug-

mented with type and cat3 features.

(52)

CHAPTER 4. RESULTS 45

Figure 4.18: Confusion matrix for the cat1 no image model augmented

with type.

(53)

Figure 4.19: Confusion matrix for the cat2 no image model augmented

with type.

(54)

CHAPTER 4. RESULTS 47

Figure 4.20: Confusion matrix for the cat2 no image model augmented

with cat1.

(55)

Figure 4.21: Confusion matrix for the cat3 no image model augmented

with type.

(56)

CHAPTER 4. RESULTS 49

Figure 4.22: Confusion matrix for the cat3 no image model augmented

with cat2.

(57)

Figure 4.23: Confusion matrix for the cat4 no image model augmented

with type.

(58)

CHAPTER 4. RESULTS 51

Figure 4.24: Confusion matrix for the cat4 no image model augmented

with cat3.

(59)

Discussion and Conclusions

This chapter aims to discuss and draw conclusions from the results in order to answer the stated research question. Suggestions on im- proved methods and models are also presented as interesting areas of future work to further research the stated research question.

5.1 Discussion

5.1.1 Dependence on feature data distribution simi- larity and generality

The results suggest that domain similarity between the dependent fea- tures has a major impact on how much of a performance boost is achieved when augmenting the models, if any at all. Another thing to be con- sidered is the generality of the respective feature data distributions.

The feature data distribution for type is quite narrow which often helps the models. For cat1 and 2, augmenting with type synergizes with the image features and yields better performance than the base models and no image models. The synergy effect emerges as the type feature augmentation narrows down similar image features to differ- ent class labels.

A more interesting result is the cat1-augmented image model for cat2, which receives a major performance increase due to the augmen- tation. The feature data distribution for cat1 is more general than that of cat2, which suggests that augmenting with cat1 effectively clusters the image features within the cat1 domain which in turn yields good performance.

52

(60)

CHAPTER 5. DISCUSSION AND CONCLUSIONS 53

Since the type and cat3 domains are quite similar (see tables A.1 and A.4), the no image model for cat3 with 1-of-K type input has relatively good performance. The model more or less learns a mapping func- tion from the type feature to the cat3 feature. Adding image features yields no particular performance increase as the type input has enough information for the model to do the mapping.

The cat2-augmented model for cat3 suffer from too general aug- mentation features which prevents the model making sense of the aug- mentation features in relation to the image features which thereby pre- vents the model from yielding a model performance increase.

Due to that the cat4-domain mainly consists of different size class labels, the base model performance is very bad since size features is hard to extract from an image since the models lack the concept of size.

Sizing depends on factors such as object distance from the camera, re- gion of fabrication and similar which is not encoded within the image.

Therefore, no augmented model for the cat4 feature yields any par- ticular or significant model performance increase and no cat4 model manages to perform any real learning.

The multiple 1-of-K augmentation models received no significant model performance increase which suggests that the most prevalent augmentation feature is used in any possible synergy effect, while the extra augmentation feature is ignored during the learning process.

5.1.2 Method criticism

Due to time constraints on this project there was no time to train the

image feature extractor in the models from scratch. Instead, transfer

learning was used with pre-trained weights and models which ex-

tracted the image features. This heavily influenced the base model

performance. For example a model trained and tuned to this particu-

lar dataset should be able to distinguish quite well between different

types and categories of boots, clothing and other items. Due to the na-

ture of the pre-trained VGG16 network and the weights based on the

quite dissimilar ImageNet dataset, there was limitations to the base

models and their performance. Better base model performance would

decrease the gap between the augmentation models and the base mod-

els and would thereby affect the results.

(61)

5.2 Conclusion

The results indicate that information about one feature may influence model performance of a model classifying another feature if incorpo- rated into the model. It depends heavily on the domain similarity of the features which determines how much sense the model can make of the extra information, and whether model performance increases or not. The generality of the augmentation features also matters, and particularly how similar in generality the classification domain and the augmentation domain is. It is therefore of importance how the augmentation features are selected, as well as how they relate to the classification domain. However, choosing reasonable augmentation features might cause synergy effects which in turn influences model performance to the better.

5.3 Future work

An interesting approach would be to train convolutional networks from scratch and use output from different convolutional networks (abstract image features) as concatenated image features and pass them on to fully-connected layers. Another approach would be to base the weights on VGG16 and ImageNet and then fine-tune them towards the specific dataset. Different class label domains would yield different image features, and making use of the different image features could improve upon classification in the one-domain-classification setting.

Training deep networks from scratch was out of scope in this thesis,

mainly due to the massive time consumption required in order to train

networks with vasts amount of data.

(62)

Bibliography

[1] Forest Agostinelli et al. “Learning Activation Functions to Im- prove Deep Neural Networks”. In: CoRR abs/1412.6830 (2014).

URL : http://arxiv.org/abs/1412.6830.

[2] Marcin Andrychowicz et al. “Learning to learn by gradient de- scent by gradient descent”. In: CoRR abs/1606.04474 (2016). URL : http://arxiv.org/abs/1606.04474 .

[3] Baidu. Paddle Paddle book. [Online; accessed 8-Aug-2017]. URL :

http://book.paddlepaddle.org/03.image_classification/ . [4] Christopher M. Bishop. Pattern Recognition and Machine Learning

(Information Science and Statistics). Secaucus, NJ, USA: Springer- Verlag New York, Inc., 2006. ^ISBN : 0387310738.

[5] François Chollet et al. Keras. https://github.com/fchollet/

keras . 2015.

[6] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive Subgra- dient Methods for Online Learning and Stochastic Optimization”.

In: J. Mach. Learn. Res. 12 (July 2011), pp. 2121–2159. ISSN : 1532- 4435. URL : http : / / dl . acm . org / citation . cfm ? id = 1953048.2021068 .

[7] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learn- ing. http://www.deeplearningbook.org. MIT Press, 2016.

[8] Simon Haykin. Neural Networks: A Comprehensive Foundation. 2nd.

Upper Saddle River, NJ, USA: Prentice Hall PTR, 1998. ^ISBN : 0132733501.

[9] Kaiming He et al. “Delving Deep into Rectifiers: Surpassing Human- Level Performance on ImageNet Classification”. In: CoRR abs/1502.01852 (2015). URL : http://arxiv.org/abs/1502.01852.

55

(63)

[10] D. Hubel and T. Wiesel. “Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex”. In: Journal of Physiology 160 (1962), pp. 106–154.

[11] David H. Hubel and Torsten N. Wiesel. “Receptive Fields and Functional Architecture of Monkey Striate Cortex”. In: Journal of Physiology (London) 195 (1968), pp. 215–243.

[12] David H. Hubel and Torsten N. Wiesel. “Receptive Fields of Sin- gle Neurons in the Cat’s Striate Cortex”. In: Journal of Physiology 148 (1959), pp. 574–591.

[13] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Ac- celerating Deep Network Training by Reducing Internal Covari- ate Shift”. In: CoRR abs/1502.03167 (2015). URL : http://arxiv.

org/abs/1502.03167 .

[14] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochas- tic Optimization”. In: CoRR abs/1412.6980 (2014). ^URL : http : //arxiv.org/abs/1412.6980 .

[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Ima- geNet Classification with Deep Convolutional Neural Networks”.

In: Advances in Neural Information Processing Systems 25. Ed. by F.

Pereira et al. Curran Associates, Inc., 2012, pp. 1097–1105. URL : http : / / papers . nips . cc / paper / 4824 - imagenet - classification- with- deep- convolutional- neural- networks.pdf .

[16] Chen-Yu Lee, Patrick W. Gallagher, and Zhuowen Tu. “Gener- alizing Pooling Functions in Convolutional Neural Networks:

Mixed, Gated, and Tree”. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics. Ed. by Arthur Gretton and Christian C. Robert. Vol. 51. Proceedings of Machine Learning Research. Cadiz, Spain: PMLR, Sept. 2016, pp. 464–472.

URL : http : / / proceedings . mlr . press / v51 / lee16a . html .

[17] Fei-Fei Li, Andrej Karpathy, and Justin Johnson. “CS231n: Con- volutional Neural Networks for Visual Recognition 2016”. In:

(2016). ^URL : http://cs231n.stanford.edu/.

(64)

BIBLIOGRAPHY 57

[18] Joseph J. Lim, Ruslan Salakhutdinov, and Antonio Torralba. “Trans- fer Learning by Borrowing Examples for Multiclass Object De- tection.” In: NIPS. Ed. by John Shawe-Taylor et al. 2011, pp. 118–

126. ^URL : http://dblp.uni-trier.de/db/conf/nips/

nips2011.html#LimST11 .

[19] Martín Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org.

2015. URL : http://tensorflow.org/.

[20] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective.

The MIT Press, 2012. ISBN : 0262018020, 9780262018029.

[21] Michael A. Nielsen. Neural Networks and Deep Learning. Determi- nation Press, 2015.

[22] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”.

In: CoRR abs/1612.08242 (2016). URL : http : / / arxiv . org / abs/1612.08242 .

[23] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. In: CoRR abs/1506.02640 (2015). ^URL : http:

//arxiv.org/abs/1506.02640 .

[24] K. Simonyan and A. Zisserman. “Very Deep Convolutional Net- works for Large-Scale Image Recognition”. In: CoRR abs/1409.1556 (2014).

[25] Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neu- ral Networks from Overfitting”. In: J. Mach. Learn. Res. 15.1 (Jan.

2014), pp. 1929–1958. ISSN : 1532-4435. URL : http://dl.acm.

org/citation.cfm?id=2627435.2670313 .

[26] Christian Szegedy et al. “Going Deeper with Convolutions”. In:

CoRR abs/1409.4842 (2014). ^URL : http://arxiv.org/abs/

1409.4842 .

[27] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gra- dient by a running average of its recent magnitude. COURSERA:

Neural Networks for Machine Learning. 2012.

[28] Eric W. Weisstein. Convolution. From MathWorld—A Wolfram Web

Resource. Last visited on 14/6/2017. URL : %5Curl % 7Bhttp :

//mathworld.wolfram.com/Convolution.html%7D .

(65)

[29] Jason Yosinski et al. “How Transferable Are Features in Deep

Neural Networks?” In: Proceedings of the 27th International Con-

ference on Neural Information Processing Systems. NIPS’14. Mon-

treal, Canada: MIT Press, 2014, pp. 3320–3328. ^URL : http : / /

dl.acm.org/citation.cfm?id=2969033.2969197 .

(66)

Appendix A

Dataset description

Table A.1: Table showing the type feature with class labels above 1000 entries

Name Amount in full dataset

Armband 1852

Axelremsväska 3582

BH 1513

Ballerinaskor 3932

Blus 25180

Body 1379

Boots 4973

Byxdress 2124

Byxor 25724

Chinos 1440

Dunjacka 1430

Finskor 2188

Fleecetröja 1162

Halsband 2457

Halsduk 3239

Handväska 6373

Hoodie 1381

Huvtröja 3635

Högklackade Skor 3339

Jacka 36061

Jeans 31579

Jeansjacka 1633

Kappa 11402

59

(67)

Kavaj 13254

Keps 1224

Kilklackar 1762

Kjol 16892

Klackskor 7977

Klänning 91237

Kofta 31828

Kostym 3372

Kostymbyxor 2815

Kängor 4701

Leggings 1095

Linne 8433

Loafers 1320

Långklänning 2710

Mjukisbyxor 1491

Mössa 2783

Overall 1589

Parfym 1485

Pikétröja 6948

Plånbok 1378

Polotröja 2102

Poncho 1525

Pumps 1345

Rock 1161

Ryggsäck 1769

Sandaler 3724

Sandaletter 1030

Shorts 6405

Sjal 3085

Skinnjacka 3423

Skjorta 34118

Skor 7263

Skärp 2773

Slips 1192

Sneakers 10623

Solglasögon 2179

Stickad Tröja 1744

Stövlar 5851

Stövletter 3401

T-shirt 14551

Tjocktröja 1138

Topp 20683

(68)

APPENDIX A. DATASET DESCRIPTION 61

Träningsbyxor 2174

Träningslinne 1565

Träningsshorts 1030

Träningsskor 2560

Träningstights 2532 Träningströja 2961

Tröja 56355

Tunika 4791

Täckbyxor 1830

Vinterjacka 1302

Väska 4835

Väst 4043

(69)

Table A.2: Table showing the cat1 feature with class labels above 1000 entries

Name Amount in full dataset

Accessoarer 41723

Antikt & Design 7155 Barnkläder & Barnskor 53745

Bygg & Verktyg 1150 Böcker & Tidningar 2711 DVD & Videofilmer 1201 Datorer & Tillbehör 2310 Foto, Kameror & Optik 1245 Hem & Hushåll 22535

Feature extraction from images with augmented feature inputs

Feature extraction from images with augmented feature inputs

ANDREAS DRANGEL

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Feature extraction from images with augmented feature inputs

ANDREAS DRANGEL

Master in Computer Science Date: November 21, 2017 Supervisor: Anders Holst Examiner: Anders Lansner

Swedish title: Särdragsextrahering från bilder med förstärkt särdragsinmatning

School of Computer Science and Communication

iii

Abstract

Incorporating augmentation inputs in the form of extra feature in- formation in image models was found to yield different results de- pending on feature data distribution similarity and level of generality.

Care must be taken when augmenting with features in order for the

feature not to be completely redundant or to completely take over in

the learning process. Selecting reasonable augmentation inputs might

yield desired synergy effects which influences model performance to

the better.

Sammanfattning

Integrering av förstärkande inmatning i form av extra särdragsin-

formation i bildklassificeringsmodeller visades ge olika resultat bero-

ende på likhet och generalitet av distribution av särdragsdata. Hänsyn

måste tas när förstärkande särdrag används för att de förstärkande sär-

dragen inte ska bli helt redundanta eller helt ta över under tränings-

processen. Väljande av rimliga förstärkningssärdrag kan medföra öns-

kade synergieffekter vilket påverkar modellprestandan till det bättre.

Contents

1 Introduction 1

1.1 Problem formulation . . . . 2

1.2 Research question . . . . 2

1.3 Limited scope . . . . 2

1.4 Report outline . . . . 3

2 Background and related work 4 2.1 Machine Learning . . . . 4

2.1.1 Classification . . . . 4

2.1.2 Supervised vs unsupervised learning . . . . 5

2.1.3 Training . . . . 5

2.1.4 Inference . . . . 6

2.2 Deep Learning . . . . 7

2.2.1 Feedforward neural networks . . . . 7

2.2.2 Neurons and activation functions . . . . 7

2.2.3 Convolutional neural networks . . . 12

2.2.4 Training deep neural networks . . . 14

2.2.5 Transfer learning . . . 16

2.2.6 Software and libraries . . . 16

3 Methods 17 3.1 The Dataset . . . 17

3.1.1 Handling imbalance . . . 19

3.1.2 Data augmentation . . . 19

3.2 Base model . . . 20

3.3 1-of-K input augmentation model . . . 21

3.4 Multiple 1-of-K augmentation model . . . 22

3.5 No image model . . . 22

v

4 Results 23

4.1 Base models . . . 23

4.2 No image models . . . 24

4.3 1-of-K input augmentation models . . . 26

4.4 Multiple 1-of-K augmentation models . . . 27

5 Discussion and Conclusions 52 5.1 Discussion . . . 52

5.1.1 Dependence on feature data distribution similar- ity and generality . . . 52

5.1.2 Method criticism . . . 53

5.2 Conclusion . . . 54

5.3 Future work . . . 54

Bibliography 55

A Dataset description 59

Chapter 1 Introduction

1

1.1 Problem formulation

This thesis aims to research the problem of classifying dependent fea- tures of varying specificity using image inputs. The features to be clas- sified are "type" and four nested levels of "categories". The dataset and features are covered more in depth in the "Dataset" section of the

"Methods" chapter. The hypothesis is that having information about one feature influences what we may know about the other features because of how they are related to one another.

1.2 Research question

The research question this thesis seeks to answer is

Can information about one feature extracted from one model, influence the performance of another model classifying another (possibly dependent) feature? How does the representation of this extra feature information influence model performance?

1.3 Limited scope

The scope of this thesis will be limited to benchmarking and inves- tigating the specific dataset covered in the "Dataset" section of the

"Methods" chapter, with the specific features it entails. Due to time

constraints on the thesis it will also be limited to using convolutional

image features extracted from the ImageNet dataset from the VGG16

deep learning model, as well as the VGG16 deep learning network

structure.

CHAPTER 1. INTRODUCTION 3

f : R ⁿ → {1, . . . , K}

an additional bias b on the form of ˆ y = W ^T h + b. This summation is