Product Matching Using Image Similarity

(1)

1 Juni 2020

Product Matching Using Image Similarity

Melker Forssell Gustav Janér

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Product Matching Using Image Similarity

Melker Forssell, Gustav Janér

PriceRunner is an online shopping comparison company. To maintain up-to- date prices, PriceRunner has to process large amounts of data every day.

The processing of the data includes matching unknown products, referred to as offers, to known products. Offer data includes information about the product such as: title, description, price and often one image of the product. PriceRunner has previously implemented a textual-based machine learning (ML) model, but is also looking for new approaches to complement the current product matching system. The objective of this master’s thesis is to investigate the potential of using an image-based ML model for product matching. Our method uses a similarity learning approach where the network learns to recognise the similarity between images. To achieve this, a siamese neural network was trained with the triplet loss function. The network is trained to map similar images closer together and dissimilar images further apart in a vector space.

This approach is often used for face recognition, where there is an extensive amount of classes and a limited amount of images per class, and new classes are frequently added. This is also the case for the image data used in this thesis project. A general model was trained on images from the Clothing and Accessories hierarchy, one of the 16 top- level hierarchies at PriceRunner, consisting of 17 product categories.

The results varied between each product category. Some categories proved to be less suitable for image-based classification while others

excelled. The model handles new classes relatively well without any, or with briefer, retraining. It was concluded that there is potential in using images to complement the current product matching system at PriceRunner.

Tryckt av: Reprocentralen ITC ISSN: 1401-5749, UPTEC IT 20016 Examinator: Lars-Åke Norden Ämnesgranskare: Petter Ranefall Handledare: Carl Svärd

(4)

(5)

PriceRunner är ett företag som erbjuder shoppingjämförelser online, deras huvud- sakliga sysselsättning är prisjämförelse. För att bibehålla uppdaterade priser be- höver PriceRunner bearbeta stora mängder data varje dag. En del av bearbetningen består av matchning av okända produkter, benämnda offers, till kända produkter på PriceRunner. PriceRunner har tidigare implementerat en textbaserad maskinin- lärningsmodell men undersöker ständigt nya metoder för att förbättra matchningen av produkter. Målet med denna masteruppsats är att undersöka hur bilder kan användas för produktmatchning. Vår metod är att använda similarity learning där nätverket lär sig att känna igen likhet mellan bilder. För att uppnå detta användes ett siamese nätverk som tränades med triplet loss funktionen. Nätverket optimeras för att placera liknande bilder nära varandra i ett vektorrum medan avvikande bilder hamnar längre ifrån varandra. Denna metodik används inom ansiktsigenkänning där det finns en stor mängd klasser med få bilder och nya klasser läggs till kontinuerligt.

Fallet är likadant för datan i denna uppsats, det finns många klasser med få bilder hos PriceRunner och produkterna uppdateras ständigt. En generell modell tränades på Kläder och Accessoarer som är en av de 16 toppnivå hierarkierna bestående av 17 produktkategorier. Modellens resultat varierade mellan de olika kategorierna.

Vissa kategorier visade sig vara olämpliga för bildbaserad matchning medan andra kategorier stod ut med bra resultat. Modellen som tränades hanterar nya produkter relativt bra och med en kortare omträning blev resultaten ännu bättre. Slutsaten blev att det finns potential i att använda bilder för att komplementera det nuvarande produktmatchningsystemet hos PriceRunner.

(6)

We would like to thank Docent Petter Ranefall at Uppsala University for his guid- ance during the thesis project. We would also like to express our most sincere gratitude to Carl Svärd, Morgan Elfvin, Marcus Janota together with the rest of the people at the matching team and PriceRunner.

(7)

Sammanfattning i

Acknowledgements ii

1 Introduction 1

1.1 Problem Statement . . . 1

1.2 Research Questions . . . 2

1.3 Purpose . . . 2

1.4 Scope . . . 2

1.5 Outline. . . 3

2 Background 3 2.1 Data . . . 4

2.1.1 Product Listing . . . 5

2.1.2 PriceRunner Images . . . 6

2.1.3 Offer Images . . . 6

2.2 Image Trends in Product Categories. . . 8

3 Theory 9 3.1 Artificial Neural Networks . . . 9

3.1.1 Activation Function . . . 10

3.1.2 Softmax Function . . . 11

3.2 Convolutional Neural Networks . . . 12

3.3 Siamese Neural Networks . . . 13

3.3.1 Image Embeddings . . . 14

3.3.2 Triplet Loss . . . 15

3.3.3 Triplet Mining. . . 17

4 Related Work 19 4.1 FaceNet . . . 19

4.2 ResNet . . . 20

5 Methodology 21 5.1 Data . . . 21

5.2 Software . . . 21

5.3 Hardware . . . 21

5.4 Model Architecture . . . 22

5.5 Data Collection . . . 22

5.6 Preprocessing . . . 23

5.7 Batch Generator . . . 24

5.8 Triplet Generator . . . 24

5.8.1 Implementation . . . 25

5.8.2 Optimisation . . . 26

5.9 Classification . . . 27

5.9.1 Efficiency . . . 27

(8)

6 Results and Analysis 28

6.1 Embedding Space and Training Effects . . . 30

6.2 Handling New Products . . . 33

7 Discussion 34 7.1 Discussion of Methodology . . . 34

7.1.1 Constraints . . . 34

7.1.1.1 Hardware . . . 34

7.1.1.2 Data . . . 35

7.1.2 Model Architecture . . . 35

7.1.3 Preprocessing . . . 36

7.1.4 Batch Generator . . . 36

7.1.4.1 Batch Generator With Many Categories . . . 37

7.1.5 Triplet Generator . . . 37

7.1.6 Classification . . . 37

7.1.7 Augmentations . . . 38

7.2 Discussion of Results . . . 39

7.2.1 Image Trends in Product Categories . . . 39

8 Conclusions and Future Work 39 8.1 Conclusions . . . 39

8.2 Limitations . . . 40

8.3 Future Work . . . 40

References 42

A Triplet Generator 45

(9)

(10)

for reference purposes) . . . 5

2 Examples of good images . . . 6

3 Examples of difficult images . . . 8

4 Simple artificial neural network . . . 10

5 ReLU function . . . 11

6 Softmax layer . . . 12

7 Basic concept of SNNs . . . 14

8 Triplet loss objective [5] . . . 15

9 Triplet loss. . . 16

10 Three types of triplets depending on where the negative is placed. a = Anchor, p = Positive. . . 18

11 Training error (left) and test error (right) on CIFAR-10 with 20- layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error [12] . . . 20

12 Residual learning: a building block [12] . . . 21

13 The 3 steps of preprocessing . . . 24

14 How increasing categories affects accuracy . . . 30

15 How the embedding space affects the precision and recall . . . 31

16 Models trained on top-level hierarchy, mid-level hierarchy and solo categories . . . 32

17 The top-level hierarchy Clothing and Accessories trained for 100, 150 and 200 epochs . . . 33

18 Handling of new products . . . 34

19 Augmentations explored by Geoffrey Hinton and his team. . . 39

(11)

Embedding Space The vector space an image embedding is mapped to.

Image Embedding Feature vector representation of an image.

Merchant Company that PriceRunner includes in its price comparison.

Offer Image Image from a merchant.

PriceRunner Image Image that is stored by PriceRunner.

Product Equivalent to a class in the setting of this thesis. The terms product and class will be used interchangeably in this thesis.

Reference Image PriceRunner image that is used as a reference point for a class during classification.

Test/Validation Image Offer image that is predicted during classification and is previously unseen by the network.

Acronyms

API Application Programming Interface.

EC2 Elastic Compute Cloud.

GB Gigabyte.

GDDR6 Graphics Double Data Rate type 6 synchronous dynamic random-access memory.

GPU Graphics Processing Unit.

ML Machine Learning.

RAM Random Access Memory.

S3 Simple Storage Service.

TPU Tensor Processing Units.

(12)

1 Introduction

There is a semantic gap between human vision and computer vision. Humans process visual information by directly extracting semantically meaningful high level features, while computers process visual information by extracting low level features from matrices of pixel values [1]. The challenge of computer vision is not only to enable computers to distinguish visual features from a byte array of a digital image, but it is also to enable computers to attain a higher-level understanding of the image content - to bridge the semantic gap [2].

The visual object recognition ability of humans is rapid and accurate for classify- ing previously seen objects, generally independent of orientation and viewpoint [3].

And until recently, visual object recognition of computers has not been able to achieve human-level capabilities. Although, with the arise of deep learning and recent advances in computer vision with bio-inspired deep convolutional neural net- works, computer vision is now on par with human-level performance and even sur- passing human-level performance for certain visual object recognition tasks [3].

Computer vision combined with machine learning can be used to create complex models for image analysis, where deep learning has been proved to be especially successful [4]. One area of application for these models is the automation of image classification tasks. The authors of the FaceNet [5] paper successfully applied a deep metric learning approach that scales for larger face recognition systems. Much of the theory and methods of this thesis are inspired by the FaceNet paper.

PriceRunner is an online shopping comparison company that processes large amounts of data with different products every day. The processing of the data includes match- ing unknown products, referred to as offers, to known products. For this classifi- cation task, machine learning can be leveraged to enhance the automation of the product matching.

1.1 Problem Statement

The core business at PriceRunner is price comparison. To ensure valid price compar- isons of all products, the latest prices from the merchants have to be up to date. To continuously ensure price comparison of the latest prices, PriceRunner fetches and receives large amounts of data every day from merchants, containing offer data. The

(13)

offer data has to be processed and matched to specific products. Today, the task of matching an offer to a product is semi-automatic. Many offers contain a unique identifier, which allows for a direct match between an offer and a product. Offers that lack an identifier, or if the identifier is faulty, need to be matched manually.

Manual matching of offers can be expensive and highly inefficient.

To reduce the need for manual labour during product matching: PriceRunner has developed an ML model that uses textual data for product matching. For offers without unique identifiers, this model is used to automatically predict and match offers to products based on textual data. PriceRunner is constantly working to improve the textual model and also to find new approaches to complement and enhance the current product matching system.

This master’s thesis proposes that the use of image data can be leveraged by an ML model to complement the current product matching system at PriceRunner.

1.2 Research Questions

• How can images be used in a supervised ML setting at PriceRunner?

• Can the matching problem at PriceRunner be posed as a metric learning problem?

• How should images be represented and what kind of model architecture should be used?

• Are certain product categories more suited for image-based classification than others? And if so, why?

1.3 Purpose

The purpose of this master’s thesis project is to investigate if and how image data can be leveraged to complement the current product matching system at PriceRunner.

1.4 Scope

This thesis project will generate an image-based model to showcase the potential of how image data can be leveraged to complement PriceRunner’s current product matching system. The integration/deployment of the image-based model will not

(14)

be carried out during the span of this thesis project. Therefore, certain aspects have not been optimized for a production setting.

1.5 Outline

Section2 introduces the background to the problem by giving a brief introduction of PriceRunner and the data used for this thesis. Section3provides the reader with a theoretical background of the networks and algorithms used. Section4 describes previous related work and research. Section 5 covers the methods used and the model implemented. Section 6 presents the results of the thesis. Section 7 and Section8 concludes the thesis by discussing the problem, methodology and results.

2 Background

The core business at PriceRunner is price comparison, i.e., given a product, find the store that sells that product at the lowest price. To handle price comparison, PriceRunner maintains a database with products and links to all available offers, that are sold by PriceRunner’s affiliated stores, referred to as merchants.

Larger merchants regularly send feeds of new and updated offers directly to PriceRun- ner. For merchants that do not independently provide PriceRunner with offer feeds, PriceRunner uses web crawling for gathering offer feeds. All offer feeds need to be ingested and processed by PriceRunner, so that at any given time: what the merchants are selling and to which price is up to date. One of the key activities that PriceRunner needs to perform during the processing of the offer feeds is to associate, or match, each offer to a product in PriceRunner’s database.

To maintain an up-to-date price comparison, the associations between offers and products are constantly updated. At any time, offers or products might be modified, added or removed. For example, there are new products launched every day that have to be created and added to PriceRunner’s database. Since the associations between offers and products are dynamic - product matching is a continuous process.

PriceRunner’s task of product matching is complex and computationally heavy.

More than 100 million offers are processed every day from thousands of merchants across 3 countries. At the time of writing, PriceRunner has more than 2 million different products in its database, with new products being added every day. Yet,

(15)

for many of the offers that are processed every day, there is no existing product in PriceRunner’s database. Therefore, the ideal sequence for the product matching system is the following for every offer:

1. Search for an existing product in Pricerunner’s database 2. - If the product exists: match it

- Else if the product does not exist: create it

The available information of offers typically includes name, description, price, image and in many cases a unique identifier, e.g. Global Trade Item Number (GTIN) or International Article Number (IAN) also known as European Article Number (EAN). For offers with GTIN, IAN or EAN the matching can simply be done by finding the corresponding identifier of a product in PriceRunner’s database. For offers without an identifier, or when the provided identifier is faulty, or when there is no matching product in Pricerunner’s database: PriceRunner has to consider additional information and other methods for product matching.

One approach is to use ML for product matching. PriceRunner has already successfully built and integrated a system that leverages ML. The first step of PriceRunner’s ML matching system, is to execute a category mapping. This is to narrow down the list of candidates of potential products an offer may be matched to. Based on the textual information of the products in the candidate list, the ML model makes a prediction. PriceRunner prioritises a high precision for the product matching. This means that a prediction with low confidence will not be matched, while a prediction with high enough confidence will be accepted as a match. This leads to a high precision at the cost of a lower recall.

2.1 Data

PriceRunner has around 360 different product categories. This thesis project is based on the image data from 17 of those categories within the selected Clothing and Accessories top-level hierarchy. Clothing and Accessories includes the mid-level hierarchies: Accessories, Clothing and Shoes. For each of those mid-level hierarchies, there are product categories such as: Watches, Children’s Clothing and Children’s Shoes. And for every product category, there might also be several sub-categories.

These categories were selected in discussion with PriceRunner, they are hard cate-

(16)

gories to much based on the Unique Identifiers and textual data.

2.1.1 Product Listing

A product listing is a product that is listed on PriceRunner, see Figure1. A product listing consists of product information, PriceRunner’s product images and matched offers. PriceRunner’s product images are the images that are referred to as PR images in this thesis. In the blue rectangle of Figure 1, there are PR images and information about the product. In the orange rectangle of Figure1, there is a list of matched offers. Every matched offer in the list is a link to a merchant that is selling this product. For every offer, there is an image from the merchant associated to it, these images are referred to as offer images in this thesis.

Figure 1: A sample product listing (blue and orange dotted rectangles added for reference purposes)

(17)

2.1.2 PriceRunner Images

PR images are always manually verified, squared, good quality and have a white background. The PR images are focused on the product and do not contain noise.

See examples of PR images in Figure 2.

PR images together with offer images constitute the training datasets. PR images are also used as the reference images for the models. Reference images are used as the reference points for a class during classification.

There is on average 1.9 PR images per product(class). This average is based on data from the top-level hierarchy Clothing and Accessories.

Figure 2: Examples of good images

2.1.3 Offer Images

For each product listing, there are matched offers. For each of those offers, there is at most one offer image from the merchant. It is sometimes the case that the offer

(18)

image of a matched offer is either corrupt, the wrong image, or of exceptionally low resolution. Most of these cases were taken care of during either the data collection or during the preprocessing. The offer images are not manually verified and have a large variety in quality, background colour, dimensions and noise. For instance, an image of a specific shoe product could be a distant image of a person wearing those shoes. See examples of difficult offer images in Figure3.

Offer images together with PR images constitute the training datasets. However, offer images alone constitute the training and validation datasets. Offer images that are only used for testing or validation purposes are referred to as test/validation images. Test/validation images have not been used for training and are unseen by the network.

Currently, PriceRunner has an approximated accuracy of 97% on the matched offers, which means that about 3% of the matched offers may be wrong (which results in mislabelled offer images that are used for training, testing and validation).

There is on average 6.2 offer images per product(class) after preprocessing. This average is based on data from the top-level hierarchy Clothing and Accessories.

(19)

Figure 3: Examples of difficult images

2.2 Image Trends in Product Categories

The images of the same product category often follow common trends. For a product category, such trends can for example be that the images of the products usually include similar angles and exhibit similar noise.

Single-angle images: for some product categories, a majority of the product im- ages only come in one angle. For example, the mid-level hierarchy Personal Care includes categories such as Skincare and Hair Products. A majority of the images of those catagories are only of a single frontal angle of the product. A difficult case for these kinds of images, is products from the same brand that look very similar, but only differ in the small text on the label of the products.

Multi-angle images: many product categories such as Shoes, have multiple images with different angles for each product. For some categories, PriceRunner often

(20)

includes some image angles that are extremely uncommon for offer images. An example of this is the Shoes category, where PriceRunner sometimes includes an image of a shoe sole among the PR images of a product listing. While at the same time, there are no offer images from merchants of shoe soles (or at least very uncommon).

Clean images: the images of categories such as Children’s Clothing and Children’s Shoes mainly include images that are focused solely on the product which do not contain noise. See Figure 2 for examples of clean images.

Noisy images: the images of some product categories often come with noise. For example, the categories Shoes and Clothes for adults might often have different people wearing the product in the image. See Figure3 for examples of images with miscellaneous noise.

3 Theory

3.1 Artificial Neural Networks

An artificial neural network (ANN) comprises a set of neurons, or nodes, that are connected together by edges [6], as illustrated in Figure 4. Parallel subsets of nodes in the network form layers. The basic structure of an ANN has three layers: one input layer, one hidden layer and one output layer. The input layer receives the data that the network should process. The original input data is then forwarded through the network and transformed, where each layer’s output acts as the input to the next layer. The output of the final output layer, also referred to as the prediction layer, is the actual prediction of the network for the given input data. If the network has multiple hidden layers, it is a deep neural network [6].

(21)

Figure 4: Simple artificial neural network

3.1.1 Activation Function

A node receives a set of numeric input values and maps those to a single output value. Fundamentally, a node operates as a linear-regression function. However, the final output of a node is also passed through an activation function. The activa- tion function applies a nonlinear transformation to a node’s output value. Three activation functions are commonly used in ANNs [6]:

1. Logistic / Sigmoid function

2. Hyperbolic Tangent / TANH function 3. Rectified Linear Unit / ReLU function

For this thesis, the ReLU function was chosen as the activation function for the network. The ReLU function is defined as in Equation1. The function returns zero for any negative input x, and returns x for any nonnegative input [6], as seen in Figure5.

f (x) = max(0, x) (1)

(22)

Figure 5: ReLU function

3.1.2 Softmax Function

Solving image classification tasks has conventionally been done by training a network to learn discriminative features of the specific classes trained on [7]. For a multi- class classification problem, the final output layer is conventionally normalized by a softmax function, as seen in Figure 6. Given an image as input: a model with a softmax prediction layer returns a probability distribution of the classes [8] trained on.

(23)

Figure 6: Softmax layer

Even though the softmax function can be used for multi-class image classification, the approach has limited scaling potential for certain data settings. When the number of classes increases and training data per class becomes scarce, and especially when new classes are added frequently: the conventional approach of a softmax prediction layer becomes impractical [7]. This is exemplified in Section 4.1, which describes the challenges of face recognition tasks.

3.2 Convolutional Neural Networks

The convolutional neural network (CNN) architecture was originally designed for ML dealing with image data [6]. Image data is spatially connected, and a network used for image recognition needs to be able to learn visual features and then recognize a learned visual feature, independent of where in the image it occurs. CNNs accomplish this through groups of nodes that share weights, referred to as filters, or kernels. A filter is a matrix of shared weights that is convoluted over input images.

Each filter learns to recognize a distinct visual feature. A CNN normally comprises several convolutional layers, subsampling layers, followed by fully connected layers.

(24)

3.3 Siamese Neural Networks

Twin networks, or Siamese neural networks (SNN), were introduced in the 1990s by Lecun and Bromley [9]. SNNs implement a metric learning approach, where the model learns a general concept of a similarity metric of the input data, as op- posed to learning class-specific concepts as conventional classification approaches [7].

Through the metric learning approach, SSNs mitigate the issues of networks using the conventional softmax approach described in Section 3.1.2.

SNNs can be used for different types of data. Though in this thesis, SNNs are mainly explained through the perspective of image data. When dealing with image data, the architecture of SNNs is composed of convolutional layers as described in Section3.2, and can then be referred to as siamese convolutional neural networks.

SNNs output d-dimensional vector representations, referred to as embeddings, of input images. An image embedding is a mapping from an image to a d-dimensional Euclidean vector space. Distances between embeddings in the learned metric vector space directly correspond to a measure of similarity. Through this similarity metric, SNNs can be used for classification tasks [5].

The reason for describing a network as siamese, is due to the identical subnetworks of the model that share weights. A basic use case that exemplifies the concept of siamese networks, is illustrated in Figure7: input two images, encode their embeddings and then calculate the distance between the two embeddings in the vector space. A shorter distance between the embeddings indicates that the images are similar - a longer distance between the embeddings indicates that the images are dissimilar. If the distance between the embeddings of a test/validation image and a reference image is within a specified threshold: the SNN will classify the test/vali- dation image as being of the same class as the reference image [9].

(25)

Figure 7: Basic concept of SNNs

There exists a trade-off between precision and recall that can be tuned by selecting different values for the threshold. Selecting the threshold depends on what is opti- mised for: a low threshold yields a higher precision but a lower recall, and vice versa for a high threshold.

During the training phase, SNNs learn to: produce embeddings that map similar images closer together - and dissimilar images further apart in a vector space. The training process of SNNs differs depending on what loss function is used. Two com- monly used loss functions for image similarity learning with SNNs, are contrastive loss [10] and triplet loss. The triplet loss function was used for this thesis, more details about the triplet loss function in Section3.3.2.

After the training phase: it is possible to precompute embeddings of reference images. Reference images are passed as input to the SNN and the resulting reference embeddings are stored to be used later during the classification phase. The step of precomputing and storing reference embeddings is done to decrease the amount of computations required during the actual classification phase.

3.3.1 Image Embeddings

An image embedding is a feature vector representation of an image. Embeddings function as mappings from input images to a d-dimensional Euclidean vector space.

The distances between embeddings in the vector space can be used to compare similarity between images [5].

The embedding dimensionality is a hyperparameter that can be used to optimize

(26)

a trade-off between accuracy and computation time. Increasing the embedding dimensionality can result in the model being able to learn a more accurate embedding mapping to the larger vector space. However, increasing the embedding dimensionality also increases the computation time [5]. An embedding dimensionality of 128 was used for the models in this thesis.

3.3.2 Triplet Loss

A triplet is a set of an anchor image, a distinct positive image of the same class as the anchor, and a negative image of a different class [5]. The objective of the triplet loss function is to minimise the distance between the anchor and the positive, while maximising the distance between the anchor and the negative by a specified distance margin parameter, as seen in Figure8.

Figure 8: Triplet loss objective [5]

During the training: for every triplet, each image is feeded through the network to generate its embedding, see Figure 9. The distances between the three triplet embeddings are then calculated. The distance metric used for the triplet loss in this project, is the squared L2 (Euclidean) distance, see Equation 2 and 3. Stochastic gradient descent is then used to minimise the triplet loss function [5].

(27)

Figure 9: Triplet loss

Equation of the triplet loss function:

For any image x, embedding representation of x: f (x) = X

Where function f is the mapping from an image to its embedding representation Anchor image a, anchor embedding: f (a) = A

Positive image p, positive embedding: f (p) = P Negative image n, negative embedding: f (n) = N distance = d

loss = l

The squared L2 distances between the embeddings of the anchor and the positive, and between the anchor and the negative, are calculated by Equation 2and 3.

d(A, P ) = ||A − P ||² (2)

d(A, N ) = ||A − N ||² (3)

For any triplet, we want to satisfy the inequality in Equation 4.

d(A, P ) < d(A, N ) (4)

(28)

The margin introduced in Equation 5, enforces a minimum distance margin value to satisfy the inequality.

d(A, P ) + margin < d(A, N ) (5)

d(A, P ) − d(A, N ) + margin < 0 (6)

l(A, P, N ) = max(d(A, P ) − d(A, N ) + margin, 0) (7)

Taking the max of the left hand side of Equation 6and 0, results in the triplet loss function in Equation 7.

3.3.3 Triplet Mining

Triplets that satisfy Equations 6 and 8, are easy triplets. Easy triplets do not contribute to the training process and lead to a slower convergence [5]. It is therefore important to select certain triplets that contribute to the training of the model to enable a faster convergence. The method of selecting triplets is referred to as triplet mining.

In the FaceNet [5] reasearch paper, three types of triplets are defined: easy triplets, semi-hard triplets and hard triplets (In some literature, the triplet types are also referred to as: easy negatives, semi-hard negatives and hard negatives). The type of a triplet depends on where the negative is placed, relative to the anchor and the positive, and on the margin value, as illustrated in Figure10.

See Section3.3.2 for definitions of: anchor, positive, negative and margin.

l(A, P, N ) = 0 (8)

d(A, P ) < d(A, N ) < d(A, P ) + margin (9)

d(A, N ) < d(A, P ) (10)

(29)

Figure 10: Three types of triplets depending on where the negative is placed. a = Anchor, p = Positive.

• Easy triplets: triplets that satisfy Equation 8.

When the negative is outside the margin, represented by the green area in Figure 10.

• Semi-hard triplets: triplets that satisfy Equation9.

When the negative is within the margin, represented by the orange area in Figure 10.

• Hard triplets: triplets that satisfy Equation10.

When the anchor is closer to the negative than the positive, represented by the red area in Figure 10.

Easy triplets do not contribute to the training since they already satisfy that the loss function yields 0. If only hard triplets would be mined, training can converge too early to a local minimum. Therefore, semi-hard triplets were preferred during triplet mining in FaceNet [5].

There are two different approaches for triplet mining described in FaceNet [5]:

(30)

• Offline triplet mining: triplets are mined for each epoch

• Online triplet mining: triplets are mined for each batch

Both approaches use the most recent checkpoint of the network to ensure that the difficulty of triplets is adapted and increased as the training progresses. Since offline mining is performed on the entire data of an epoch, online mining is more efficient with considerably lower training times [11]. Therefore, online mining was used for this thesis project.

In the In Defense of the Triplet Loss for Person Re-Identification [11] paper, two online mining strategies are presented: batch all and batch hard. Batch all selects all valid semi-hard and hard triplets and computes the average loss. Batch hard selects for each anchor: the hardest positive and the hardest negative in the batch. The batch hard strategy was used for this thesis project, as it is the most efficient of the two online mining strategies [11].

4 Related Work

4.1 FaceNet

In 2015, Google researchers developed a new approach for image recognition which was published in FaceNet: A Unified Embedding for Face Recognition and Clus- tering [5]. Previous approaches for image recognition based on neural networks, conventionally use a softmax prediction layer, described in Section 3.1.2. For these approaches, the network is trained to learn class-specific discriminative features of the images trained on. The authors of FaceNet however, created a network that is trained to learn a general concept of a similarity metric of the images trained on.

The network proposed in FaceNet is a kind of siamese neural network. SNNs re- turn embeddings from input images, where distances between embeddings directly correspond to a measure of similarity. Further details of SNNs are described in Section3.3. The FaceNet network was trained with the triplet loss function. More details of the triplet loss function are described in section3.3.2.

The approach described in the FaceNet research paper, where an SNN and the triplet loss function is used for image classification, was developed to address the challenges of implementing face recognition efficiently at scale [5]. In practice, face

(31)

recognition tasks often include en extensive amount of classes, limited amount of images per class, and continuously added and removed classes [7]. The data setting and challenges of face recognition, are similar to the product matching task at PriceRunner: many classes, few images per class and the set of classes is dynamic, more details about the data of the thesis in Section 2.1.

4.2 ResNet

ResNet is short for residual network, and it was created to mitigate issues caused by stacking too many layers in deep convolutional neural networks. The residual network architecture was created by researchers at Microsoft and published in Deep Residual Learning for Image Recognition [12].

Deeper CNNs have resulted in important strides for image classification. A deep network architecture naturally allows a model to learn low-, mid-, and high-level features of images. Where the levels of features can be enhanced by stacking more layers(increasing the network depth) [13]. The depth of a CNN is a significant performance factor [14]. However, deeper neural networks are more difficult to train.

As more layers are added to a network, the complexity increases and training be- comes more challenging, exposing the degradation problem. Due to the degradation problem, accuracy becomes saturated and eventually starts to degrade as network depth increases [12]. Such degradation is not caused by overfitting, as increasing depth also leads to increased training error, illustrated in Figure11.

Figure 11: Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error [12]

ResNet mitigates the problem of degradation by introducing building blocks that use shortcut connections, as seen in Figure12. Shortcut connections skip one or more

(32)

layers, creating an identity mapping of the input. The identity mapping enables a neural network to better pass through abstractions that were learned in preceding layers [12].

Figure 12: Residual learning: a building block [12]

5 Methodology

5.1 Data

All data used for this thesis is described in detail in Section2.1.

5.2 Software

All models created during this thesis project were implemented in the programming language Python 3 with the tf.keras module. The tf.keras module is the Tensorflow core implementation of Keras [15]. TensorFlow is an open source library for devel- oping and training ML models [16]. Keras is an open source, high-level API that simplifies the development of neural networks [17].

5.3 Hardware

The training was performed on an Amazon EC2 instance class g4dn.xlarge. It has 4 virtual CPU cores, a Tesla T4 GPU with 16GB of GDDR6 memory and 16GB of RAM.

(33)

5.4 Model Architecture

The network architecture in Listing 1 was used for the models in this thesis.

1 Model : " s e q u e n t i a l "

_________________________________________________________________

3 L a y e r (t y p e) Output Shape Param #

=================================================================

5 c o n v 2 d ( Conv2D ) ( None , 1 1 2 , 1 1 2 , 2 5 6 ) 1 9 4 5 6 _________________________________________________________________

7 max_pooling2d ( MaxPooling2D ) ( None , 5 6 , 5 6 , 2 5 6 ) 0

_________________________________________________________________

9 conv2d_1 ( Conv2D ) ( None , 2 8 , 2 8 , 2 5 6 ) 1 6 3 8 6 5 6

_________________________________________________________________

11 max_pooling2d_1 ( M a x P o o l i n g 2 ( None , 1 4 , 1 4 , 2 5 6 ) 0

_________________________________________________________________

13 conv2d_2 ( Conv2D ) ( None , 7 , 7 , 2 5 6 ) 1 6 3 8 6 5 6

_________________________________________________________________

15 max_pooling2d_2 ( M a x P o o l i n g 2 ( None , 3 , 3 , 2 5 6 ) 0

_________________________________________________________________

17 f l a t t e n ( F l a t t e n ) ( None , 2 3 0 4 ) 0

_________________________________________________________________

19 d e n s e ( Dense ) ( None , 1 2 8 ) 2 9 5 0 4 0

_________________________________________________________________

21 lambda ( Lambda ) ( None , 1 2 8 ) 0

=================================================================

23 T o t a l params : 3 , 5 9 1 , 8 0 8 T r a i n a b l e params : 3 , 5 9 1 , 8 0 8 25 Non−t r a i n a b l e params : 0

_________________________________________________________________

Listing 1: Network architecture

5.5 Data Collection

PR images were available in Amazon S3 buckets, that were managed by PriceRunner.

Different PriceRunner APIs were used to acquire product and image identification, in order to download the needed PR images. The offer images were not stored by PriceRunner and not directly available. PriceRunner maintains a database that contains information about matched product offers. This data was used to find image URLs of merchants’ offer images that could be downloaded.

To create different training and validation datasets of images: an extensive Python script was created to be able to efficiently download and organise PR and offer images of needed product categories. Given a list of product categories: the script downloads all images of all products of the specified categories. The script organises the downloaded images by category, sub-category and product(class).

(34)

5.6 Preprocessing

All raw images have to be preprocessed to assure a standard of quality and com- patibility for the data used by the model. Especially, the offer images from the merchants are of varying quality and some images have to be filtered out.

The first step of the preprocessing is to verify that all images are valid image files and non-corrupt. Any file that is not a valid image format or if the image file is broken, is discarded.

Then, the images are standardised so that every image is in the RGB format (the RGB image format has 3 colour dimensions: red, green, blue). The model requires the same format for all images. Therefore, RGBA images which have 4 colour dimensions and grayscale images which have 1 dimension, have to be converted into RGB. RGBA images had to be alpha blended [18] during the RGB conversion to prevent flicker in the images.

The ResNet4.2 and DenseNet [19] networks are often used as baseline networks for various image-based ML research. The largest image size that ResNet and DenseNet support is 224x224 pixels. Therefore, it was decided to also use square images of size 224x224 pixels as standard for all the networks used in this thesis. To feed the network with images of 224x224 pixels, resizing was performed. To maintain high quality of the images, at least one axis had to be 150 pixels or more, otherwise the image was discarded.

For downsampling, Lanczos filter [20] was used. An important factor to consider when resizing images into squares, is to maintain the aspect ratio. This was achieved by adding padding for images which were not originally squared.

The following steps are done during resizing:

1. Search from the corners of the image towards the center for changes in pixel values. Stop when the change is above a certain threshold.

2. Save the positions of the pixels of the four corners.

3. Create a bounding box around the area that have been found.

4. Crop it.

(35)

5. Add padding of the colour found in the top left corner, so that the image becomes squared.

6. The image is now a square, resize it.

The main reason for cropping, is to have as much focus on the product as possible.

The offer images varies in placement of the object, size of the object, colour and dimension of the background. So if only resizing is performed, there would be loss of information. The steps of the resizing are shown in Figure 13.

Figure 13: The 3 steps of preprocessing

5.7 Batch Generator

Three different types of batches were defined: random batches, category batches and subcategory batches. The type of the batch is changed after each epoch by the ratios of: 20,40,40 for random, category and subcategory. Random batches include classes across all categories of the training data. Category batches include classes only from the current category. Subcategory batches include classes only from the current subcategory.

5.8 Triplet Generator

The triplet loss function is used to train the models in this thesis project. A triplet consists of an anchor, a positive and a negative image, see more details about the triplet loss function in Section3.3.2. Conventionally, the only rule for a valid triplet is defined byRule 1.

(36)

Rule 1 The anchor and the positive have to be distinct images of the same class.

The negative image has to be of a different class [5].

The image data used for this thesis can be divided into two different types of images:

PR images and offer images, more information in Section2.1. Each class comprises a set of PR images and a set of offer images - these two different types of images introduce an intra-class division. This intra-class division offers an extra layer of control over the triplets: it enables the possibility of creating different combinations of valid triplets, by setting different combinations of PR and offer images for the anchor, positive and negative.

The intra-class division was exploited by adding an extra rule for a valid triplet, that enabled controlling the triplet combination of PR and offer images, defined by Rule 2. A valid triplet for the triplet generator is defined as: a triplet that satisfies Rule 1and Rule 2.

Rule 2 The anchor has to be a PR image. The positive and negative have to be offer images.

The test and validation datasets only consist of offer images. The aim is to classify the offer images using PR images as reference images. For successful classification, the embeddings of the offer images have to be close enough to the PR reference embeddings, as explained in Section 3.3. In theory, controlling triplets by Rule 2 will cause the model to learn to map embeddings of PR (anchor) images and offer (positive) images of the same class closer together during training - while pushing offer (negative) images further away from PR (anchor) images that are of a different class.

5.8.1 Implementation

TensorFlow provides an addon for the triplet loss function that also performs triplet mining [21], more information about triplet mining in Section 3.3.3. With the Ten- sorFlow triplet loss function, it is however not possible to implement control for the triplet combination of PR/offer images. Therefore, a custom triplet loss function with triplet mining had to be created. TensorFlow recommends [22] an open source implementation of the triplet loss function with triplet mining for tf.keras. This open source implementation [23] was used as a code base, that was customised to allow controlling the combination of PR/offer images.

(37)

A tf.keras loss function only accepts a fixed set of two input parameters by default [15]: 1. the image embeddings of the current batch, 2. the labels of those embeddings. However, to be able to generate valid triplets during training accord- ing toRule 1andRule 2, it is not only required to have the labels of the embeddings, but also to know whether an embedding was encoded from a PR or offer image. To clarify, our triplet loss function requires information about the class label and the intra-class division type (PR or offer image) for all embeddings.

Passing more information to a tf.keras loss function can be solved in multiple ways.

For the loss function used in this thesis project, it was solved by using bitwise operations to add a binary flag at the least significant bit (LSB) for every label (a label is a unique integer for each class). It works as follows:

For every image and its associated label:

1. Bitwise left-shift the label of the image to add a new LSB 2. - If the image is a PR image: set the LSB to 1

- Else if the image is an offer image: set the LSB to 0

This results in a single, multipurpose label for each image that both marks the image class and whether the image is a PR or offer image. To restore the original label, the bit-modified label is simply bitwise right-shifted. The bit-modified label enables passing an extra piece of information about each image embedding to the loss function.

5.8.2 Optimisation

Due to long training times when working with large datasets of images, efficiency had to be optimised to enable faster training iterations. One approach was to make the triplet generator more efficient. This was accomplished by transcoding all of the necessary native Python data structures and logic for the triplet generation, into TensorFlow tensors and tensor operations [24]. This process mainly included replacing native Python loops iterating over arrays, with pure tensor operations applied on tensors.

Code examples are available in Appendix A, which includes two code snippets.

The code snippets serve as the part of the triplet generator that is responsible for

(38)

controlling the combination of PR/offer images of triplets. Listing 2 is the initial implementation and Listing3 is the transcoded optimised tensor implementation.

Transcoding Python control flow into a tensor control flow (matrix multiplication etc.), results in more efficient code. The training time was decreased by more than ten times by performing these changes. Working with tensors instead of eg. NumPy arrays, also has the advantage that tensors can reside in accelerated memory, like GPU or TPU [24].

5.9 Classification

Siamese networks lead to a separation between classification and training. In our setting, only PR images are used as reference images. We generate embeddings of PR images to create an embedding space with all the reference embeddings in a vector space. In a production setting, the reference embeddings can be precomputed and stored in a database.

Once an unknown image or test/validation image is to be classified, it is prepro- cessed to the network default image size of 224x224 pixels and passed through the network, resulting in a test/validation embedding. The simple approach is to com- pare each test/validation embedding with every reference embedding by calculating the Euclidean distance and state that it belongs to the class of the closest reference embedding. However, once the amount of classes increases, calculating the distance to every reference embedding comes at a high computational cost.

5.9.1 Efficiency

Efficiency has not been a priority during this thesis project. However, as the size of the dataset increased, efficiency had to be optimised to enable faster training iterations to obtain results faster. Apart from optimising the triplet generator as described in Section 5.8.2, another approach was implemented to optimise how to search in the embedding space during classification. For this, Spotify’s open source library Annoy [25] for approximated nearest neighbour search was used.

5.10 Evaluation

To evaluate the results of the models, the metrics in Section5.10.1 will be used to assess the following aspects:

(39)

1. How does the number of categories affect the training of the model?

◦ Do we need many models or can we use one general?

2. How does the number of categories in the embedding space affect the results?

◦ If it does affect the results, there may be a need for a category mapping prior to only have an embedding space of one category in the embedding space.

3. How does the model handle new products?

◦ Can we handle new products without retraining the model?

◦ Do we need to retrain the model from scratch?

◦ Can we continue the training where we left off and reduce the training?

5.10.1 Evaluation Metrics

The following evaluation metrics will be used as a basis for result assessment of the models:

Top-1 Accuracy the proportion of offers where the top prediction is the correct product(class)

Offer Coverage Recall the proportion of offers that can be matched to a prod- uct(class)

Product Matching Precision the proportion of the matched offers that are cor- rect

6 Results and Analysis

All the results presented in this section are based on data from 17 categories within the Clothing and Accessories top-level hierarchy described in Section 2.1. Cloth- ing and Accessories includes the mid-level hierarchies: Accessories, Clothing and Shoes. For each of those mid-level hierarchies, one category was selected for analy- sis: Watches, Children’s Clothing and Children’s Shoes.

The same network was used and trained for 100 epochs for each model. Depending on the set of categories used for the training, the training is referenced as top meaning

(40)

all 17 categories, mid meaning all categories within a mid-level hierarchy and solo training. The results of the models trained on different categories can be seen in Table 1.

Top Hierarchy Mid Hierarchy Solo

Category Top-1 Top-1 Top-1

Watches 83% 86% 90%

Jewellery 77% 80%

Bags 63% 63%

Wallets 59% 62%

Umbrellas 53% 52%

Sunglasses 53% 65%

Luggage 35% 39%

Children’s Clothing

81% 82% 84%

Masquerade 81% 84%

Work Gear 54% 62%

Women’s Cloth- ing

52% 57%

Lingerie 34% 44%

Workout Clothes

49% 52%

Men’s Clothing 48% 51%

Children’s Shoes 58% 62% 69%

Shoe Accessories 59% 66%

Shoes 45% 52%

Table 1: Results from different types of training

As seen in Table 1, the accuracy can differ considerably between categories. From our results, we could not detect a correlation between the number of images per class and accuracy. However, certain image attributes and trends were observed to affect the difference in accuracy between the categories. More information about image trends in Section2.2.

(41)

6.1 Embedding Space and Training Effects

(a) (b)

(c)

Figure 14: How increasing categories affects accuracy

As can be seen in Figure 14, the accuracy decreases in all of the graphs when increasing the number of categories trained on. Without a category mapping there will be a mixed embedding space of different categories, this decreases the accuracy of the classification, but only if there are similar categories in the mixed embedding space. This can be seen in graphs14aand14b, where Children’s Shoes and Children’s Clothing both have similar categories such as Shoes and Clothing for adults in the mixed embedding space. The accuracy of Watches did not decrease when sharing the embedding space, as seen in graph14c. This is probably due to Watches being a visually unique category within the Clothing and Accessories hierarchy.

For evaluation purposes, PriceRunner uses the metrics of Offer Coverage Recall and Product Matching Precision defined in Section 5.10. For PriceRunner, it is important to not show inaccurate results, therefore, precision is prioritised over recall.

For the precision vs. recall graphs in this section, the dots on the lines mark a change

(42)

of the value of the threshold, information about the threshold in Section3.3.2. The threshold is incremented by 0.1 and assigned values in the range from 0-0.9.

At 95% precision or above, sharing the embedding space did not cost in recall for any of the three categories, as seen in Figure15.

(a) (b)

(c)

Figure 15: How the embedding space affects the precision and recall

However, the number of categories trained on did affect the recall. With only one exception which indicates that at high precision, the recall may be improved slightly if training with similar categories. This can be seen in graph 16a, where the recall at 95% precision, increased slightly for the orange line. In graph 16b, the same result could though not be seen. Generally, all three graphs in Figure 16 indicate that by training the model with more categories, the recall is reduced. As can be seen, the threshold to archive 95% precision differs from 0.1-0.2 in Children’s shoes to 0.7-0.9 in Watches. This means that to archive high recall with a kept precision there needs to be a decision layer that selects the threshold that adapts depending on the category.

(43)

(a) (b)

(c)

Figure 16: Models trained on top-level hierarchy, mid-level hierarchy and solo categories

By training the model for more than 100 epochs, the trade-off between precision and recall could be improved, as seen in Figure 17. As can be seen in all three graphs of Figure 17, the recall is improved when increasing the epochs from 100 to 150.

However, increasing to 200 epochs resulted in a lower recall than 150 epochs, which indicates that the network starts to overfit.

(44)

(a) (b)

(c)

Figure 17: The top-level hierarchy Clothing and Accessories trained for 100, 150 and 200 epochs

6.2 Handling New Products

To evaluate how the models handle new products, we withheld 10000 classes of Watches from the training and only trained on 20000 classes of Watches.

On average, PriceRunner adds 5000 products across all 360 categories manually each week. Therefore, 10000 new classes for a single category is an extreme case, but it showcases a stress test on how the model handles new products. This is by far no representative dataset and only gives an indication.

Graph18adisplays the difference between the previously trained 20000 Watches and the new untrained 10000 classes of Watches. Even though the recall remains notably high for the untrained classes, the difference between trained and untrained data is a significant decrease in recall. As can be seen in graph 18b, we explored how new products can be trained into the model. It proved to be an insignificant difference in recall by continuing the training of the network for another 50 epochs, compared to retraining from scratch for 100 epochs. This means that for new classes, the same

(45)

result can be achieved in 50 epochs by continuing training the network, as with 100 epochs from scratch. In a production setting, this difference is significant since the use of computing resources can be reduced. In graph18c, the result of the additional 10000 watches is displayed as untrained and continued trained for 50 epochs, which reveals the difference in recall that be be achieved with only 50 epochs.

(a) unseen data vs trained (b) different type of training

(c) Training 50 epochs with added classes

Figure 18: Handling of new products

7 Discussion

7.1 Discussion of Methodology

7.1.1 Constraints 7.1.1.1 Hardware

To utilise the GPU fully, the data is stored directly in the GPU during runtime. This means that we were limited by the 16GB GDDR6 memory in the GPU. This meant the data had to be divided into small batches during training. This resulted in an upper limit of around 150 images per batch before the GPU was out of memory.

(46)

7.1.1.2 Data

During training, FaceNet [5] uses a batch size of approximately 1800 images that averages around 40 separate classes with 45 images per class. In FaceNet [5], the authors mention the importance of ensuring that during training, there should not be too few images of any one class in a batch. As written in Section 2.1.1, the data used for this thesis includes a limited amount of images per class. This was one of the major challenges during the thesis project. In order to generate valid triplets during training with a combination of PR and offer images as stated byRule 1 and Rule 2in Section5.8: at least one PR image and one offer image have to be present per class to create anchor-positive pairs. Due to the hardware constraint described in Section7.1.1.1, the number of separate classes per batch was also limited. This resulted in few combinations of valid triplets per batch.

As explained in Section2.1, there are two different types of images: PR images and offer images. Both of the image types are labelled and both image types are used for training. However, only PR images are allowed to be used as reference images to match against during classification. This constraint was set, because the offer images have not been manually verified by PriceRunner, thus deemed not reliable enough. Since offer images are not verified, it means that they could potentially aggregate errors if they were to be used as reference images. This constraint further decrease the already limited amount of reference images per class. Fewer reference images per class effectively decreases the capability of image recognition, since there are fewer image variations of each class to match against (fewer angles of an object etc.).

7.1.2 Model Architecture

Throughout the project, various network architectures were tested and evaluated.

The best results were achieved with the network architecture in Listing1. ResNet4.2, DenseNet [19] and MobileNet [26] were also tested with various combinations of transfer learning, though none reached better results than the architecture in List- ing 1. As stated in Section 4.2, deeper networks are more challenging to train. The poorer results, might be due to too few images per class, or too few images in total to be able to train the more complex networks properly.

(47)

7.1.3 Preprocessing

Using resizing preprocessing was decided after evaluating other options, such as using random-crop and center-crop during batches. There was no benefits noticed by using crop of larger images during training. However, removing the need of preprocessing is cost effective and should be considered.

There were less than 5 % of the images that were discarded due to low quality in the categories we evaluated. This needs to be further analysed if more categories are used.

The background colour of added padding was determined by the colour of the top left corner. With this approach, the worst case scenario is a padding with that colour and the best case scenario is a background with the same colour as the rest of the image. We did not notice any issues with this approach, even though it might not be the most reliable.

7.1.4 Batch Generator

As stated in Section3.3.3, online triplet mining has been used for this thesis project.

When performing online triplet mining, it has to be ensured that valid triplets can be formed for each batch. There has to be at least 2 images per class to be able to form anchor-positive pairs. The simple approach of generating batches is to random sample images into batches. However, this is not a feasible approach since it would not be possible to form valid triplets for most of the batches.

We have tried multiple combinations of how to generate triplets and have seen varying results depending on how we build the triplets. In the beginning, we used a Tensorflow package [21] for triplet mining where we had no control of how the triplets were generated. This meant we only controlled how the batches were generated and the triplet mining package took care of the rest.

However, since there are more offer images than PR images, this caused the network to be biased towards offer images. This bias was noticed when validating unseen data versus all the training data, when we created embeddings of the offer images we saw a large accuracy increase compared to only embeddings from PR images.

To tackle this problem we tried many different strategies such as ensuring that it was the same amount of PR and offer images in each batch and taking all images

(48)

from a class, but using fewer classes in each batch. However, we saw no increase in the classification accuracy. The strategies that has been successful are controlling how the triplets are generated, further details in 5.8, we also saw promising results when using the augmentations of the same image as both anchor and positive, more in7.1.7.

7.1.4.1 Batch Generator With Many Categories

Some product categories had over hundred thousand images, while other categories just had a few thousand. the variety in the amount of classes per category and amount of images per category impacted small categories while better categories still did not reach their potential we had seen earlier.

We decided to look into how to make harder triplets and started to take advantage of the category hierarchy from PriceRunner. We had seen a boost when training two categories together so we knew that we wanted to have a mix of that but we also saw a decrease when adding too many categories.

We defined three types of batches: random batches, category batches and subcategory batches. After trial and error, we decided to change the type of batches after each epoch by the ratios of: 20,40,40 for random, category and subcategory.

7.1.5 Triplet Generator

Rule 2 in Section 5.8 can be modified to produce different possible combinations of PR/offer images for a triplet. Out of the PR/offer combinations that were tested, the presented PR/offer combination of Rule 2 was the combination that yielded the best results. Controlling the triplet combination of PR/offer images by Rule 2, resulted in a significant increase in accuracy compared to if not controlling the combination of PR/offer images. Therefore, it was determined that a valid triplet for the triplet generator was defined as: a triplet that satisfiesRule 1 and the extra Rule 2.

7.1.6 Classification

To reduce the computational cost of the classification, a category mapping can be executed to produce a smaller candidate embedding space. Category mapping is already used in PriceRunner’s textual-based machine learning model. Category mapping was not used in this thesis project. Instead, another approach for reducing

(49)

the computational cost of the classification was implemented, this approach was an approximated search [25] of the embedding space.

For approximated nearest neighbour search, Spotify’s open source library Annoy [25]

was used. For the scope of our thesis, the Annoy library served its purpose of enabling us to collect result faster. However, an issue with the Annoy library, is that once an index for the search space has been built, it cannot be updated. This means that when changes occur in the search space, the index has to be rebuilt. For a production setting at PriceRunner, a setup with Annoy would therefore not be optimal.

7.1.7 Augmentations

As stated earlier in Section 2.1, we had few images per class and we wanted to see if augmentations could counter this aspect and improve the performance. We tried various combinations of the augmentations seen in Figure 19 from a recent paper [10]. The most successful combination was to use two augmented versions of the same image as anchor and positive in combination of batches with one image per class.

In the paper [10], they concluded that a sequence of the random crop with resize and random flip, random Gaussian blur and random colour distort was the best combination. These augmentations can be seen in the Figure 19 with notations:

b,c,d,e,h and i. The best results with augmentations were reached when adjusting the colour distort to be lower and the minimum crop to be higher than in the paper [10]. In the end however, we achieved the best results when only controlling the combination of PR and offer images of the triplets as explained in Section 5.8, without using any augmentations.