Automatic Retail Product Identification System for Cashierless Stores

(1)

Automatic

Retail Product

Identification System

for Cashierless Stores

PAPER WITHIN Software Product Engineering AUTHOR: Shiting Zhong

TUTOR:Rachid Oucheikh JÖNKÖPING February 2021

(2)

This exam work has been carried out at the School of Engineering in

Jönköping in the subject area machine learning. The work is a part of the

two-year university diploma programme of the Master of Science in

Software Product Engineering programme. The author takes full

responsibility for opinions, conclusions and findings presented.

Examiner: He Tan

Supervisor: Rachid Oucheikh

Scope: 30 credits

(3)

Summary

The purpose of this research in this project is to design and build an end-to-end retail product identification system. This system will provide cashierless stores with a smart checkout in which the purchases are automatically recognized without the need for scanning the barcodes or neither standing in long queues. Besides, this system will help self-checking stores to identify fraudulent customer behaviors, such as putting false barcodes on expensive products in order to scan them and get the price difference.

The research questions that this project aims to answer are mainly four issues. Firstly, the appropriate architecture of an end-to-end retail product identification system was intended to be determined by using text analysis and classification. Afterwards, using text classification as basis of the system, an investigation was done about the suitable representation and features of the text to use as input of the deep learning model. Then, to evaluate the designed system,there is a push to select the performance metrics, and finally finding out how to boost the performance of the system and achieve high efficiency. This final issue involves the problem of retraining GloVe embeddings to get new embeddings including all the vocabulary of dataset in this project.

In this project, design science research method was chosen as basic method for study. Furthermore, text classification was used as the fundamental technique for resolving the automatic retail product identification task. Besides, a pretrained GloVe model was also used as embedding method. Word2Vec and Mittens are the retraining tools that used to get full representation of text dataset. In general, the proposed solution includes text extraction from OCR, text preprocessing, model building and training and the main component which is the classifier. In the test and inference stage, the trained model is used for prediction of test dataset samples and confusion matrix was used, which are precision, recall, F1-score and accuracy, to evaluate the performance of the system. In addition, the LSTM-based model was compared with RMDL from three perspectives: runtime, memory usage and accuracy.

Finally, an end of conclusions was drawn that RMDL has the better performance for prediction. However, the solution based on LSTM is obviously better than RMDL in terms of runtime and memory requirements. To boost the performance of the system and to increase the accuracy, the GloVe are retrained using Word2Vec and Mittens.

Keywords

Text classification, retail product identification, Word Embedding, Neural Network, Word2Vec, GloVe.

(4)

Summary ... 2

1 Introduction ... 4

1.1 BACKGROUND ... 4

1.2 PURPOSE AND RESEARCH QUESTIONS ... 5

1.3 DELIMITATIONS ... 7

1.4 OUTLINE ... 7

2 Theoretical background ... 9

2.1 AUTOMATIC RETAIL PRODUCT IDENTIFICATION SYSTEM (ARPIS) ... 9

2.2 EXISTING SOLUTIONS FOR ARPIS ... 10

2.3 THE SOLUTION FOR ARPIS BASED ON A DEEP LEARNING MODEL... 11

2.4 THE TECHNIQUES AS ASSISTANCE FOR BUILDING ARPIS ... 16

2.5 EVALUATION METRICS ... 18

3 Method and implementation ... 20

3.1 METHOD ... 20

3.3 IMPLEMENTATION ... 21

4 Findings and analysis ... 30

4.1 RESULTS ... 30

4.2 ANALYSIS ... 33

5 Discussion and conclusions ... 34

5.1 DISCUSSION OF METHOD ... 34

5.2 DISCUSSION OF FINDINGS ... 35

5.3 CONCLUSIONS ... 37

(5)

1 Introduction

1.1 Background

The retail sector is one of the most active and powerful industries in the world. The retail industry makes about 40% of the world’s Gross Domestic Product (GDP) and it is the biggest employer when compared to other economic sectors. However, the industry is currently facing significant challenges that threaten its survival. For example, in the U.S. the retail sector is struggling with increased competition in the market from big players which have developed means of reaching out to consumers directly. This has compelled retailers to analyze and redesign their operations along with their marketing strategies critically in order to effectively deal with competition (Perdikaki, 2009). To stay in the competitive market, several retailers have differentiated themselves by creating improved in-store shopper experiences and various other tactics to distinguish themselves in the competition (Perdikaki, 2009). Furthermore, retailers are constantly looking to improve quality in operations since the extremely thin profit margins leave little room for incompetence and wastage. The need to survive in an increasingly competitive market has made retailers to be innovative as well as to employ technology in their operations to improve quality and efficiency.

In order to fulfill the idea of employing technology to improve shopping experience and to make operations as simple as possible, cashierless stores arise, with lower cost for labors since there is not high demand for human surveillance in case of fraudulent customer behaviors, such as replacing barcodes by a lower-price barcode, and with smooth shopping experience since customers can just grab anything they want and then go out of shops without any need of scanning barcode or checkout.

In the case of cashierless stores, one of the biggest challenges is to ensure that the products are identified properly. For this purpose, various advanced tools and techniques are used. The current solution is based mainly on computer vision which is used for image recognition problem. When the customer grabs the retail product, it will identify the product grabbed by customer and find out the purchased item according to the product appearance. This solution is not perfect and showed some shortcomings. The problems faced by computer vision are mainly due to the fact that many products have the same color, shape, and appearances and that only some small texts which make distinction among them. For instance, when two kinds of chips for the same product have similar appearances, the only difference is their metadata, precisely the words “salted” and “unsalted” in their packaging. Therefore, a new solution is come up with by converting image recognition problem into text classification problem. At the same time, there is no research related to using Optical Character Recognition (OCR) and text classification to deal with image recognition problem, especially for retail product identification. This is the knowledge gap driving this research

(6)

in this report to build a new artifact of an end-to-end retail product identification system and to achieve its better performance as classifying retail products correctly as much as possible. In this report, the product identification system for cashierless stores will be examined as one of the technologies employed by retailers. To identify retail products properly by recoginizing texts in product packaging, OCR is firstly used for scanning texts in packaging and then transforming them into a form which can be read and edited through the computer so that texts can be used by a word processor or other computer software. This technology can help greatly in the identification of products and ensuring that the cashierless stores keep a track of the data and product details. The next process called text classification is to put texts into organized groups and to categorize the labeled dataset including document or text samples into different disjoint set. Through the use of Natural Language Processing (NLP), text can automatically be analyzed and be assigned tags based on their content. Some of the applications of text classification include spam detection, intent identification, sentiment interpretation, and topic labelling. In product identification, text classification allows consumers to navigate through products easily. Word embedding and vectorization are also important techniques used in product identification. The two techniques are mainly used for text representation which involves numerical representation of unstructured text. While production identification is crucial to the performance of retailers in the market, the question that arises is, what is the appropriate architecture for a product identification system for retailers.

1.2 Purpose and research questions

The purpose of this research is to determine how to build an artifact of the end-to-end retail product identification system for cashierless stores and examine how the system can achieve performance better for cashierless stores. As indicated, a product identification system is key to cashierless stores because it allows them to effectively identify or track the products purchased by customer. It can also help them track inventory, determine which products are run out quickly and what to stock. However, the system can help the stores perform these roles only if it has been designed in the right way. The purpose of this study is to highlight the architecture that allows for building such a system.

Research Questions:

1. What is the appropriate architecture of an end-to-end retail product identification system using text analysis and classification?

Because there is no study showing anything previous research about using text classification to deal with recognition of retail products for cashierless stores, the solution of combing OCR with text classification is come up with, which drives the building of the artifact of the retail product

(7)

identification system. There are some words needed explanation in research question one. When it mentions “appropriate architecture”, it indicates the performance of the system to be built and it also means that the system to be built will get improvement for its performance as much as possible within limited time for conducting the research in this report. When it comes to “architecture”, it has the definition from Combridge dictionary for IT field of “the design and structure of a computer system, which controls what equipment can be connected to it and what software can operate on it” (Combridge Dictionary). In this report, it means that the structure of the retail product identification system which will recognize different products according to the texts in packaging with using OCR. Another word needed to be clarified is “end-to-end”, which is defined as that this kind of system only needs input data like texts in product packagings and it will produce output like classification for that product. In brief, this word “end-to-end” can be equivalent to “automatic”. At the same time, the inner connection and structure of the system is like black box for everyone and neither retailers nor customers have need to know anything about it while using the system. The whole system will work in the way that some cameras with OCR scanning function scan retail product packagings many times to get extraction of texts and texts are stored in txt files as input to be fed into the system. The system will generate categorical class which is the label to recognize what the retail product grabbed by customers is. Meanwhile, the information about the identified retail product, such as quantity and price, will be shown from database, so that the cashierless stores can deal with payment problem.

2. Using text classification as basis of the system, what is the better representation and features of the text to use as input of the deep learning model?

Since deep learning model is used in this report for solving text classification problem, it is important to know what kinds of input will be the suitable for the model. For example, OCR will process extraction of texts from product pakagings and save texts into editable form like txt file. Next, the question is coming to how to represent these txt files so that they can be as input for the deep learning model. Moreover, how these txt files can be represent to make the system perform better, which will be the better representation of texts.

3. What metrics should be used to measure the system performance?

Because the purpose is to build an artifact of an end-to-end retail product identification system and to make it perform as better as possible, finding the proper metrics for measuring system performance is necessary and important so that the results from metrics can be feedback to guide an iteration of improvement for the system.

(8)

4. How to retrain Global Vectors for Word Representation (GloVe) embeddings to include the new vocabulary specific for dataset in this project and particularly misspelled words caused by the OCR imperfection?

Since OCR is used for extraction of texts from retail product packagings, it cannot get exactly correct texts because of its quality which causes many misspelled words and even foreign language words. According to preliminary investigation of GloVe, the 100-dimensional pretrained file will be used as embeddings in this report. More details about GloVe file and word embeddings will be described in chapter 2 and chapter 3. In a few words, they are used as a word representation showing similarity among words. However, the pretrained GloVe file is produced from a word collection of Wikipedia and Gigaword corpus, which means this file has limitation within those words, without including words out of Wikipedia and Gigawords corpus. Besides, it is impossible to include all the words needed for every project, which is also meaningless as it will be a huge burden for some projects requiring a small number of words. Therefore, the general solution is that a pretrained GloVe file will be used firstly as word embeddings, including a huge amount of normal words appearing on the internet, and then researchers will add new words only existing in their project according to their needs into this pretrained file to give a full representation for each word so that they can achieve better results from system performance. Hence, the research question four arises.

1.3 Delimitations

The main focus of this report is on the building of a structure of retail product identification system by using OCR and text classification, and there will not have user interface of the system to be built. Meanwhile, development of a new OCR tool will not be a part of this project, for there will be a selection from existing OCR tool. Since the focus is about designing retail product identification system, other software methods relating to payment and next processes for cashierless stores after recognizing retail products and giving classes for retail products will not be covered. In addition, the programming language is limited in Python with using many deep learning libraries existing for Python. The memory requirement for computers is at least 16GB. Moreover, this report has a limitation within the readers who have adequate knowledge about machine learning and deep learning for understanding.

1.4 Outline

This report will be organized in the following way.

In chapter 2, the theoretical background will be introduced to pave the way for implementation of this project in chaper 3, which includes relevant works done previously and introduction of theories

(9)

with explanation of how the system will work from a theoretical point of view. In chaper3, the method and implementation will be described, also including evaluation of the system to be built. Next, the results and analysis will be presented in chapter 4. In chapter 5, discussion will be conducted from different perspectives, like discussion of methods, discussion of findings. In the meatime, the conclusion will end this chapter. Finally, references will be listed in chapter 6.

(10)

2 Theoretical background

This section is going to start from a general introduction of retail product identification system and the definition of automatic retail product identification system in this project. Then it comes to the existing solutions for automatic retail product identification system. Next, it describes the solution which will be implemented in this project with its a variety of supportive techniques as assistance. Finally, it ends with an introduction of evaluation metrics which can be used for this system.

2.1 Automatic retail product identification system (ARPIS)

As the retail commerce industry is increasing and growing at a very rapid pace, the traditional modes of retail business are being shifted towards a more advanced form which requires high technology advancements. Therefore, cashierless stores are brought into the market with smooth operations for customers and reduction of cost for employment so as to save money for retailers. The benefit from cashierless stores is not only about saving time for customers to avoid long queues at checkouts but also avoiding fraudulent behaviours like putting a false barcode on packaging to get a lower price (Martin, et al.).

However, only when the system with technologies used for cashierless stores recognizes retail products correctly can customers have a smooth shopping experience with high satisfaction, as wrong identification can cause unnecessary matters for both customers and retailers, like overcharging or undercharging of products. That drives attention to the retail product identification system which uses technologies to recognize retail products and to classify retail products so that more detailed information can be obtained for next shopping processes like payment. Apparently, everyone wants it to perform well with ability to identify retail products correctly as much as possible.

Since there is no common definition of a system as retail product identification system, the definition of it can be defined in this report as a system which has function of recognizing groceries that have their own information all stored in database, such as price, quantity, ingredients, place of origin, brand, and so on. This system can be fed some information related to retail items and then it will generate distinctive labels which can be used as a connection to the database so that the relevant information about the items will be obtained as soon as the retail item is recognized by the system. Furthermore, an automatic retail product identification system can be described as a retail product identification system can recognize retail products without any human intervention, which means that self-checkout point and pick-up station with the need of help from customers to scan barcode cannot be counted as automatic retail product identification system. When it comes to ARPIS, it means that customers are allowed to stroll into a store, to snatch what they need and

(11)

essentially to exit without filtering standardized identifications for checkouts or remaining in long lines.

2.2 Existing solutions for ARPIS

At present, there are many different technologies used for ARPIS. For instance, Upadhyay, Aggarwal, Bansal and Bhola (2020) proposed a method of using a combination of Single Shot MultiBox Detector structure and Mobile Net classifier, an efficient Convolutional Neural networks for mobile vision, for object detection, with further incorporation of Augmented Reality, as the basic of product identification system. To avoid switching between different windows so that users can make their final decision on which product they will buy, Augmented Reality was incorporated to assist the process of decision making with displays of products’ features for customers. Although they declared that they achieved a good result with around 96% accuracy, the test cases of important functions are only thirteen passed by the system, which is too small to be supportive for their solutions. Becides, customers also are limited within a panel while shopping, which is a not free or smooth enough shopping experience.

Geng et al. (2018) suggested a mixed method with featured-based matching and one-shot deep learning to identify similar retail products. They firstly chose candidate regions by selectiong similar feature and gave them coarse labels. Then they used attention map to increase the differences in detail of those candidate regions. Finally, they got a coarse-to-fine hybrid method with Convolutional Neural Networks based classifier to make better results for recognizing grocery products. They achieved a result of average mean precision from 0.46 to 0.94 according to testing on different datasets and different features they chose, which obviously shows a unstable performance for their solution.

Another way was found by Gundimeda, Murali, Joseph and Naresh Babu (2019) to identify images from food product with improved quality by using traditional computer vision algorithms, with using OCR and Natural Language Processing (NLP) as assistance. Besides, Srivastava (2020) presented three tricks for improving performance of deep learning models which is based on convnets for retail product image classification. The first one is that Local-Concepts-Accumulation layer can get consisten increased accuracy on all datasets, the second is that using both Instagram and imagenet pretrained convnet can get better results than using imagenet pretrained convnet only, and the final trick is that using Maximum Entropy loss as assistance can also get better performance.

All in all, all of them is about computer vision algorithms which is dealing with image recognition problem with the rationale of Convolutional Neural Networks (CNN or ConvNet) (Rawat & Wang, 2017; Aloysius & Geetha, 2017) behind it. The problem of it is that for many retail products,

(12)

they have same appearance with same shapes, same colors. If the retail product identification system handles these retail items by using computer vision, it is hard to recognize them. Additionally, it is hard for Convolutional Neural Networks to deal with sequences of data (Sadr, Pedram, & Teshnehlab, 2019), like audio, speeches and so on. Besides, Convolutional Neural Networks has a fully-connected layer that includes all features of datasets, which has high time complexity with long training time (Albawi, Mohammed, & AL-AZAWI, 2017).

2.3 The solution for ARPIS based on a deep learning model

While considering the main problem of recognition of retail products with similar appearance for ARPIS, an idea is come up with of converting image recognition into text classification, for that even if retail products can have similar appearance, they always have different descriptions in texts on their packagings as long as they are different retail products. Based on this idea, Optical character recognition (OCR) is adopted as assistance to scan character and to extract them from retail products packaging so that these texts can be stored in editable form for further use. Modi and Parikh (2017) made an exhaustive review on optical character recognition and the conclusion showed a variety of reliable techniques for OCR with good performance for either texts in handwriting or natural scene images.

Once texts are obtained from retail products packaging and are stored in editable form, they can be used as datasets for a deep learning model. Since text classification is the main idea of classifying these description from retail products packaging into distinctive labels so that the retail product can be recognized, it is important to choose proper techniques for text classification.

2.3.1 Text classification and its techniques

Text classification is referred to assigning categories, topic or labels for unstructured information, such as emails, chats, survey response, and texts from social media, according to their content. It is one of basic assignments in Natural Language Processing. The process of assigning labels, topic and categories in text classification makes it easier for structure and analyzing texts which also makes benefits for business. The figure 1 (Kowsari, et al., 2019) shows the workflow of text classification. (see figure 1.) A summarized structure of many text classification processes can be described as four steps, feature extraction, dimension reductions, classifier selection and evaluations, respectively. During the step of feature extraction, text pre-process can be crucial and a grouping of data with following certain rule will be defined at first that will be the chosen feature applied for all datasets. Kowsari, Meimandi, Heidarysafa, Mendu, Barnes and Donald Brown (2019) made an investigation of text classification algorithms from the three aspects, feature extractions, dimensionality reduction, evaluations methods, and summarized limitation of each technique from

(13)

reality problems. Dimension reduction is an option for text classification which aims at reducing time and memory complexity.

Figure 1. Overview of Text Classification Pipeline. (Kowsari, et al., 2019)

Text classification is becoming an increasingly significant portion of businesses because it lets them to simply get understandings from data and automate business processes. The retailers are using this technology to improve their customer experience, as it can classify the different types of products even in the case of different languages. Meanwhile, the most important part of the whole process of text classification is classifier selection or building, which involves many algorithms having an impact on performance. The general classification algorithms or techniques contains K-Nearest Neighbor (KNN) classification algorithm, Support Vector Machine classification algorithm, Decision Tree classification algorithm and Naïve Bayesian classification algorithm and so on. Aggarwal and Zhai (2012) did a survey on various text classification algorithms and they indicated that neural networks has been a remarkable algorithm with stunning performance in text classification. In fact, neural networks, known as the architecture of deep learning, has been widely used in dealing with the task of sentiment analysis in recent years (Sadr, Pedram, & Teshnehlab, 2019).

2.3.2 Neural networks

Neural networks as the structure of deep learning (Schmidhuber, 2015) is a more advanced technique comparing to traditional text classification model and it has superior performance in many tasks such as sentiment analysis, categorization and so on. In traditional text classification algorithms, the features are handcrafted. That is to say, those features need to be defined or created by human. When it develops into deep learning or neural networks approaches, the features are extracted or recognized totally by computer itself during the process of learning from layers. The rationale behind deep learning or neural networks is that it simulates human brains in a real world. With nodes and layers chosen in different deep learning models, the information is feed into models

(14)

and the model generates outputs, which makes it works for a computer to learn and understand information.

The simplest one of neural networks is Artificial Neural Networks (Yegnanarayana, 2009), knows as a model with three layers, input layer, hidden layer and output layer. The information is fed into the model from input layer, through hidden layer, to output layer. When there are many hidden layers between input layer and output layer, it results in a new type of neural networks called deep neural networks (DNN) (Szegedy, Toshev, & Erhan, 2013), which can be considered as a more advanced and deeper neural networks. What we are interested in is one kind of DNN, namely Recurrent Neural Networks.

2.3.3 Recurrent Neural Networks (RNN)

Recurrent Neural Networks (RNN) (Medsker & Jain, 2001) is a type of neural networks, which uses the data as input for the current cell and then generate outputs that will be the input for the same cell in the next step. With this kind of backpropagation, which is the reason why RNN is good at dealing with sequential data, RNN can easily learn information from speech texts, audio, video and so on (Graves, Mohamed, & Hinton, Speech recognition with deep recurrent neural networks, 2013). It takes the data in sequence as inputs and produces outputs which will be fed again as inputs in the same layer. In this project, datasets with description of retail products is exactly sequential data for RNN, which makes RNN be the best choice for the solution of ARPIS. Figure 2 shows the architecture of RNN. From this figure, we can see one state of RNN as the left structure and the details inside one state of RNN can be interpreted as the right structure. For instance, it takes X0 firstly as input and generates h0 as output. Then it will take both h0 and new sequential data X1 as inputs for the next step and generates h1 as output.

Figure 2. Architecture of RNN.

In this way, RNN can remember context in a limited degree. RNN can be used for predicting a word in texts. However, it has a weakness that RNN does not have long memory to process very

(15)

large sequential datasets. In other words, RNN cannot predict a word or a sentence accurately when we give it some huge context. It just cannot remember all information’s relationship. In addition, RNN suffers from the vanishing gradient problem (Pascanu, Mikolov, & Bengio, 2013). The gradient is the value used for updating the parameters of a model. In RNN, its backpropagated gradient is 𝜕𝐸𝑘 𝜕𝑊 = 𝜕𝐸_𝑘 𝜕ℎ_𝑘 𝜕ℎ_𝑘 𝜕𝐶_𝑘(∏ 𝜎′(𝑊𝑟𝑒𝑐∙ 𝑐𝑡−1+ 𝑊𝑖𝑛∙ 𝑥𝑡) ∙ 𝑊𝑟𝑒𝑐 𝑘 𝑡=2 ) 𝜕𝑐₁

𝜕𝑊. When 𝑘 is large, the

gradient tends to vanish. As a result, the model will lose the guideline for updating parameters. Therefore, the model will tend to be worse for prediction. Hence, a new neural networks based on RNN called Long Short-term Memory (LSTM) arises.

2.3.4 Long Short-term memory (LSTM)

Compared with RNN, LSTM has a relatively long memory to handle these large sequential datasets (Graves, 2013). The rationale behind LSTM is that LSTM has an extra gate called forget gate. Comparing to RNN that has just input gate and output gate, the forget gate is used for LSTM networks to know which information should be kept and which information should be discarded. Figure 3 shows three gates inside LSTM. Input gate is described by equation (2.1). It is used for finding which value from inputs should determine the memory modification. Sigmoid function is used for letting certain values through 0 to 1 and tanh function adds weightage to those values allowed for passing to control their range from -1 to 1. Forget gate, described by equation (2.2) is used for determining which information should be discarded from the computation. It is defined by sigmoid function which observes the previous state ht-1 and input xt, then outputs a number between 0 and 1 for each number in the state Ct-1. The last gate is output gate, described in equation (2.3). In the square of output gate in figure 3, the input and the memory are used to determine the output. Sigmoid function and tanh function do the same work as in input gate. Besides, tanh function also multiplies the output of Sigmoid at last.

𝑖𝑡 = 𝜎(𝑊𝑖∙ [ℎ𝑡−1, 𝑥𝑡] + 𝑏𝑖)

𝐶̃𝑡 = tanh⁡(𝑊𝐶∙ [ℎ𝑡−1, 𝑥𝑡] + 𝑏𝐶) (2.1)

𝑓𝑡 = ⁡𝜎(𝑊𝑓∙ [ℎ𝑡−1, 𝑥𝑡] + 𝑏𝑓) (2.2)

𝑜𝑡 = 𝜎(𝑊𝑜[ℎ𝑡−1, 𝑥𝑡] + 𝑏𝑜)

ℎ𝑡 = 𝑜𝑡∗ tanh⁡(𝐶𝑡) (2.3)

In this way, LSTM uses this back-propagation way to keep longer memory for remembering huge datasets in series. To be clearer, LSTM takes both datasets and outputs from last state of cell unit

(16)

as current inputs to feed into the same cell unit, like neurons in human brains and then generates outputs as an entity with memory. This entity then will be fed into the same cell unit together with new data as inputs. As a result, LSTM networks can have a relatively long memory to remember huge sequential data. Besides, the gradient of error in LSTM contains the forget gate’s vector of activations, that is to say the LSTM networks can have better control of the gradient values by using proper updates of the forget gate for parameters. In other words, LSTM can avoid the gradient vanishing problem by updating parameters according to the guide from the forget gate.

Figure 3. LSTM Gates.

There is an examination done by Yu, Si, Hu and Zhang (2019) about the recurrent unit of different networks based on LSTM. Yu et al. divided LSTM into two coarse-classes, LSTM-dominated networks and integrated LSTM networks. They discussed different applications based on LSTM. Then they gave a conclusion that variants of LSTM can perform better than standard LSTM cell on some tasks but none of them can be superior to standard LSTM cell in all facets. Besides, LSTM-dominated networks are good at dealing with inner relationship inside LSTM cells and integrated LSTM networks perform better when integrating different superior features from other components with LSTM. Sak, Senior and Beaufays (2014) also did an investigation on LSTM RNNs to see its performance on coping with speech recognition. With using a recurrent projection layer embedding in each layer of LSTM (LSTMP RNN), they showed a state-of-the-art performance of it for huge scale acoustic modeling, that is superior to standard LSTM and DNN. To sum up, LSTM will be the more suitable deep learning model for texts extracted from retail products packaging in this project, with its characteristic for sequential data in huge context.

(17)

2.4 The techniques as assistance for building ARPIS

While the basic solution has been found, there are other tools and techniques useful for assisting the build of ARPIS and helping its improvement on performance. With regards to using a deep learning model for text classification, it is important to figure out how to make inputs effiecient for the model to be processed.

2.4.1 Natural Language Processing and its techniques

As for ARPIS, it is essential to make a system understand description from retail products packaging so that the system can further give correct categories to retail products, which can be considered as that a system has ability to understand these description in texts. This understanding problem of human language for a system or a computer can be generally classified into syntactic analysis and semantic analysis (Moro & Navigli, 2013). Syntactic analysis is to analyze the grammar in a certain language so that the computer can follow this grammatic rule to have a general understanding of texts. Semantic analysis is about digging out the meaning of a sentence or a paragraph, even a document behind context for a computer to understand. Both of them is core to Natural Language Processing (NLP) (Nadkarni, Ohno-Machado, & Chapman, 2011), described as the process for a computer to understand and to extract useful information according to some defined rules from huge unstructured raw datasets, such as social media posts, documents, emails, survey response and so on.

However, will a system or a computer really understand human language? Actually, it is hard for a computer to understand the meanings and contextual information. In the field of NLP, there are some techniques to solve this problem. Sun, Luo and Chen (2017) proposed a detailed review of NLP techniques used for opinion mining systems. Zeng, Shi, Wu and Hong (2015) made the use of NLP techniques in bioinformatics to predict protein structure and function, detecting noncoding RNA. Sangers, Frasincar, Hogenboom and Chepegin (2013) used NLP techniques for finding semantic Web services. All of them has shown effective performance for their goals. Regardless of any type of NLP techniques, the rationale behind it is easy. If a document is complicated for a system to understand, then they can be sliced into sentences, which is called sentence segmentation. Compared with understanding meaning of a document for a system, it is much easier to understand a sentence. The sentence can continue to be split up into many words likewise, which is called word tokenization. In this way, it would be even much easier to understand a word than a sentence for a system. Another technique is called text lemmatization (Nadkarni, Ohno-Machado, & Chapman, 2011), which finds the same meaning words and converts them into the same form. There are many other techniques related to dealing with texts such as stopwords

(18)

removal, dependency parsing, finding noun phrase, named entity recognition and coreference resolution.

In general, these techniques can be helpful when coping with preprocessing for texts before they are fed into deep learning model so that texts can be processed more efficiently from removing unrelevant or unnecessary words to save time for computation.

2.4.2 Word Embedding

Word Embedding is the fundamental technique used for capturing nuanced semantics without sentence structure with excellent performance (Chen, Perozzi, AI-Rfou, & Skiena, 2013), which is important for text classification in ARPIS. Yu, Wang, Lai and Zhang (2017) viewed word embeddings as the method of using context from large text corpora to learn continuous vector representations of words in low-dimensional space. Yu et al. also mentioned that word embeddings can be considered as an unsupervised approach to extracting syntactic and semantic information from unstructured data.

By and large, word embedding is the way that words are represented in vectors. It shows that a word can be represented in different dimensional space and it can also demonstrate the semantic and syntactic similarity among different words from those vectors. By now, there are many convenient pretrained word embeddings for researchers or developers to use in machine learning field, such as fastText developed by Facebook, Stanford’s GloVe and Google’s Word2Vec. In this project, GloVe and Word2Vec will be chosen as basic tools.

2.4.3 Word2Vec

Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013) involves two efficient architectures as the continuous bag-of-words (CBOW) and skip-gram for calculating vector representations of words which can be the future usage in NLP applications and others research. It has provided state-of-the-art word embeddings available for researchers to use (Goldberg & Levy, 2014). The process of how Word2Vec works is that it uses a text corpus as input and then generates outputs with word vectors. It builds a vocabulary from training text data and learns vectors representation of words. As for the two algorithms inside of it, continuous skip-gram is more suitable for infrequent words and it is slower for learning process. In contrast, CBOW is faster for the frequent words but works not so much well on infrequent words.

2.4.4 Global vectors for word representation (GloVe)

Pennington, Socher and Manning (2020) proposed an unsupervised learning algorithm called GloVe for representation of words in vectors. The process of training model is conducted on

(19)

word-word co-occurrence statistics gathering global words in a large corpus. The resulting representations display amazing linear substructures of word vector space. In general, GloVe collected words from a huge text corpus and trained words so that they are represented into word vector with showing interesting inner substructures as linear in dimensional space. What will be used in this project is a pre-trained word vectors TXT file of GloVe called glove.6B.100d.txt. It is a pre-trained word vectors file with 100 dimensions for each word and it was trained from a huge corpus of Wikipedia and English Gigaword which is enough for this project’ words.

2.4.5 Mittens

Mittens is a tool proposed by Dingwall and Potts (2018) which extends the GloVe representation so that GloVe data can be updated from a specialized domain for different projects. Since a pretrained GloVe file will be chosen as word embeddings, Mittens can be very helpful with its characteristic for retraining GloVe embeddings.

Mittens can be considered as the updates of GloVe for specific vocabulary in different fields to provide more precisely meanings in words of a certain field. The rationale behind it is that it uses the occurrence matrix to show word counting in higher dimensions. For instance, the co-occurrence of a pair of two words is that the number how many times the two words appear together in a specific window called context window from a given context. The number of the specific window and the direction of the window’s moving will be defined according to different situations. Then the co-occurrence matrix can be drawn accordingly to the correlation of those pairs of words so that the similarity of words can be shown in the co-occurrence matrix.

2.5 Evaluation Metrics

Whether a ARPIS is good or bad, there must be some performance metrics for measurement of the system. Hence, it is crucial to select suitable metrics for this system.

Joshi (2016) mentioned confusion matrix as a table which is used to display the performance of a classification model on test dataset for showing the true values. (see table 1.) The matrix can be performed in the following table. There are four elements in it, True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN), respectively.

• TP means that the actual class is yes and predicted class is also yes, which predicts the correct positive values.

• TN means that the actual class is yes but predicted class is no, which predicts the correct negative values.

(20)

• FN means that the actual class is no and predicted class is no, which still predicts nothing.

Table 1. Confusion Matrix.

Predicted Class

Actual Class Class=Yes Class=No

Class=Yes True Positive (TP) False Negative (FN)

Class=No False Positive (FP) True Negative (TN)

These four elements can be used for calculating accuracy, precision, recall and F1 score.

➢ Accuracy = (TP+TN)/(TP+FP+FN+TN), which is a normal way to show the model’s performance. A higher accuracy usually means a good model.

➢ Precision = TP/(TP+FP), whose high value can be supposed with low FP rate. ➢ Recall = TP/(TP+FN), also called True Positive Rate.

➢ F1 score = 2*(Recall*Precision)/(Recall+Precision), which is more useful and more accurate because of both involvement of FP and FN in Recall and Precision.

The problem of using Confusion Matrix is that the results are only in numbers, which is less obvious to see the trend of a model or a system. Susmaga (2004) proposed a method for visualizing Confusion Matrix to help the research on multi-class classification problems, which helped a lot for many researchers on control and tune classifiers.

(21)

3 Method and implementation

This section firstly describes research methods used in this report and the reason why they are chosen. Then it goes deeper into implementation part, with introducing how data is collected and is preprocessed, and how the model is built then is trained. In addition, it also describes prediction and evaluation of the system. After the explanation of improvements on system performance, it ended with the comparison of the other model called Random Multimodel Deep Learning.

3.1 Method

Before conducting this project, design science research (DSR) method was chosen for the artifact to be built in this project. Hevner and Chatterjee (2010) mentions that design science research pays attention on both IT artifact and relevant application domain, with the importance of improving IT artifact’s usage and effectiveness for dealing with problem in reality. It is one of engineering research methods, which makes an iterable cycle between the process of building design artifacts and the process of evaluation to get a higher performance of its artifacts. The reason why it is chosen as the main research method in this project is that it is an appropriate method for designing and modelling a system to make improvements continuously as much as possible. Pries-Heje, Baskerville and Venable (2008) made the analysis of evaluation strategies, which then is developed as a strategic DSR evaluation framework for evaluating design process that will be useful for DSR researchers.

Following the guideline of DSR, an architecture firstly was built for meeting the basic requirements of objectives in this report by using techniques and materials based on this research. Then the performance of the incipient system was reviewed and an iterative modification about retraining GloVe was made to improve performance of the system, which fulfilled the goal of achieving a better performance. The retraining GloVe can be considered as an iteration process. Although it is fine to use the pretrained GloVe file with subtle imperfections of the lack of misspelled words for GloVe file, the process of improvement from retraining GloVe has emphasis on always making a better performance for the system as possible as it can within limited time for this project. By using this method in the project, the ARPIS is designed, and its performance is evaluated, and its improvement of performance is getting better. With iterable process like that, a system can be developed as better as researchers want. From this qualitative method, there is evaluation about the most influential variables of getting a higher accuracy and less average loss rate in order to get the appropriate architecture of an end-to-end retail product identification system and to make this system achieve high performance.

(22)

3.3 Implementation

This chapter firstly gives an overview of how to conduct implementation of ARPIS, the it describes the way how data is collected and how metadata is preprocessed for further usage in a model. Moreover, it gives detail of model training after talking about building model. Next, it comes to the evaluation of the system performance and iterative improvement for achieving a better performance by using two techniques. Finally, it ended with a comparison with Random Multimodel Deep Learning model.

3.3.1 Overview of implementation

As shown in figure 4, the first step in this project is data collection. The used dataset is available online in Github (Klasson, Zhang, & Kjellstrom, 2019) and selected products with metadata are images from different retail products packaging which can be mushrooms, avocado, eggs and so on. These selected products images are the validation image datasets in figure 4. Then, an investigation of the most commonly used OCRs is conducted from literature research to determine the most suitable one to recognize texts on from those images. With using the most accurate OCR, texts can be extracted from blocks identified by OCR on images and be pre-processed so that the description for retail products can be stored in editable form like txt files. To classify these texts into categories of retail products, the extraction of information, like nutrition fats, weight, and name of the brand, etc., is fed into text classifiers as the input. These descriptions will be saved in a database which will serve for matching any purchased product. The matching process is based on text classification and gives a score as output. If this score exceeds a threshold determined by retailers, then the identification of the product can be confirmed. In this project, as there is no third party playing a role as retailers to determine a threshold for which score a product can be considered as identified, evaluation metrics are used for measuring the system performance. In the second process of figure 4, it is a simple display of the top process about how the metadata in images will be processed and stored into the other dataset as description in texts. The image below shows the stepwise process. (see figure 4.)

(23)

3.3.2 Data Collection

Klasson, Zhang and Kjellstrom (2019) collected 5125 natural images from 81 different classess of retail products and all natural images was taken by a smartphone camera. Those retail products also can be found in different grocery stores. Since they have made 81 fine-grained classes into 42 coarse-grained classes, there are 39 classes selected in this project as metadata. When two different brands of apple can be classified as one fined-grained class, two coarse-grained classes means that two different grocery, like eggs and milk. The figure 5 shows some examples of these natural images taken by Klasson, Zhang and Kjellstrom. Except of these images, Klasson, Zhang and Kjellstrom also made iconic images with its corresponding products’ descriptions for some products. For those images of only pure products, like a bunch of apples, their corresponding description will be used in this project. Others, like images of a carton of milk, will just be used limited to their images, which means further process will go to OCR scanning.

Figure 5. Grocery images. (Klasson, Zhang, & Kjellstrom, 2019)

Optical Character Readers Selection

Thanks to Gabasio (2013) for the work of comparison of main optical character readers from both commercial and opensource software, this work collected the most popular optical character readers, which are Tesseract, Ocrad, CuneiForm, GOCR, OCRopus, TOCR, Abbyy CLI OCR, Leadtools OCR SDK, OCR API Service respectively. By testing many images with different quality, such as skewed images, underlined images, images with pictures in the text, Gabasio came up with the conclusion that TOCR is the better optical character reader with highest accuracy of an 8.79% mean error. Although it is a commercial software as an optical character reader, compared with others’ performance, it is worth choosing it. Besides, TOCR has a sample version which is free to

(24)

use. Therefore, TOCR was chosen as the extraction tool to collect texts from retail products’ images.

The figure 6 shows some examples of two retail products’ descriptions in texts extracted by TOCR and stored in txt files. The three upper extracted descriptions refer to a kind of avocado of some brand. The two bottom descriptions relate to a type of baby spinach.

Figure 6. Retail products' descriptions in texts.

3.3.3 Data Preprocessing

After TOCR was used to get the texts from retail products’ images as the datasets, these datasets were divided into two groups. One is training dataset, and the other is test dataset. For each group, there are 39 products folders in computer and there are different txt files inside each product folder. The TXT files in the same folder is extracted from the same retail product, but it is from different position of the product or it is from different times of scanning by TOCR, because the OCR tool cannot extract the completely correct texts from images at one time and it needs to scan many times to get correct information as much as it could. Despite of many times of scanning, extracted texts are not always with completely correct spelling because of the quality of OCR. In fact, it does not matter that if texts are in correct spelling or in wrong spelling, as long as all information needed is got, such as information about place of origin, brand, name of retail products, net weight, nutrition information, ingredients and so on.

(25)

Before the datasets can be fed into the model of ARPIS, texts of descriptions need to be extracted from txt files and further be converted into a list with sub-lists of word tokens, which means each sentence or paragraph will be divided into words with comma as separator. For example, from a txt file, it will be converted into a list like [‘Pailenti’, ‘Chrie’, ‘WHOLE’, ‘MUSHROOMS’, ‘CHAMPIGNONS’, ‘ENTIERS’, ‘BLANCS’, ‘DOUX’, ‘ET’, ‘DEAICATS’, ‘454’, ‘g’, ‘PRODUCT’, ‘OF’, ‘CANADA’, ‘PRODUIT’, ‘DU’, ‘CANADA’, ‘LOBLAWS’, ‘INC’, ‘MMCLOBLANS’, ‘NC’]. From each txt file, one sub-list will be like above. After extracting information from all txt files, a description list containing many sub-lists will be generated. Secondly, the label for identifying different products will be extracted from each txt file. Since the name of txt files is composed of “Text” and a number from 0 to 38, the number will be extracted as marks for distinction. In this way, a label list will be got like [0,0,0……1,1,1……2,2,2……38,38,38……]. For each number in this list, its index of position matches its corresponding description’s index of position in description list. That is to say, txt files are divided into two parts, a description list and a label list with giving the same index value for mapping their relationship.

 Data Cleaning

Once data was loaded, a function called text_to_wordlist would be used to convert word tokens into lowercase and to remove stop words.Stopwords are those normal words with high frequent occurrences that they are not important for classification. So, they should be removed to reduce the memory space and to decrease the time computation complexity. There is a piece of code below showing how to remove stopwords in English.

if remove_stopwords:

stops = set(stopwords.words("english")) text = [w for w in text if not w in stops]

Next, regex function would be used for cleaning data. For instance, the following piece of code with regex removes lots of noise such as punctuations and special characters that do not have a positive impact on the classification.

text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text) text = re.sub(r"what's", "what is ", text)

text = re.sub(r"\'s", " ", text) text = re.sub(r"\'ve", " have ", text)

Besides, words would be converted into their stem in order to reduce the size of inputs in a more efficient way. Many words have the same stem, which is the basic form of a kind of words. For example, cats will be converted into cat as cat is a stem.

(26)

if stem_words: text = text.split()

stemmer = SnowballStemmer('english')

stemmed_words = [stemmer.stem(word) for word in text] text = " ".join(stemmed_words)

The most important cleaning steps was cited here with pieces of code for better explanation and understanding for both professional developers and normal readers.

3.3.4 Pre-trained GloVe and Word Embedding Preparation

In this pre-trained txt file called GloVe with word vectors, each word with 100 vectors displays in each line of a txt file. Those vectors represent a word in 100 different dimensions, which includes 400 thousand words. If two words have similar meanings or can be classified into one category, the cosine distance of their vectors will be close to 1. In other words, the more similar words are in meaning, the closer the position of words is in dimensional space.

Before adding embedding layers, an embedding matrix is required to be created firstly, which is used to multiply one-hot words to get feature vector for each word. To begin with, null pseudo-words for padding in embedding matrix would be created. The following piece of code shows the whole process in detail.

nb_words = min(MAX_NB_WORDS, len(word_index)) + 1

embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM), dtype='float32') for word, i in word_index.items():

if word in word2vec.vocab:

embedding_matrix[i] = word2vec.word_vec(word)

print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

3.3.5 Division of Data

A parameter called VALIDATION_SPLIT=0.2 was set in the beginning. It shows that 20% train dataset was used as validation set and the rest 80% was totally train set. A function called random.permutation from NumPy library in Python was used to shuffle data randomly so that a 20% randomly generated validation data could be got from train dataset and the rest of train dataset was still train data. A piece of code is cited below for legibility.

perm = np.random.permutation(len(data))

idx_train = perm[:int(len(data) * (1 - VALIDATION_SPLIT))] idx_val = perm[int(len(data) * (1 - VALIDATION_SPLIT)):] data_train = data[idx_train]

labels_train = label[idx_train] data_val = data[idx_val]

(27)

3.3.6 Definition of Model Structure

When everything is prepared well, it is time to focus on the model structure, which can be demonstrated in figure 7. A word embedding layer was defined with different parameters, such as words number, embedding dimension, initial weights, input length. The a LSTM layer was added with parameters like LSTM units’ number in one layer, dropout rate and recurrent dropout rate. After data was put into sequence, sequential data would be as input to go through embedding layer, then would pass through LSTM layer. It is necessary to drop out some nodes randomly in LSTM layer so as to make results more reliable. In figure 7, there is an arrow in purple showing the backpropagation of LSTM characteristic for processing sequential data. Next, batch normalization was used to do feature scaling and Dense layer would be added subsequently. The same procedures of dropping out nodes randomly and normalizing batch data would be done for Dense layer. Finally, an output layer will be added in the model with activation of softmax because of multiple classes in this project. The figure 7 displays different layers in red squares, like word embedding layer, LSTM layers, Dense layers, and Output layer. Those red crosses means the dropout process of random nodes in layers. The whole structure is based on Recurrent Neural Networks indicated in green arrows.

(28)

3.3.7 Model Training

Next, this model would be trained and be fed with validation data that is split at first at a rate of 0.2 from train dataset. The function called categorical_crossentropy was chosen as loss function because of multiple classes of datasets. With regards to categorical_crossentropy, it requires usage of to_categorical function as a tool of converting labels’ number into binary form. As for optimizer, adam was chosen as it has ability to adopt different situation by itself. There were 200 epochs in model, and each epoch means one iteration of datasets. A batch size of 128 was set according to empirical knowledgement. In order to make it more efficient, EarlyStopping function was used to stop training when a monitored quantity has stopped improving so that time can be saved from training model. ModelCheckpoint function was also applied to save the model after every epoch with the epoch number and the validation loss in case of any error or breakdown of computer. All of those functions meationed above are from a library called Keras.

3.3.8 Prediction and Evaluation

Finally, the trained well model can be used to predict test datasets and to produce results. Since model.predict function was used for this project because of multiple classes case, the predicted results would be encoding values. To make results easy to read, argmax function in NumPy was used to convert the predicted results into label numbers, showing details in the following piece of code.

preds = model.predict(test_data) y_pred_bool = np.argmax(preds, axis=1)

When results were generated from model, model performance needed to be evaluated so that it could be known if this model is good or not. Firstly, evaluate function was used to measure loss value and accuracy on test data.

results = model.evaluate(test_data, test_ids, batch_size=128, verbose=1)

Secondly, Confusion Matrix is used for model evaluation. Since Scikit-learn provides a library called sklearn.metrics for performance measurement, it is easy to use confusion_matrix function and classification_report function from that library to get every measured parameters for this project, such as precision, recall, F1-score and accuracy. And this detailed information about those parameters would be generated automatically after calling classification_report function, which made it effortless for researchers to use.

(29)

3.3.9 Challenges – GloVe Retraining

During using pre-trained GloVe as the embedding file in this project, there is a common challenge in training dataset. Despite that pre-trained GloVe file includes 400000 words’ vector representation, these trained word embeddings are still not enough precise for each project. In training dataset for different projects, they have their own special terms in special fields. For instance, there are some special words with wrong spelling in English in this project because of the quality of OCR scanning. Besides, some foreign language words are also scanned from OCR. These words are the one special in this project and they should be represented in vectors for the latter usage of prediction. These words are so significant to this system in recognizing products that they cannot be ignored. That is also the breakthrough point for further improvement of system performance. To solve this problem, retraining GloVe file would be required by using the datasets in this project. Furthermore, to give those special words, in this project, vector representation could be managed.

 Use of Word2Vec

At first, Word2Vec function was chosen to give vectors to those special words in this project. Because of the usage of Word2Vec function, its first parameter sentences should be a list of sub-lists of tokens. The dataset was processed abiding by the requirement from Word2Vec function. All words in GloVe file and its corresponding vectors were put into a dictionary in Python, with each word as key and the word’s vectors as value. One parameter of Word2Vec function called sg was set by 0, which means that continuous bag of words (CBOW) was used rather than the algorithm of skip-gram. Because the accuracy can be a little higher compared to using sg=1. The following piece of code can show Word2Vec function clearly with its parameters.

em_model = Word2Vec(text_data, size=100, window=1, min_count=1, workers=2, sg=0)

A model called em_model with word vectors mapping to word tokens of datasets in this project was produced, after using Word2Vec for training datasets. Then, a second dictionary in Python was used to hold words for datasets in this project following the same way in the first dictionary. By comparing with the second dictionary’s keys, the first dictionary was updated, with new words not included in glove.6B.100d.txt, but in datasets of this project only. At last, this retrained model was put into a txt file with words and their 100-dimensional word vectors, which was called retrained GloVe file.

 Use of Mittens

Moreover, the other tool called Mittens was used. Before the use of Mittens, the pretrained model needed to be loaded by using glove2dict function, and glove.6B.100d.txt was still the good choice

(30)

for this project. Meanwhile, preprocessing of train dataset was also needed, like removing stopwords. There were some words called out of vocabulary (OOV), which means these words are not included in the pretrained GloVe file, and what is used to build co-occurrence matrix for dataset in this project in the next step. Because of the too large space complexity with O(𝑛2_{), some very}

rare words inside OOVs must be filtered out. Then CountVectorizer was used to convert document into word-doc matrix. After that, fit function could be used for training the model instantiated by Mittens. Finally, the new embedding was produced and could be used for training. 3.3.10 Comparison with Random Multimodel Deep Learning (RMDL) Model

RMDL (Kowsari, Heidarysafa, Brown, Meimandi, & Barnes, 2018) stands for Random Multimodel Deep Learning, which is ensembles of three different deep learning architectures: Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN). In this project, RMDL was used for comparison with the LSTM-based model structure to demonstrate two models’ performance for ARPIS from different perspectives.

During the process of carrying out RMDL in datasets of this project, it was found that it requires huge memory for running RMDL model. In this situation, a cloud service called MistGPU had to be applied to run RMDL. MistGPU is a tool of cloud service that provides online rental service of GPU for deep learning’s users. In this project, a server called Tesla V100 with 96G memory was chosen as GPU.

At first, the same procedures for data preprocessing was done, which includes data loading and data cleaning. Besides, array function in NumPy was used to convert both test data and training data into array form in order to make them suitable for RMDL model.

In order to use this multiple-model architecture, RMDL.Text_Classification function has to be called. This function has many different parameters which are set by default. In view of dataset in this project, parameters were set as the follow-up descriptions. EMBEDDING_DIM was set by 100, showing that the embedding dimension is 100. Then, the absolute path and file name of retrained GloVe file was given to the parameter GloVe_dir and GloVe_file. Since the label list was numerical encoding, sparse_categorical should be set as 1. Otherwise, if labels are binary or one-hot encoding, the value of parameter sparse_categorical should be 0. Next, three models of DNN, RNN and CNN were set in three layers for each one by setting parameter random_deep in a list of [3, 3, 3]. Moreover, the other list like [30, 50, 30] was used for defining epochs, which represents epochs number for DNN, RNN and CNN respectively. Since RMDL is a matural tool, the results would be generated automatically after setting necessary parameters, without need of knowing details inside it.

(31)

4 Findings and analysis

In this findings and analysis section, the results of what has been worked will be presented at first with tables for clear display and understanding. These tables of results articulate how the system’s performance was evaluated in aspects. Then, it comes to analysis of those results obtained from classification_report function in a library called sklearn.metrics in Python. Hence, they pave the way for the next section of discussion in the next chapter to expound research questions in this report.

4.1 Results

4.1.1 LSTM-based Model Results

From the process of prediction in the LSTM-based model, the loss value was got as 0.407 in test datasets and the accuracy was around 0.849. Since a function called classification_report was used which is provided by a library from Scikit-learn, it is easy to get the results directly and automatically from code to visualization, even detailed into calculation of each product’s precison, recall, f1-score and accuracy. These products listed in table 2 and table 3 are specific products such as mushroom and avocado. To give a clear view of results, they have been tabulated into the table called Classification Report (see table 2.). The evaluation parameters needed are all involved such as precision, recall, F1-score and accuracy. Excepted that, there are also more detailed value showing these four evaluation parameters in macro average and weighted average. The macro average values in four parameters like macro average value of precision, macro average value of recall, macro average value of F1-score give each prediction similar weight when the loss value is calculated. In contrast to it, the weighted average gives different weights independently when calculating one of these four parameters and the given weight relies on the number of correct predictions in each class. That is also the reason why the weighted average can be used for unbalanced classification problem. The element “support” in the table displays the number of correct predictions in each class, also called true labels. Obviously, there are not the same number of data amounts in each class, which is to say that the weighted average is suitable for the situation in this project with unbalanced classification problem.

4.1.2 Model Results after Retraining GloVe  Results of Using Word2Vec

After using the new word vectors produced by Word2Vec for retraining GloVe, the accuracy got a little bit higher as 0.862 from test dataset and the loss value was around 0.379. The classification report shows more detailed parameters in each product from prediction as depicted in Table 3. From this table, it is clear to see the values of precision, recall, F1-score on each predicted product,

(32)

which was generated by classification_report automatically. And the average accuracy was 0.86, which is a slightly increase compared with incipient model.

Table 2. Classification Report

precision recall f1-score support

product 0 0.80 0.87 0.84 150 product 1 0.00 0.00 0.00 13 product 2 1.00 1.00 1.00 77 product 3 1.00 0.99 0.99 76 product 4 0.91 0.91 0.91 64 product 5 1.00 0.99 1.00 151 product 6 0.92 0.93 0.92 126 product 7 1.00 1.00 1.00 152 product 8 0.89 0.95 0.92 78 product 9 0.74 0.77 0.75 175 product 10 0.99 0.96 0.97 315 product 11 1.00 1.00 1.00 201 product 12 0.88 0.86 0.87 178 product 13 0.76 0.74 0.75 78 product 14 0.77 0.75 0.76 198 product 15 0.46 0.76 0.57 100 product 16 1.00 0.88 0.94 25 product 17 0.97 0.84 0.90 224 product 18 0.90 0.78 0.84 317 product 19 0.59 0.78 0.67 122 product 20 0.95 0.98 0.96 322 product 21 0.93 0.96 0.94 276 product 22 0.81 0.72 0.76 58 product 23 0.99 0.99 0.99 295 product 24 0.99 0.99 0.99 311 product 25 0.87 0.85 0.86 535 product 26 0.78 0.81 0.80 369 product 27 0.90 0.41 0.56 22 product 28 0.93 0.05 0.10 248 product 29 0.63 0.98 0.77 774 product 30 0.95 0.43 0.59 361 product 31 0.98 0.99 0.99 159 product 32 0.93 0.95 0.94 134 product 33 0.99 0.97 0.98 237 product 34 0.98 0.92 0.95 162 product 35 0.64 0.79 0.70 113 product 36 0.68 0.57 0.62 114 product 37 0.86 0.84 0.85 129 product 38 1.00 0.94 0.97 136 accuracy 0.85 7575

(33)

weighted avg 0.87 0.85 0.84 7575

Table 3. Classification Report after Retraining.

precision recall f1-score support

product 0 0.86 0.85 0.86 150 product 1 0.00 0.00 0.00 13 product 2 0.99 0.99 0.99 77 product 3 0.97 0.97 0.97 76 product 4 0.92 0.88 0.90 64 product 5 1.00 0.99 1.00 151 product 6 0.92 0.90 0.91 126 product 7 0.99 1.00 1.00 152 product 8 0.97 0.87 0.92 78 product 9 0.73 0.76 0.74 175 product 10 0.99 0.96 0.98 315 product 11 0.99 1.00 0.99 201 product 12 0.87 0.86 0.86 178 product 13 0.66 0.71 0.68 78 product 14 0.82 0.77 0.79 198 product 15 0.46 0.82 0.59 100 product 16 1.00 0.80 0.89 25 product 17 0.95 0.86 0.90 224 product 18 0.94 0.76 0.84 317 product 19 0.60 0.87 0.71 122 product 20 0.95 0.97 0.96 322 product 21 0.94 0.95 0.95 276 product 22 0.88 0.76 0.81 58 product 23 1.00 0.99 0.99 295 product 24 0.99 1.00 1.00 311 product 25 0.86 0.86 0.86 535 product 26 0.80 0.83 0.81 369 product 27 0.92 0.55 0.69 22 product 28 0.65 0.46 0.54 248 product 29 0.74 0.86 0.79 774 product 30 0.77 0.66 0.71 361 product 31 0.99 0.99 0.99 159 product 32 0.96 0.96 0.96 134 product 33 1.00 0.97 0.98 237 product 34 0.98 0.90 0.94 162 product 35 0.67 0.59 0.63 113 product 36 0.63 0.76 0.69 114 product 37 0.90 0.88 0.89 129 product 38 0.98 0.95 0.97 136 accuracy 0.86 7575

Automatic Retail Product Identification System for Cashierless Stores

Automatic

Retail Product

Identification System

for Cashierless Stores

This exam work has been carried out at the School of Engineering in

Jönköping in the subject area machine learning. The work is a part of the

two-year university diploma programme of the Master of Science in

Software Product Engineering programme. The author takes full

responsibility for opinions, conclusions and findings presented.

Examiner: He Tan

Supervisor: Rachid Oucheikh

Scope: 30 credits

Summary

Keywords

Contents

Summary ... 2

1

Introduction ... 4

2

Theoretical background ... 9

3

Method and implementation ... 20

4

Findings and analysis ... 30

5

Discussion and conclusions ... 34

1

Introduction

1.1 Background

1.2 Purpose and research questions

1.3 Delimitations

1.4 Outline

2

Theoretical background

2.1 Automatic retail product identification system (ARPIS)

2.2 Existing solutions for ARPIS

2.3 The solution for ARPIS based on a deep learning model

2.4 The techniques as assistance for building ARPIS

2.5 Evaluation Metrics

3

Method and implementation

3.1 Method

3.3 Implementation

4

Findings and analysis

4.1 Results