Mobile Object Detection using TensorFlow Lite and Transfer Learning


Academic year: 2021

Mobile Object Detection using

TensorFlow Lite and Transfer





Transfer Learning


Master in Computer Science Date: August 27, 2018 Supervisor: Pawel Herman Examiner: Danica Kragic

Swedish title: Objektigenkänning i mobila enheter med Tensorflow Lite


With the advancement in deep learning in the past few years, we are able to create complex machine learning models for detecting objects in images, regardless of the characteristics of the objects to be detected. This development has enabled engineers to replace existing heuristics-based systems in favour of machine learning models with superior performance. In this report, we evaluate the viability of using deep learning models for object detection in real-time video feeds on mobile devices in terms of object detection performance and inference delay as either an end-to-end system or feature extractor for existing algo-rithms. Our results show a significant increase in object detection per-formance in comparison to existing algorithms with the use of transfer learning on neural networks adapted for mobile use.



Utvecklingen inom djuplärning de senaste åren innebär att vi är ka-pabla att skapa mer komplexa maskininlärningsmodeller för att iden-tifiera objekt i bilder, oavsett objektens attribut eller karaktär. Denna utveckling har möjliggjort forskare att ersätta existerande heuristik-baserade algoritmer med maskininlärningsmodeller med överlägsen prestanda. Den här rapporten syftar till att utvärdera användandet av djuplärningsmodeller för exekvering av objektigenkänning i video på mobila enheter med avseende på prestanda och exekveringstid. Vå-ra resultat visar på en signifikant ökning i prestanda relativt befintli-ga heuristikbaserade algoritmer vid användning av djuplärning och överförningsinlärning i artificiella neurala nätverk.


1 Introduction 1

1.1 Problem statement . . . 2

1.2 Scope . . . 2

1.3 Thesis outline . . . 3

2 Background 4 2.1 History of Computer Vision . . . 4

2.2 Definitions . . . 5

2.2.1 Classification . . . 5

2.2.2 Object Detection . . . 5

2.2.3 Real-Time Object Detection . . . 6

2.2.4 Training and inference . . . 6 Mean Average Precision . . . 7

2.2.5 Precision and Recall . . . 7

2.2.6 Cost function . . . 8

2.2.7 Hyperparameters . . . 8

2.3 Relevant Theory . . . 9

2.3.1 Artificial Neural Networks . . . 9 Architecture . . . 9 Feed-forward Neural Networks 9 Deep Neural Networks . . . 9 Activation function . . . 10 Rectified Linear Units . . . 10 Softmax . . . 10 Learning . . . 10 Algorithms . . . 10 Generalisation . . . 11 Regularisation . . . 12

2.3.2 Convolutional Neural Networks . . . 13


2.3.3 Transfer learning . . . 16

2.3.4 Sliding Window Detector . . . 17

2.3.5 Existing heuristics based algorithm . . . 17

2.4 Related Work . . . 18

2.4.1 R-CNN, Fast R-CNN & Faster R-CNN . . . 18

2.4.2 SSD . . . 21

2.4.3 YOLO, YOLOv2, YOLOv3 & Tiny YOLO . . . 23

2.4.4 MobileNets . . . 25

2.4.5 Inception . . . 27

2.4.6 ResNet . . . 28

2.5 Tools and Utilities . . . 29

2.5.1 TensorFlow . . . 29

2.5.2 TensorFlow Mobile . . . 30

2.5.3 TensorFlow Lite . . . 31

2.5.4 CUDA and cuDNN . . . 31

3 Method and experiments 33 3.1 Data . . . 33

3.1.1 Existing data . . . 33

3.1.2 Data gathering and processing . . . 34 Instagram-scraper . . . 34 RectLabel . . . 35

3.2 Hardware . . . 35

3.3 Choice of base models . . . 36

3.4 Model training . . . 36 3.4.1 Tiny YOLO . . . 36 3.4.2 SSD and Faster R-CNN . . . 37 3.5 Hyperparameter selection . . . 37 3.6 Hyperparameter optimisation . . . 37 3.7 Data augmentation . . . 37

3.8 Measuring and evaluating model performance . . . 38

3.8.1 Evaluating model inference time . . . 38

3.8.2 Evaluating heuristic model versus Machine Learn-ing (ML) model . . . 40

4 Results 41 4.1 mAP Performance . . . 41

4.2 Inference time . . . 42


4.5 Heuristics vs Machine Learning . . . 49

5 Discussion 54 5.1 Performance/latency payoff . . . 54

5.2 Augmented network performance . . . 55

5.3 Deep learning vs heuristics . . . 55

5.4 Quality of data . . . 56

5.5 Hyperparameter tuning . . . 57

5.6 Sustainability and ethics . . . 57

6 Conclusions 59

AI Artificial Intelligence.

ANN Artificial Neural Networks.

CNN Convolutional Neural Networks.

CPU Central Processing Unit.

CV Computer Vision.

DL Deep Learning.

DNN Deep Neural Networks.

FLOPS Floating Point Operations Per Second.

FPS Frames Per Second.

GPU Graphics Processing Unit.

mAP Mean Average Precision.

ML Machine Learning.

RAM Random-Access Memory.

ReLU Rectified Linear Units.

TF TensorFlow.

TFL TensorFlow Lite.

TFM TensorFlow Mobile.


With the advancement in Deep Learning (DL) in the past few years, we are able to create complex ML models for detecting objects in images, regardless of the characteristics of the objects to be detected. This de-velopment has enabled engineers to replace existing heuristics-based systems in favour of ML models with superior performance [37].

As people are using their mobile phones to a larger extent, and also expect increasingly advanced performance [43] from their mobile applications, the industry needs to adopt more advanced technologies to meet up to expectations. One such adaptation could be the use of ML algorithms for object detection.

ML is commonly divided into two phases namely the training and the inference phase. Training is the phase where a model, usually a neural network, is trained to behave a certain way based on given datasets. This step can easily be carried out in the cloud and dis-tributed to mobile devices, where the trained models can be used for inference on previously unknown data.

When applying more advanced technologies and algorithms in a mobile environment one of the challenges is the limited computational power of the mobile hardware. As inference is computationally ex-pensive, it is crucial that operations are optimised for mobile devices. By using the mobile version of TensorFlow (TF) [30] namely Flow Mobile (TFM) [22] and the updated mobile framework Tensor-Flow Lite (TFL) [21], developers are able to use pre-trained models on mobile devices for inference with optimisation for mobile hardware.

The goal of this thesis is to evaluate the feasibility of using DL mod-els for detecting Post-it R

notes on mobile devices in comparison to the


current heuristic-based models for detecting Post-it R

notes from a live camera feed.


Problem statement

The task of detecting Post-it R

notes is challenging as their geometrical shape is similar to many other objects, and when obscured this shape is altered to a large extent. Without no other characteristics, the note is easily mistaken to be another object, and vice versa.

Specifically, this thesis is aimed to examine if DL models can be used on mobile devices to outperform, the existing heuristic-based vi-sual object detection algorithms in terms of recall performance. Fur-thermore, this thesis examines the constraints in delay and computa-tional time in the inference phase for the use of DL models on mobile devices. The examination is limited to the development and evalu-ation of the DL models and does not cover the implementevalu-ation and deployment of the DL models on any end-user mobile applications, but solely the development of an android application for ML model inference time measures.



The assignment entails the development of a ML model running on a mobile device capable of detecting Post-it R

notes in real-time from a video feed. The following challenges have been identified.

1. Computational cost time during the recall phase of such a model, as it should be capable of running on a mobile device with lim-ited computational power. As multiple objects might exist in a single frame, the frame must be divided into a grid were multi-ple cells, where each cell is analysed independently, which inher-ently increase the computational cost.

2. The need to distinguish between similar objects.

3. Identification of multiple objects in a single frame, where some objects might be only partially visible, and others are overlap-ping.


mance of an will focus heavily on the use of Convolutional Neural Networks (CNN)s[25, 17] as CNNs have been proven useful for Com-puter Vision.


Thesis outline

This report follows a standard academic outline for research papers and is divided into multiple chapters. Chapter 1 (Introduction) in-troduces the research subject and the question to be explored as well as brief information regarding the stakeholders of the paper. Chap-ter 2 (Background) consists of an overview of the research subject and its corresponding definitions, as well as relevant theory and previ-ous work. Furthermore, the chapter outlines tools and utilities to be used in later chapters. The primary purpose of chapter 2 is to present the reader with the theoretical knowledge required to follow the argu-ments and conclusions in later chapters and enable the reader to the-oretically grasp the following results and the interpretation of them. Chapter 3 outlines the experiment in terms of model construction, model tuning, gathering of data as well as evaluation methods for the ML models and the heuristic model comparison. Chapter 4 presents the results from the experiments. Chapter 5 contains discussions and reflections of the achieved the results. Chapter 6 contains the final in-terpretations of the results and the discussion that followed from them, as well as suggestions for further research, refinements and improve-ments.




History of Computer Vision

Computer Vision (CV) emerged in the late 1960s as a subset of Arti-ficial Intelligence (AI), where scientists intended to mimic the func-tionality of the human vision system. It was believed that processing data from digital images in order to achieve a high-level understand-ing of it and unravellunderstand-ing symbolic data from the image data was an easy task – namely the "visual input" problem [52]. It didn’t take long until researchers grasped the complexity of transforming retina input into symbolic information.

As the field evolved in the 1980s, researchers focused on mathemat-ical models and techniques to analyse images, like edge and contour detection. During this time, researchers noticed correlations between various algorithms in CV, and algorithms were unified to a higher ex-tent [52].

In the later 2000s, the domain of ML models for visual recogni-tion emerged, and is currently dominating the field of CV. With the rise of large amounts of labelled data, sophisticated algorithms and in-creasing computational power, these ML models are able to categorise objects without human supervision [52]. The MNIST [28] dataset (Fig-ure 2.1) was commonly used to evaluate performance of ML models for visual recognition [56].

Currently the most commonly used algorithm for object detection (section 2.2.2) in CV are CNNs [46], that have been proven to surpass human-level performance on image classification [14] (section 2.2.1).

The characteristics of Artificial Neural Networks (ANN)


Figure 2.1: Example images from the MNIST data with the corre-sponding correct labels. The dataset consists of thousands of images of handwritten digits.

tions are similar to the characteristics of real-time graphics computa-tions in video game rendering, where operacomputa-tions such as matrix multi-plications and division are executed per pixel in parallel. Around 2005, researchers realised the potential advantages of performing ANN com-putations on Graphics Processing Unit (GPU)s rather than Central Pro-cessing Unit (CPU)s, resulting in faster computations and higher per-formance. This enabled researchers to add more layers to ANNs, also knows as deeper ANNs, and to use more data while maintaining a reasonable execution time [12].





The process of specifying which of the k possible categories some input xbelongs to is referred to as a classification problem. This is described as producing a function f : Rn → {1, . . . , k}. The output could be the predicted class y, or a vector Y with the probability distribution of all k classes [12]. Image classification is the task of classifying the category to which the object in the image belongs to.


Object Detection

Object detection is the process of detecting objects on an image apply-ing a recognition algorithm on all sub-windows of the original image, stretching from one to multiple classes of objects [52].


Object detection could, for example, be used to detect faces, pedes-trians or cars in an image. In Figure 2.2 the YOLO network is used to detect objects in an image, where the process of dividing the image to multiple sub-windows is visualised [40, 39]. Object detection requires localisation of the objects within the image, which classification does not [11].

Figure 2.2: The left image shows how the YOLO architecture image is split into an grid, and the middle image displays how this grid is used to evaluate multiple sub-windows of the image. The right image displays the original image with the corresponding GT boxes [40].


Real-Time Object Detection

When analysing videos consisting of multiple images (frames) per sec-ond, and the detection is executed in retime, the object detection al-gorithm is considered a real-time object detection alal-gorithm. Analysing multiple images per second of objects puts heavy emphasis on efficient algorithms, as the computational power required is increased.


Training and inference

In ML the training phase is where the model parameters θ are opti-mised to minimise the cost function, and inherently learns the map-ping function f∗ from input to output.

The inference phase of a ML model is when the fully trained model is shown some input x, and outputs some output y derived from the learnt function composition.


A commonly used performance metric in object detection is Mean Av-erage Precision (mAP), as defined by PASCAL VOC [6]. Better per-formance is indicated as a higher mAP value given the ground-truth boxes and given classes for an object detection task.

In order to use mAP in object detection, all predicted boxes and classes are sorted in decreasing order of probability and matched with ground-truth boxes and classes. If the classes of the prediction and the ground truth match, and their Intersection over Union (IoU, also knows as Jaccard Index) (Figure 2.3) is greater or equal to 0.5 (0.5IOU), the prediction is considered a match. The match is predicted as a true positive if and only if it has not previously been used, to mitigate du-plicate detections of objects [4].

The Average Precision is computed as the area under the preci-sion/recall curve by numerical integration, and the mAP is achieved by calculating the mean of the Average Precision of all classes.

Figure 2.3: Intersection over Union as an similarity metric for object detection [44]. As seen in the right image, the GT bounding box and the predicted bounding box has an high percentage of overlap, which results in an large IoU value.


Precision and Recall

Precision and recall are commonly used metrics in pattern recognition and information retrieval. The precision metric represents the fraction of relevant documents of all the retrieved documents, and the recall metric represents the fraction of relevant documents that have been retrieved out of all relevant documents.


To calculate the precision and recall we use the number of true posi-tives, false posiposi-tives, true negatives and false negatives. Precision and recall are calculated as in equation 2.1 and equation 2.2.

P recision = tp

tp + f p (2.1)

Recall = tp

tp + f n (2.2)


Cost function

The cost function is used to evaluate the performance of the model. During the training phase, our model is constructed as to minimise the cost function and therefore increase the performance of the model. Many ML algorithms are trained with maximum likelihood, which naturally leaves us with a cost function represented as the negative log-likelihood as described in equation 2.3. The primary cost func-tion is often combined with a regularisafunc-tion term as described in sec-tion [12].

J (θ) = −Ex,y∼ ˆplog pmodel(y|x) (2.3)



The goal of training a ML model is to learn the model parameters θ that minimises the cost function (section 2.2.6). These parameters are de-rived during the training phase, but there are other parameters in the algorithm, hyperparameters, that are not optimised during the training phase but have to be set prior to the learning process begins.

For example, the learning rate (η) in a mini-batch stochastic gradi-ent descgradi-ent algorithm specifies the speed of the weight updates in the learning phase, as described in equation 2.4 [45] where J is the cost function.

θ = θ − η · ∇θJ (θ; x(i:i+n); y(i:i+n)) (2.4) Another hyperparameter example is momentum (γ) as described in equation 2.5, which aims to dampen oscillations and therefore acceler-ate learning [45].


υt= γυt−1+ η∇θJ (θ) θ = θ − υt



Relevant Theory

The literature study will focus on the use of DL for real-time object detection, using CNNs [42, 40, 29, 25, 58, 27, 11, 10]. Furthermore, the literature study will cover existing literature on the implementation of DL algorithms with the use of TFL on mobile devices[48, 57, 1, 17].

The necessary background knowledge and knowledge regarding current state-of-the-art systems will be obtained by analysing the re-sources identified in the literature study and implementing these in TF.


Artificial Neural Networks Architecture Feed-forward Neural Networks

The Feed-Forward ANN is the most commonly used deep learning model and serves as the basis for most neural network models such as the CNN. The network aims to approximate some function f∗, which for example could represent the mapping of input x to category y as a classifier function y = f∗(x). This mapping is represented as y = f (x; θ), where the θ parameters are learnt during training [12].

The Feed-Forward ANN is represented as a composition of func-tions between layers as described in equation 2.6, where each layer in the neural network represents one function in the composition. The length of this chain of function compositions is referred to as the depth of the model [12].

f (x) = f(3)(f(2)(f(1)(x))) (2.6) Deep Neural Networks

An ANN is considered a Deep Neural Networks (DNN) when it con-sists of multiple hidden layers between the input and the output layer. This corresponds to the function composition depth described in

(20) Activation function Rectified Linear Units

The Rectified Linear Units (ReLU) was proposed as an non-saturating alternative activation function to the activation functions tanh and sig-moid. The ReLU has been proven to convert faster than the saturating nonlinearities [31, 25] and is described in equation 2.7 as the max func-tion of 0 and the given value.

g(z) = max{0, z} (2.7) Softmax

The softmax function, also known as the normalised exponential function is a multiclass generalisation of the logistic function. It takes a K dimen-sional vector z as input and returns a K dimendimen-sional vector σ(z) where P σ(z) = 1 and where the value of every element in σ(z) is between 0and 1 [3]. The softmax function is commonly used as the last layer in neural networks for multiclass classification as the vector σ(z) rep-resents the probabilities for the K different classes, and the equation is described in equation 2.8. σ(z)j = ezj PK k=1ezk for j = 1, . . . , K. (2.8) Learning Algorithms

During the learning phase of the ML model the objective is to minimise the cost function by making constant improvements. This is achieved by the concept of partial derivatives, where we analyse the gradients at a point x, where the function f (x) decreases fastest from x when mov-ing towards the negative gradient of f (x) described in equation 2.9 [12] which enables us to update the parameters θ. By doing this process it-eratively, the algorithm makes constant small improvements towards the global minimum. Though, the algorithm does not guarantee to converge towards a global minimum as the algorithm is prone to get stuck in local minimums as visualised in 2.4.


Figure 2.4: Local and global minimums [12]. There are multiple ex-treme points corresponding to minimums as seen in the figure, which is an challenge when training an ANN. As we want to avoid local min-imums, there are multiple strategies to avoid these. Generalisation

Generalisation is the capability of a ML model to perform well on pre-viously unseen data. During the training process of a ML model, the cost on the training set, the training cost, is used to optimise the pa-rameters θ of the model as to minimise the cost function. When the training phase is complete, an optimal algorithm performs just as well on a dataset not used during the training phase, namely the test set, which measures how well the algorithm generalises to previously un-seen data. This is called the generalisation error or test error [12].

The goal of the training phase is to minimise the training loss, in-herently learning as much as possible from the training data. Further-more, we would like to minimise the gap between training loss and test loss. In order to increase the performance on the training set, one could increase the complexity of the ML model, by for example in-creasing the depth of an ANN. This process is challenging from a view-point of generalisation, as the risk that the model mimics the training data rather than learn from it increases as the model complexity in-creases and the training loss dein-creases, which has a negative impact on the test loss as the model performs worse on unseen data [12].


the training data, the model is said to suffer from overfitting. In the same way, a model that is not complex enough to learn the underlying patterns of the training data is said to suffer from underfitting. Over-fitting, underfitting and appropriate model complexity are visualised in Figure 2.5.

Figure 2.5: Three different models subjects to underfitting, overfitting and appropriate complexity [12]. When a model suffers from under-fitting as in the left image, the model is unable to learn the underlying patterns of the data, and if the model suffers from overfitting as in the right image, the model simply learns to mimic the data. Regularisation

L1 and L2 regularisation are two common regularisation techniques used to penalise large weights. For example, equation 2.10 describes logistic regression with an regularisation term R(θ) controlled by a penalty parameter α [32].

The difference between L1 and L2 regularisation is the regularisa-tion term R(θ), where L1 regularisaregularisa-tion equals R(θ) = ||θ||1 =Pni=1|θi| and L2 regularisation equals R(θ) = ||θ||2

2 = Pn i=1θ 2 i [32]. arg max θ m X i=1

log p(y(i)|x(i); θ) − αR(θ) (2.10) When training ANNs, our goal is to minimise the objective function or loss function. The unregularised objective function depends only on the weights of the network, as described in equation 2.11 [26].


L(w) =X i=1

`i(y(Xi; w, γ, β))for every sample i (2.11) When adding regularisation to the training of ANNs a regulari-sation term is added to the unregularised objective function, just as described for the logistic regression in equation 2.10. This revised ob-jective function with the added L2 regularisation is described in equa-tion 2.12. The λ parameter in equaequa-tion 2.12 corresponds to the α pa-rameter in 2.10, and adjusts the degree of penalty from the regularisa-tion [26].

Lλ(w) = L(w) + λ||w||22 (2.12) In order to lower the degree of co-adaption between neurons, dropout is commonly performed on feed-forward ANNs as a measure to lower the degree of overfitting. It can be interpreted as killing random neu-rons in the ANN with a probability p, inherently lowering the degree to which connected neurons depend on each other by rendering the presence of each neuron unreliable. Dropout has been proven to help networks generalise better to unseen data and increase network per-formance [16, 56, 50].


Convolutional Neural Networks

The CNN was first introduced in 1989 [27], and proved effective for digit recognition. Following this success, the interest for CNNs in-creased, and they have been shown to perform incredibly well in more challenging image recognition tasks, and have outperformed other ML models. The success of CNNs can be contributed to the availability of larger datasets, increased computational power and improved regu-larisation techniques (section [58].

The underlying idea of the CNN is based on previous work in vi-sual pattern recognition, where it has been demonstrated useful to extract and combine local features to more abstract higher-order fea-tures [27].

The first layer of a CNN is a convolutional layer, which consists of one or multiple Y by Y, Y ∈ Z filters convolving over the original in-put image. As each filter convolves over the original image, element-wise multiplication of the pixel values of the filter and the sub-part of


the original image (receptive field) is performed and the result is sum-marised. This will result in one or more N by N, N ∈ Z feature maps of the original image, where every unit in the resulting feature map is a result of the operations on the Y by Y neighbourhood of the original image [27]. This convolution operation is visualised in Figure 2.6.

Figure 2.6: An example of a 2-D CNN Convolution [12]. The figure displays how the sub-part of the original input is multiplied element-wise with the filter values to construct new output.

Pooling layers are commonly inserted between successive convo-lutional layers as a means to reduce the spatial size and therefore the number of parameters in the network, and also serves as a natural way to reduce overfitting. Max pooling is commonly used in the pool-ing layer, which performs the action of selectpool-ing the maximum value inside of the the receptive field [23] as visualised in Figure 2.7. The summarised neighbourhoods from the pooling operation usually do not overlap as the stride is equal to the width of the filter, but over-lapping pooling where the stride is smaller than the width of the filter has proven useful [25]. The depth of the data remains unchanged, and only the width and height dimensions are altered during the pooling operation [23].


Figure 2.7: CNN Max Pooling operation. By choosing the maximum value in each sampled filter area, the input space is downsampled [23]. As seen in the figure, the data is downsampled from a 4x4 to 2x2.

As seen in Figure 2.8, the patterns that the CNN layers learns to distinguish are increasingly complex. The first layers learn to identify simple lines and shapes, whereas the second layer is more complex which naturally follows from the non-linearity between layers. This pattern follows from all layers, and more complex patterns are con-structed as we reach the top layers in the CNN. This shows the impor-tance of depth in a CNN, as without the depth the network is unable to distinguish and classify more complex patterns and images [58].

Figure 2.8: Evolution of CNN layers. The complexity of the layers increase as data flows from the bottom layers to the top layers [58]. As seen in the figure, the bottom layers learn to identify crude geometrical shapes whereas the top layers identify more advanced shapes.

Naturally, CNNs are built deep with multiple convolutional lay-ers and pooling laylay-ers. It’s a common practice to add one or multiple dense layers for further complexity and non-linearity at the top of the network and to use a softmax layer (section as the top layer for image classification. A complex CNN network is displayed in Fig-ure 2.9 as an example.


Figure 2.9: Example CNN network architecture. The network contains convolutional layers and pooling layers, as well as dense layers fol-lowed with a softmax layer for image classification [25].


Transfer learning

It is rare to train a CNN from scratch, as it requires rarely large quan-tities of data to perform well, and training a CNN with such large amounts of data could take weeks on GPU clusters. As of this, it is a common practice to use pre-trained networks as an initialisation or feature extractor for the task to be implemented. The task of re-using a pre-trained network is known as Transfer Learning, and implies transferring knowledge from one domain to another. The base net-work is commonly trained on a dataset such as the ImageNet dataset, which holds 1.2 million images over 1000 categories. This network then serves as the base network used in the transfer learning.

When the pre-trained network is used as a feature extractor, the last fully connected layer in the base CNN is removed, and two new adaption layers are added to the network. During training, all layers but the new two last layers remain fixed, and the only weights that are adjusted are the weights of these two fully connected layers. As of this, the rest of the CNN serves as a fixed feature extractor as visualized in Figure 2.10.

The pre-trained network could also, but not as commonly, be used to initialise all the weights in the network where all the weights are adjusted during training. When applying this approach in transfer learning, it is common to leave some of the earlier layers fixed as they represent generic features.


Figure 2.10: Example of transfer learning where the last fully con-nected layer is removed, and two new adaptation layers are added to the network. These two layers are later fine-tuned to the new task [35].


Sliding Window Detector

A historical approach to object detection in images is the sliding win-dow detector, where boxes are sliding over the whole frame in incre-mental steps in order to analyse each window for objects as visualised in Figure 2.11. As objects could be of different scales in the image, the sliding window detector creates a image pyramid by resizing the image at various scales, and sliding the fixed-sized window detector over the resized image. Nevertheless, the sliding window detector is computa-tionally expensive and struggles with aspect ratios [8].


Existing heuristics based algorithm

The existing algorithm used to detect Post-it R

notes performs a series of image pre-processors prior to performing the object detection task on the processed image. These series of processing steps are primarily performed with the use of the OpenCV library [53],

The algorithm is built to detect edges in the image corresponding to the edges of Post-it R

notes, and thereafter applies a set of filters to the detected edges in order to evaluate if the detected edges are edges of actual notes or not. This process consists of four steps, where the overlap, boundary, shape and size filters are applied in order. As these filters are applied, the algorithm calculates various metrics that are


Figure 2.11: Sliding Windows over an image [8]. When using the slid-ing window approach, a fixed-sized window slides across the entire input space. This approach is therefore computationally expensive and does not capture objects of varying scale.

used to make the decision to classify the detected edges as a Post-it R

note or not.

Figure 2.3.5 displays the process of applying the preprocessing and filters to an image and how the edges in an image are approximated.


Related Work


R-CNN, Fast R-CNN & Faster R-CNN

As described in section 2.3.4 the sliding window detector is compu-tationally expensive, and as of this it would not be feasible to apply CNNs to such an approach [11]. To solve this, R-CNN utilises an object proposal algorithm named selective search, intended to limit the ber of bounding boxes to be analysed and therefore lowering the num-ber of windows to analyse. The selective search algorithm combines exhaustive search and segmentation and uses cues such as texture and colour to pinpoint all possible locations of objects [55].

The selected bounding boxes from the selective search algorithm are later extracted as a fixed-size feature vector using a CNN and fed to a Support Vector Machine (SVM) for classification of the object inside the region. All the generated boxes are resized to a fixed size (224x244


Figure 2.12: The current heuristic algorithm applying multiple steps of preprocessing and calculations on a two images. As seen in the left figure the algorithm classifies the object as an false positive Post-it R


for VGG) before being fed to the CNN [11].

Despite the use of selective search as an object proposal algorithm for limiting the number of regions to analyse the average number of proposed regions are 2403 [11]. This heavily limits the performance of the architecture, as all of the 2403 regions have to be analysed individ-ually, serving as a bottleneck for the Frames Per Second (FPS) of the algorithm.

The Spatial Pyramid Pooling Network (SPP-net) was developed on the basis of R-CNN, with the intention of increasing the performance of the network. This was achieved by calculating the CNN feature vector for the entire image once, and then calculate the CNN repre-sentation for each region based on the CNN feature vector already cal-culated as seen in Figure 2.13. However, back-propagation was non-trivial with the SPP-net [15].

Fast R-CNN was released as an improvement to both R-CNN and SPP-net, as it incorporates the improvements of SPP-net to R-CNN, with the possibility of training the network end-to-end. The Fast R-CNN network is 9x faster than the R-R-CNN network. Furthermore,


Figure 2.13: Network using the Spatial Pyramid Pooling Layer [15]. The CNN feature vector is calculated once as displayed by the black and white layers, and the the SPP-layer is thereafter applied to calcu-late the CNN representation for each region.

Fast R-CNN added the bounding box regression to the training of the network. As of this, training for classification and localisation is not required to be performed independently [10].

Faster R-CNN is an improvement of the Fast R-CNN network and performs 10x faster by replacing the Selective Search and Edge Boxes part of the Fast R-CNN network, which served as the performance bottleneck. The improved version of the Selective Search algorithm is a CNN called the Region Proposal Network (RPN) [42].

The RPN is faster than traditional region proposal algorithms as it shares the full-image convolutional features with the network respon-sible for the detection of objects, minimising model cost for region pro-posal. As with R-CNN and SPP-net, the RPN is subject to being trained end-to-end with the Fast R-CNN detection network [42].

The Faster R-CNN network edge boxes algorithm is modified to further improve the architectures capability to identify objects with various aspect ratios and scale. The network uses three kinds of anchor


2 : 1 and 1 : 2, which in total gives us 9 boxes to be analysed by the RPN [42].

The TensorFlow Object Detection API [20] provides multiple im-plementations of the Faster R-CNN model built on both the Inception V2 [51] and ResNet [13] models.



SSD uses a fixed set of default bounding boxes using convolutional filters applied to feature maps. The network makes predictions on feature maps of different scales in order to achieve high accuracy on predictions. The SSD network utilities anchor boxes just as YOLO (sec-tion 2.4.3). At the time of predic(sec-tion, that network creates box adjust-ments to max object shape and produces probabilities for the existence of each classification label in the box [29].

The fact that SSD uses various feature maps to combine predic-tions results in an increased number of detecpredic-tions per class and image and the varying resolution on these feature maps leads to increased capabilities of detecting objects of different sizes. At the time of devel-opment, SSD aimed to outperform the state-of-the-art Faster R-CNN network in terms of mAP [29].

The Faster R-CNN network executed slowly at about 7 FPS using the same hardware as the SSD network, which runs at 59 FPS due to the removed need to re-sample pixels or features, which lowered the number of computations per detection while maintaining high formance. The SSD network ran both faster and had superior per-formance to YOLO. A model comparison between SSD and YOLO is displayed in Figure 2.14.

As mentioned, the increased performance in speed in comparison to the Faster R-CNN model was due to the elimination of bounding box proposals and subsampling of the image. The performance in-crease was partially due to the small changes to the network as listed below [29].

• Small Convolutional Filters were used to predict object labels and bounding box offsets.

• Multiple Feature Maps & Prediction at Multiple Scales increased performance on objects of all sizes, but especially smaller objects


Figure 2.14: Comparison between the SSD and the YOLO Single Shot detection models for object detection [29]. The primary difference be-tween the two architectures is that the YOLO architecture utilises two fully connected layers, whereas the SSD network uses convolutional layers of varying size.

which could be more challenging to detect. These predictions were separated by aspect ratio.

During training of the network, default boxes of varying aspect ratios are evaluated and box offsets are predicted as well as prediction confidence for each category and compared to the ground truth (GT) boxes as displayed in Figure 2.15. The model loss during training is a weighted sum of the confidence loss (label) and the localisation loss (box) as described in equation 2.13 [29].

L(x, c, l, g) = 1

N(Lconf(x, c) + αLloc(x, l, g)) (2.13) In the described loss equation, N is the number of matched default boxes, the localisation loss is a Smooth L1 loss between the GT box (g) and the predicted box (l). The offsets for the center (cx, cy) of the de-fault bounding box (d) and its width (w) and height (h) are regressed similar to Faster R-CNN, as described in equations 2.14 and 2.15 be-low [29].


Lloc(x, l, g) = N X i∈P os X m∈{cx,cy,w,h} xkij smoothL1(lmi − ˆg m j ) ˆ gcxj = (gjcx− dcx i )/d w i ˆg cy j = (g cy j − d cy i )/d h i ˆ gjw = log(g w j dw i ) gˆhj = log(g h j dh i ) (2.14) Lconf(x, c) = − N X i∈P os xpijlog(ˆcpi) − X i∈N eg log(ˆc0i) where ˆcpi = exp(c p i) P pexp(c p i) (2.15)

Figure 2.15: During the training of the SSD network, boxes and the corresponding box offsets are predicted and compared versus the GT boxes [29]. The confidence loss (label) and the localisation loss (box) are used to increase the performance of the network.

The TensorFlow Object Detection API [20] provides multiple im-plementations of the SSD model built on the MobileNet V1 [17], Mo-bileNet V2 and Inception V2 models.



As visualised in Figure 2.2, the YOLO network divides images into a grid with GxG cells, and the grid then generates N predictions for bounding boxes (GxGxN boxes in total). Each bounding box is limited to having only one class during the time of prediction, which restricts


the network from finding smaller objects. YOLO unifies the task of object detection and the framing of the detected objects as the spatial location of the bounding boxes are treated as a regression problem. As of this, the entire process of calculating class probabilities and pre-dicting bounding boxes is executed in one single ANN, which enables optimised end-to-end training of the network, and enables the YOLO network to perform at a high FPS [39].

As networks such as Fast R-CNN, Faster R-CNN and SSD were released the YOLO network was improved to YOLOv2 (YOLO9000), which included some of the algorithms adapted in these networks. The aim of YOLO9000 was to release a better version of the YOLO network, and some of the changes in YOLOv2 are the following [40].

• Batch Normalisation. Helps to regularise the model and im-proved training converge.

• High-Resolution Classifier. Fine tunes the classification network at a higher resolution.

• Convolution With Anchor Boxes. YOLO uses bounding boxes for framing the objects, whereas YOLOv2 is inspired by Faster R-CNN and utilises Anchor Boxes instead, predicting offset and confidences for these.

• Multi-Scale Training. By training on various input dimensions, the network is forced to perform well on various input dimen-sions. This also leaves the network with a good payoff between accuracy and speed depending on the requirements of the appli-cation.

• Joint Classification and Detection. Utilising WordTree (Figure 2.16) the authors of YOLOv2 were able to combine multiple datasets in hierarchies, and therefore train hierarchies of objects such as animal families. This enables the network to train on more data and to further refine the detection of objects.

The development of YOLOv3 built upon YOLOv2 but introduced changes such as multi-scale predictions, an improved backbone classi-fier and a new network for feature extraction [41].

As YOLO, YOLOv2 and YOLOv3 require large amounts of compu-tational power for inference, the Tiny YOLO network was developed


Figure 2.16: Combined datasets using a WordTree Hierarchy [40]. Us-ing this approach, multiple datasets could be combined in an hierar-chical manner, which increased the amount of training data.

and optimised for use on embedded systems and mobile devices. The Tiny YOLO networks is inferior to the full YOLO networks in terms of mAP but runs at significantly higher FPS.

As seen in Figure 2.17 YOLOv3 executed faster than both Faster R-CNN and SSD but does not perform as well in terms of classifica-tion accuracy. As computaclassifica-tional power is a limiting factor for mobile devices, the Tiny YOLO networks will be reviewed deeply when con-structing the network in this thesis.

All YOLO networks are executed in darknet [38], which is an open-source ANN library written in C. These networks can be exported to a common .pb format, which is supported by TensorFlow. As of this, networks trained in darknet can be used on all platforms supported by TensorFlow.



MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build the CNN. This results in a light weight DNN not only restricting model size but primarily model la-tency (inference time). These set of models allow for simple


optimisa-Figure 2.17: YOLOv3 Performance on the COCO dataset [41]. As seen in the figure, YOLOv3 outperforms all other networks on the COCO dataset.

tion of hyperparameters to adjust the latency/accuracy traoff de-pending on the problem at hand. MobileNets have been proven effec-tive in a wide range of applications, include object detection, classifica-tion, facial attributes recognition and large scale geolocalization [17].

MobileNet alters the standard convolution as described in chap-ter 2.3.2 by factorising it into a depthwise convolution and a 1x1 point-wise convolution, as displayed in Figure 2.18. The pointpoint-wise convolu-tion is used to combine the output of the depthwise convoluconvolu-tion that applies a filter to each input. The standard convolution generates new output by combining the input and applying the filters in one step, whereas the depthwise convolution splits the layers into a filtering layer and a combining layer, which reduces model size and compu-tations. Using 3x3 depthwise convolutions, MobileNet computations is reduced to 1/8 - 1/9 of a standard convolution [17].

In the task of object detection, MobileNet has been shown to be a useful base network. As shown in Figure 2.19, the reported results for MobileNet on the COCO dataset is promising. The MobileNet ar-chitecture remains the smallest and least computationally expensive base network in comparison to VGG and Inception for both the SSD and Faster-RCNN framework while having similar or superior mAP


Figure 2.18: The MobileNets convolution architecture [17]. By factoris-ing the standard convolutional filters to depthwise 1x1 pointwise con-volutions the number of computations required is reduced.




The Inception network share the same goal as MobileNet, namely to limit the model size and computational cost in environments such as for mobile vision and big-data scenarios. The Inception network is successfully scaled up by factorising convolutions and adding regu-larisation [51].

In general, the Inception network is built upon a set of design prin-ciples, implying constraint on how the network should be constructed. One of these design principles is to avoid bottlenecks and allowing information to flow through the network in a direct manner. This is


Figure 2.19: Object detection results on the COCO dataset using vary-ing frameworks and models [17]. As seen in the table, the number of parameters and operations vary greatly between the models. From this data, is it clear that the MobileNet model architecture is less com-putationally expensive than the others.

achieved by gently decreasing the size of the network from input to output [51].

Increasing the activation per tile, the Inception network trains faster as it allows for more disentangled features, and by doing spatial aggre-gation over lower dimensional embeddings the network remains less complex while not losing large quantities of representational power. Furthermore, the Inception network emphasises the importance of bal-ancing the width and depth of the network in order to reach optimal performance and create more qualitative networks [51].

The Inception network uses Inception Modules, which uses convo-lutions of varying sizes per layer, executing these in parallel and con-catenating the results for the following layer. As these modules more or less serve as models inside of a model, they are called Inception Modules. As to increase the speed of the network, larger convolutions are replaced by multiple smaller ones as seen in 2.21, which reduces the number of parameters due to the weight sharing between adjacent tiles [51].



The key idea behind ResNet is to easen the increased difficulty of training as networks become deeper by reformulating the neural net-work layers as residual functions with references to the input layer,


Figure 2.20: The original Incep-tion module [51].

Figure 2.21: Inception module where 5x5 convolution is re-placed with two 3x3 conolu-tions [51].

whereas normal networks learn from unreferenced functions. This en-ables training of deeper networks, and the ResNet network was able to train on a depth that was 8x deeper than other successful networks as it was presented, while still having lower complexity [13].

The complexity of training very deep ANN is solved in ResNet by adding identity mappings as shortcuts connections between layers as seen in Figure 2.22, which enables it to not only hope that ANN layers stacked on each other will fit an underlying mapping, but explicty fit it to a residual mapping. These identity mappings ads no extra com-plexity or parameters to the network [13].


Tools and Utilities



A commonly used software for ML tasks is TF. It is widely adopted as it provides an interface to express common ML algorithms and ex-ecutable code of the models. Models created in TF can be ported to heterogeneous systems with little or no change with devices ranging from mobile phones to distributed servers. TF was created by and is maintained by Google, and is used internally within the company for ML purposes. TF expresses computations as a stateful data flow graph


Figure 2.22: Residual learning identity mapping [13]. By adding these identify mappings shortcuts connections are created between layers, inherently making the network easier to train.

as seen in Figure 2.23, enabling easy scaling of ANN training with par-allelisation and replication [30]. As the model described in this paper is to be trained on computational servers and later ported to mobile devices, TF is highly suitable.


TensorFlow Mobile

During design, Google developed TF to be able to run on heteroge-neous systems, including mobile devices. This was due to the prob-lems of sending data back and forth between devices and data centres when computations could be executed on the device instead. TFM en-abled developers to create interactive applications without the need of network round-trip delays for ML computations [22].

As ML tasks are computationally expensive, model optimisation is used to improve performance. The minimum hardware requirements of TFM in terms of Random-Access Memory (RAM) size and CPU speed are low, and the primary bottleneck is the calculation speed of the computations as the desired latency for mobile applications is low. For example, a mobile device with hardware capable of running 10 Giga Floating Point Operations Per Second (FLOPS) is limited to run a 5 GFLOPS model in 2 FPS, which might impede desired application performance.


Figure 2.23: Example of a Computational Graph in TensorFlow CNN. As seen in the figure, all operations and variables are connected and split according to how they relate to each other.


TensorFlow Lite

TFL is the evolution of TFM, which already supports deployment on mobile and embedded devices. As there is a trend to incorporate ML in mobile applications and as users have higher expectations on their mobile applications in terms of camera and voice it is highly incen-tivised to further optimise TFM for lightweight mobile use [21].

Some of the optimisations included in TFL are hardware accelera-tion through the silicon layer, frameworks such as the Android Neural Network API and mobile-optimised ANNs such as MobileNets [17] and SqueezeNet [19]. TF-trained models are converted to the TFL model format automatically by TF [21].


CUDA and cuDNN

CUDA was developed by NVIDIA to create an interface for paral-lel computing on CUDA-enabled GPUs. The platform functions as a software layer for general calculations that developers can utilise to execute virtual instructions, and support many programming lan-guages [33].


en-ables GPU-accelerated training and inference of deep neural networks for common routines and operations in ML. As ML heavily depends on accessibility to computational power, this is crucial when training larger networks or training on high-dimensional data such as images. The cuDNN library offers great support for low-level GPU perfor-mance tuning, enabling ML developers to focus on the implementa-tion of the networks. cuDNN supports and accelerates operaimplementa-tions in TensorFlow, which is the deep learning framework used in this the-sis [34].


This section explains the procedure and rationality behind the ducted experiments and the choice of method. The goal of the con-ducted experiments was to assess the performance of the object detec-tion algorithms and to evaluate their viability in this computadetec-tionally limited environment. The evaluated networks are R-CNN, SSD and YOLO, as these networks provide state-of-the-art performance and are widely favoured. Ultimately, the desired outcome was to identify one or multiple networks that outperformed the existing heuristics-based model under the constraint of performing inference in a reasonable time.

Training of all networks was conducted on a desktop computer as described in section 3.2.




Existing data

Bontouch has a dataset of pre-processed images of Post-it R

notes, that can be used for evaluating the deep learning model in comparison to the existing heuristics-based algorithm. This dataset is limited to a couple of hundred entries, and more data is required for the training process. To extend the dataset, we gathered images of Post-it R

notes from social media websites and web directories. Luckily, there were thousands of high-quality images of Post-it R

notes easily accessible.



Data gathering and processing Instagram-scraper

In order to gather more images of Post-it R

notes the command-line Python application instagram-scraper [36] was used. The application enables efficient scraping of images from user profiles and hashtags on Instagram [7], which is one of the largest social media platforms focusing on photo-sharing with 233 million active users.

In order to find suitable images of Post-it R

notes, the social media platform was queried and browsed manually. As the platform was explored, multiple users were identified with a large quantity of high-quality Post-it R

, such as the image in Figure 3.1.

Figure 3.1: Post-it R

note uploaded by Instagram user instachaaz. The manual search for suitable accounts and the use of instagram-scraper resulted in a training set of over 3000 images of Post-it R

notes, where 1842 images including 2436 Post-it R

notes remained after man-ually removing non-relevant images from the dataset. This amount of data is sufficient for tuning the models, and adding more images in transfer learning tasks does not increase mAP performance signifi-cantly as the training suffers from diminishing returns [18].


RectLabel [24] is a tool for image annotation available on the Mac App Store, which eases the process of labelling images with bounding boxes. The tool enables the user to easily draw bounding boxes and annotate each box with a pre-defined label. The annotations of the bounding boxes created by RectLabel follows the PASCAL VOC [6] format, as seen in Figure 3.3, which is commonly used in object detec-tion.

All images gathered from the scraping of Post-it R

notes from In-stagram were manually processed for bounding boxes in RectLabel, where one or multiple notes were annotated in each image as seen in Figure 3.2.

Figure 3.2: RectLabel bounding box annotation interface [24].

Figure 3.3: Example of the bound-ing box annotation in the com-monly used PASCAL VOC for-mat [6].



The networks were trained on a computational server with the follow-ing hardware specifications.

• GPU: Nvidia GTX 1080 Ti (11GB) • CPU: Intel i7-7700K @ 4.20GHz • RAM: 32GB 1600MHz



Choice of base models

As Deep CNNs require intensive training, networks such as SSD[29], R-CNN[11] and YOLO[40, 39] were evaluated as the detection frame-works for the CNN to be extended and fine-tuned, and MobileNet [17], Inception [51] and ResNet [13] were evaluated as base networks.

The selection of the base model for the object detection algorithms amongst these state-of-the-art models depends heavily on the speed versus performance payoff between the models. As visualised in Fig-ure 2.17, the speed and accuracy varies heavily between the different models as well as the mAP between the networks. According to this data Tiny YOLO is faster than object detection models built with SSD, but is inferior in detection performance, which is also the case between Faster R-CNN and SSD, where SSD is the faster but lower performing networking.

In a mobile environment, the FPS that the mobile device is able to perform computations in is crucial for the mobile experience. With a too complex and computationally expensive network, the device would struggle to run the application on more than 1 FPS. According to professional developers at Bontouch, a minimum around 2 FPS is required for a smooth mobile experience, which limited the base net-work to either SSD or YOLO as Faster R-CNN has inferior speed on mobile devices. Nevertheless, all mentioned architectures were evalu-ated and assessed.


Model training



The Tiny YOLO network was trained using the open-source library darkflow [54] as it enabled convenient transfer learning from the base model to our specific model by loading the pre-trained model and dropping the last two layers. Darkflow also provides an API to dict bounding boxes for detection of objects, and to export these pre-dictions in JSON format. Furthermore, darkflow enabled us to export the darknet model to tensorflow for deployment on mobile devices, which was required to evaluate the viability of the model in terms of FPS on mobile devices.


The official TensorFlow [30] object detection library [20] contains the object detection frameworks SSD and Faster R-CNN built on top of MobileNet and Inception. These pre-trained models were downloaded and trained in the same manner as Tiny YOLO.


Hyperparameter selection

The DNN models that were used had existing hyperparameter val-ues, that had been analysed to give the most optimal performance for each model during the training on their respective datasets, and these values were used for the transfer learning task as the default values. These default hyperparameter values were later optimised as described in chapter 3.6.


Hyperparameter optimisation

When training a DNN, finding the most suitable hyperparameters for the given dataset increases the probability of optimal performance. The initial networks were trained using grid search for hyperparam-eter optimisation, performing an exhaustive search over the hyperpa-rameter space [5].

As the top performing networks were identified during the first round of network experiments these hyperparameters of these net-works were further optimised using random search over a discrete space, as it has been proven to outperform grid search when applied to a small number of hyperparameters [2].


Data augmentation

The TensorFlow Object Detection API image preprocessor provides multiple data augmentation steps in the preprocessing pipeline as shown in appendix A. Applying these augmentation steps to the dataset could increase the networks ability to generalise as more training data is generated, with variation and modification from the original data.

To further improve the performance of the top performing net-works, these top-performing networks were finetuned and trained


fur-ther on augmented data. Augmentation techniques such as the Ran-domBlackPatches are significantly interesting, as the primary character-istic of the Post-it R

is the square form with four corners, which could be altered when the black patches are applied to the original image.


Measuring and evaluating model


The performance of the detection networks was evaluated using mAP as defined in chapter The final network was evaluated in com-parison to the heuristics-based model in terms of recall as described in chapter 2.2.5. The mAP value served as a direct indicator of detec-tion performance in terms of both class predicdetec-tion and bounding box prediction, and as the base metric for evaluating model performance.

As mAP is a commonly used metric, there are multiple open-source libraries and software packages available to evaluate the network mAP performance and TensorFlow has built-in support for calculating the mAP metric via TensorBoard during training and evaluation.

After evaluating the various alternatives, the open-source project mAP[4] was selected due to the functionality to convert the darkflow prediction JSON format and PASCAL VOC Ground Truth boxes XML format to a common format for evaluation, which enables us to evalu-ate the various formats effortlessly.

The recall was measured by conducting a set of experiential object detection tasks for the heuristics-based model and DL model and eval-uating the results manually. By performing a series of experiments, enough data was gathered to evaluate the performance of the models.


Evaluating model inference time

As the final model is to be deployed on a mobile phone the inference time of the model has to be low as to perform in a smooth manner. As mentioned in section 3.3, 2 FPS is enough to guarantee a good enough mobile experience. The trained models were evaluated based on their execution time, and inherently FPS, as this metric is crucial for the area of use of the trained model.

The trained models were deployed on a OnePlus 5T Android mo-bile device, where the inference task was executed on a separate thread.


and logged in a text file in the internal storage of the mobile device as shown in Figure 3.4. The inference time was calculated as seen in listing 3.1. This data was later be analysed and assessed to evaluate the viability of the model in terms of FPS performance.

Listing 3.1: Capture inference execution time in ms

final long startTime = SystemClock.uptimeMillis();

final List<Classifier.Recognition> results = detector.recognizeImage(croppedBitmap);

lastProcessingTimeMs = SystemClock.uptimeMillis() - startTime;

Figure 3.4: Mobile object detection inference architecture and inference time capturing. The mobile device launches an video thread and an inference thread, where the video thread serves data from a video feed to the inference thread for object detection. The inference thread logs the execution time to a log file on the device.



Evaluating heuristic model versus ML model

As stated in section 1.1, this paper aims to investigate if the ML model is superior to the heuristic based model in terms of recall, as our pri-mary goal is to increase the number of correctly identified Post-it R

notes in the image. These metrics were calculated by performing ob-ject detection with both the heuristic based model as well as the ML model on the same dataset, and the results were evaluated manually. The precision and recall metrics were calculated as described in chap-ter 2.2.5.



mAP Performance

In order to find the best performing model for the task of detecting Post-it R

notes, the Tiny YOLO V2, SSD MobileNet V1, SSD MobileNet V2, SSD Inception V2, Faster RCNN Inception V2 and the Faster RCNN ResNet50 ML models were trained and evaluated in terms of mAP per-formance. The performance of the trained models is displayed in Ta-ble 4.1, where all models have been trained on the Post-it R

training data and tested on the test data as described in chapter 3.1. The calcu-lation of the mAP was carried out as described in chapter 3.8.

As seen in the table, the mAP value varies heavily between the var-ious models, which follows naturally from the complexity of the mod-els. A more complex model such as Faster RCNN ResNet50 consists of considerably more parameters and operations than a less complex model such as SSD MobileNet V1. As the complexity of the models increase, the models capabilities to capture spatial relationships in-creases as there are more non-linearities in the network. As of this, the performance difference between these models is natural and expected. During the training the mAP performance of all models increased fast initially, to later stagnate around 20k − 30k steps, as seen in Fig-ure 4.1. From the given results, these many steps are required to fine-tune the network to detect the spatial features of the Post-it R

note. Nevertheless, all models were trained for 100k steps or finished earlier due to early stopping in order to optimise model performance.


Model Base model train data mAP

Tiny Yolo V2 VOC 2007+2012 87.57%

SSD MobileNet V1 COCO trainval 91.16% SSD MobileNet V2 COCO trainval 91.90% SSD Inception V2 COCO trainval 96.82% Faster RCNN Inception V2 COCO trainval 96.69% Faster RCNN ResNet50 COCO trainval 99.33% Table 4.1: The mAP performance of the trained models. The perfor-mance of the models vary greatly, and the more complex Faster RCNN ResNet50 model reaches near-optimal performance, whereas the less complex Tiny Yolo V2 performs significantly worse.

Figure 4.1: The AP of the SSD Inception V2 model has an near loga-rithmic growth, and reaches a plateau after 30k steps. At this point, the performance of the model barely increases.


Inference time

As mentioned in chapter 3.3, the inference time and inherently FPS of the model is crucial for a smooth experience in a mobile app environ-ment. In order to capture the inference time of each model and frame, the code as described in chapter 3.8 was executed as a wrapper around each inference.

The inference time of the various models varied greatly as seen in Figure 4.2. This boxplot described the inference time in ms for each of the evaluated models.


Mo-ResNet50 & Faster RCNN Inception inference time is multiple times slower. As the inference time of the RCNN models was too long, these mod-els were dropped for further finetuning and evaluation. Furthermore, these models will be left out in the following graphs as to keep the graph data representation more similar and convenient. The execution time on a non-logarithmic scale for all models is displayed in Table 4.2.

Figure 4.2: Frame inference time per model measured in log(ms). The Faster RCNN ResNet50 model is significantly slower than the other models, which is not surprising as of the complexity of the model.

To visualise the boxplot results in a more graspable manner, the measured inference times on a non-logarithmic scale of the Tiny YOLO, SSD MobileNet and SSD Inception models are displayed in Figure 4.3. As seen in the figure, the inference times of the models are quite similar


except for the somewhat slower SSD Inception V2 model, that had a significantly slower inference time in comparison to the other models. Furthermore, it is clear that the SSD Inception V2 & Tiny Yolo V2 models had significantly more inference time outliers in comparison to the SSD models based on MobileNet, which resulted in a larger stan-dard deviation of the inference time as seen in figure 4.5. This be-haviour was traced to the launch of the Android App, as the applica-tion required significant resources during launch, and did not affect the overall performance of the model when the app was fully loaded. The mean inference times of the four models are visualised in Fig-ure 4.4.

Figure 4.3: Inception & MobileNet based models inference time per model measured in ms. The models vary in inference time, and it is clear that the SSD Inception V2 and the Tiny YOLO V2 model suffers from outliers that the MobileNet based models do not.


Figure 4.4: Mean inference time per model measured in ms. The mod-els based on MobileNet executes faster than the others, and the SSD Inception V2 model is significantly slower than the others.


Detection experiments

As the purpose of the networks was to serve as real-time object detec-tion algorithms for mobile devices, the trained networks were evalu-ated by users on mobile devices. In order to evaluate the performance of the models apart from the mAP metric and the viability of the mod-els in terms of inference latency performance, detection experiments were conducted in an environment as similar as possible to the end use case, which is to detect one or multiple notes gathered on a plain surface.

These experiments were conducted by end users, where they utilised the Android application to detect Post-it R

notes in real-time. As the users were given the task to use the application to detect notes in vary-ing environments, recall accuracy of Post-it R


Figure 4.5: Standard deviation of the inference time per model mea-sured in ms. The standard deviation of the SSD Inception V2 models is larger than the other models, where as the SSD MobileNet V2 and Tiny YOLO has the smallest standard deviation.

metric to evaluate model performance.

The detection experiments were conducted using the top perform-ing networks in terms of the accuracy and latency, namely SSD Incep-tion V2 and SSD MobileNet V2. As seen in Table 4.1, SSD IncepIncep-tion V2 is the highest performing network of the viable networks in terms infer-ence speed, whereas SSD MobileNet V2 performs somewhat worse in terms of mAP, but had superior inference speed as seen in Figure 4.3.

The initial experiments included detection of one to three notes on a plain surface. During these experiments, both networks performed with near to 100% recall, which is not surprising given the near opti-mal mAP value of both the network as displayed in Table 4.2. In this setting, the two networks were approximately equivalent in terms of


