Deep Learning for Object Detection and Retrieval with Intel’s NCS - as part of Autonomous Wheelchair Navigation Georgios Tsiatsios

(1)

Deep Learning for Object Detection and

Retrieval with Intel’s NCS - as part of

Autonomous Wheelchair Navigation

Georgios Tsiatsios

Master of Science in Computer Science

(Specialization in Robotics and Control), 30 ECTS

Ume˚

a University (UMU)

Supervisor

Ulrik S¨

oderstr¨

om, Associate Professor

In partial fulfillment of the requirements for the degree of

Master’s Programme in Robotics and Control, 120 hp

(2)

(3)

Declaration

(4)

Acknowledgements

Firstly, I would like to express my sincere gratitude to my supervi-sor at Ume˚a University, professor Ulrik S¨oderstr¨om for the continu-ous support of my thesis project, his patience, motivation, immense knowledge and constant guidance throughout the semester.

Besides my supervisor, I would like to thank my dear friend Chaitanya Ganesh Kudaka for the countless hours we spent studying and the experiences we shared together.

(5)

(6)

Abstract

(7)

List of Figures

3.1 NCS workflow. . . 11

3.2 Ground-truth (red) vs. predicted (green) bounding boxes. . . 17

3.3 IoU calculation with visual explanation for the numerator and de-nominator. . . 18

4.1 Loss during training. . . 28

4.2 Detection accuracy. . . 28

4.3 A frame that is classified as not blurry, Blurriness = 41.15. . . . 31

4.4 A frame that is classified as blurry, Blurriness = 25.58. . . 31

4.5 Query image (first) and top 5 results retrieved after CBIR. . . 33

4.6 RGB (left) and depth frame (right) with pixel perfect precision after alignment. . . 35

4.7 Final pipeline. . . 36

5.1 Precision-Recall Curve for plant, AP = 58.12%. . . 43

5.2 Precision-Recall Curve for bottle, AP = 41.28%. . . 43

5.3 Precision-Recall Curve for can, AP = 39.92%. . . 43

(11)

LIST OF FIGURES

5.13 Occluded Objects. . . 45

5.14 No frame splitting. . . 46

5.15 3 frame splitting. . . 46

5.16 6 frame splitting. . . 46

5.17 Blurry Frames. . . 46

5.18 mAP comparison at each depth K for every class. . . 47

5.19 mAP comparison at each class for every depth K. . . 48

5.20 mAP comparison at each depth K for every class after blur detection. 49 5.21 mAP comparison at each class for every depth K after blur detection. 49 5.22 1 meter distance with retrieved distances for objects from left to right (m): 1.06, 1.09, 1.13, 1.08, 1.09, 1.07. . . 50

5.23 2 meter distance with retrieved distances for objects from left to right (m): 2.01, 1.99, 2.09, 2.13, 2.14, 2.04. . . 51

5.26 3 meter distance with retrieved distances for objects from left to right (m): 2.91, 2.95, 3.12, 2.91, 2.93. . . 52

(12)

LIST OF FIGURES

5.29 Small (1.09 m) and large plant (2.10 m). . . 54 5.30 Water bottle (0.59 m), Jolt cola can (1.05 m) and Coca cola can

(2.12 m). . . 54 5.31 Water bottle (0.66 m), small plant (0.97 m), large plant (2.39 m)

(13)

List of Tables

4.1 Number of images for each object in the database. . . 27

5.1 mAP & mmAP results for CBIR. . . 47

5.2 mAP & mmAP results for CBIR after blur detection. . . 48

6.1 Average FPS. . . 57

(14)

Chapter 1 Introduction

The development of autonomous systems, such as self-driving cars and service robots, is increasing in interest and popularity since it can change the way that people live, work, interact and entertain themselves [1]. The goal of an au-tonomous system is to be able to execute complex commands and actions with limited or no human intervention. For instance, a self-driving car should ideally be able to receive the destination choice as human input and perform necessary actions such as optimal path planning, obstacle avoidance, compliance with the road traffic laws and safe driving to the desired destination, just like a human would do.

This thesis report shows the work that was conducted in the field of computer vision with sub-fields in deep learning and image analysis. More specifically, an electrical wheelchair was designed that aims to allow the user to navigate to different items of choice using object detection and object retrieval.

(15)

con-1.1 Problem Statement

cepts and techniques that were used for the implementation part (Chapter 3). In the next chapter (Chapter 4), the actual implementation including a detailed pipeline diagram is presented. Afterwards, the experimental results (Chapter 5) are shown where the evaluation of the final system occurs. Finally, we discuss the results and the overall conclusions (Chapter 6) of this project, including current limitations and possible future work.

1.1 Problem Statement

The overall idea with the designed electrical wheelchair is to allow the user choose a desired object to navigate to, based on an image of this object, using a simple interface. The chosen object is compared with the objects found in the frame captured by a camera that is installed on the wheelchair. To achieve this, a deep learning network is required, capable of performing object detection, combined with a system that can classify the desired object (in image form) based on its contents, also known as Content Based Image Retrieval (CBIR). Even though CBIR has been present in research since the beginning of the 1980s, an increasing interest has occurred recently with the versatility and popularity of Convolutional Neural Networks (CNNs) and can be a viable option when hardware resources are limited.

(16)

1.2 Aim

The aim of this research is to implement a system that consists of all the neces-sary tools to achieve object detection and object retrieval. Due to the fact that this application will be used in real-life scenarios, there are numerous challenges concerning the object detection part (and can affect the object retrieval), which are listed below.

1. Blurry frames (motion blur).

2. Occluded objects (objects that are not 100% present in the frame). 3. Changing size of objects (close or distant objects).

4. Similar objects (e.g. two similar bottles with different liquid inside). Furthermore, in order to evaluate the final system, Precision-Recall Curves and mean Average Precision (mAP) metrics are calculated for object detection and Precision@K that leads to Mean mAP for the CBIR evaluation. The robust-ness and precision of the system depends on various factors e.g. choice of the object detection model, training dataset, image database for the CBIR system as well as the challenges enumerated above that are presented on the Experimental Results section (Chapter 5).

Finally, using as resource the Intel’s RealSense D435 camera 1_{, it is possible}

to capture RGB and RGBD (depth) images which can be used to examine the accuracy obtained regarding the distance between the object and the camera. It shall be noted that this small part of the project was considered an extra task.

(17)

1.3 Research Questions

This thesis project aims to answer the following research questions:

1. How effectively can a CBIR system combined with deep learning recognize objects and be used as input for path planning?

2. How does the system behave in real-life scenarios, where several challenges as mentioned in Section 1.2 appear?

3. Due to resource limitations, is Transfer Learning and Finetuning an effective way for obtaining sufficient results?

1.4 Delimitation

The overall project related to the autonomous electrical wheelchair can be divided into three main parts:

1. Design of the interface.

2. Path planning and autonomous navigation. 3. Real-time object detection and object retrieval.

(18)

Chapter 2 Related Work

2.1 Deep Learning

Deep learning is a part of machine learning methods that utilizes many layers in a hierarchical manner and its aim is to extract patterns and high level features around various applications [2]. Due to recent advances in processing capabilities, such as the recent developments of high-end GPUs, there has been a significant increase in popularity of deep learning research. It is used in plenty of applications that involve self-driving cars [3], speech and image recognition [4][5], advertising [6], finance [7], healthcare [8] etc. There are several deep neural network ar-chitectures, with major ones being recurrent, recursive and convolutional neural networks. This project makes use the latter which is further explained in Section 2.1.1.

2.1.1 Convolutional Neural Networks

(19)

2.1 Deep Learning

their depth which was limited to the computational power. It was not until 2006 that GPUs were utilized for speeding up the CNNs [10]. One of the first CNNs was Lenet-5 which was published on 1998 by Yann LeCun et al. [11]. After that, numerous CNN architectures were developed, a few of them being AlexNet [12], VGG16 [13], Inception v1 [14], ResNet-50 [15], Xception [16], MobileNets [17] etc.

The main applications of CNNs are related to computer vision with fields in image recognition, image classification, medical image analysis as well as in the field of natural language processing. A general architecture overview of a CNN is described in Section 3.2.

2.1.2 Object Detection

Object detection is one of the central tasks in the field of computer vision. Object detection involves necessary actions in order to perfom two main tasks, object localization as well as classification in single images or video sequences [18]. A greedy way to perform object detection is to consider all possible locations and scales, however this way immediately rises unnecessary computational intensity to the hardware [19]. The latest state-of-art detectors utilize CNNs and deliver methods and techniques for performing object detection in a more clever way. One of the key advantages of CNNs in object detection is the fact that the feature extraction is developed in a hierarchical manner i.e. from pixel-level features to high-level semantics [20].

(20)

R-2.2 Deep Learning on Edge Devices

CNN [25] and Mask R-CNN [26]. As for regression/classification frameworks, the well-known YOLO (v1 [27], v2 [28], v3 [29]), SSD [30] and Multibox [31] are amongst the most famous. Their main difference is that the first method is a two-step process where initially, selective search is used and regions of probable appearances of objects are created and then are fed into the CNN for region clas-sification purposes. On the contrary, the second method is a one-step process that only requires one pass through the CNN and predicts the bounding boxes (usually fixed in number and scales) and classifies them. The latter method is generally faster in both training and inference while the first is considered more accurate but slower.

It should be noted that the object detection task is broad and there can be plenty of discussion around it. In this section, we just referred to work that was done in combination with deep learning and more specifically with CNN architectures.

2.2 Deep Learning on Edge Devices

(21)

2.3 Content Based Image Retrieval (CBIR)

In a paper of Alsheikh et al. [34], a framework that combines deep learning and Apache Spark for IoT data analytics was proposed that follows similar principles as the edge device which was used for this project, the Neural Compute Stick. In short, the inference execution occurs on the mobile devices while the heavy data training occurs on the cloud. In a research from Liu C. et al. [35], they designed a food recognition system using edge computing to overcome improper system latency and low-battery life of mobile devices. They managed to reduce the response time and lowered the energy consumption, while surpassing previous works in food recognition accuracy.

The NCS can substantially increase the inference times on IoT devices. In [36], the researchers developed a surveillance system for person detection and deployed it on a Raspberry Pi. The addition of the NCS increased the FPS rate from 0.4 to 5.3, making it ideal for an affordable, yet effective solution. Another example is the work of Hossain et al. [37], where they used the NCS to boost the execution time of the inference in a drone capable of multi-person detection and tracking. It can be concluded that edge devices, including the NCS, provide an efficient way of deploying deep learning networks at a low-cost (under 100$), low-consumption (1W for over 100 GFLOPs) and a compact size (USB form factor).

2.3 Content Based Image Retrieval (CBIR)

(22)

2.3 Content Based Image Retrieval (CBIR)

low-level features. However, CNNs are proven to learn higher-level features and generalize better, making them more effective feature extractors, as shown in the research work of [42], [43], [44]. In addition, a study by Wan Ji et al. [45], showed that using CNNs that are pre-trained on a large dataset can achieve promising results and obtain high semantic information in raw pixels. This method was followed in this project, as explained in detail in Section 3.4.1.

Regarding similarity measurements, we first need to define how the images are represented. If we suppose that the feature extraction comes from a CNN, then the features of an image are aggregated into a feature vector. Calculation of a noticeable amount of feature vectors can create a robust database where each of the feature vectors represent the corresponding images. This vectors can then be compared with a query (test) image to examine the similarity based on the query and the images in the database. To compare the similarity between vectors, there are various metrics that can be used. Dengsheng Zhang and Guojun Lu [46] evaluate different similarity measurements regarding image retrieval. Inspired by that, we implemented a function that calculates the similarity using the Euclidean distance, while there is an option of choosing the Cosine distance, which is further explained in Section 3.4.2.

(23)

Chapter 3 Theory

3.1 Hardware

3.1.1 Intel

Movidius

R TM

Neural Compute Stick

Intel’s Neural Compute Stick (NCS)1 _{is a low-powered neural network accelerator}

in a USB stick form factor. The NCS can be utilized for prototyping, tuning, validating and deploying Deep Neural Networks (DNNs) at the edge. It features the Intel Movidius Myriad vision processing unit (VPU) 2 _{that is used in drones,}

surveillance cameras, and other low-power autonomous products.

The USB form factor of the NCS enables the easy attachment to IoT devices such as the Raspberry Pi running Raspbian Stretch OS or a PC running Ubuntu 16.04 (recommended version). At the time of this writing, there are two versions of NCS devices available, Intel’s NCS and Intel’s NCS 2. For this project, the first version was utilized.

In order to deploy an optimized deep learning model compatible to the NCS

1

https://software.intel.com/en-us/movidius-ncs, Intel’s Neural Compute Stick

2

(24)

3.1 Hardware

and perform inference, three main steps are required, as presented in Figure 3.1. Firstly, the deep learning model is trained using standard methods as resources i.e. CPU, GPU or Cloud GPU (Google Colaboratory). It shall be noted that the NCS is not required for this step. Secondly, after the model is trained, it needs to be optimized and compiled in a compatible form (binary graph file) for the NCS to read. All the libraries and tools that are necessary for profiling, checking and generating the graph are available in the Neural Compute SDK (NCSDK), a GitHub open source project 1_{. Finally, after the graph generation, the optimized}

model can be loaded to the NCS and while connected to an inference host, the model can be ran based on the aim of the specific project (e.g classification, object detection etc).

Figure 3.1: NCS workflow.

3.1.2 Intel

RealSense

R TM

D435 Camera

Intel’s RealSense D435 device consists of a pair of depth cameras, an RGB camera and an infrared projector. The device uses stereo vision [47] in order to calculate depth, making it capable for including depth perception to project development. Based on its datasheet 2_{, the camera is meant for both indoor and outdoor}

envi-ronments and the output resolution and frame rate can be up to 1280 x 720 and 90 frames per second (fps) respectively for depth stream, while the RGB sensor

1

https://github.com/movidius/ncsdk, Neural Compute SDK

2

(25)

3.2 Convolutional Neural Networks

can be up to 1920 x 1080 at 30 fps. Furthermore, the minimum depth distance is 0.11 meters (m) and the maximum range is 10 m.

For this project, the purpose of the camera was to acquire depth information and perform the necessary adjustments in order to calculate the distance between the objects from the camera. A more detailed description regarding the distance calculation is presented in Section 4.6.

3.2 Convolutional Neural Networks

Convolutional Neural Networks are one of the most popular deep neural networks and are involved in several technological advances. They have taken the place of Multi-layer Perceptrons (MLP) that were previously used in computer vision tasks. MLPs were considered insufficient for modern computer vision problems, since one perceptron was used for each input (where one input of an image is one pixel). For example, if we assume that we use a MLP for a 300x300 input RGB image (as SSD-MobileNet, Section 3.3.1) then the trainable weight number becomes 270000. In addition, MLPs are not invariant to translation, which means that e.g. for a person classification task, the person would be treated as it should be in a specific image position all the time, and the weights would be adjusted accordingly.

(26)

3.2 Convolutional Neural Networks

called f represented as a 2D matrix, the 2D convolution is calculated by sliding the filter over the image, performing element-wise multiplication and the results are summed into a single output pixel value, as presented in Equation 3.1.

Conv2D(x, y) = h ∗ f (x, y) = α X s=−α β X t=−β h(s, t) × f (x − s, y − t) (3.1)

Each filter that passes over the image creates a feature map. These feature maps (convolution layer output) are usually accompanied with an activation func-tion (almost exclusively Rectified Linear Units (ReLU) [48] are used to account for non-linear relationships) to decide whether the feature examined has appeared in a specific part of the image and a pooling layer (almost exclusively Max-Pooling) for reducing the dimensionality of the feature maps while retaining the important information and creating translation invariance.

(27)

3.3 Object Detection

3.3.1 SSD-MobileNet

3.3.1.1 Mobilenet

MobileNet [17] is a CNN architecture which is ideal for mobile vision applications where size and speed matters. The model is mainly based on depth-wise separable convolutions that, based on [17], were firstly introduced in [49]. Depth-wise separable convolution is a depth-wise convolution which is applied to each input channel of an image (e.g. an RGB image has three channels) followed by a point-wise convolution which aims to make a linear combination of the result of the depth-wise layer. After each convolution, Batch Normalization (BN) [50] and ReLU layers are used. These convolutions have the effect of reducing the model size and computations resulting in fast executions.

The MobileNet can be used as an effective base network for object detection tasks. By the term base network, it is meant that this network provides high level features for classification or detection tasks. Since our objective is to perform object detection, the last fully connected layer that is used for classification is replaced by a detection network, in our case the Single Shot Multibox Detector (SSD) (described in Section 3.3.1.2), a popular network combination for object detection projects.

3.3.1.2 Single Shot Multibox Detector

(28)

pixels/features for each box and applying a qualitative classifier. The approach here is that bounding box proposals and pixels/features inside the bounding box are eliminated and with a combination of techniques i.e.:

• Adding a convolutional filter to predict categories of objects and offset in the locations of the bounding boxes using separate filters for different aspect ratio.

• Activating those filters in a later stage of the network in order to detect boxes at multiple scales.

With this approach, they achieved the following:

1. A network that is less computationally intensive, being ideal for mobile real-time applications.

2. A significant increase in execution speed.

3. Comparable detection accuracy compared to other state-of-the-art methods. Based on their experiments, for 300 x 300 input images (SSD300), they man-aged to get 59 fps versus (vs.) 7 fps vs. 45 fps compared to Faster R-CNN and YOLO respectively, while the mean Average Precision (mAP) (explained in Section 4.8.2) was 74.3% vs. 73.2% vs. 63.3% respectively when tested on the VOC2007 test dataset1_{. It should be noted that the authors in their experiments}

used the VGG16 as a base network which is 30 times slower network than the MobileNet [17].

The drawbacks of this method are associated with 1) similar object confusions due to the fact that the location of multiple categories are shared and 2) bounding box size sensitivity, which means that the performance for small objects is much

1

(29)

poorer than for large objects. This was indeed the experience (among other reasons that will be discussed) we had with our testing as it is shown in Chapter 5, since we used objects that are considered small (soft-drink cans), medium (bottles) and large (plants).

3.3.2 Intersection over Union

In object detection, the Jaccard index, also known as Intersection over Union (IoU), is an evaluation metric that compares the similarity between two shapes which, in our case, are two bounding boxes (rectangles). IoU is invariant to scale due to encoding of properties i.e. width, height and location of the compared bounding boxes into region properties which produces a normalized measure with focus on the area [51].

(30)

Figure 3.2: Ground-truth (red) vs. predicted (green) bounding boxes.

The IoU can be determined with the formula presented in Figure 3.3. The value of the IoU is in the range of [0, 1] where 0 denotes that there is no overlap between the boxes and the prediction is classified as False Positive (FP) and 1 corresponds to a perfect match between the ground-truth and predicted box.

(31)

Figure 3.3: IoU calculation with visual explanation for the numerator and de-nominator.

3.3.3 Modified Non-maximum Suppression

Non-Maximum Suppression (NMS) is a key step in the object detection procedure in order to filter out redundant bounding boxes [52]. It is a popular component that is included in numerous object detection algorithms, including the SSD. Let us suppose that B = b1, ..., bn are the initial bounding boxes (before filtering)

and S = s1, ..., sn are their confidence scores, then, for a threshold Nt, the NMS

(32)

Algorithm 1 Non-Maximum Suppression Input : B, S, Nt D ← {} while B 6= empty do m ← argmaxS M ← bm D ← D ∪ B B ← B − M for bi ∈ B do if IoU (M, bi) ≥ Nt then B ← B − bi S ← S − si end end end Output: D, S

For this project, the purpose of using NMS was to filter out redundant boxes of the same class that were resulting from different frame parts of the same image due to the sliding window technique (Section 4.4.1). It was observed that the standard NMS algorithm was not providing a satisfactory result since, the desired objective was to have as much information as possible or in other words, the biggest bounding box. Due to the fact that the bounding box containing the object was used as input to the CBIR system, we realised that selecting the largest bounding box provided sufficiently better results. The pseudo-code of our modified NMS is presented in Algorithm 2, where all variables are the same as in Algorithm 1 with the addition of C = c1, ..., cn that represents the class names

(33)

Algorithm 2 Modified Non-Maximum Suppression Input : B, C, Nt D ← {} while B 6= empty do M ← bm D ← D ∪ B B ← B − M for bi ∈ B do

if IoU (M, bi) ≥ Nt & ci == cm then

if area(bi) ≥ area(bm) then

B ← B − bi B ← C − ci else B ← B − bm B ← C − cm end end end end Output: D

3.3.4 Transfer Learning

Transfer Learning is a technique in machine learning where a model that has been trained for a specific task is used for another related task. According to Sinno Jialin Pan and Qiang Yang, a unified definition of transfer learning is “Given a source domain DS and learning task TS, a target domain DT and learning task

TT , transfer learning aims to help improve the learning of the target predictive

function fT(·) in DT using the knowledge in DS and TS, where DS 6= DT , or

TS 6= TT.” (p. 1347, 2010) [53].

(34)

3.4 Content Based Image Retrieval

transfer learning works well when a model has been trained in a lot of data and has generalized itself e.g. a model trained on ImageNet1_{, a dataset that consists}

of millions images over more than 1000 classes. Consequently, using transfer learning for a task where the available dataset is small (e.g. a few thousand images) can be beneficial in terms of performance and training time by using a pre-trained model. This is the method that we followed in this project for training the object detection network, in combination with finetuning (Section 3.3.5) due to resource limitations.

3.3.5 Finetuning

Generally, finetuning is the process of freezing the weights of the layers of a pre-trained model except the final layer. More specifically, we keep the same network architecture that the model has been trained on and all the weights that have been adjusted for a task on a large dataset, and use our desired smaller dataset to train the final layer. In our case, we used the SSD-MobileNet that was pre-trained on the VOC0712 dataset 2 _{which consists of 20 classes and finetuned it}

for 3 classes (bottle, plants, can). It should be noted that the bottle and plant classes are also included in the VOC0712 dataset but not the can, which makes the main dataset and ours partially related.

3.4 Content Based Image Retrieval

3.4.1 Feature Extraction with MobileNet

In order to perform CBIR, the images in the database as well as the inference images must be represented as feature vectors. This can be achieved using the

1

http://www.image-net.org/, ImageNet

2

(35)

result of the penultimate layer as a feature vector. MobileNet not only has one of the fastest inference speed compared to other well-known architectures but is also an architecture that is fully supported from the NCS, and consequently, it was decided to be used as a feature extractor for the CBIR system. In order to do that, we implemented the following procedure:

1. Firstly, we used transfer learning with the pre-trained weights from the ImageNet dataset.

2. Afterwards, we removed the softmax layer which is used for classification (last layer of the network) and maintained the penultimate layer (global average pooling) as final output for feature extraction.

3. Finally, for each image in our database, we extracted their features with shape of (1, 1024) in vector form and after performing PCA (Section 3.4.3), we serialized the array using pickle and stored them to a single file which contains the feature maps, the images path and the PCA object. The PCA object is needed to transform the inference image in the same way as the images in the database.

3.4.2 Visual Similarity

(36)

Euclidean distance [55] : This distance is commonly used due to its efficiency and effectiveness. The distance can be calculated as:

d(Q, D) = v u u t n X i=1 (|Qi− Di|)2 .

Cosine distance [55] : Cosine distance is another common choice for distance similarity measurements. It is defined as:

d(Q, D) = n X i=1 ( Qi· Di ||Qi|| · ||Di|| )

and represents the angle between the two feature vectors. It should be noted that the cosine distance calculates the similarity between the feature vectors, whereas the euclidean distance provides metric result for dissimilarities. In other words, the higher the euclidean distance the more dissimilar the images are, and the higher the cosine distance (value range [0, 1]), the more similar the images are.

It was noticed that with the use of either of the distance metrics, the obtained results were similar. As a result, we used the euclidean distance as default but there is flexibility in the implemented function for a future developer to choose the cosine distance as input, as an alternative.

3.4.3 Principal Component Analysis

(37)

2) dimensionality reduction, 3) simplified dataset description and 4) to extract the most useful information.

As it was mentioned before, the main purpose of a CBIR system is to extract useful features that represent an image. In our case, our features are extracted from a neural network. Depending on the network, the number of features can vary in length e.g. for a VGG16 model the extracted features of one image have the shape of (1, 4096) while for a MobileNet model (the one we used) the shape is (1, 1024). Suppose that we have a database that consists of 1000 images, then the features of the database have a shape of (1000, 1024). If we use as an example the previous feature matrix, then the main steps for calculating the PCA features for a CBIR database are:

Step 1: We compute the mean for every image i for i = 1, 2, ..., 1000. Step 2: We compute the covariance matrix of the database.

Step 3: We compute the eigenvectors with their corresponding eigenvalues. Step 4: We sort the eigenvectors by decreasing eigenvalues in order to have the highest variance in the first eigenvectors.

Step 5: We choose a number of eigenvectors n with the highest eigenvalues and form a matrix with shape (1000, n).

Step 6: We use the new formed matrix to transform the original features into the PCA subspace.

(38)

Chapter 4 Methodology

4.1 Dataset/Database

4.1.1 Dataset

Our training dataset was collected using Intel’s RealSense D435 camera. The camera provided output videos at 1280x720 pixels. The objects from every frame were hand-labeled for the object detection task using LabelImg 1_{. The objects}

of interest that were included in the dataset were two bottles (water bottle and coffe bottle), two soft-drink cans (Coca-cola and Jolt cola) and two potted plants (small plant and large plant). It shall be noted that when labeling the training dataset, there was the option of classifying a detection as “difficult”, which means that when testing, if the object is not detected then it would not count as a missed detection and therefore, it could increase the AP greatly, as we had a lot of objects that were far away. The part of difficult object detection was a crucial part to this project and consequently, we did not use this option.

The dataset consists of 4300 images, where 1581 of them contains at least

(39)

4.1 Dataset/Database

one of the objects while the rest of them are empty images (negative images) i.e. images with no objects present and are treated as background. We decided to use negative images to reduce the FP rate that was apparent at the early stages of training. In Chapter 5, we provide results with two datasets, one with only a few negative images and one with plently in order to show that with addition of negative images, the false positive rate indeed decreases and consequently, the Average Precision (AP) increases.

In addition, the quality of the data greatly affects the performance of neural networks, so we focused on creating a dataset that was as diverse as possible so as to tackle different real-life scenarios. As a result, we included images where the objects were occluded, far away, extremely close, with different angles, different ratios and blurry.

Furthermore, it was mentioned that the input of the SSD-MobileNet model are images that are resized into 300x300 pixels and since our dataset includes small objects, we decided to take random 400x400 crops from the original images by taking into account that the objects are present. In the case of keeping the images as they were (1280x720 pixels), there was a high probability for the small and far away objects to perform poorly. For example, if we assume that an object is at a far distance from the camera and occupies an area of 50x50 pixels, then after the resize process the image would occupy 11x20 pixels. Therefore, this object will not contain sufficient information for meaningful feature extraction and furthermore, the aspect ratio will get distorted if the width/height size is not the same.

4.1.2 CBIR Database

(40)

4.2 SSD-Mobilenet Training

As a result, we ended up with a database that has the outline presented in Table 4.1. The nature of the data collection (video frames) makes the data imbalanced, which means that there is no equal or close to equal number of object images per class. Data imbalance is one of the reason that affects the overall per class AP as shown in the experimental results (Section 5).

Object water bottle coffe bottle coca-cola can jolt cola can large plant small plant

No. of images 318 432 177 190 308 247 = 1568

Table 4.1: Number of images for each object in the database.

4.2 SSD-Mobilenet Training

The object detection model was implemented, trained and tested using the Caffe framework [59]. The summarized process of training the model is described below. Step 1: We collected and hand-labeled 4300 images from videos as described in Section 4.1.1. The labels are stored as XML files and each image has one corresponding XML file.

Step 2: Secondly, we randomized and split the data into a training (80%) and a validating (20%) set.

Step 3: Then, we converted the dataset into LMDB format, a file format that Caffe can read and process.

Step 4: As mentioned earlier, we used the pre-trained SSD-MobileNet, trained on VOC0712, which can be found in 1_.

Step 5: Afterwards, we generated the training, testing and deploy model pro-totxts. These files contain the full network architecture.

Step 6: Finally, we started training the model until the loss and accuracy reached adequate peaks and become stable.

1_{https://github.com/chuanqi305/MobileNet-SSD, SSD-MobileNet Caffe model with}

(41)

4.3 Model Deployment to NCS

The model was trained on Google’s Colaboratory GPUs. Google offers free 12 hour sessions for people to develop and train their models. The GPUs which were used for training were NVIDIA’s Tesla T4 and NVIDIA’s Tesla K80 (dependent on the availability on the current session). We used the default multistep learning rate policy with base learning rate = 0.0005, weight decay = 0.00005 and batch size = 32 for the training set and 8 for the validating set. The loss and detection accuracy on the validation set are presented in Figures 4.1 and 4.2 respectively. From the figure, it can be deduced that the loss reached a minimum peak at ≈ 0.7 while the detection accuracy reached a peak at ≈ 90%. Furthermore, it needs to be noted that, since the dataset contains images of objects being far away, the loss graph inevitably illustrates some fluctuation.

Figure 4.1: Loss during training. Figure 4.2: Detection accuracy.

4.3 Model Deployment to NCS

(42)

4.4 Techniques for System Improvement

library for generating the deployment model after fusion and, with the following command, we successfully created an inference binary graph:

$ mvNCCompile S S D M o b i l e N e t d e p l o y . p r o t o t x t \ −w SSD MobileNetSSD deploy . c a f f e m o d e l \

−s 12 − i s 300 300 −o o b j e c t d e t e c t i o n g r a p h

where, mvNCCompile is the command that compiles the model into a graph, SSD MobileNet deploy.prototxt is the model architecture after fusing the BN and scale layers, SSD MobileNetSSD deploy.caffemodel is the actual fused model that is deployed, -s 12 specifies the number of shave cores (12 is the maximum number for the NCS), -is 300 300 states the input image size and objectdetection graph is the name of the output binary graph file. The same method was followed for converting the CBIR model into a supported computational graph.

4.4 Techniques for System Improvement

4.4.1 Sliding Windows for Frame splitting

As mentioned in Section 3.3.1.2, one of the drawbacks of the SSD architecture is the poor performance on small or distant objects. In addition, the output resolution of the camera is 1280x720 pixels and the input image size that the SSD accepts is 300x300 pixels. Because of these reasons, we were motivated to split the original frame into smaller parts to increase the accuracy of the object detection model, but that also resulted in a sacrifice in execution speed. This sliding window technique is as its name suggests, a fixed width/height rectangular box that “slides” across an image in an overlapping manner.

(43)

provided increasing AP values for every class with overall mAP increase from 46.44% for no split −→ 66.97% for 3 frame splitting −→ 75.15% for 6 frame splitting. The sliding window algorithm can be easily customized for even more splitting parts, however, it was noticed that the execution speed became quite slow, beyond acceptable terms for real-time processing.

4.4.2 Blur Detection

While capturing videos for training and evaluating purposes, it was observed that several frames, while in motion, became too blurry and were not worthy of performing detection. For that reason, we decided to use a simple blur detection algorithm to classify the frame parts as blurry or not blurry based on a threshold. By doing that, we were able to skip frames that were too blurry, which allowed us to speed up execution time, since the object detection model and CBIR system are not performing any inference on the blurry frames.

To detect the blurriness of an image which is based on [60], we firstly trans-formed an input RGB image into grayscale, we then convolved the grayscale image with the Laplacian filter as presented in Equation 4.1 and finally, we computed the variance of the response which produces a single floating point value i.e. our blurriness value. Laplacian Kernel =      0 1 0 1 −4 1 0 1 0     (4.1)

A comparison between two back to back frames where one of them is affected by motion blur is presented in Figures 4.3 and 4.4, by comparing the blurriness with a threshold value of 40.0. If the value is below our threshold, it is considered as blurry, whereas if it is greater than that, it is not.

(44)

computes the second derivative of an image and is consequently often used for edge detection. The hypothesis is that if the variance is high, then we have a wide spread of responses, meaning that there are a lot of edges in the image. However, when the variance is low, the spread of responses is not high and the amount of edges in the image is minuscule, as in blurry images.

Figure 4.3: A frame that is classified as not blurry, Blurriness = 41.15.

Figure 4.4: A frame that is classified as blurry, Blurriness = 25.58.

4.4.3 Bounding Box Filtering

The use of the sliding window approach (Section 4.4.1) to improve detection accuracy raises the possibility of two or more frame parts to detect the same object (or different parts of the same object). As a result, we needed to filter out bounding boxes that belong to the same class and are overlapping at some degree between each other.

To choose the most dominant bounding box, which will result in higher prob-ability of classifying the correct object when used as input in the CBIR system, we propose the following procedure. First and foremost, for each frame part, we stored the top left corner coordinates in order to transform the relative bounding box coordinates into the whole frame (global bounding box coordinates). If we assume that for a frame part the top left coordinates are (T Lx, T Ly) and one set

(45)

4.5 CBIR System

the top left and bottom right coordinates respectively, then the transformation is as shown in Equation 4.2:         G1x G1y G2x G2y         =         x1+ T Lx y1+ T Ly x2+ T Lx y2+ T Ly         (4.2)

where (G1x, G1y), (G2x, G2y) represent the bounding box top left and bottom right

coordinates respectively in the global frame. Finally, we compared all the bound-ing boxes in the global frame usbound-ing our modified NMS (Section 3.3.3) with an IOU threshold of 0.2 (this number was found through trial and error) and extracted the final filtered bounding box that is used as input to the CBIR system and for detection display purposes.

4.5 CBIR System

The object detection model was trained to detect three classes: can, bottle and plant objects. In addition, the objects of interest that are examined are similar in per class comparison and consequently, it is necessary to distinguish between two objects of the same class when considering limited resource availability.

(46)

4.6 Camera - Object Distance

vector while preserving as much information as possible for accurate results. The final features were saved in a database that consists of 1569 images. The total execution time for extraction and transformation of the features for all images on an Intel Core i5-7200 CPU was ≈ 9 minutes and 37 seconds.

The final step was to transform the query images into the same PCA subspace as the database and calculate their similarity based on distance metrics (Section 3.4.2). The output of the CBIR system based on the top 5 images retrieved is illustrated on Figure 4.5 where the left image is the query image and the next 5 are the result images obtained.

Figure 4.5: Query image (first) and top 5 results retrieved after CBIR.

The query images derive from the object detection model. The bounding box coordinates are used to crop the specific part of the frame and use it as query image to compare it to the database for object retrieval. Finally, to examine the stability of the CBIR system, we retrieved the top 1, 5, 10, 20, 50 result images and the AP and mAP are shown in Chapter 5.

4.6 Camera - Object Distance

(47)

taking advantage of Intel’s RealSense SDK 2.0 1, RGB and depth frames can be captured and processed in a convenient way. Also, by combining the object detec-tion model, we can project the bounding box coordinates of the object detected to the depth image and encircle the depth information of the object. The process of calculating the distance between the camera and the object is presented below. Step 1: We collect RGB and depth frames that can occur from either a livestream or a file in .bag format that has stored all the required messages. It shall be noted that to process a single frame, the set of the RGB and depth images must be coherent.

Step 2: The initial depth frames are represented as black and white images. As a result, we need to colorize the depth image in order to create the colorized depth map.

Step 3: The colorized depth map at this stage does not have the same viewpoint as the RGB image, since the sensors have different locations on the hardware system. By using a built-in function of the SDK, we can align the two frames and create a single RGBD image with pixel perfect precision. An example of an RGB image and its coherent depth frame after alignment is illustrated on Figure 4.6.

Step 4: After the alignment, we can use the object detection model to generate the bounding boxes with the location of the objects. Since we have the aligned depth frame, we project the bounding box of interest to the depth frame and enclose the data inside.

Step 5: Finally, by getting the depth scale from the device properties and con-verting it into meters, we can calculate the distance between the depth

(48)

data enclosed in the bounding box and the camera as:

Distance = M EAN (depth data × scale) (4.3)

Every pixel inside the bounding box is multiplied by the converted scale and the multiplication result of Equation 4.3 corresponds to a distance value. Consequently, by taking the mean we can approximate the av-erage distance of the object. For better precision, we decided to resize the bounding box into half of the original. From our testing, it provided more accurate results.

The algorithm was tested through several videos where all the objects of interest were at the same marked distances (from 1 −→ 3 meters) in order to examine if the distance measurements are equivalent. The result images are illustrated in Chapter 5. It shall be noted that this part of the project was considered as an extra task and consequently, the research conducted around it was not adequate. From our limited research, we did not find a state-of-art evaluation metric to examine the precision of the algorithm. The way we evaluated it was by having marked distances calculated by a ruler and placing the objects at these distances and comparing the obtained results.

(49)

4.7 Pipeline

Figure 4.7: Final pipeline.

Now that all the necessary parts of the theory and methodology have been explained in detail, the final pipeline that connects everything can be seen in Figure 4.7. Based on the figure, we firstly select the object that we want to find through a simple choice in the command prompt and then the stream starts. The stream can either start from a camera for real-time object detection and retrieval or from a video file. Additionally, the configuration makes it possible to have a stream that only captures RGB frames. However, with this setting, the distance between the camera and the object cannot be calculated since the depth images (RGBD) are further required.

(50)

4.8 System Evaluation

object detection model. Object detection is performed and before displaying the results with the final bounding boxes, labels and confidence scores, filtering starts. During filtering, IoU (Section 3.3.2) is performed as part of our modified NMS (Section 3.3.3) to filter out redundant bounding boxes. The final bounding boxes are used as input to the camera-object distance algorithm (Section 4.6) and the final boxes with their corresponding distances, classes and confidences are stored for visualization.

The final part of the pipeline is related to the CBIR system in order to retrieve the object of interest that was initially selected. From the previous part, the final bounding box coordinates are used in order to crop the frame area that is being occupied and that is used as input (query image) in order to classify if the object is the one to be retrieved. Consequently, every query image is being pre-processed based on the MobileNet standards, followed by feature extraction. The extracted features, with shape (1, 1024), are being transformed into the PCA subspace in the same way as the database images were transformed. Finally, we calculate the similarity between the query image and all the database images using distance metrics (Section 3.4.2), we display the final results and continue processing the next captured frame.

4.8 System Evaluation

(51)

retrieval. Regarding CBIR, it is not wise to evaluate the classification based only on the first retrieved image but based on top-K retrieved results.

Considering all of the above, there are two popular evaluation metrics that were used for object detection, the Precision - Recall Curves that calculate the (mean) Average Precision ((m)AP) across classes and the IOU (thoroughly ex-plained in Section 3.3.2) Regarding CBIR, Precision@K that leads to mAP was used for one class and mean mAP across all classes (Section 4.8.2.2).

4.8.1 Precision - Recall Curve

To derive the mathematical equations of precision and recall, we define the fol-lowing acronyms:

• True Positive (TP): A correct detection which is accepted when the classification is correct and IOU ≥ threshold.

• True Negative (TN): A correct mis-detection e.g. when the background is correctly classified as background and no bounding boxes occur.

• False Positive (FP): A detection that is either wrongly classified, wrongly detected or both, e.g. a person detected as a dog or part of the background is detected as a class of interest.

• False Negative (FN): A detection that should be detected and correctly classified but is not i.e. a missed detection.

The precision of a model is the ability to retrieve all of the relevant results. The calculation is given by Equation 4.4:

P recision = T P T P + F P =

T P

(52)

The recall of a model is the ability to retrieve all of the relevant cases that are correctly classified. The calculation is given by Equation 4.5:

Recall = T P T P + F N =

T P

P redicted Results (4.5) The precision-recall curve is a popular method for evaluating the performance of an object detection model. The performance of one class is acknowledged as good if with the increase of recall, the precision remains high or, in other words, the ability of retrieving only relevant objects i.e. 0 FP for all ground truth objects i.e. 0 FN. The higher the number of FP, the lower the precision and the higher the number of FN, the lower the recall. The range of both precision and recall is [0, 1].

4.8.2 Average Precision

4.8.2.1 Object Detection AP

AP is another convenient way of to compare the performance of an object detector across a class. It is represented by a single numerical value in the range of [0, 1], usually expressed as a percentage and it is the area under the precision-recall curve. The proposed way for calculating the AP based on the PASCAL VOC challenge [61] since 2010 is by interpolating all data points in the precision-recall graph compared to previous years where the interpolation used 11 points with equal space. We decided to follow their recent evaluation implementation (interpolation using all data points), whose formula is illustrated in Equation 4.6.

AP =

1

X

r=0

(53)

4.8 System Evaluation where ρinterp(rn+1) = max ˜ r:˜r≥rn+1 ρ(˜r)

is the interpolated precision value and ρ(˜r) the precision at recall ˜r. It shall be noted that the IoU threshold throughout testing was set to a default value of 0.5, where if IoU ≤ 0.5, then the detection is considered as FP and if IoU > 0.5, the detection is considered TP.

4.8.2.2 CBIR AP

As mentioned in Section 4.8, we considered it wise to evaluate the CBIR system in various depths, where depth is the number of images retrieved. To calculate the AP for CBIR, we followed the “Precision at K” method as explained in [62]. If we assume that K is the depth, P recision@K the precision at depth K, hiti a

correct retrieved image at Ki, H the total number of hits, T the total images that

belong to a specific class, C the total number of classes and classi the specific

class examined then:

AP = 1 H × K X i=1 hiti (4.7)

where AP represents the AP of a specific query image,

classimAP = 1 T × T X j=1 AP (imagej) (4.8)

which is the mean average precision of one class, and finally,

mmAP = 1 C × C X i=1 mAP (classi) (4.9)

is the mean MAP across all classes (the entire CBIR system).

(54)

(55)

Chapter 5 Experimental Results

5.1 Testing Data & Specifications

To test the implementation of the entire system, qualitative data were required that covered all the cases that needed to be examined (occluded and blurry ob-jects, varying sizes and object similarities). We captured and carefully hand-labeled (in order to compare the grouth-truth and predicted detections) two videos that consist of a total of 1687 images (frames) and made it as diverse as possible to evaluate all the requirements in equal terms. A few of the testing images do not include any object of interest, since we wanted to examine the FP rate when no object was present in the frame.

(56)

5.2 Object Detection Evaluation

Finally, we provide sample images for occluded and blurry object detection as well as a comparison in performance when using more splits for a frame.

In Section 5.3, we present the mAPs and mmAPs for each and all classes respectively for different top-K images retrieved, for K = 1, 5, 10, 20, 50 and visualize them with clustered bar charts, including results with and without blur detection. Finally, we evaluate our camera-object distance algorithm (Section 5.4) where we captured videos from various pre-defined distances and test if they correspond to the output of the algorithm.

5.2 Object Detection Evaluation

No frame splitting, mAP = 46.44%.

Figure 5.1:

Precision-Recall Curve for plant, AP = 58.12%.

Figure 5.2:

Precision-Recall Curve for bottle, AP = 41.28%.

Figure 5.3:

(57)

3 frame splitting, mAP = 66.97%.

Figure 5.4:

Figure 5.5:

Figure 5.6:

Precision-Recall Curve for can, AP = 55.15%.

6 frame splitting, mAP = 75.15%.

Figure 5.7:

Figure 5.8:

Figure 5.9:

Precision-Recall Curve for can, AP = 62.06%.

6 Frame splitting with few Negative Images, mAP = 69.72%.

Figure 5.10:

Figure 5.11:

Figure 5.12:

(58)

Occluded Sample Objects.

(a) Occluded large plant. (b) Occluded coffee bottle.

(c) Occluded Jolt cola can. (d) Occluded large plant far away.

(e) Occluded small & large plant. (f) Occluded Coca cola can.

(59)

Detection results for the same frame for different number of splits.

Figure 5.14: No frame splitting.

Figure 5.15: 3 frame split-ting.

Figure 5.16: 6 frame split-ting.

Detection results on blurry frames.

(a) Blurry large plant. (b) Blurry and distant objects.

(c) Blurry and occluded. (d) Blurry plants.

(e) Blurry coffe bottle. (f) Blurry Jolt cola can.

(60)

5.3 CBIR Evaluation

Class\mAP@K mAP@1 mAP@5 mAP@10 mAP@20 mAP@50 mmAP ∀ K large plant 98.83% 98.87% 98.50% 98.27% 97.92% 98.48% small plant 90.68% 91.76% 90.98% 90.22% 88.76% 90.48% coffe bottle 79.75% 82.87% 82.09% 81.29% 80.19% 81.24% water bottle 94.75% 96.11% 95.61% 94.55% 91.80% 94.56% Jolt cola can 88.51% 90.92% 89.80% 88.52% 85.68% 88.69% Coca cola can 62.91% 67.91% 65.56% 63.19% 59.33% 63.78% mmAP across all classes 85.91% 88.07% 87.09% 86.01% 83.95%

Table 5.1: mAP & mmAP results for CBIR.

(61)

5.3 CBIR Evaluation

Figure 5.19: mAP comparison at each class for every depth K.

Class\mAP@K mAP@1 mAP@5 mAP@10 mAP@20 mAP@50 mmAP for all K

large plant 98.65% 98.98% 99.01% 98.98% 98.78% 98.88%

small plant 91.23% 92.39% 91.90% 91.56% 90.72% 91.56%

coffe bottle 89.15% 91.38% 90.96% 90.34% 89.42% 90.25% water bottle 100.00% 99.86% 99.35% 98.75% 97.03% 99.00% Jolt cola can 91.33% 94.24% 94.03% 93.45% 91.60% 92.93% Coca cola can 56.25% 59.71% 58.60% 57.25% 53.96% 57.15% mmAP across all classes 87.77% 89.43% 88.97% 88.39% 86.92%

(62)

5.3 CBIR Evaluation

Figure 5.20: mAP comparison at each depth K for every class after blur detection.

(63)

5.4 Camera-Object Distance Evaluation

Pre-defined distance measurement samples.

(64)

Figure 5.23: 2 meter distance with retrieved distances for objects from left to right (m): 2.01, 1.99, 2.09, 2.13, 2.14, 2.04.

(65)

(66)

Distance measurement samples on the testing videos.

(67)

Figure 5.29: Small (1.09 m) and large plant (2.10 m).

(68)

(69)

Chapter 6 Discussion & Conclusions

6.1 Discussion

Several observations can be made from the results obtained from our implemented system, which can be separated into 1) NCS, 2) object detection, 3) CBIR and 4) camera object distance observations.

1) NCS

a)The NCS allowed us to perform inferences that are approximately 10 times quicker compared to lab’s PC CPU.

b)For object detection and input images of 300 × 300 pixels, the average inference speed is 76 ms.

c)For CBIR and input images of 224 × 224 pixels, the average inference speed is 39 ms.

(70)

6.1 Discussion

the average FPS with 0 and 1 CBIR activation is presented in Table 6.1. Certainly, there can be more than one CBIR activations.

Frame split number FPS w/ CBIR FPS w/o CBIR

0 8.69 13.15

3 3.28 4.38

6 2.02 2.19

Table 6.1: Average FPS.

e)Due to the fact that some layers are not supported from the NCS, we were limited to choosing either the SSD-MobileNet or the TinyYOLO for the object detection task.

f )The MobileNet, which was used for CBIR, had the quickest inference speed on the NCS compared to other other popular neural network architectures and was chosen as the fixed feature extractor.

2) Object Detection

a)From the precision-recall curves presented in Section 5.2, we can see that with the increase of the number of frame splits, the recall is increased for every class comparison, which indicates that the FN rate is generally de-creasing, which means that there are less missed detections.

b)The classes based on their precision can be sorted in descending order as bottle, plant can for all cases.

c)With the increase of the number of frame splits, the precision is generally decreasing by comparing the per-class curves, which indicates an increase in the FP rate, which means that there are more false detections.

(71)

6.1 Discussion

Six frame splitting can be computationally expensive for some applications since at each frame there are at least 6 inferences that occur (+ CBIR inference if any) and the FPS rate drops significantly. However, it was observed that with six splits, the far away objects were accurately detected (changing size of objects).

Frame split number transition Plant AP increase (%) Bottle AP increase (%) Can AP increase (%) 0 −→ 3 16.82 29.53 15.23 3 −→ 6 12.13 5.50 6.91

Table 6.2: AP increase for each class with the increase of frame splits.

e)With the inclusion of more negative images during training, we successfully reduced the FP rate and increased the per-class AP and mAP across all classes.

f )In general, the occluded objects (up to 70%) are correctly detected with decent confidence thresholds. Since we hand-labeled the testing videos, it was difficult to extract only the images that contain occluded objects since most of images contain multiple objects where not all of them are occluded and hence, no precision-recall curve, AP and mAP was presented.

g)We were able to acquire accurate detection results on blurry frames, how-ever, with the combination of blur detection that results in frame skipping, the FPS can be dramatically increased.

(72)

6.1 Discussion

combination of the two methods, we also saved valuable time into labeling the training and testing dataset as well as implementing the CBIR system which did not require any training.

3) CBIR

a)In general, the results of the CBIR system look promising, if we consider that there was only feature extraction from a pre-trained model without the slightest amount of training. For mAP @5, we achieved 89.43% when enabling blur detection.

b)Before acquiring the results, we expected that the object with the most images inside the database (coffee bottle) would get the best mAP and the object with the least (Coca-cola can) the worst result (Table 4.1). The reason is that with more database images, we assumed that the objects would have more variety in terms of angles, distances and points of view. We suspect that since the “plant” class in object detection had the highest AP (large plant included which was the biggest object of the objects of interest), the large plant had overall the best detections and provided the best input to the CBIR system that resulted in the best output with mmAP ∀ K at 98.48% (Figure 5.18).

c)The Coca cola can object, having the least amount of database images (177) combined with the worst object detection results in the “can” class, had overall the worst mmAP ∀ K at 63.78% (Figure 5.18).

(73)

6.1 Discussion

bottle and Jolt cola can respectively. Regarding the Coca cola can, there was a surprising decrease of 6.63% after blur detection which was the only decrease overall. A reason for that could be that there were only 128 Coca cola can objects after blur detection, causing fluctuations in the mAP and mmAP (a wrong retrieval could lower the mAP and mmAP substantially and the opposite).

e)Overall, from Figures 5.18 and 5.20 it can be seen that for all objects, with-out enabling blur detection, the mAP @5 provides the best results compared to other mAP @K whereas this is rejected when blur detection is applied. f )Finally, it needs to be noted that there should be a consideration regarding

the database size, since the similarity of the query image is compared to the whole database. An excessive amount images can cause execution delays in similarity measurements and an inadequate amount of images could lower the precision significantly.

4) Camera-Object Distance

a)From the pre-defined object placements presented in Figures 5.22, 5.23, 5.24, 5.25, 5.26 and 5.27, we can observe small differences compared to the pre-defined distances (1, 2, 3 meters) and in some (fewer) cases up to 0.5 meter deviation. As mentioned, there was inadequate research around the area and there was not intense testing to collect a wide variety of data, hence, no obvious reason can be described that affects the accuracy. b)Overall the results are mediocre. The results can be tilized, but more

(74)

6.2 Conclusions

6.2.1 Conclusions

This thesis project utilizes the NCS to perform fast inferences in real-life scenarios with the aim of detecting, recognizing and retrieving objects and their distances. An SSD-MobileNet object detection model was used as the only reliable choice for object detection (full support from the NCS) and a MobileNet v1 model for feature extraction for the CBIR system. The combination of those models and various useful techniques provided us with results that look encouraging, especially for the CBIR system where by taking full advantage of transfer learning, results with success rate close to 90% were obtained. Regarding real-life scenarios, detection and retrieval of occluded and blurry objects were satisfactory, changing size of objects was handled in an adequate manner, considering the choice of model properties and limitations and, similarity retrieval worked surprisingly well excluding the “can” class due to data imbalance and poor data quality. Finally, the extra task of distance calculation between the object and the camera worked with auto camera settings and small offsets, however, there was limited testing around it and we cannot draw convincing conclusions for the robustness of this algorithm.

6.2.2 Future Work

(75)

6.2 Conclusions

greater importance should be added related to indoor navigation e.g. office doors. In addition, we propose the addition of object tracking in order for the object recognition part to provide the complete input parameters for wheelchair navi-gation. Finally, we suggest the evaluation and finetuning of other popular object detection models that can be used in the latest version of the NCSDK that is supported from the second version of the NCS.

(76)

References

[1] J. Kim and J. Canny, “Interpretable learning for self-driving cars by visual-izing causal attention,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017. 1

[2] L. Deng, “A tutorial survey of architectures, algorithms, and applications for deep learning,” APSIPA Transactions on Signal and Information Processing, vol. 3, p. e2, 2014. 5

[3] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, M. An-driluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue, F. A. Mujica, A. Coates, and A. Y. Ng, “An empirical evaluation of deep learning on highway driving,” CoRR, vol. abs/1504.01716, 2015. 5

[4] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: an overview,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Pro-cessing, pp. 8599–8603, May 2013. 5

[5] D. C. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” CoRR, vol. abs/1202.2745, 2012. 5 [6] S. Zhai, K.-h. Chang, R. Zhang, and Z. M. Zhang, “Deepintent: Learning

Deep Learning for Object Detection and Retrieval with Intel’s NCS - as part of Autonomous Wheelchair Navigation Georgios Tsiatsios