Object Detection in Object Tracking System for Mobile Robot Application

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Object Detection in Object

Tracking System for Mobile Robot Application

ALESSANDRO FOA´

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Object Detection in Object

Tracking System for Mobile Robot Application

ALESSANDRO FOA´

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2019

Supervisor at Volvo Construction Equipment: Torbjörn Martinsson Supervisor at KTH: Timo Koski

Examiner at KTH: Timo Koski

(4)

TRITA-SCI-GRU 2019:090 MAT-E 2019:46

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Object Detection In Object Tracking System For Mobile Robot Application

Abstract

This thesis work takes place at the Emerging Technologies department of Volvo Construction Equipment(CE), in the context of a larger project which involves several students. The focus is a mobile robot built by Volvo for testing some AI features such as Decision Making, Natural Language Processing, Speech Recognition, Object Detection. This thesis will focus on the latter.

During last 5 years researchers have built very powerful deep learning object detectors in terms of accuracy and speed. This has been possible thanks to the remarkable development of Convolutional Neural Networks as feature extractors for Image Classification. The purpose of the report is to give a broad view over the state-of-the-art literature of Object Detection, in order to choose the best detector for the robot application Volvo CE is working with, considering that the robot’s real-time performance is a priority goal of the project. After comparing the different methods, YOLOv3 seems to be the best choice. Such framework will be implemented in Python and integrated with an object tracking system which returns the 3D position of the objects of interest. The result of the whole system will be evaluated in terms of speed and precision of the resulting detection of the objects.

(6)

(7)

Objektdetektering F¨ or Ett Objektsp˚ arningssystem Applicerat P˚ a En Servicerobot

Abstrakt

Detta arbete utförs hos Emerging Technologies p˚a Volvo Construction Equip- ment(CE) i ett stort projekt som involverar flera studenter. Arbetes fokus är att använda en robot skapad av Volvo för att testa olika AI tekniker s˚asom beslutsfattandeg, naturlig spr˚akbehandling, taligenkänning, objektdetektering.

Denna uppsats kommer att behandla den sistn¨amnda tekniken.

Under de 5 senaste ˚aren har forskning visat att det är möjligt att bygga kraft- fulla deep learning object detectors vad gäller att korrekt identifera samt snabbt detektera objekt. Allt detta är möjligt tack vare ramverket Convolutional Neu- ral Networks som agerar som feature extractors för Image Classification. M˚alet med denna rapport är att ge en generell överblick över det senaste inom objektdetektering för att p˚a s˚a sätt välja den mest lämpliga metoden att implementera p˚a en robot hos Volvo CE. Att ta hänsyn till realtidspresetanda är ett av m˚alen med projeketet. Efter att ha utvärderat olika metoder valdes YOLOv3. Detta ramverk implmenterades med Python och integrerades med ett objektidentifer- ingssystem vilket retunerar en position i tre dimentioner. Hela systemet kommer att utvärderas med hänsyn till hastighet och presition.

(8)

(9)

Acknowledgment

I want to thank my supervisor at Volvo Construction Equipment Torbjorn Martinsson for the support he provided during last 5 months. I also want to thank my supervisor at KTH Timo Koski.

(10)

(11)

1 Introduction

This project takes place at Volvo Construction Equipment (CE), which is a Volvo Group subsidiary. Its core activity is the developing, production and mar- keting of equipment for construction vehicles such as excavators, wheel loaders, soil compactors et cetera.

Recently, the field has been changing very rapidly: Artificial Intelligence (AI) and Machine Learning (ML) have had a strong impact on the research and development of the automotive and robotics industry. The general trend sees tomorrow’s machines getting more and more independent, being able to perform their task on their own, by reducing progressively human involvement.

In this context, the department of Emerging Technologies at Volvo CE has developed a mobile robot for testing basic ideas and concepts of the new generation of construction vehicles they would like to lunch. One of the main features such machines will have is AI.

The robot, named ”Butler”, has fully autonomous low-end control, it can be programmed or manipulated by remote, however it does not have any ”intel- ligent soul”, i.e. it has no AI functions. Butler robot is approximately 1.40 meters tall and weights 65 kg. Its main 3 components are:

• Bottom - it contains battery, wheels and motors. It has 8 degrees of freedom. It can move at a maximum speed of 4 m/s.

• Body - it connects arm and bottom, contains the computer which runs the software, has a camera on the top.

• Arm - it can grasp small objects through a nipper at the end of it, the whole arm has 6 degrees of freedom.

(14)

Figure 1: Butler robot

1.1 Project

As mentioned before, Butler robot has been built for testing AI concepts: the idea is to make it perform a simple task which includes features such as decision making, object detection, speech recognition. In particular, the task is comparable to a daily task for waiters: a costumer will order a coffee and the robot should get the order, grab the coffee and bring it back to the costumer. The environment of the project is a room 4x4 meters with a table one side having the coffee mug on it. The robot will get the order standing on the other side and will walk to the table to accomplish the task. The whole task must be done in a time comparable to human time. By ”comparable”, in this project, it is meant 70%.

(15)

Figure 2: Coffee task scenario

1.2 Premise: Object Detection and Point Cloud

My role in the project, assigned by Volvo CE, concerns the vision part of the robot. I am required to design, engineer and implement an object tracking system in order to detect the object of interest, i.e. a coffee mug. By object tracking it is meant to process a frame/video which contains a coffee mug and to return its 3D position. Such task can be divided into object detection, which consists in finding the object position within the digital image, and point cloud which means computing the real world 3D position of the object from its position in the digital image. The point cloud part is performed by the stereo camera available for the project. Therefore, this part of the vision system is not subject to any research. Hence, despite point cloud is still part of this work, the thesis focus will be on the object detection part.

1.3 Thesis goal

The aim of this thesis is to analyze and compare the state of the art literature and techniques for object detection in order to design and engineer a vision system for object tracking, according to need the for this project: since the global goal of the project is reaching 70% of human time, the vision system and therefore the chosen object detector should perform real-time. Once the system is built there will be an evaluation in terms of time and precision of the

(16)

detection.

2 Neural Network theory

Object detection is the capability of classifying and locating an object in a digital image.

Image classification is the process of taking an image as input and outputting the class to which the object belongs (i.e. ”car”,”bicycle”,dog” etc). Locating, instead, means drawing a bounding box around the object in order to state its position in the image. In this section I will introduce the mathematical framework behind the state-of-the-art object detection techniques. In particular, Convolutionl Neural Netowrks will be studied in deep, as they are the fundamental approach for the classification part of object detection.

Figure 3: Object detection= Classification + Location

2.1 Artificial Neural Network

An Artificial Neural Network (ANN) is a biologically inspired computational model for signal processing, forecasting and clustering. It consists of processing elements (called neurons), and connections between them with coefficients (weights) bound to the connections. This mathematical framework is one of the most used in the machine learning field.

(17)

In mathematical terms, an ANN can be viewed as a simple mathematical model which represents a function f : X → Y . Such function f (x) can be defined as composition of other functions a^l(x), which are called layers of the network. We define as x the input of the network and as ˆy the output.

If we set the number of layers as L, the relation between the input and the output of the network is:

ˆ

y = a^L(a^L−1(a^L−2...(a¹(x)))) (1) In the l^thlayer, a^l(x) assumes the following generic form:

a^l(x) = σ^l(W^la^l−1(x) + b^l) (2) where W^l, b^l and σ^l are respectively the matrix of the weights, the bias vector and the activation function of the l^thlayer. For simplicity we will use a^l:= a^l(x).

If we plug (2) into (1) and expand the vectors and matrices, we get the following equation. The output of a layer becomes the input for the next layer.

ˆ

y = a^L_n(x) =

σ

X

m

w^L_nm

...

σ

X

j

w²_kj

σ

X

i

w_ji¹x_i+b¹_j

+b²_k

...

m

+b^L_n

n

(3) The network takes an input x and produces on output ˆy by propagating through the layers according to the above equation. This is called forward propagation and points out how the input gets processed through all the network, layer by layer.

2.2 Loss function and back-propagation

In supervised learning, the objective of training an ANN is to modify the parameters of the network (weights and biases) in order to produce an output ˆy as close as possible to y = y(x), where y is the true response variable function . Hence, we need to measure the inconsistency between these 2 quantities. This is done by introducing the supervised Loss function L(y, ˆy), which can assume different forms depending on the types of minimization problem. For simplicity we will use the Mean Square Loss function, which is defined as follows:

L(y, ˆy) = 1 n

X

x∈ T raining set

(y(x) − a^L_n(x))² (4)

For minimizing this function, ANNs use some gradient descent algorithms, which involves the derivation of L(y, ˆy) with the respect to each weight and bias. Such process can be computationally very heavy. The back-propagation algorithm has the purpose of making this gradient computation in all the network efficient by the reutilization of previuolsy computed gradients. Let us define the input sum of a neuron k in the l^th layer:

z^l_k=X

j

w^l_kja^l−1_j + b^l_k (5)

(18)

By applying the activation function we get:

a^l_k = σ(z_k^l) (6)

Now we can calculate the input sum of a neuron m in layer l+1.

z^l+1_m =X

k

w^l_mka^l_k+ b^l_m (7)

Using the chain rule of calculus [6], the derivative of the Loss function with the respect to a single weight can be written as:

∂L

∂w^l_kj = ∂L

∂z_k^l

∂w^l_kj = ∂L

∂a^l_k

∂z^l_k

∂w^l_kj = (8)

=X ∂L

∂zm^l+1

z_m^l+1 a^l_k

∂a^l_k

∂z_k^l

∂w^l_kj =X ∂L

∂zm^l+1

w^l+1_mk

σ⁰(z_k^l)a^l−1_j (9) From this expression we can see that the value of the derivative of the Loss function with the respect to weights in the l^thlayer depends on the contributes from the l + 1^thlayer.

Now let us define the error signal of a neuron k in layer l as how much the total error changes when the input sum of the neuron is changed:

δ^l_k≡ ∂L

∂z^l_k (10)

It is possible to notice that the term in right part of the equation (10) has already been expanded through equations in (8) and (9). We can then rewrite

∂L

∂z^l_k =X ∂L

∂zm^l+1

w_mk^l+1

σ⁰(z_k^l) (11)

Now it is easy to see that here a recursive formula for the error signal holds:

δ^l_k=X

δ^l+1_m w^l+1_mk

σ⁰(z^l_k) (12)

In order to be able to use this formula again, we have to compute the initial error signal, which is the one of the final layer neurons.

δ_j^L= ∂L

∂z_j^L = ∂L

∂a^L_j

∂z_j^L = ∂L

∂a^L_j σ⁰(z_j^L) (13) Therefore, the only derivative that needs to be computed is the derivative of

(19)

2.3 Feature extraction

The most important component of any object recognition and image classification system is Feature Extraction (FE). A qualitative description about this concept will be given, since it is necessary to understand the following section.

In Computer Vision, FE is a dimensionality reduction process for extracting rel- evant information from an image. It consists in mapping the image pixels into a feature space: an image is represented through a set of vectors that correspond to its attributes. The idea is that images of objects with similar attributes are similar, i.e. they probably contain the same object. The various contents of an image such as color, texture, shape etc. are used to represent and index an image or an object.

Figure 4: Feature extraction

(20)

2.4 Convolutional Neural Network

Convolutional Neural Network (CNN) is probably the most powerful ANN for image related problems[19], since it represents a very effective technique for performing feature extraction. This is why, in the context of object detection, they are also called feature extractors. In object detection, CNN architecture is the core component and takes care of the classification aspect.

The interest in these models begun with image classification network AlexNet[1]

in 2012 and it has grown very much in the following years. In just three years, researchers progressed from 8 layer AlexNet to 152 layer ResNet[16]. Nowadays it is the best deep learning solution for basically every image related problem, both in terms of precision and speed.

The CNN architecture consists of 2 types of layers that alternate repeatedly (convolutional layer and pooling layer) and terminates with a fully connected layer. The input of the network is usually a tensor, since digital images have 3 dimensions, called channels: Red, Blue, Green. The output is the class to which the image belongs, i.e. dog, boat, person etc.

2.4.1 Convolution Intuition

According to its very general definition, a convolution is a mathematical operation between two functions f and g which returns a third function that shows how the first function has shaped the second function. In Image classification a convolution is performed between the input and a kernel which should represent a feature. In particular the kernel is convolved with many submatrices of the input. In other words it is possible to quantify how the feature kernel shapes each sublocation of the input image. This means that through this approach it is possible to detect a feature locally by checking in which part of the image that feature is present. For this reason the resulting convolution matrix is called feature map.

2.4.2 Convolutional Layer

The first layer which processes the input is the convolutional layer. In this layer, the input is convolved with a kenrel (or filter), which is a tensor having same depth as the input, but smaller width and height.

Let us define the input x^l of the l^th as a tensor of dimensions H^l× W^l× D^l and the feature kernel as a tensor of dimensions H × W × D^l. Convolutional layers have several filters in order to detect different features. Let us define D as the number of kernels in the l^thlayer. Then, the set of filters is a tensor of dimensions H^l× W^l× D^l× D. This tensor will be called f .

If we call H^l+1 = H^l− H + 1 , W^l+1 = W^l− W + 1 and D^l+1 = D, the output of the convolution between the input x^land the the filter set f will have

(21)

For simplicity we explain the convolutional layer through an example in in 2 dimensions and using 1 filter, assuming that the input and consequently the filter have depth equal to 1.

Table 1: Input

1 0 1 1 2

0 1 1 1 0

2 1 0 1 1

1 0 1 1 2

0 2 1 2 1

Table 2: Filter

1 0 1

0 1 0

1 0 1

The convolution operation is performed by sliding this filter over the input.

At every location, element-wise matrix multiplication is done and the results are summed. If we define y as the feature map matrix, m its width and height, x the input matrix and f the convolutional filter matrix, the following formula holds:

y_i,j=

m

X

k=1 m

X

l=1

f_k,lxi+k−1,j+l−1 (15)

As we can see from the matrices below, in this case the first element of the feature map is given by the the sum of the multiplication of the top left elements with the kernel elements i.e.:

y_1,1 = 1 × 1 + 0 × 0 + 1 × 1 + 0 × 0 + 1 × 1 + 1 × 0 + 2 × 1 + 1 × 0 + 0 × 1 = 5 (16)

Table 3: First convolution Iteration 1x1 0x0 1x1 1 2 0x0 1x1 1x0 1 0 2x1 1x0 0x1 1 1

1 0 1 1 2

0 2 1 2 1

Table 4: First element resulting 5

The actual value which goes into the feature map, however, is not simply the sum resulting from the convolution, but is the value which comes from passing the sum through ReLU (Rectified Linear Unit) activation function:

R(y) = max(0, y) (17)

The rectifier activation function is used to add non linearity to the network, otherwise the network would only ever be able to compute a linear function.

(22)

2.4.3 Pooling Layer

The second characteristic layer of a CNN is called pooling layer. The goal of this layer is to downsample the feature maps by reducing their width and height and at the same time preserving their information. To do that there are 2 ways:

max pooling and average pooling. Referring to the notation introduced in the previous section, we denote x^lthe input of the layer, which this time is a pooling layer. If we define the spatial extent of the pooling layer as (H × W ), then the output y of the layer will be a tensor with dimensions H^l+1× W^l+1× D^l+1 where H^l+1= ^H_H^l, W^l+1=^W_W^l and D^l+1= D^l.

The relation between the output y and the input x^l as it follows depending on which kind of poolin is applied: equation (18) represents the max pooling layer, equation (19) represents the average pooling layer.

y_il+1,j^l+1,d^l= max(x^l_il+1×H+i,j^l+1×W +j,d, ∀0 ≤ i < H, 0 ≤ j < W ) (18)

y_il+1,j^l+1,d^l= 1 HW

X

0≤i<H,0≤j<W

x^l_il+1×H+i,j^l+1×W +j,d (19) Again, it is easier to understand the pooling layer through a 2 dimensional example. The max pooling operation, which is used much more than average pooling, at a given position, outputs the maximum value of the input, that falls within the kernel. If we denote the output of pooling operation by the matrix y and the input feature map as x and (m,m) is the extent of the pooling layer, the following equation holds:

yi,j= max(xi+k−1,j+l−1∀1 ≤ k ≤ m, 1 ≤ l ≤ m) (20) Average Pooling, instead, returns the average of all the values from the portion of the image covered by the Kernel.

yi,j= Pm

k=1

Pm

l=1xi+k−1,j+l−1

m² (21)

From the tables below we can see a simple example of both the pooling approaches:

Table 5: Input

12 20 30 0

8 12 2 0

34 70 37 4

112 100 25 12

Table 6: Max pooling 20 30 112 37

Table 7: Average pooling

(23)

Pooling has two main advantages: it reduces the computational cost and it extracts dominant features which are rotational and positional invariant, thus maintaining the process of effectively training of the model.

Convolutional and pooling layers are repeated alternatively several times and together they perform feature extraction. They form the core part of a CNN.

Figure 5: CNN architecture

2.4.4 Fully connected layer

The sequence of convolutional and pooling layers produces some high quality features which are many and have small dimensions compared to the initial input image. Such layers compute their outputs using just a small number of elements from the previous layer. A fully connected layer, instead, uses all of them. The purpose of the fully connected layer is to learn non linear combination of the previously produced features. for this reason, usually it has the softmax activation function, in order to add non-linearity. Let us define the input vector elements for the softmax activation function as y₁ ... y_K. The activation function operates as it follows:

σ(yi) = e^yⁱ PK

j=1e^yⁱ (22)

(24)

3 Object Detection State Of The Art

In this section the most used object detection methods are described.

All the following methods involve the use of CNN for classification, as mentioned before. They mostly differ in the location part of the detection, i.e. how to decide where to apply the CNN in the image. The following methods are presented according to the chronological order they have been published.

3.1 Region Proposals

R-CNN[3], Fast R-CNN[5], Faster R-CNN[20] are methods developed by Ross Girshick et Al. between 2013 and 2016. They have a 2 step approach, i.e. they do classification and location in 2 different models: to detect an object, such systems take a classifier for that object and evaluate it at various locations and scales in a test image. In this way the network does not look at the complete image, but scans a lot of regions, called region proposals, trying to understand which one contains the object. The drawback of such methods is that it is required to use a CNN for classifying the presence of the object within each region proposal. Since one might want to detect objects of different sizes, this approach leads to a lot of region proposals. Therefore the computation time for predictions using these methods is rather high and they cannot perform in real-time. The accuracy of these techniques, however, is very high.

Figure 6: Faster R-CNN

(25)

3.1.1 Faster R-CNN

Faster R-CNN is the latest release method within this region proposal approach.

The novelty compared to the previous region proposals methods is the introduction of the Region Proposal Network (RPN) which is a fully convolutional network to generate region proposals. The key is that RPN and Fast R-CNN share a set of convolutional layers. A mini-network is applied to a n × n sliding window on the feature maps outputted from the last shared layer. Such mini- network, consisting in 2 fully connected layers, outputs the region proposals.

This output, combined with the feature maps from Fast R-CNN, gives the object detection. Such architecture is shown in the image below.

Figure 7: Region Proposals methods

The effectiveness of the region proposals production by the mini-network is its usage of the anchor boxes. For every sliding window, a number k of proposals is generated. Those proposals are centered at the sliding window and have different scale and aspect ratio. Their default number is 9. Whit this approach, the number of proposals can be reduced from 2000 of R-CNN to 300.

3.2 YOLO

YOLO [2] - You Only Look Once - is an object detection algorithm which differs much from the region based techniques seen above. Indeed, it uses a 1 step approach: a single convolutional neural network predicts the bounding boxes and the class probabilities for these boxes: the input image is divided into a SxS grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell produces B bounding boxes and their relative confidence scores (from 0 to 1). The full number of bounding boxes per image is 98 which outperforms 2000 from R-CNN but is also better than 300 from Faster R-CNN. Hence, a very good advantage is that YOLO can

(26)

perform real-time. Its prediction time is incredibly small, since it can detect objects up to 45 frames per second. Also, during the prediction, YOLO sees the entire image. R-CNN for instance confuses some background patches for objects because it can’t see the larger context.

Figure 8: YOLO bounding boxes generation

3.2.1 YOLOv3

YOLOv3 paper[14] has been released in 2018. It is the improved version of YOLO and YOLOv2[15]. The method uses as base network for feature extraction a CNN called Darknet-53, since its total number of convolutional layers is 53. Such network has been developed by the YOLO researchers ad hoc.

YOLOv3 predict bounding boxes at 3 different scales,i.e at 3 different stages of the network. For each of the 3 scales, 3 boxes are predicted. This combination of different scales feature maps lead to a more meaningful detection. Once the boxes are computed, each box predicts the classes using multilabel classification.

(27)

Figure 9: Darknet-53

3.3 SSD

Single Shot Multibox Detector(SSD) [22] is a technique developed in 2016 based on a convolutional network called VGGNet[17]. The main focus of this approach is to speed up the prediction time compared to Faster R-CNN, but at the same time keep an accuracy level rather high compared to YOLO (first version, 2015).

SSD method divides the image into a grid at different scales and produces 4 fixed boxes for each grid location, as showed in the picture below. The boxes have different orientation in order to be suitable for different object: a vertical box would be more suitable for a person, while an horizontal box would be suitable for a car.

Figure 10: SSD bounding boxes generation

(28)

As mentioned before, SSD is a known high quality classifier which is VG- GNet as base network: all the feature extraction part of this CNN is used, i.e.

convolutional and pooling layers. The final classification layers, instead, are truncated and replaced with extra convolutional layers for predicting detection at multiple scales, as shown in figure 11.

Figure 11: SSD architecture

4 Comparison

What defines a good object detector are essentially two parameters: speed and accuracy. By speed it is meant the frames processing time. Usually, speed is measured in frames per second (FPS). By accuracy it is meant how much the detector is good at determining whether an object is present in the image and the location of the object. Accuracy is measured using the mean Aver- age Precision (mAP) index. These two quantities are linked to each other by a trade-off: in general, by increasing the computational time of the prediction (lower speed), it is possible to reach a better accuracy; in the same way if the computational time is reduced, so the speed is higher, the technique loses in accuracy. This is an instrinsic characteristic of the CNN architecture : more layers can lead to a better feature extraction but at the same time they make the forward propagation computationally heavier, so the prediction time gets longer.

4.1 Dataset

Testing such algorithms requires a large amount of labelled data. This is why researchers test object detectors on data set which are quite known such as

”PASCAL Visual Object classes” (VOC[8]) and ”Common Object in Context”

(29)

COCO is larger and has 80 object classes. it has more than 200K labeled images with approximately 1.5M objetcs.

4.2 Performance

4.2.1 Faster R-CNN performance

In this section there are the performance results of FASTER R-CNN.

Figure 12: Faster R-CNN mAP performance on VOC 2012

Figure 13: Faster R-CNN mAP performance on COCO

Figure 14: Timing on a K40 GPU in millisecond with PASCAL VOC 2007 test set.

(30)

4.2.2 YOLOv3 perfromance

In this section there are the performance results of YOLOv3.

Figure 15: Yolov3 performance on VOC 2007

Figure 16: YOLOv3 performance on COCO

(31)

4.2.3 SSD performance

In this section there are the performance results of SSD.

Figure 17: SSD performance on VOC 2007, VOC 2012, COCO

Figure 18: Faster R-CNN mAP performance on COCO

Figure 19: SSD performance on COCO.

(32)

4.3 Why YOLOv3

The tables above are the results of the performance as they have been published on the original papers of YOLOv3, SSD and Faster R-CNN. For accuracy, the comparison is not immediate since researchers have tested several versions of the methods and performed the experiments in very different settings. These versions differ in input image resolution, number of region proposals, batch size, the adopted feature extractors, i.e. which CNN to use as base network for the algorithm and other settings. Furthermore, they have been training on different data sets and evaluated at different ratios of precision. Ultimately, the tables have been published at different times, so they do not take into consideration the latest versions of each of the other methods. A direct comparison using these tables, therefore, would be unfair.

However, we have to remember the context of this research work: designing and engineering of a vision system for a mobile robot. A severe constraint of the project is that the robot must be able to perform in real time. This means that the chosen object detection method should work while the robot camera is streaming a video. The camera used in the project, Kinect v2, which I will describe later in the report, streams video at 30 frames per second. In order to be real time, the detector should process a number of frames as close as possible to 30 frames per second.

Speed, therefore, assumes a much larger importance over accuracy in this context . From figure 16, which is the latest released, it is clear that YOLOv3 is the fastest object detector, reaching ¹⁰⁰⁰₂₂ = 45 frames per second.

Referring to the already mentioned speed-accuracy trade off, we know that a higher speed gives a worse accuracy. This is the case of YOLOv3 on COCO with 28.3 mAP. This, however, is not a big concern: the robot will stream and process approximately 30 frames per second. In the meanwhile it will move towards the object at a speed of 4 m/s, which is rather low. During the locomotion we do not expect the system to detect the object of interest in most of the frames, but we would be fine with a small percentage of successfully detected frames.

At the end of the day, such considerations lead to YOLOv3 as final choice for the vision system of the robot Butler application.

(33)

5 Methodology

5.1 Point cloud and 3D position

For the point cloud and the 3D position I need to refer to the available hardware for this project, i.e. the Microsoft Kinect for Windows v2 camera [9]. It is a stereo camera, meaning it has a depth sensor. The Kinect v2 streams both in color and depth mode at a speed of 30 frames per second. Its color frame resolution is 1920x1080, while its depth frame is 512x424. Once the camera depth sensor computes the depth Z of an object, it is easy to calculate the other two spatial coordinates X and Y using trigonometry.

Figure 20: Kinect for Windows v2.

Before doing that, though, it is necessary to calibrate the camera, which enables to compute parameters such as the focal length (f_x, f_y) and the optical center (cx, cy), making clear the relation between the pixel coordinates (u, v) and the real world coordinates (X, Y ).

If we denote by S the depth outputted by camera sensor, the following equation holds:



 X Y Z



= S







u−c_x fx

v−c_y f_y

1





 (23)

To do camera calibration, I used the python package OpenCV[12].

5.2 Pykinect2

The original Software Development Kit (SDK) for the Kinect v2 camera is written in c++. A python wrapper called PyKinectv2[10] is used, in order to manipulate all the Kinect functions using python, since that is the programming language in which the rest of the project is developed.

5.3 Tensorflow

Tensorflow[18] is an open source software for machine learning framework developed by Google Brain team in 2015. It is written in Python, C++ and CUDA.

Few Tensorflow implementations of YOLOv3 were released on GitHub in 2018

(34)

and 2019. Among them, [11] is used, with the Tensorflow-gpu python library.

The weights of YOLOv3 trained on COCO are taken on YOLO website [13].

5.4 System overview

The whole vision system can be divided in of 3 main parts: object detection, point cloud and positioning. Object detection takes as input a RGB frame (1920x1080) from the kinect camera. Then it processes the input with the YOLOv3 architecture and outputs the pixel position of the detected object bounding box. By default, YOLOv3 outputs the pixel coordinates of the corners for each bounding box (Top Left(x,y), Bottom Right(x,y)). Since the objective of the project is to return a single position of the object, a small modification is done so that the algorithm returns the center (u, v) of the bounding box, which is computed as :

u = T opLef t(x) + BottomRight(x)

2 (24)

v = T opLef t(y) + BottomRight(y)

2 (25)

The second building block, Point cloud, takes as input the pixel position (u, v) and, through the Kinect v2 camera depth sensor, returns the depth S of that pixel. Ultimately, the third block takes as input both the depth S and the pixel position (u, v) and returns the real world relative-to-camera position (X,Y,Z) of the object. This 3 steps are set in a for loop so that while streaming a video it is possible to output the 3D position of the detected object for each frame in real time. Despite the other 2 components, object detection is the core part of the vision system both for complexity and for computational time.

In the following figure it is possible to see how the Vision system processes the frames until it gets the object of interest position.

(35)

Input Object Detection Point Cloud

3D Position

Output

RGB frame (u,v)

(u,v) depth

3D position

Figure 21: Vision System diagram representation

(36)

5.5 Integration in the Robot Software

The process above described has been integrated in the soul software of the robot. The scheme is repeated for every frame of a video. Therefore the algorithm loops through each of the 3 components in order to lead to a real-time performance. Now, an evaluation of the vision system applied to the Butler robot use case is possible.

6 Evaluation

The goal of this section is to evaluate the performance in terms of speed and precision of the implemented vision system towards the use case of this project.

6.1 Test scenarios and settings

Before explaining the test settings we remind to the reader about the environment for the robot to operate. It is a squared room 4x4 meters. The coffee mug is placed on a table at one side of the room, the robot starts its task and therefore the detection from the other side. The robot must be able to find the cup, then to move towards it and grasp it.

In this context, the robot locomotion with the respect to the cup will change several times during the whole task. Thus, for a complete evaluation, this test session will consider 3 different scenarios, as it follows:

• Robot moving towards the cup starting at a distance of 4 meters and stopping at 50 centimeters from the cup. The moving speed is 4 m/s.

This scenario will be called ”MOVING TOWARDS”

• Robot moving horizontally facing the cup at a distance of 4 meters from the cup. The moving speed is 4 m/s. This scenario will be called ”MOV- ING HORIZONTALLY”

• Robot standing at a distance of 50 centimeters from the cup. This scenario will be called ”GRASPING”

In each scenario there will be a precision evaluation of the detection. The following variables will be recorded:

• Correct detection: percentage of frames in which the cup gets detect cor- rectly

• Multiple detection: percentage of frames in which an object which is not a cup gets detected as a cup

• No detection: percentage of frame in which no cup is detected

(37)

6.1.1 Tuning parameter

In YOLOv3 it is possible to set the confidence threshold t parameters for run- ning the detection. We remind that such threshold is the confidence level of the prediction: 0 means that the algorithm is totally unsure about the prediction, 1 means totally sure. In this context this parameter assumes a large importance:

a too high level of the threshold, can result in YOLOv3 missing the object since it will detect it just when it is totally sure. On the other hand, a low level can lead to some misdetection, meaning that the algorithm detects as cups some objects which are not cups due to its uncertainty. For this reason the test is run at different confidence levels in order to state which is the best for this specific case.

6.2 Results

To run this test a NVIDIA GPU GeForce RTX 2070 was used. The precision results are percentages. The number of evaluated frames for each line of the following tables is approximately 600.

6.2.1 Threshold and Speed

Speed at different threshold levels Threshold level Frames per second

0.1 11.1

0.3 13.6

0.5 14.5

0.7 16.9

0.9 18.7

6.2.2 Scenario 1

Moving Towards scenario

Threshold level % Correct detection % No detection % Multiple detection

0.1 22.1 0.7 77.2

0.3 49.2 13.7 35.1

0.5 45.4 42.3 10.3

0.7 39.2 52.5 8.4

0.9 26.3 74.6 0

(38)

6.2.3 Scenario 2

MOVING Horizontally scenario

0.1 58.5 27.6 13.9

0.3 32.3 66.7 0

0.5 11.1 88.9 0

0.7 2.3 97.7 0

0.9 0 100 0

6.2.4 Scenario 3

GRASPING scenario

0.1, 0.3, 0.5, 0.7 100 0 0

0.9 97.6 2.4 0

7 Conclusion

7.1 Speed

The first table shows the speed of the vision system for each confidence threshold. We can observe that the speed decreases as the confidence threshold increases: This result is explainable by referring to the architecture of the vision system. Once YOLOv3 predicts the bounding boxes, the vision system loops over them in order to check whether they are cups or not. A low confidence level makes ´YOLOv3 predict a high number of bounding boxes and therefore the loop is computationally heavier. On the other hand a high confidence level predicts a small number of bounding boxes. The loop over the boxes is hence computationally lighter. This is why the highest speed (18,7 FPS) correspond to confidence = 0.9 and the lowest (11.1 FPS) to confidence level = 0.1.

7.2 Precision

Scenario 1 is the most ”complete” of the three, since the robots performs the detection at different distances from the object. It is interesting to notice that such context offers a perfect example of the confidence threshold trade-off mentioned in section 6.1.1. In order to improve the understanding of the data coming from the table in section 6.2.2 we plot one variable at a time along with the confidence levels. As the confidence increases, the number of frames in which there is no cup detection at all is higher (from 0.7% to 75.6%). Instead, the number of misde- tections, initially 77.2% for 0.1 confidence level, decreases down to 0% when the

(39)

confidence level is 0.9.

0 0.1 0.3 0.5 0.7 0.9

0 20 40 60 80 100

% No detection

ConfidenceThreshold

Scenario 1

0 0.1 0.3 0.5 0.7 0.9

0 20 40 60 80 100

% Misdetection

ConfidenceThreshold

Scenario 1

(40)

The result of this trade-off is that the variable representing the number of correct detections is lower when the threshold is too low or too high. From the table in section 6.22 we have the highest amount of correct detections (49.2%) at 0.3 confidence level.

0 0.1 0.3 0.5 0.7 0.9

0 20 40 60 80 100

% Correct detection

ConfidenceThreshold

Scenario 1

Scenario 2 evaluation enables to understand whether the robot can start his task or not as, if there is no detection from 4 meters away from the cup, it does not know where to go. It is possible to see that here the number of correct detections is a decreasing function going from 58.5% for confidence = 0.1 to 0% for confidence = 0.9. The opposite trend sees the ”No detection” variable being an increasing function, going from 27.6% to 100%. Misdetection variable is always 0 except for confidence = 0.1. In this scenario, thus, is not possible to observe the trade-off cited above. This is due to the bad performance of YOLOv3 on small objects compared to medium and large objects[14], as it is possible to see in figure (15). Indeed, from 4 meters away the cup takes a space rather small in the digital image.

(41)

0 0.1 0.3 0.5 0.7 0.9 0

20 40 60 80 100

% No detection

ConfidenceThreshold

Scenario 2

0 0.1 0.3 0.5 0.7 0.9

0 20 40 60 80 100

% Correct detection

ConfidenceThreshold

Scenario 2

(42)

0 0.1 0.3 0.5 0.7 0.9 0

20 40 60 80 100

% Misdetection detection

ConfidenceThreshold

Scenario 2

The scenario 3 concerns the grasping part, as the robot will be standing in front of the table until the cup is grasped. We can see that the algorithm works very well for all the confidence levels with 100% correct detections for confidence levels 0.1, 0.3, 0.5, 0.7 and 97.6% for level 0.9. The reason is that when the camera is just 50 centimeters away from the object, the latter is rather big in the digital image and YOLOv3 performance is very good.

(43)

8 Future Work

The vision system has been presented as divided into object detection and point cloud. The first part is performed on RGB frames and returns a 2D position, while the point cloud adds the depth dimension through the camera depth sensor. In this work the expression ”point cloud” has been slightly abused, as it was referring the one coordinate point outputted by the camera depth sensor.

In general, however, a point cloud is a set of points in the space which can be created digitally by a scanner. Since 2016[4] researchers have developed deep learning methods for object classification, part segmentation and other processing on point clouds in their raw form. An investigation on this topic would be very interesting in order to state how much that kind of approach is applicable to the specific use case. Those methods, indeed, permit for instance to understand the orientation, the precise borders and the volume of the objects in the 3D space. Such features would be very useful for interacting with objects with a structure more complex than a cup.

Furthermore, expanding the view, we remind that robot butler is just a tester for AI features. The tested techniques will be involved in the future Volvo CE construction machines. In this context, the deep learning point cloud methods seem to guarantee a full understanding of the environment. Point cloud networks could be very important in areas such as interaction between vehicles, interaction with humans, security field and many others.

(44)

References

[1] Alex Krizhevsky et al. ImageNet Classification with Deep Convolutional Neural Networks. arXiv. 2012.

[2] Joseph Redmon et al. You Only Look Once: Unified, Real-Time Object Detection. arXiv. 2015.

[3] Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv. 2013.

[4] Hao Su et al. Charles R. Qi. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv. 2016.

[5] Ross Girshick. Fast R-CNN. arXiv. 2015.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.

Chapter 6.5.2 http://www.deeplearningbook.org. MIT Press, 2016.

[7] http://cocodataset.org/home. COCO dataset.

[8] http://host.robots.ox.ac.uk/pascal/VOC/. The PASCAL Visual Object Classes Homepage.

[9] https://developer.microsoft.com/en-us/windows/kinect. Kinect - Windows app development - Microsoft Developer.

[10] https://github.com/Kinect/PyKinect2. Wrapper to expose Kinect for Win- dows v2 API in Python.

[11] https://github.com/YunYang1994/tensorflow-yolov3. Tensorflow implemen- tation of YOLOv3.

[12] https://opencv.org/. OpenCV.

[13] https://pjreddie.com/darknet/yolo/. YOLO: Real-Time Object Detection - Joseph Redmon.

[14] Ali Farhad. Joseph Redmon. YOLOv3: An Incremental Improvement.

arXiv. 2018.

[15] Ali Farhadi Joseph Redmon. YOLO9000: Better, Faster, Stronger. arXiv.

2016.

[16] Xiangyu Zhang et al. Kaiming He. Deep Residual Learning for Image Recognition. arXiv. 2015.

[17] Andrew Zisserma et al. Karen Simonyan. Very Deep Convolutional Net- works For Large-Scale Image Recognition. arXiv. 2014.

[18] Martın Abadi et al. TensorFlow: Large-Scale Machine Learning on Het- erogeneous Systems. Software available from tensorflow.org. 2015. url:

http://tensorflow.org/.

[19] Keiron O’Shea and Ryan Nash. An Introduction to Convolutional Neural

(45)

[21] Michael Maire Tsung-Yi Lin. Microsoft COCO: Common Objects in Con- text. arXiv. 2014.

[22] Dragomir Anguelov et al. Wei Liu. SSD: Single Shot MultiBox Detector.

arXiv. 2016.

(46)

(47)

(48)

TRITA -SCI-GRU 2019:090

Object Detection in Object Tracking System for Mobile Robot Application

Object Detection in Object

Tracking System for Mobile Robot Application

ALESSANDRO FOA´

Object Detection in Object

Tracking System for Mobile Robot Application

ALESSANDRO FOA´

Object Detection In Object Tracking System For Mobile Robot Application

Abstract

Objektdetektering F¨ or Ett Objektsp˚ arningssystem Applicerat P˚ a En Servicerobot

Abstrakt

Acknowledgment

Contents

1 Introduction

1.1 Project

1.2 Premise: Object Detection and Point Cloud

1.3 Thesis goal

2 Neural Network theory

2.1 Artificial Neural Network

2.2 Loss function and back-propagation

2.3 Feature extraction

2.4 Convolutional Neural Network

3 Object Detection State Of The Art

3.1 Region Proposals

3.2 YOLO

3.3 SSD

4 Comparison

4.1 Dataset

4.2 Performance

4.3 Why YOLOv3

5 Methodology

5.1 Point cloud and 3D position

5.2 Pykinect2

5.3 Tensorflow

5.4 System overview

5.5 Integration in the Robot Software

6 Evaluation

6.1 Test scenarios and settings

6.2 Results

7 Conclusion

7.1 Speed

7.2 Precision

8 Future Work

References