Vision based indoor object detection for a drone

(1)

IN THE FIELD OF TECHNOLOGY DEGREE PROJECT

ENGINEERING PHYSICS

AND THE MAIN FIELD OF STUDY

COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2017,

Vision based indoor object detection for a drone

LINNEA GRIP

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

Vision based indoor object detection for a drone

LINNEA GRIP

Master in Computer Science Date: June 8, 2017

Supervisor: Patric Jensfelt Examiner: Hedvig Kjellström

Swedish title: Bildbaserad detektion av inomhusobjekt för drönare School of Computer Science and Communication

(3)

i

Abstract

Drones are a very active area of research and object detection is a crucial part in achieving full autonomy of any robot. We investigated how state-of-the-art object detection algorithms perform on image data from a drone. For the evaluation we collected a number of datasets in an indoor office environment with different cameras and camera place- ments. We surveyed the literature of object detection and selected to research the algorithm R-FCN (Region based Fully Convolutional Network) for the evaluation. The performances on the different datasets were then compared, showing that using footage from a drone may be advantageous in scenarios where the goal is to detect as many objects as possible. Further, it was shown that the network, even if trained on normal angled images, can be used for detecting objects in fish eye images and that usage of a fish eye camera can increase the total number of detected objects in a scene.

(4)

ii

Sammanfattning

Drönare är ett mycket aktivt forskningsområde och objektigenkänning är en viktig del för att uppnå full självstyrning för robotar. Vi undersökte hur dagens bästa objektigen- känningsalgoritmer presterar på bilddata från en drönare. Vi gjorde en literatturstudie och valde att undersöka algoritmen R-FCN (Region based Fully Convolutional Network).

För att evaluera algoritmen spelades flera dataset in i en kontorsmiljö med olika kameror och kameraplaceringar. Prestandan på de olika dataseten jämfördes sedan och det visades att användningen av bilder från en drönare kan vara fördelaktig då målet är att hitta så många objekt som möjligt. Vidare visades att nätverket, även om det är tränat på bilder från en vanlig kamera, kan användas för att hitta objekt i vidvinklade bilder och att användningen av en vidvinkelkamera kan öka det totala antalet detekterade objekt i en scen.

(5)

Introduction

Object detection is important for reaching higher level autonomy for robots. It is a very active area of research in robotics, applied computer vision and machine learning. Un- manned Aerial Vehicles (UAVs), or drones are being used more and more as robotic platforms. It is of interest to see how to make use of methods that have been developed in computer vision and machine learning and used for other robot embodiments on drones.

The objective of this degree project is to determine how an existing object detection method can be used on image data from a drone. One of the advantages of using a drone to detect objects in a scene may be that the drone can move close to objects compared to, for example, a wheeled robot. The drone may therefor be able to detect more "small" objects.

Here "small" objects are defined as object that can easily be held in one hand, such as cups, cell phones and bottles. We examine whether the distance from which objects are viewed by a camera makes a difference in object detection performance.

When a drone navigates a building in search for objects, it is of interest for the drone to be able to view as much of its surroundings as possible. To achieve a large field of view the camera could be mounted on a tilting mechanism on the drone. This requires to put on more weight on the drone and to avoid this a wide angle (fish eye) camera is used instead. However, images taken by a fish eye camera are distorted and quite different from images taken by a normal camera. Therefor, it cannot be assumed that object detection algorithms normally used on "normal" images perform well on fish eye images. Part of the study is to investigate the performance of algorithms widely used on normal images on fish eye images.

Previous works ([1],[2]) stress that the images captured by a drone often is different from those available for training, which are often taken by a hand held camera. Difficulties in detecting objects in data from a drone may arise due to the positioning of the camera compared to in images taken by a human, depending on what type of images the network is trained on. Therefor, different ways of positioning the drone and the camera with respect to objects will be evaluated.

1

(8)

2 CHAPTER 1. INTRODUCTION

1.1 Research Question and Hypotheses

How can the best performance of an object detection algorithm in an indoor scene be obtained using the flexibility of a drone when the goal is to detect as many objects as possible?

After a literature study the algorithm that is currently best suited for indoor detection of objects is chosen. The chosen algorithm is then evaluated on different data sets in order to determine whether there are any benefits and/or drawbacks in using data acquired by a drone when trying to detect objects in an indoor scene, what type of camera to use and how to take advantage of the flexibility of the drone.

Several hypotheses will be addressed, including the following.

1. The chosen algorithm can be used on image data acquired by a drone.

2. The chosen algorithm, trained on images from a normal camera, can be used to some extent on images from a fish eye camera.

3. More objects can be detected in data from a fish eye camera than from a normal camera, because of the larger field of view.

4. More objects can be detected from a closer viewpoint.

5. The number of detected object instances depend on the angle of the camera.

1.2 Limitations

It will be assumed that a drone equipped with a RGB camera sends a continuous stream of images to a computer which then performs computations off board the drone. It is not part of the project to perform "light weight" object detection on board the drone. The drone will navigate (not part of the project) an indoor, office-like, environment and en- counter and try to detect objects. It is expected to be able to detect objects such as chairs, screens and people. However, also detection of smaller object such as mugs and cell phones will be attempted.

1.3 Report Outline

In Chapter 2 relevant theory of object detection is outlined. Chapter 3 touches on important works that have been made in the areas of object detection as well as drones. Chap- ter 4 describes briefly the algorithm used for detecting objects throughout the project.

The general method used for performing experiments and evaluating performance of the object detection algorithm on different datasets is described in Chapter 5.

The three sections of Chapter 6 each present an experiment. They contain first a brief motivation of why the experiment was important, then a description of the experimental setup, the results obtained and lastly a short analysis of the results. These three experiments were carried out without using a real drone.

(9)

CHAPTER 1. INTRODUCTION 3

Chapter 7 then shows data acquired by a camera mounted on a real, flying drone, and the detections as predicted by the algorithm.

The results are discussed in Chapter 8, which also proposes future work. In particular, Section 8.5 contains a summary of the report and appendix A presents a brief discussion about social aspects of using drones and object detection.

(10)

Chapter 2

Background

In this chapter important theory and concepts that may not be common knowledge is explained.

Object detection entails detecting instances of predefined object classes in images. The object instances should also be localized using a so called bounding box, a box contain- ing the object in the image. There are many ways of performing object detection, each method with different strengths and weaknesses.

2.1 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are special types of Neural Networks that are especially well designed for usage on images. This allows for optimizing the architecture so that the amount of parameters of the network can be reduced, compared to a regular Neural Network, and the method made more efficient [3].

A CNN generally consists of two main parts; convolutional layers followed by fully connected layers. CNNs are trained end-to-end, that is, from pixels to final classification without needing to introduce any particular feature extractor which make CNNs a good choice for various general object detection tasks. However, training a CNN requires very large sets of images compared to other object detection methods [4]. There are several CNN based methods available and state-of-the-art object detection of today builds on CNNs, as will be described in Chapter 3.

Convolutional layers

The convolutional layers of a CNN performs a sliding window operation and outputs feature maps. Each convolutional layer of a CNN represents a certain type of feature and each corresponding output feature map is a spatial activation image where the strongest responses to the feature of interest are indicated. For example, a convolutional layer applied on an image of a box could output a feature map showing strong activations on the positions of the corners of the box. In Figure 2.1, the depth of the dotted box represents the number of these feature layers. In the learning process the weights of the convolu-

4

(11)

CHAPTER 2. BACKGROUND 5

tional layers are tuned so that features that minimize prediction errors more are taken into larger account than less helpful features. The convolutional layers do not require any specific image size.

Figure 2.1: Schematic figure of a CNN. Figure copied from [3].

Fully-connected layers

Fully-connected layers are often added on top of the convolutional layers to perform the actual classification, since the outputted feature maps of the convolutional layers are still low level. The fully-connected layers are built in the same way as regular Neural Net- works and have full connectivity since all neurons of each layer are connected to all outputs from the previous layer. Often, fully-connected layers consist of regular Neural Net- works or other classifiers, such as support vector machines (e.g. [5],[6]). These layers take the feature maps as input and classifies objects in the image depending on the activated features in the feature maps. The fully-connected layers require fixed size input vectors - a property that used to cause problems when images were of differing sizes (e.g. [5],[7]) but have now been addressed (e.g. [6],[8]), as will be mentioned in Chapter 3.

Region Proposals

Many object detection methods of today rely on some type of region proposal algorithm, which can be integrated with (e.g. [8],[9]) or separated from (e.g. [5],[7],[6]) the CNN it- self . The region proposal algorithms suggest regions, or bounding boxes, in the images likely to contain objects so that the rest of the computations (classification and finer local- ization) can be made only in these probable regions.

2.2 Other object detection methods

There are several ways, apart from CNNs, to perform object detection. The different methods have different strengths and weaknesses, such as different computation time, accuracy or performance on different types of objects. For example, methods based on HOG [10] or SIFT [11] may be more suitable for on-board classification (for example on a drone) because it requires less memory and works on a CPU. However, as of today CNNs are the primary approach to most object detection problems [12] with outstand- ing performance.

(12)

6 CHAPTER 2. BACKGROUND

2.3 Common Datasets

There are several widely used datasets in the object detection community to train and evaluate performance of different methods and networks on standard images. Some of the largest are the 20 category dataset PASCAL Visual Object Classes (VOC) challenge [13], ImageNet [14] with millions of classified images and at least one million images with corresponding bounding boxes and Microsoft COCO [15] which presents a dataset of more than 300,000 images and 80 labeled categories, including smaller objects such as fruit, cell phones and computer mouses in natural, everyday scenes.

2.4 Metrics

There are some common methods for measuring performance of object detection.

Intersection over Union

Introduced in [13], the Intersection over Union (IoU) is a metric commonly used in object detection for evaluating correctness of a bounding box. IoU is computed by

IoU = Intersection area

Union area (2.1)

where the intersection area is the area of the intersection between the predicted bounding box and the true bounding box (their overlap). Similarly the union area is the union of the two. A predicted bounding box close to the true bounding box yields an IoU close to 1.

Precision and Recall

Precision of a classifier on a dataset is defined as the number of true positives over the total number of detected positives, that is

Precision = Number of true positives

Number of true positives + Number of false positives. (2.2) Here, a true positive is a detection of an instance that is actually present in the image. A false positive is a detection of an instance that is not present in the image. That is, the number of true positives is the number of objects correctly classified as a certain class and the number of false positives is the number of objects incorrectly classified as that certain class. When no false positives are detected the precision is 1. The precision is then 1 regardless of whether there are any true positives. A precision of 1 means that all detected objects were true, but doesn’t say anything about how many actually existing objects were not detected.

In the same way, the recall of a classifier on a dataset is defined as the number of true positives over the true number of instances, that is

Recall = Number of true detected positives

Number of true detected positives + Number of false not detected negatives. (2.3)

(13)

CHAPTER 2. BACKGROUND 7

Here, a false negative is an instance of an object that is present in the image but not detected. When there are no false negatives the recall is 1, regardless of whether there are any true positives or not. A recall of 1 only means that no objects that should have been detected were left out and doesn’t say anything about the quality of the actual predic- tions made.

It is desirable to maximize both precision and recall, so that few instances are wrongly classified while at the same time few instances that should have been classified are left out.

F1 score

The F1 score is a way to summarize precision and recall in one number to evaluate the overall performance of a classifier. The F1 score is defined as

F₁ = 2 · precision · recall

precision + recall. (2.4)

Mean Average Precision

Average precision is related to the area under a precision-recall curve for a category, that is, precision plotted to recall. It is desirable for this area to be large in order for precision and recall to be maximized. The mean Average Precision (mAP) is the Average Precision averaged over all class categories in a dataset and is a common way of evaluating how well an object detection method performs.

(14)

Chapter 3

Related work

In this chapter previous work related to the project is briefly described. First, research related to drones and how computer vision has been used on drones is surveyed. Sec- ondly, research in the area of object detection is described, followed by a short description of fine tuning of a CNN.

3.1 Drones

Drones are platforms capable of flying, e.g. small unmanned helicopters. A drone, as other robots, can be programmed to different levels of autonomy, from being radio con- trolled to being fully autonomous. To achieve full autonomy a well developed navigation and perception system is required. Drones are very flexible compared to ground based robots, as they can fly over and around things and thus view objects from a larger va- riety of angles. However, there is a limitation as to how much weight one can put on a drone which in turn limits the number of sensors, the on board computational power and so on. However, data can be streamed to a larger computer and processed there.

In 2014 imagery from a drone was used to count animals in images of natural environ- ments [1]. They used imagery taken from high altitude (10-100 meters) with a skewed angle compared to "human" photos, which are usually taken from an altitude of about 1- 2 meters from the front. Since their goal was to perform object detection on board the drone GPU-requiring CNN methods were not applicable at the time and a HOG [10]

based method was used. [1] stresses that most object detection algorithms are trained and tested on images taken from a "human" perspective, that is, from a certain height and angle, and can thus not be assumed to perform well on other types of images.

Drones have also been used for tracking objects on the ground, as in [16] where color thresholding was used to detect a colored rectangle to follow. In this case, no classification of the object was made.

Further, [2] used an RGB camera together with a heat camera to detect humans from on board a drone. They first found human-temperature silhouettes and then used a cascade of boosted classifiers with Haar-like features on the RGB image of the corresponding position to ensure the presence of a human. Also here it is stressed that the images of in-

8

(15)

CHAPTER 3. RELATED WORK 9

terest are very different from images generally used in computer vision (with a "human"

perspective) since they are taken from a large hight and thus a skewed angle.

3.2 Object detection

Already in 1989 the first deep learning approach to object detection was proposed in [17]

where supervised back-propagation networks were used to detect hand written digits in zip codes. However, until the year 2012 methods based on feature extraction such as SIFT [11] and HOG [10] were in focus and performance on the PASCAL VOC challenge improved slowly.

In 2012, [18] reintroduced the usage of Convolutional Neural Networks in object detection and won the ImageNet Large-Scale Visual Recognition Challenge [14] with their network called AlexNet. This was the starting point for a lot more research on CNNs in object detection.

[5] combined AlexNet with region proposals in 2013 (they used Selective Search [19]) and thus improved performance on PASCAL VOC significantly (from previous best result of 35.1%mAP [19] to 53.7% mAP). The method was named R-CNN (Regions with CNN features) since it is based on first generating region proposals for the input image and then extracting a feature vector for each proposed region using a CNN. Lastly each region is classified using a Support Vector Machine (SVM).

In 2014, [12] used the features extracted by a CNN called overfeat [4] in various recognition tasks such as image classification and scene recognition. They achieved astounding results compared to current state-of-the-art methods in all tasks on various datasets, including PASCAL VOC [13], and thus showed that deep learning with CNNs should be considered the primary approach in any visual recognition task.

Spatial Pyramid Pooling networks (SPPnets [6]) took on the problem of earlier CNNs requiring fixed sized input images in 2015 by adding a SPP layer between the last convolutional layer and the first fully-connected layers. In this way, the need to crop or warp images in order to run them through a CNN was eliminated. Further, SPPnets speed up R- CNN by sharing computation across regions. That is, in SPPnets the features of an image are computed only once instead of separately for each region of interest. SPPnets proved to be 24-102x faster than R-CNN and to perform better or comparable [6].

Also in 2015, Fast R-CNN [7] improved the work of R-CNN [5] further by proposing a network that can simultaneously be trained to classify objects and to tune their spatial lo- cations - leading to a significant increase in training speed (9x faster than R-CNN [5] and 3x faster than SPPnet [6]) while also achieving better accuracy on PASCAL VOC (66%

mAP).

ResNet [20] introduced a deep residual learning framework in the end of 2015 which al- lowed networks to grow much deeper than before. They reformulated the network layers as learning residual functions with reference to inputs instead of learning unreferenced functions and showed that the residual mappings can be easier optimized than the origi- nal mappings.

In 2016 the R-CNN algorithm was even further developed by integrating the fast R-CNN

(16)

10 CHAPTER 3. RELATED WORK

[7] with a Region Proposal Network (RPN), resulting in faster R-CNN [9]. Until [9], the main bottleneck in object detection was the region proposals, which were often time consum- ing. The RPNs of [9] share convolutional layers with the object detection networks [7], [6] and simultaneously regress region bounds and the probability of the region to contain an object at each location on a grid of the image. Usage of RPNs ensure nearly cost free region proposal and also improved accuracy of the proposed regions.

Later in 2016, [8] proposed a Region-based Fully Convolutional Network (R-FCN) which improved object detection performance by further centralizing the method. While the previous methods had, to different extents, performed some computations several times for different regions of the images R-FCN is fully convolutional with almost all computations shared across the whole image. Until today, R-FCN is considered a state-of-the- art method for object detection and therefor the work of this project will be based on R- FCN.

(17)

Chapter 4

The Object Detection Algorithm

According to the findings of the previous chapter R-FCN [8], being one of the best object detection frameworks of today with competitive accuracy and fast computations, is used in this project. The details of R-FCN can be found in the paper [8] but a brief overview of the architecture used is given here.

Figure 4.1: Key idea of R-FCN for object detection. Figure copied from [8].

Figure 4.1 shows the overall architecture of R-FCN. The first "white box" consists of a backbone network, in this case ResNet-101 [20]. ResNet-101 is a residual network with 100 convolutional layers followed by a pooling and a fully connected classification layer.

Here, the two last layers are removed and the 100 convolutional layers are used to com- pute feature maps.

From these feature maps k × k(C + 1) position-sensitive score maps are computed (the last

"plate" in Figure 4.1). Here C is the number of object categories (+1 for background) and k is the dimension of the position-sensitive score maps (3 × 3 in the figure). These score

11

(18)

12 CHAPTER 4. THE OBJECT DETECTION ALGORITHM

maps are activated on a specific relative position to a certain object, for example top-left or right-bottom. For each object category there are k² score maps. An example showing how the position-sensitive score maps work is showed in Figure 4.2.

Figure 4.2: Illustration of the position-sensitive score maps of R-FCN, with k = 3. The figure is copied from [8].

Simultaneously, Regions of Interest (RoIs) are extracted using the Region Proposal Net- work (RPN) of [9] and the same output feature maps. A pooling layer then generates C + 1channel score maps for each RoI, using the information from the position-sensitive score maps. Finally, the categories and bounding boxes are computed using a Softmax function [21] and a box regression convolutional layer respectively.

The network used in this project is pre-trained on a 80 class dataset from Microsoft COCO [15]. Several of the classes present in the dataset are "small", as defined in Chapter 1.

(19)

Chapter 5

Method

The hypotheses stated in Section 1.1 are addressed in three different experiments. In this chapter, the general method of the experiments is described. Each experiments is described in more detail in Chapter 6.

5.1 Experiment Design

In all of the experiments a hand held camera is used instead of a camera mounted on a flying drone. Not using a real drone facilitates the experiments greatly since controlling a drone is difficult. Further, images obtained by hand are assumed to be very similar to corresponding images that would have been obtained using a drone. In Chapter 7 detections made on images recorded from a real drone are displayed to show that this is true.

The procedure of each experiment includes the following steps:

1. Record various image sequences and extract a number of images (between 20 and 25), equally spaced in time.

2. Manually annotate ground truth bounding boxes to the images.

3. Input the images to R-FCN and save the resulting bounding boxes.

4. Compare the bounding boxes from R-FCN with the ground truth bounding boxes.

The evaluation method is described in Section 5.2.

In step 1 between 20 and 25 images are extracted from the image sequences. In each of these images several object instances are generally present so that the total number of objects in each dataset is larger than the number of images.

In step 2 ground truth bounding boxes were manually annotated in the images. All objects that could be identified by looking at an image were annotated with a bounding box. That is, even objects that took up a small amount of pixels in an image were annotated, as long as they could be identified. This hold for objects close to the edges of the images as well.

13

(20)

14 CHAPTER 5. METHOD

Two of the experiments are designed to directly address some of the hypotheses stated in Section 1.1.

In one experiment the number of detected objects in three different datasets, one recorded with a normal camera (that is, a non fish eye camera), one recorded with a fish eye camera and one recorded with a fish eye camera and then rectified, are compared in order to determine with which type of camera most objects can be detected (hypotheses 2 and 3 in Section 1.1).

In another experiment the number of detected objects in four different datasets recorded from different horizontal and vertical distances to a table with objects on it are compared in order to determine from what distance most objects can be detected (hypothesis 4 in Section 1.1).

The third and last experiment compares the number of detected objects in three datasets recorded with different camera tilt angles in order to determine how to mount the camera on the drone.

5.2 Evaluation

The experiments of Chapter 6 each contain at least two different datasets. Performances of R-FCN on the different datasets are compared rather than defining a threshold for a "good" or "bad" performance. That is, since each experiment is designed to show in what way most objects can be detected, it is of more interest to see on which one of the datasets R-FCN performs better than to state whether it performs well on each individ- ual dataset.

To evaluate performance, precisions, recalls and F1 scores are computed both for indi- vidual class categories and as an average over all categories present in a dataset. The procedure of computing these values can be reviewed in Chapter 2. Further, since the goal is to detect as many objects as possible, as mentioned in Section 1.1, the total number of correctly detected objects as well as the total number of objects actually present in the images are counted for each dataset.

In computing precision and recall what is considered a correct classification, or a true positive, needs to be defined. Here a IoU threshold of IoU > 0.5 for a true positive is used, as shown in Figure 5.1. This is the standard IoU threshold of PASCAL VOC [13], and is also used in for example [8] and [9].

(21)

CHAPTER 5. METHOD 15

(a) IoU > 0.5, positive.

(b) IoU < 0.5, negative.

Figure 5.1: Illustration of the IoU requirement for a true positive.

(22)

Chapter 6

Experiments

This chapter contains three sections which each describe one experiment. They start with a short motivation of why the experiment was performed followed by a description of the experimental setup, the results and finally a short analysis of the results.

6.1 Fish Eye Camera

The goal of this experiment was to show whether a network trained on non-fish eye images can be used on fish eye images with satisfactory results. To the best of the authors knowledge this has not been tested before and the results are used in the choice of camera to use in the remainder of the project. This experiment addresses hypotheses number 2(The chosen algorithm, trained on images from a normal camera, can be used to some extent on images from a fish eye camera.) and 3 (More objects can be detected in data from a fish eye camera than from a normal camera, because of the larger field of view.) of Section 1.1.

6.1.1 Experimental Setup

A fish eye camera (with a field of view close to 180^◦) and a normal angled camera were mounted close to each other (the fish eye cameras lens about 3 cm above the normal camera lens) facing the same way, as shown in Figure 6.1.

16

(23)

CHAPTER 6. EXPERIMENTS 17

Figure 6.1: Figure describing the setup of the cameras used in the fish eye experiment. The circle with a F represents the lens of the fish eye camera and the circle with a N represents the lens of the normal camera.

Image sequences were recorded simultaneously with the two cameras walking around an office room and with various objects in it. A third image sequence was created with rectified versions of the fish eye images. 25 images, equally spaced in time, were extracted from each image sequence and input to the R-FCN. Since the goal of this experiment was to compare performances on the three dataset rather than to determine how "well" the network performs on a global scale this relatively small number of images was sufficient.

Further, as mentioned in 5, each image generally contains more than one object so the number of objects in the datasets is larger than the number of images.

The three datasets were also manually annotated with bounding boxes for the evaluation. Then, the performances on the three datasets were evaluated, comparing the annotated ground truths with the detection results from R-FCN for all datasets.

(a) Example image from the normal camera.

(b) Example image from the fish eye camera.

(c) Example of a rectified image from the fish eye camera.

Figure 6.2: Examples of the images used in the fish eye experiment.

Example images from the three datasets can be seen in Figure 6.2. It can be seen that the image quality of the two cameras is not exactly the same. That is, comparing Figures 6.2a and 6.2b there are some differences other than the field of view. For example, Figure 6.2a is darker than Figure 6.2b and this fact may affect the detection performance slightly.

However, also the training data [15] is from different cameras of varied quality and the differences in image quality should not affect the results too much.

(24)

18 CHAPTER 6. EXPERIMENTS

6.1.2 Results

Table 6.1 summarizes the results for all three datasets in the experiment. For both the normal angled camera, the fish eye camera and the rectified image of the fish eye camera the total number of ground truth instances and correct detections of all present classes and the averaged precision, recall and F1 score over all present object classes are shown.

The average precision for the fish eye camera was 1.0 which means that there were no false detections in the dataset. Further, the average recall of the fish eye camera was lower than that of the normal camera which means that a larger fraction of the present objects were not detected. The F1 score, which summarizes precision and recall was slightly lower for the fish eye camera than for the normal camera, suggesting lower performance.

Camera Number of Number of Number of Average Average Average ground truths correct incorrect precision recall F1 score

detections detections

Normal camera 149 59 4 0.902 0.453 0.505

Fish eye camera 290 102 0 1.0 0.264 0.471

Rectified image 251 78 3 0.917 0.228 0.414

Table 6.1: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score averaged over all classes for one dataset recorded with a normal camera, one recorded with a fish eye camera and one with rectified images recorded with a fish eye camera.

The lowest performance was that of the rectified image of the fish eye camera. Fewer objects were also detected on this dataset compared to the fish eye dataset. The total number of ground truths in the rectified dataset is lower than that of the fish eye dataset since some parts of the images are lost in the rectification process.

The total number of correct detections was highest for the fish eye camera, nearly twice the number of correct detections in the normal camera dataset.

(a) Example image from the normal camera with bounding boxes.

(b) Example image from the fish eye camera with bounding boxes.

(c) Example of a rectified image from the fish eye camera with bounding boxes.

Figure 6.3: Examples of the bounding boxes generated by R-FCN in the fish eye experiment.

Figure 6.3 shows examples of the bounding boxes found by R-FCN in the three datasets.

(25)

Tables 6.2, 6.3 and 6.4 show the results of the fish eye experiment for each present class.

They show that some object classes are more easily detected than others. For example, no knifes were detected in any of the datasets while many bottles, cups and keyboards were detected. This is probably because the distance and viewing angle was better suited (more similar to that of the training data) for the latter objects. Table 6.3 shows a precision of 1.0 for all classes which is because no false positives were detected in the dataset.

class apple banana bottle cell_phone chair cup diningtable fork keyboard knife mouse orange tvmonitor

number of ground truths 12 14 8 13 2 24 0 11 10 10 9 9 27

number of correct detections 0 6 2 4 2 17 0 1 7 0 7 2 11

number of incorrect detections 0 0 0 0 0 0 1 0 1 0 0 0 2

precision 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 0.88 1.0 1.0 1.0 0.85

recall 0.0 0.43 0.25 0.31 1.0 0.71 1.0 0.09 0.7 0.0 0.78 0.22 0.41

F1 score 0.0 0.6 0.4 0.47 1.0 0.83 0.0 0.17 0.78 0.0 0.88 0.36 0.55

Table 6.2: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in a dataset recorded with a normal camera.

class apple banana bottle cell_phone chair cup fork keyboard knife laptop mouse orange tvmonitor

precision 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

recall 0.06 0.31 0.33 0.31 0.09 0.36 0.0 0.7 0.0 0.0 0.57 0.05 0.65

F1 score 0.12 0.47 0.5 0.48 0.16 0.53 0.0 0.82 0.0 0.0 0.73 0.1 0.79

Table 6.3: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in a dataset recorded with a fish eye camera.

class apple banana bottle cell_phone chair cup fork keyboard knife laptop mouse orange tvmonitor

precision 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 0.93

recall 0.0 0.29 0.57 0.0 0.06 0.21 0.0 0.61 0.0 0.0 0.63 0.06 0.54

F1 score 0.0 0.45 0.72 0.0 0.12 0.35 0.0 0.76 0.0 0.0 0.77 0.11 0.68

Table 6.4: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in a dataset recorded with a fish eye camera and then rectified.

Comparing Tables 6.2, 6.3 and 6.4 there are some differences in what object categories are present. For example, only Table 6.2 contains the category "diningtable", but on the other hand does not contain the category "laptop". There are different reasons for these differences. The diningtable class is present in Table 6.2 because an incorrect detection of a diningtable was made on that dataset. The laptop class is not present because the laptop seen by the fish eye camera could not be seen by the normal camera (see Figures 6.2 and 6.3, where a laptop can be seen on the right hand side of the fish eye and rectified images).

(26)

6.1.3 Analysis

The total number of correct detections of the fish eye camera was higher than that of the normal camera, strengthening the hypothesis that overall more objects can be detected using a fish eye camera. Further, the F1 score was a little lower for the fish eye camera than for the normal camera but not much. It can thus be said that it is advantageous to use a fish eye camera for object detection using a network trained on normal images if the goal is maximizing the total number of detected objects. Of course, the reason for this advantage of the fish eye camera is the wider field of view and not that it is easier to detect objects in fish eye images. However, since the fish eye camera performed well it is used in the remainder of the project.

Surprisingly, objects were detected not only in the center of the fish eye images but also on the distorted borders. Figure 6.4 is an example of this. This fact speaks for the advantage of using a fish eye camera to detect many objects - some of the "extra" detected objects compared to the normal camera are actually outside of the normal cameras field of view and the numbers cannot be only due to, for example, different image quality.

Figure 6.4: An example of detections on the borders of a fish eye image.

6.2 Distance to Objects

The goal of this experiment was to examine from what distance to view "object clusters"

in order to detect as many objects as possible. More objects are expected to be detected when the camera is closer to the objects as compared to when it is further away. The experiment shows whether this is true or not. This experiment addresses hypothesis number 4 (More objects can be detected from a closer viewpoint.) of Section 1.1.

A similar office environment as in the previous experiment (Section 6.1) was viewed by the fish eye camera (since it performed best in detecting as many objects as possible).

More specific, a table with some objects on it was viewed from different horizontal and vertical distances. The distances were measured from the front edge of the table. The

(27)

camera was facing forward.

An image sequence was recorded from each distance and height from the table edge as the camera was moved along the table. 20 images, equally spaced in time were extracted from the image sequence and run through R-FCN. Like in Section 6.1, this relatively small number of images is sufficient since the goal of the experiment is to compare datasets of equal sizes rather than determining how good the performance of the network is on a more global scale. Bounding boxes for objects in the images were also manually annotated and the results compared as explained in Section 5.2.

In order to determine from what distance most objects can be detected two different horizontal distances and two different vertical distances were examined. First, a horizontal distance of 0 cm between the camera and the table edge was used as a "close" distance.

Then, as a "far away" distance 50 cm was used. Note that a typical ground robot would often have difficulties getting even this close to objects. Further, the closest vertical distance was chosen to be 15 cm (not 0 cm because it would not be possible to fly a drone that close to the table, and a camera is typically not mounted on the lower parts of a drone). Then, the "far away" vertical distance was chosen to be 35 cm, from where many objects were still present in the image. That is, if the camera was moved even higher, there were few objects in the image because of the camera facing forward. Figure 6.5 illustrates the different camera positions with respect to the table and Figure 6.6 shows examples of images from each dataset.

Figure 6.5: Illustration of the distances in the experiment. The table is seen from the side.

The dots represent the different camera positions. For each camera position, the camera was moved along the table out of the paper.

(28)

(a) 0 cm horizontal distance, 15 cm vertical distance.

(b) 0 cm horizontal distance, 35 cm vertical distance.

(c) 50 cm horizontal distance, 15 cm vertical distance.

(d) 50 cm horizontal distance, 35 cm vertical distance.

Figure 6.6: Example images from different distances.

6.2.2 Results

Table 6.5 shows the results of the distance experiment. While more ground truths are present in the two datasets recorded from a 50 cm horizontal distance the number of correct detections is larger in the 0 cm horizontal distance datasets. This means that the average recalls and the F1 scores in these datasets are higher.

Horizontal Vertical Number of Number of Number of Average Average Average distance [cm] distance [cm] ground truths correct incorrect precision recall F1 score

0 15 154 56 0 1.0 0.293 0.469

0 35 155 52 0 1.0 0.262 0.442

50 15 172 35 3 0.944 0.127 0.302

50 35 194 50 0 1.0 0.153 0.344

Table 6.5: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score averaged over all classes for datasets recorded with different horizontal and vertical distances to a table.

(29)

Figure 6.7 shows examples of the bounding boxes found by R-FCN for the different distance datasets.

(a) 0 cm horizontal distance, 15 cm vertical distance.

(b) 0 cm horizontal distance, 35 cm vertical distance.

(c) 50 cm horizontal distance, 15 cm vertical distance.

(d) 50 cm horizontal distance, 35 cm vertical distance.

Figure 6.7: Example images showing the resulting bounding boxes for different distances.

Tables 6.6, 6.7, 6.8 and 6.9 show the results for each present object class in the datasets.

class apple banana bottle cup keyboard mouse scissors tvmonitor

number of ground truths 18 11 20 15 16 16 7 51

number of correct detections 0 0 4 3 15 8 0 26

number of incorrect detections 0 0 0 0 0 0 0 0

precision 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

recall 0.0 0.0 0.2 0.2 0.94 0.5 0.0 0.51

F1 score 0.0 0.0 0.33 0.33 0.97 0.67 0.0 0.68

Table 6.6: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in dataset from 0 cm away from and 15 cm above table.

(30)

class apple banana bottle cup keyboard mouse scissors tvmonitor

precision 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

recall 0.0 0.0 0.4 0.64 0.55 0.0 0.0 0.5

F1 score 0.0 0.0 0.57 0.78 0.71 0.0 0.0 0.67

class apple bottle chair cup keyboard laptop mouse scissors tvmonitor

number of ground truths 19 20 3 11 16 3 20 20 60

number of correct detections 0 2 0 1 7 0 3 0 22

number of incorrect detections 0 0 0 0 0 0 3 0 0

precision 1.0 1.0 1.0 1.0 1.0 1.0 0.5 1.0 1.0

recall 0.0 0.1 0.0 0.09 0.44 0.0 0.15 0.0 0.37

F1 score 0.0 0.18 0.0 0.17 0.61 0.0 0.23 0.0 0.54

class apple banana bottle chair cup keyboard laptop mouse scissors tvmonitor

number of ground truths 20 12 20 5 12 19 6 20 20 60

number of correct detections 0 0 6 0 1 12 0 0 0 31

number of incorrect detections 0 0 0 0 0 0 0 0 0 0

precision 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

recall 0.0 0.0 0.3 0.0 0.08 0.63 0.0 0.0 0.0 0.52

F1 score 0.0 0.0 0.46 0.0 0.15 0.77 0.0 0.0 0.0 0.68

6.2.3 Analysis

The experiment showed, as expected, that small objects can be more easily detected from a closer horizontal distance. The experiment didn’t show as clear results for the vertical distance which could be because the the change in vertical distance was not as large as the one in horizontal distance (20 cm compared to 50 cm).

Some of the objects in the Tables 6.6, 6.7, 6.8 and 6.9 are of extra interest. For example, the mice. It can be seen that the F1 score of mice in Table 6.6 is much larger than that of Table 6.8 (0.67 compared to 0.23). This indicates that mice are more easily detected from a closer distance. The two other tables (6.7 and 6.9) display a F1 score of 0.0 for mice which can be explained by the mice being very far out on the borders in the 0 cm horizontal and 35 cm vertical distance dataset (see Figure 6.8 for an example) and the distance being large in the 50 cm horizontal and 35 cm vertical distance dataset. Even though the results of Section 6.1 showed that objects can be detected on the distorted

(31)

borders of fish eye images we still expect detection performance on the borders to be lower than detection performance in the middle of the images.

Figure 6.8: An example from the 0 cm horizontal and 35 cm vertical distance dataset where the mouse is close to the border of the image.

The TV-monitors show, as opposed to the mice, a similar F1 score in all of Tables 6.6, 6.7, 6.8 and 6.9. Since the TV-monitors are much larger they apparently are less affected by the change in distance to the camera.

It can further be seen that in the two 0 cm horizontal distance datasets, which performed better than the other two overall, the 15 cm vertical distance dataset show a larger recall of keyboards than the 35 cm vertical distance dataset while it is the other way round for bottles and cups. All this indicates that each type of object has an "optimal" viewing distance which needs to be adjusted in order to detect that type of object. That is, in general smaller objects need to be viewed from a closer distance while larger objects are not affected as much (although we expect difficulties in detecting large objects from too close a distance as from a certain point the borders of the object would not be visible in the image).

However, it is possible to conclude that being close to small objects increases the chance of detecting them while larger objects may need a larger distance for best performance.

6.3 Camera Angle

I this experiment three different camera angles are tested. The goal is to determine how to mount the camera on the drone for best object detection performance. One of the advantages of using a drone for detecting objects in a room is that it can fly over large objects, such as tables, in order to get a different kind of view than, for example, a ground robot can. Therefor, in this experiment the camera will be moved above and along a table. The experiment addresses hypothesis number 5 (The number of detected object instances depend on the angle of the camera.) of Section 1.1.

(32)

It is of interest to be as close to the objects as possible, however a drone cannot fly too close to things (and again, a camera is generally not mounted on the lower parts of a drone). In an office environment, the fish eye camera was moved from one side to the other about 0.4 meters above a table with objects on it.

The distance of 0.4 meters was chosen trying to keep the camera as close as possible to the table, because of the results of the distance experiment in Section 6.2. However, because in this experiment the drone was moved along the table, and since there were objects on the table it was not possible to keep a closer distance.

The camera was moved along the table three times, first with a 0 degree angle of the camera, then with a 45 degree angle and lastly with a 90 degree angle. What is meant by the different angles is demonstrated in Figure 6.9. Each time, an image sequence was recorded and 20 images extracted and run through R-FCN. Examples from the three datasets are shown in Figure 6.10. The images were also manually annotated with bounding boxes and the results compared.

(a) 0 degree camera. (b) 45 degree camera. (c) 90 degree camera.

Figure 6.9: Illustration of the different camera angles. The green arrows show the directions in which the cameras were moved.

(a) Example image from the 0 degree dataset.

(b) Example image from the 45degree dataset.

(c) Example image from the 90degree dataset.

Figure 6.10: Examples images from the different camera angle datasets.

(33)

6.3.2 Results

Table 6.10 shows the results of the angle experiment. The F1 score is a lot higher for the 90 degree dataset, which is expected as most training data images were probably taken from a close to 90 degrees perspective.

Angle [deg] Number of Number of Number of Average Average Average ground truths correct incorrect precision recall F1 score

0 110 15 2 0.889 0.222 0.152

45 157 31 0 1.0 0.201 0.279

90 123 43 4 0.881 0.300 0.458

Table 6.10: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score averaged over all classes in datasets from different angles.

Figure 6.11 shows the bounding boxes predicted by R-FCN in example images from the three datasets in the experiment.

(a) Example image from the 0 degree dataset with bounding boxes.

(b) Example image from the 45 degree dataset with bounding boxes.

(c) Example image from the 90 degree dataset with bounding boxes.

Figure 6.11: Examples images from the different camera angle datasets with bounding boxes from R-FCN.

Tables 6.11, 6.12 and 6.13 show the results for each object category present in the different datasets. It can be seen that the higher F1 score of the 90 degree dataset compared to the other two is mostly due to a higher recall of large objects, such as TV-monitors and chairs.

(34)

class banana bottle cell_phone chair cup keyboard laptop mouse tvmonitor

number of ground truths 13 7 5 22 8 16 0 19 20

number of correct detections 0 0 0 0 1 14 0 0 0

number of incorrect detections 0 0 0 0 0 0 2 0 0

precision 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0

recall 0.0 0.0 0.0 0.0 0.13 0.88 1.0 0.0 0.0

F1 score 0.0 0.0 0.0 0.0 0.22 0.93 0.0 0.0 0.0

Table 6.11: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in dataset from 0 degrees camera angle.

class banana bottle cell_phone chair cup keyboard mouse tvmonitor

precision 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

recall 0.0 0.0 0.0 0.2 0.57 0.8 0.0 0.04

F1 score 0.0 0.0 0.0 0.33 0.73 0.89 0.0 0.07

class banana bottle chair cup keyboard mouse tvmonitor

number of ground truths 11 5 67 4 11 12 21

number of correct detections 0 2 29 1 6 0 11

number of incorrect detections 0 3 1 0 0 0 0

precision 1.0 0.4 0.97 1.0 1.0 1.0 1.0

recall 0.0 0.4 0.43 0.25 0.55 0.0 0.52

F1 score 0.0 0.4 0.6 0.4 0.71 0.0 0.69

Similar to in Section 6.1, not all object categories are present in all of Tables 6.11, 6.12 and 6.13. The reasons are the same, either that misclassifications were made or that object instances were outside of the field of view in some of the datasets.

6.3.3 Analysis

The performance on the 0 degree dataset is very low compared to the other two, which is logical since many objects look very different from this point of view compared to the training data. This fact was mentioned in Chapter 3, as others ([1], [2]) working on drones had already stressed the difficulties in detecting objects in images that are different from training data.

(35)

For an example of how different objects may look from above, see Figure 6.12 and note how the cup looks almost completely round as compared to what a cup looks like from the side.

Figure 6.12: An example of an image from the 0 degree dataset.

Yet, comparing Tables 6.11, 6.12 and 6.13 there is one object category that displays different behavior, the category "keyboard". The F1 scores of the keyboards are 0.93, 0.89 and 0.71 for the 0 degree camera, the 45 degree camera and the 90 degree camera. That is, more keyboards could be detected with the camera looking straight down from above and the 0 degree dataset which generally performed worst performed best on keyboards.

This is likely because keyboards are often seen from above (from the side they look almost two dimensional) and many pictures of keyboards in the training data were taken from a similar angle.

Another object category that show an interesting behavior is "cup". Most cups were detected in the 45 degree dataset (F1 score of 0.73, table 6.12) and not in the 90 degree dataset (F1 score of 0.4, table 6.13) as expected. The reason for this is likely that the cup was

placed close to the edge of the table and thus ended up in the distorted border of the images in the 90 degree dataset. Further, as the camera was moved forward the cup quickly went out of the field of view and was not present in more than 4 images in the 90 degree dataset which increases the impact of each missed detection in the calculation of the recall and F1 score.

The rest of the object categories show a uniform behavior, where few objects were detected in the 0 degree dataset, more in the 45 degree dataset and most in the 90 degree dataset. This is in order with the fact that the overall F1 score of the forward facing (90 degree) camera dataset is much larger than that of the other two datasets.

The results suggest that the camera should be mounted facing forward in order to detect as many objects as possible of different categories. However, as can be seen in Ta- ble 6.13, one of the reasons for the superior performance of the 90 degree dataset is that more large objects (chairs) were detected. Many of these chairs were in the background of the images (Figure 6.13 is an example of this) and thus distorted in the other datasets.

Therefor, depending on the environment where the drone will move, how it is meant to fly and what objects it is meant to detect it may be reasonable to mount the camera

(36)

slightly tilted.

(a) 45 degree camera angle. (b) 90 degree camera angle.

Figure 6.13: Example image showing detection of chairs in the background.

In general it can be seen that object detection performance depend on the angle from which the training data images were taken. Assuming that objects are distributed in a

"normal" way in the room it is concluded above that a camera facing forward, or slightly tilted downward, is best for detecting many objects of different categories from the Mi- crosoft COCO dataset. However, if the objects had been placed in "unnatural" ways the camera would likely need to be directed differently for best performance. For example, had there been cups and bottles lying on a table it would likely be better to mount the camera facing downward.

Vision based indoor object detection for a drone

Vision based indoor object detection for a drone

LINNEA GRIP

Vision based indoor object detection for a drone

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1 Research Question and Hypotheses

1.2 Limitations

1.3 Report Outline

Chapter 2

Background

2.1 Convolutional Neural Networks

2.2 Other object detection methods

2.3 Common Datasets

2.4 Metrics

Chapter 3

Related work

3.1 Drones

3.2 Object detection

Chapter 4

The Object Detection Algorithm

Chapter 5

Method

5.1 Experiment Design

5.2 Evaluation

Chapter 6

Experiments

6.1 Fish Eye Camera

6.2 Distance to Objects

6.3 Camera Angle