Combining RGB and Depth Images for Robust Object Detection using Convolutional Neural Networks

(1)

DEGREE PROJECT, IN COMPUTER SCIENCE AND COMMUNICATION , SECOND LEVEL

STOCKHOLM, SWEDEN 2015

Combining RGB and Depth Images for Robust Object Detection using Convolutional Neural Networks

JESPER THÖRNBERG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Combining RGB and Depth Images for Robust Object Detection using Convolutional Neural

Networks

JESPER THÖRNBERG

Master Thesis at CSC CVAP

Supervisor: Mårten Björkman <celle@kth.se>

Examiner: Danica Kragic <dani@kth.se>

(4)

(5)

Abstract

We investigated the advantage of combining RGB images with depth data to get more robust object classifications and detections using pre-trained deep convolutional neural networks. We relied upon the raw images from publicly available datasets captured using Microsoft Kinect cam- eras. The raw images varied in size, and therefore required resizing to fit our network. We designed a resizing method called “bleeding edge” to avoid distorting the objects in the images. We present a novel method of interpolating the missing depth pixel values by comparing to similar RGB values. This method proved superior to the other methods tested. We showed that a simple colormap transformation of the depth image can provide close to state-of-art per- formance. Using our methods, we can present state-of-art performance on the Washington Object dataset and we pro- vide some results on the Washington Scenes (V1) dataset.

Specifically, for the detection, we used contours at different thresholds to find the likely object locations in the images.

For the classification task we can report state-of-art results

using only RGB and RGB-D images, depth data alone gave

close to state-of-art results. For the detection task we found

the RGB only detector to be superior to the other detectors.

(6)

objektdetektering med neurala faltningsnätverk

Vi undersökte fördelen med att kombinera RGB och djup- bilder för att få robusta bildklassificeringar och objekt de- tekteringar med hjälp djupa neurala faltningsnätverk som är redo att användas. Vi använde råbilder från offentligt tillgängliga databaser skapade med Microsoft Kinect ka- meror. Råbilderna varierade i storlek och av den anled- ningen krävdes det att bildernas storlek ändrades för att passa vårt nätverk. Vi designade en storleksändringsmetod som vi döpte till “bleeding edge” där risken för utseende- förvrängningen av objekten minskar. Vi presenterar en ny metod för att interpolera de saknade djuppixelvärdena ge- nom att jämföra liknande RGB-värden. Denna metod visa- de sig överlägsen de andra metoder som testades. Vi visade att en enkel färgkartstransformation av djup bilden kan komma nära “state-of-art” prestanda. Med våra metoder kan vi presentera “state-of-art” prestanda på Washingtons Objekt databas och vi ger några resultat från “Washing- ton Scene (V1)” databasen. Specifikt för detektionsproble- met använde vi konturer vid olika trösklar för att hitta de platser i bilderna där det sannolikt finns objekt. För klas- sificeringsuppgiften kan vi rapportera “state-of-art” resul- tat med RGB och RGB-D bilder, ren djupdata kom nära

“state-of-art” prestanda men nådde inte riktigt hela vägen

fram. För detektionsuppgiften fann vi att RGB-detektorn

vara överlägsen de andra detektorer.

(7)

Introduction

Humans are using vision every day to navigate, find objects and interact with the environment, all without actively thinking about it. And now, robots of different types are becoming more and more common e.g. autonomous cars, UAV’s, bin- picking and surgery robots to name a few. These robots will need large sets of sensors to safely operate in our environment, and computers will process and make decisions based on the information provided by the sensors. To give an idea of the demands on these robots, we can compare it to what our brains do every day.

The brain is extremely complex with connections to multiple “sensors” such as:

stereo RGB vision with depth perception, stereo auditory awareness, audio and visual communication accessories, scent and odor detector, flavor sensors, and we are also covered in touch sensitive skin. Our brain processes inputs and outputs from all these sensors at all times, while at the same time it is controlling a tall body with a small contact surface and a total of 360 different joints; the amount of information the brain receives, processes and transmits every second is staggering!

Computers are still far behind when it comes to this amount of processing and taking good decisions from the information it receives. The computer vision research community is actively trying to make computers process visual information more efficiently and improve how they understand the world.

1.1 Thesis Objective

In the thesis we will investigate the following questions:

• How can depth data be used to support CNN-based image classification and how shall missing depth data preferably be treated?

• How can a CNN-based classification framework be applied for object detection, assuming a limited computational budget?

In short, CNNs are biologically inspired neural networks that have recently become

widely used for image classification and object detection. They consist of layers

(10)

two or three layers with all-to-all connectivity. This architecture makes it possible to extract low-, mid- and high-level features needed for these kinds of tasks. Using our methodology (see below) we compare different strategies of filling in missing depth data and converting the depth data from grayscale to RGB.

1.2 Methodology

We work with datasets gathered with Microsoft Kinect cameras in order to get both color and depth images of the objects and scenes. To take advantage of the current state-of-art we use the VGG [44] network to convert the images to feature vectors.

These vectors we then use as inputs to a Support Vector Machine (SVM) classifier.

For the image classification task specifically, we introduce a resizing method we call “bleeding edge” inspired by Howard’s [20] creating an artificial border to avoid cropping or distorting the object geometries. Then we interpolate the missing depth values by assuming the color is locally constant for the objects. Next, we show that a naive colormap transformation of the normalized grayscale depth image gives close to state-of-art performance. Lastly, by concatenating the RGB and depth features extracted from the VGG network we can improve the classification accuracy.

For the object detection task we use the same techniques (with the exception of filling in the missing depth values) as for the classification task. Further we design a method to find good object regions in different scenes that (for the Scenes dataset) provide less than 100 possible object locations per image. This method is based on contours from both the color and depth image.

For the image classification task we evaluate our method by comparing our re- sults to current state-of-art on the chosen dataset. However, for the object detection task we could not compare our method to other state-of-art methods since no such results exist using all six possible classes. Results do exist, but only for a maxi- mum of four different classes. Therefore, we evaluate our detector by giving some common numerical results such as precision and recall rates.

1.3 Delimitations

Due to the time limitations of the thesis we limit ourselves to using one CNN for converting the images to feature vectors. For the same reason we also use the network as it is, no finetuning of the weights for either RGB or the converted depth images.

1.4 Ethical Considerations

The ethics surrounding object detection and relevant datasets can be discussed at

length, in particular the publicly available datasets often used for training. In these

datasets it is sometimes possible spot other people that might unknowingly have

(11)

1.5. OUTLINE

become a part of the dataset. No dataset that we have come across states if the people inside are aware that they now are a part of a publicly available dataset that anyone can access. On the other hand, from the datasets we have been looking at, the people inside are always anonymous, there has never been a name attached to a face, hand or other body parts. For some people this might be enough to feel like their integrity has not been violated, other might want more protection or maybe even not appear at all in this dataset. It would be good to get the permission of the people who end up in the dataset, especially if they can be easily identified.

Moving forward to the possible use cases for an object detector we can find some other ethical dilemmas that should be discussed and considered. Systems working in human environments might collect personal information about the preference and habits of humans it observes. For example, a service robot working in your home might need to be taught to recognize between your favorite type and brand of chips versus other chips. With this there is a question, how safe is this information with the robot? Will it forward this information to the manufacturer of the robot? And if it is forwarded, will the information be anonymized first? There are many more ethical aspects which can easily be discussed in several other theses.

1.5 Outline

The aim of this chapter is to give a short introduction to the field of computer vision, state the objective and delimitations as well as discuss some ethical aspects of this thesis. In Chapter 2 we give an overview of the work that has been done in the area. Later in Chapter 3 we briefly go through some theory that is good to know about the Convolutional Neural Networks and Support Vector Machines.

Chapter 4 contains out experimental set-up with hardware, software and data. In

Chapter 5 we explain what we did for the classification task and show our results,

while we in Chapter 6 do the same for the detection task. Finally, we end with

conclusions and future work in Chapter 7.

(12)

(13)

Chapter 2

Related Work

The computer vision research field is almost 50 years old now. As a result of this, there are thousands and thousands of articles covering the topic, some better than others. In this chapter, we will first fast-track through the most significant contributions historically (2.1), and then take a closer look at more recent work (2.2), most of which has been published after 2011. Note that this chapter assumes the reader has prior knowledge of how Convolutional Neural Networks work and the terms commonly used by the research community, we explain this along with the Support Vector Machines in chapter 3 (Theory).

2.1 Background of Deep Networks

We have to go back in time to 1962 (when it was first published) to find the foun- dation of Deep Convolutional Neural Networks (CNN) and what they are based upon. The paper we are referring to is “Receptive Fields, Binocular Interaction and Functional Architecture in the Cat’s Visual Cortex” by Hubel and Wiesel [21].

Their experiments provided new insights into how the brain sees objects and things around us. The experiments they performed were conducted on a sedated cat. They shone light on to the cat’s eyes with electrodes connected to its brain. The authors made some very important findings about how the brain interprets visual stimuli.

First of all, they found two different cell categories, “simple” and “complex” cells.

Secondly, they noticed the existence of multiple, distinctive, receptive fields. And thirdly, a diffuse light produced very small responses, while large responses occurred when the light was focused into long and narrow stripes. Both shape, position and orientation of the stripes affected the intensity of the response. Finally, they ob- served that complex cells were activated by the same type of light as simple cells with the difference that they were less dependent on the spatial position.

Kunihiko Fukushima used this model when he designed the Neocognitron [9] in

1980. It used multiple layers consisting of both simple excitatory (S) and inhibitory

(Vs) cells with the corresponding complex (C and Vc) cells. These S- and C-cells

were organized in subgroups, each of the subgroups consisted of only S- or C-cells,

(14)

(a) “Illustration showing the input inter- connections to the cells within a single cell-plane.”

(b) “Schematic diagram illustrating the interconnections between the layers in the neocognitron.”

Figure 2.1: Neocognitron illustrations. Both figures and captions shamelessly stolen from [9].

and each subgroup detected one particular feature. The only difference between the cells in a single subgroup is a parallel shift in position for the incoming connections (see Figure 2.1a). The layers were connected in the order of Us - Uc - Us - Uc - Us - Uc where Us is a layer consisting of multiple subgroups of S-cells and Uc is the corresponding for C-cells (see Figure 2.1b). The network was self-taught by repeatedly showing the same types of patterns without giving any information about them. In the end, the network started to categorize similar patterns in the same way humans did, even when the position was shifted.

Almost 20 years later Poggio and Risenhuber came with the Hierarchical MAX [41]

(commonly known as HMAX) model. Instead of using average pooling they utilize Max Pooling. They argue that the max operation could be biologically plausible, since it automatically and locally selects a relevant subset of data.

Both previous networks (Neocognitron and HMAX) have attempted to simulate how the brain classifies and detects objects in an as biologically correct way as possible. The rest of the networks discussed in this thesis are inspired by biology, but they are not biologically possible due to various reasons (e.g. feed-forward architecture and weight sharing).

One year before Risenhuber’s HMAX, Yann LeCun invents the Convolutional Neural Network [29], as we know it, for optical character recognition. In the theory section we will further explain how and why it works since it is of vital importance for this thesis.

2.2 Contemporary Works in Computer Vision

We divide the related modern work into multiple categories for an easier overview.

These categories include: pooling schemes (2.2.1), recent classification results using

deep CNNs (2.2.2), pre-trained networks used for semi-related tasks (2.2.3), un-

(15)

2.2. CONTEMPORARY WORKS IN COMPUTER VISION

derstanding of how and why CNNs work (2.2.4), classification and detection using both color and depth images (2.2.5), classification and detection using stereo vision or multiple viewpoints (2.2.6), different image datasets (2.2.7) and augmentation of data (2.2.8).

2.2.1 Pooling Schemes

Pooling (a.k.a. subsampling) is an important part of CNNs. Nowadays, Max Pool- ing [41] is the most commonly used pooling technique. Earlier, when Hinton intro- duced the CNN, Average Pooling [28] was normally used. They both work like their name implies and we explain it in more detail in section 3.4 where we also show some examples.

Another pooling method is “Stochastic Pooling” [57] introduced by Zeiler and Fergus. It maintains the advantages of Max Pooling but, reduces the overfitting that Max Pooling is prone to. To get an idea of how stochastic pooling works. Imagine you are using Max Pooling, but you have multiple copies of the input image, all with some slight variation to the pixel values.

All networks mentioned in this thesis require all input images to be of the same size. This can hurt the accuracy of the network since information is usually cropped out or distorted by subsampling the images to their desired size. To combat this problem Kaiming He introduced “Spatial Pyramid Pooling” [17] (SPP), which previ- ously has been used with shallow models and is known as “Spatial Pyramid Match- ing” (SPM). It works by partitioning the images into different scales, from fine to coarse, and then aggregating the local features for all scales. This produces a fixed length vector independent of the input size of the image. Therefore, they can place the SPP layer right before the fully connected layers in the network enabling the use of different sized images. This also means that they only need to perform the convolution step once for each image and scale giving a significant (24 to 64 times overall) speed boost to the computation time for object detection tasks. They set a new state-of-art on the VOC07 challenge with 59.2% mAP using a single network, while being 24 times faster.

2.2.2 Classification with Deep Convolutional Neural Networks

Recently, a lot of progress has been made in the area of image classification and object detection with the help of deep CNNs. It all really took off when Krizhevsky and Hinton crushed the previous state-of-art Top-5 error rate on the ImageNet [6]

challenge by 10.9% (absolute) compared to the second best entry in the competition using methods described in [25]. Deep CNNs are prone to overfitting, meaning they do not generalize very well and easily fits to the training data. The authors overcame this problem by using Dropout, a method first described in [19] and later in [46], we describe this method in the Theory section 3.5.2.

Since then the results on the competition have continued to improve. Goodfel-

low [11] came up with an idea to enhance the effects of Dropout using a new type

(16)

function approximator. In practice this means that each neuron can learn which activation function it should use. This causes different neurons to use different activation functions, all self-taught.

The winner of ILSVRC (ImageNet Large Scale Visual Recognition Challenge) 2013 was Overfeat [42] with a very similar network to Hinton’s with three small changes: no contrast normalization, overlapping pooling and a smaller stride of two instead of four.

At ILSVRC14 the winner was GoogLeNet [48] achieving an impressive 6.7%

Top-5 error rate with VGG [44] (Oxford) coming in second with 7.3% error rate.

They both used multiple, very deep, CNNs to vote and in that way improve the classification rate. We say they are very deep since they reached down to 22 layers, this can be compared to previous years’ winner (Overfeat) that used only eight layers. Google and Oxford differed in the way their methods worked. Oxford’s VGG used purely 3 × 3 convolutions down to a maximum of 19 layers depth while GoogLeNet used 7 × 7, 5 × 5, 3 × 3 and 1 × 1 convolutions. From these convolutions (and pooling layers) GoogLeNet created a module that was reused multiple times along the network reaching a total depth of 22 layers in the end.

The idea of using 1 × 1 convolutions originated with [32] where the authors showed that 1 × 1 convolutions correspond to a multilayer perceptron producing a more advanced function approximator.

2.2.3 Pre-trained CNNs for Semi-Related Tasks

Multiple researchers [7, 10, 40, 43] have used networks pre-trained on the ImageNet dataset (consisting of ∼1.3 million images). Normally they remove the final classifi- cation layer, replacing it with an SVM classifier fine-tuned for their application, or they remove all the fully-connected layers and replace them with an SVM or MLP layer that they tune for their data.

In [7] Donahue and Jia investigate how well features extracted from a CNN trained on ImageNet generalize to other datasets and how well they perform at var- ious depths. They do this by visualizing the separation between different categories in the first and sixth layer, showing greater separation in the deeper layers. They also show that for an eight layer network the three final fully-connected layers are the most expensive layers when it comes to computational time. By tuning the pre-trained network separately for fine-grained bird classification, domain adapta- tion and scene labeling they out-perform state-of-art models in these categories and thereby prove that the high-level semantic features generalize very well.

Similarly Oquab [40] outperformed state-of-art object detectors on the VOC07 and VOC12 datasets by replacing the last classification layer with a Rectified Linear Unit (ReLU) and Softmax classification layer.

Some researchers have made it even easier by simply taking state-of-art classifiers

trained on the ImageNet dataset and extract the features generated before the final

classification layer. Razavian (in [43]) used the Overfeat network by removing the

(17)

2.2. CONTEMPORARY WORKS IN COMPUTER VISION

last classification layer. He then told the network to classify the training images, the output was 4096 features for each of the training images; these features were used for training an SVM classifier. The results presented were created by first putting the test images through the adapted Overfeat network and then sending them into the SVM classifier trained on their specific datasets. Using this method he outperformed state-of-art methods on a series of datasets, among them: Pascal VOC07 Classification, MIT-67 Indoor scene classification, Oxford 102 Flower fine- grained classification, Paris 6k, Holidays and UKbench Visual Instance Retrieval datasets.

Szegedy in [49] attempted to do object detection on the VOC07 dataset by removing the final layer of the pre-trained classifier and replacing it with a regression layer. This method did fairly well with a mean Average Precision (mAP) of 30.5%, but it did not outperform the state-of-art.

On the other hand, in [10] Girshick is using “Recognition using Regions” for object detection on the Pascal VOC 2010 and 2012 datasets. He achieved state-of- art results on both with 58.5% and 53.7% mAP respectively. The method comes from [12] where the authors used regions for object detection for two major rea- sons. First, it encodes the shape and scale information naturally and, secondly it is robust to clutter from other regions. This proved to be good for object detection and, combined with selective search [55], achieved great results on the VOC object detection datasets.

2.2.4 Understanding CNNs

Some researchers have also tried to understand how and why CNNs work. Among them, Szegedy in [50] where he showed that small perturbations on the images the network has been trained on can cause 100% misclassification. He also show that the perturbations are quite general as they significantly decrease the performance of networks trained with different number of layers and using different training data.

In [58] Zeiler and Fergus visualize which features are activated by a particular object using deconvolutional neural networks. From the knowledge they gain, they improved state-of-art on the Caltech-101/256 dataset.

Chatfield [5], on the other hand, proves that it actually is the depth in CNNs that is the major cause for the significant leap in performance and not differences in the surrounding methods used e.g. cropping, the use of data augmentation, and GPU implementations.

2.2.5 Using RGB and Depth Images

With the introduction of affordable cameras which are able to record both RGB and

depth information, like the Microsoft Kinect [34], researchers started using them for

object classification, recognition, detection, as well as pose estimation. One of the

best systems that used RGB-D data, before the introduction of CNNs, was from

Kevin Lai and Dieter Fox in [27] where they treated the depth map like an image

(18)

used HOG descriptors. With this method they achieved state-of-art results on the RGB-D Object Dataset [26] with an accuracy of 85.4% on the category classification task.

With the resurface of CNNs, Socher [45] trained two different CNNs, one for the RGB part and one for the depth part, and then connected both to a shared Recursive Neural Network (RNN). Both of the CNNs and the RNNs were two layers deep each giving a total depth of four layers. The idea was to use the CNNs to extract low- level features while the RNNs extracted the high-level features. He out-performed all previous methods on the RGB-D Object Dataset with the exception of [2], which takes up five times as much memory while being 0.7% better.

Recently, significant (32.4% versus 21.8% mAP without segmentation) improve- ments have been made on the NYUD2 [35] dataset by Gupta using CNNs as re- ported in [13]. To achieve this fantastic result he used a network pre-trained on the ImageNet dataset as initialization weights for his network. Instead of using the original depth channel he used three different transformations on it to create three new channels. These channels consisted of horizontal disparity, height above ground and the angle the pixel surface makes with the inferred gravity direction.

This produced a total of six (red, green, blue and three converted depth) channels.

Due to the lack of training data in the NYUD2 dataset he created synthetic data from various viewpoints to reduce overfitting and allow better generalization.

Deep learning has also been used for teaching robots how to grasp objects of different shapes and sizes. This was done in [31] where the authors used a cascade of two CNNs to first roughly detect where the object is located and then to more precisely find it using a slower network with more features for more accurate detec- tion. They were also using RGB-D cameras, but they used a total of seven channels for input to the networks. Instead of using the RGB channel directly, they used YUV for greater invariance to illuminations. They are not only using depth but also the X-, Y- and Z-surface normal components, this gives four more channels.

Finally, they proposed a novel “multimodal group regularization” to the first input layer of the CNN. All in all, they achieved state-of-art performance while testing on the Cornell grasping dataset.

2.2.6 Using Stereo Vision and Multiple Viewpoints

Researchers have previously looked into the advantages of using multiple views for detection, localization and recognition in images or video. In [18] the authors prove that more views of the same object yields better classification and localization rates.

However, they did not use CNNs for this task, but instead relied on deformable parts models (DPMs), which were used for previous state-of-art image classifiers and detectors before CNNs became popular again.

Sudowe and Leibe designed an extremely fast sliding window object detector by

using multiple viewpoints in video feeds (e.g. CCTV cameras) [47]. The detector

requires information about two variables to work properly; it needs to know the

(19)

2.2. CONTEMPORARY WORKS IN COMPUTER VISION

height of the targets bounding box and a range of values for the real world targets size (e.g. human height [1, 2.3] m). It is unable to handle large perspective dis- tortions, when the cameras are placed too far apart. What it is good at is speed, especially when implemented on GPUs where it can achieve a staggering 222 fps.

The Facebook AI group together with Lior Wolf created an astounding face detector in [51] using deep neural networks. They trained their network on the Social Face Classification (SFC) dataset with more than four million labeled faces and tested it on Labeled Face in the Wild (LWF) where they achieved a 97.35%

accuracy, 0.18% shy of human performance. To reach this incredible performance they used 3D alignment of the faces and three locally connected layers between the last pooling layers and the fully connected layers.

2.2.7 Datasets for Various Tasks

There exist multiple datasets for different (and sometimes the same) kind of tasks.

The four most important ones for this thesis are:

• ImageNet [6] containing over 14 million RGB images with one million of them containing bounding box annotations of 1000 different classes.

• Pascal VOC2012 [8] with 20 different classes represented on 11,530 images used for training and validation. There is a total of 27,450 bounding box annotations as well as 6929 segmentations on the images. This dataset consists solely of RGB images.

• NYUD2 [35] is one of the largest RGB-D datasets for object detection around with 1449 labeled images and over 400,000 images in total. This dataset focuses on larger objects such as sofas, walls and chairs.

• RGB-D Object Dataset [26] is another large RGB-D dataset with over 250,000 images that was created using a spinning table to get a 360° view of each ob- ject, also with bounding box annotations. All objects in this dataset can easily fit on a small table making them interesting for robotic grasping applications.

• RGB-D Scenes Dataset (V1) [26] is another dataset created by Lai that is intended for object detection using RGB-D data. It consists of eight different scenes containing various objects.

2.2.8 Augmentation of Data

Since CNNs tend to perform better the more data there is, researchers typically want to augment the data they have and in this way, artificially, create more data.

With this in mind Howard [20] came up with a new strategy for creating data.

Instead of cropping (or subsampling) down images to 256 × 256 he suggests that

one should only subsample down the shortest edge to 256 giving either N × 256 or

256 × N . This way one keeps approximately 30% more information (depending on

(20)

Since the network used required images of size 224 × 224 he cropped the resulting

N × 256 (or 256 × N ) image five times. He also flipped the images, used three

different scales and views creating a total of 90 images and predictions for each

image. He also created a greedy algorithm to only use the best 10 or 15 predictions

to not slow down the system, making sure it can run in real time.

(21)

Chapter 3

Theory

Convolutional Neural Networks (CNNs), as we know them, were first introduced by Yann LeCun’s [29] in 1998 for Optical Character Recognition (OCR) where they showed impressive performance. At this point we should note that CNNs are not just used for image related tasks, they are also commonly used for signals and language recognition, audio spectrograms, video, and volumetric images. The most significant difference between them are the input dimension of the signals. For signals and languages the input is normally 1D, for images and audio spectrograms it is 2D and for video and volumetric images it is 3D [30]. In this section we will focus on the theory behind 2D CNNs for image related tasks. We base the theory on [14, 15, 25, 29]. We will in particular explain convolutions in section 3.3, pooling in section 3.4, fully-connected layers in section 3.1, training of a CNN in section 3.5.

Support Vector Machines (SVM) for classification is also explained in section 3.6.

In the following sections we use italic for concepts that will be explained in the upcoming sections.

3.1 Fully-Connected Layers

Fully-connected layers refer to the final layers in the full CNN model (an example is seen in Figure 3.1). Fully-connected layers operate as a Multi-Layer Perceptron (MLP) with normally either two or three hidden layers and one classification layer.

Figure 3.1: An example of a Convolutional Neural Network with all stages.

(22)

two hidden layers it can approximate any function assuming it has enough hidden neurons. Normally, the number of neurons in the hidden layers is constant, with 4096 being a common number for deep networks with large input images (similar to the ones used in ILSVRC). The inputs to the first hidden layer originate from all neurons in the previous layer (either a pooling or convolutional layer ). In other words, each neuron in the previous layer is connected to each and every neuron in the first hidden layer. Outputs from the first hidden layer are connected to each and every neuron in the second hidden layer, it is fully-connected with this characteristic being the origin of its name. Outputs from the last hidden layer are then fully- connected to the final classification layer. The size of the final layer depends on the number of classes the network is trained to separate between. For example, if you want to separate between 78 different bird species you set the number of neurons (i.e. the size) in the final layer to be 78.

Looking at a single neuron j in the MLP, it is described mathematically by the following equations

v

_j

(n) =

m

X

i=0

w

_ji

(n)y

_i

(n) (3.1)

y

_j

(n) = ϕ

_j

(v

_j

(n))) (3.2)

where y

_j

(n) is the output of neuron j at iteration n. The weight between neuron i (in the previous layer) and neuron j (in the current layer) is w

_ji

. From this follows that v

_j

(n) is the weighted sum of all neurons in the previous layer. While ϕ

_j

(·) is the activation function of neuron j and m is the total number of inputs (neurons in the previous layer). The bias is accounted for with i = 0 and y

₀

(n) = 1.

3.2 Activation Function

The activation function plays a central, and very important role in how CNNs work and how well they perform. Every neuron in every convolutional and fully- connected layer has a specific non-linear activation function. The type of non-linear function varies, but most networks nowadays are using ReLUs (Rectified Linear Units) mathematically described by ϕ(x) = max(0, x). The major reason for this is the ease and speed of which the derivative can be calculated precisely compared to other, previously commonly used activation functions like the logistic ϕ(x) =

_1+e¹−x

and hyperbolic tangent ϕ(x) = tanh(x) functions. In Figure 3.2 we can see the significant speed boost ReLUs provide when compared to sigmoid functions. We should also mention the possibility of self-training activation functions from [11]

where they are explained in detail. By using self-taught activation functions it is possible to keep the non-saturating properties from ReLUs while being able to tailor each neurons activation function for the specific classification task.

Furthermore, in Figure 3.3 we can view the difference between the most common

non-linear functions. It is important to note that ReLUs are non-saturating, unlike

(23)

3.3. CONVOLUTIONAL LAYERS

Figure 3.2: Difference in the number of iterations it takes for the training error to reach 25% (on the CIFAR-10 dataset) using ReLUs (solid) and tanh (dashed) activation function.

Figure take from [25].

the logistic and hyperbolic tangent functions, this is thought to contribute greatly to the reduction in training time. The reason for using non-linear over linear activation functions is that they allow us to solve non-trivial problems using a much smaller number of neurons in comparison to using linear functions. A simple example would be solving the XOR classification problem. It consists of two classes, A and B. The points p that belong to class A are p

_A

= [(0, 0), (1, 1)] and corresponding for B are p

_B

= [(0, 1), (1, 0)]. This set of points is obviously impossible to separate using a single straight line, a minimum of two are needed. By using curved lines or ellipses we can separate them with a single line. Non-linear functions allow us to create these curved lines or ellipses. This means that with non-linear activation functions it can be solved with only one hidden neuron, but with linear activation functions you need a minimum of two hidden neurons to solve the XOR-problem.

3.3 Convolutional Layers

The convolutional layers are a more restricted version of the MLP adapted to take a

2D inputs instead of 1D. The idea behind convolutional layers is to take elementary

(24)

-3 -2 -1 0 1 2 3 -1

-0.5 0 0.5 1 1.5 2 2.5 3

tanh(x) max(0, x)

1 1+e^−x

Figure 3.3: Different activation functions that are commonly used.

features such as edges, corners and endpoints, and combine them using multiple layers to get high level features that might describe a complete object, e.g. a book, glass or chair. This type of architecture operates in the following manner: the first layer f

₁

(·) contains the most elementary features, the second layer f

₂

(f

₁

(·)) is a function of the elementary features in the previous layer. The third layer is then a function f

₃

(f

₂

(f

₁

(·))) of the features in the second layer and so on. An architecture like this is desired as very high level features can be generated to describe a face, chair, or a car with a lot less features than if they would be linearly combined.

A very important and valuable attribute is the shift invariance that convolutional layers provide. That is, if the input to the first layer is shifted, then the output of the first layer is also shifted by the same amount. This is true for every convolutional layer in the network.

3.3.1 Elementary Features

Elementary features detect the simplest of shapes, such as edges and corners. The

Sobel operator is an example of a simple edge detector as seen in Figure 3.4, it

approximates a derivative for the pixel values. This type of detector follows the

basic assumption that is the foundation of CNN, it states that elementary features

useful in one part of the image are likely to be useful elsewhere in the image. From

this assumption follows that all units from one feature map are forced to have

(25)

3.3. CONVOLUTIONAL LAYERS

identical weight vectors even though their (the neurons) local receptive fields cover a different section of the image. This implies that every unit (a.k.a. neuron) in each feature map has the exact same weight vector. By sharing the weights this way one gains three significant advantages:

1. Generalization ability of the network increases as the number of free param- eters is reduced. This means that the network is less likely to overfit to the training data.

2. There is a reduction in training time as there is less free parameters to tune.

3. It allows for parallelization (both on GPUs and CPUs), which also helps to significantly reduce the training and testing times.

3.3.2 Weight Vector

A weight vector can be viewed as a filter. Its task is to perform a specific operation on a part of the image covering the corresponding local receptive field. It is very important to note that all weights are learned by the network itself, they are not tuned manually by the user. The user only initializes the weights, this is normally done by randomly picking values from a Gaussian distribution with mean zero and variance 0.01. The weights are shared for each feature map and therefore every neuron that belongs to a feature map has exactly the same weights as every other neuron in that feature map.

In the example in Figure 3.4 this is an horizontal and vertical derivative across the whole image. The weight vector (for the horizontal Sobel the weight vector is w = [0, −1, −2, −1, 0, 0, 0, 1, 2, 1]) is described by the 3 × 3 matrix (corresponding ¯ to a 3 × 3 local receptive field) and a bias (0 in this case). As seen, the resulting images after applying the horizontal and vertical Sobel operator correspond to two different feature maps in the same layer.

3.3.3 Local Receptive Field

Every neuron has a local receptive field from which features are extracted. In

Figure 3.5 the units local receptive field is 3 × 3 pixels (more common is 5 × 5 and

sometimes 7 × 7, this allows for a larger variety of elementary features with less

layers). Typically, the receptive field increases with the input image size. The local

receptive field is the input to the neuron, this allows for 10 trainable coefficients

for each feature map (one for each of the nine input pixels and one for the bias,

assuming 3 × 3 size of the receptive field). Let us say that he first neuron has a local

receptive field covering the top-left corner of the image. Sliding this receptive field

across the image gives the second neuron a local receptive field (same 3×3) covering

the top-left corner shifted one pixel to the right of the first neurons receptive field

(this is assuming a stride of one us used). This will give an overlap of 3 × 2 pixels

with only 3 × 1 new pixels. The window continues to slide across the whole image

(26)

Figure 3.4: The Sobel operator is an elementary kernel able to extract horizontal and vertical changes in the image. Looking closely, one can see the difference between the horizontal and vertical Sobel operator. Using the horizontal Sobel gives clear horizontal edges, while the vertical are invisible. The figure is an edited version of [36].

until every pixel has been covered at least once. The resulting feature map will be smaller than the previous layer by two pixels (for 3 × 3) in width and height as putting the center on an edge pixel is considered an invalid operation, because a part of the mask is outside the previous feature map (or image).

3.3.4 Feature Map

Each layer consists of multiple feature maps where each feature map is the result of performing the same operation (applying the same weight vector for a local receptive field) across previous layers to detect a certain feature. If a layer consists of eight feature maps it means the receptive field has eight different weight vectors.

Each pixel in position (k, l) in a feature map m is the result of applying a weight vector with a receptive field centered in (k, l) on feature map m − 1, this is seen in Figure 3.5. By sharing weights in this way the chances of overfitting to the training data is reduced. It also reduces the number of free parameters and allows for detecting multiple features of different kinds over the same part of the image.

See Figure 3.6 for an example of how the feature maps are built up and Figure 3.4

for an example of what a feature map might look like after a simple edge detector

has been applied to the input image.

(27)

3.4. POOLING LAYERS

Figure 3.5: Visual explanation of how the convolutional operation works. Here a 3 × 3 receptive field is used with the weight vector [0, 4, 0, ..., 0, −4] (w

0

= 0 is the bias weight).

Figure from [22].

3.4 Pooling Layers

Pooling (a.k.a. subsampling) is a simple way of reducing the precision for the position from where distinctive features are located in the feature map. This can be done since the exact position of the feature is irrelevant, only its position in relation to the other features is of importance, especially for classification tasks. In other words, we do not care where an edge or a corner is located in the image, we only care about its position relative the other corners and edges in the image.

For example, we can separate between a passenger jet and a bus by looking at the contours. If seen from the side, they both have straight and curvy edges as well as endpoints. But from the way the contours are linked together we are able to separate between the two vehicles.

In the end you are simply reducing the resolution of the feature maps either by averaging or maximizing (see Figure 3.8 for an example of both) over an M × M local receptive field. Of the two pooling techniques maximizing is most commonly used nowadays and is referred to as max-pooling. It uses a receptive field of M × M where the maximum value is forwarded to the next layer.

The size of the receptive field is critical, most commonly used is 2 × 2, this is due

to the significant information loss that occurs when using larger receptive fields. For

(28)

Figure 3.6: Representation of feature maps in two different layers. Layer m − 1 consists of four feature maps each with their own feature vector. Layer m contains two feature maps with different feature vectors. Observe that the feature vector is a non-linear combination of all feature maps in the previous layer. Figure from [52].

instance, if a 2 × 2 receptive field is used four pixels are turned into one. Increasing the receptive field to 3 × 3 would mean nine pixels are turned into one and if 4 × 4 is used, 16 pixels are turned into one. To put it into perspective, imagine you have a 12 × 12 image (or look at Figure 3.7) giving a total of 144 pixels. A receptive field of 2 × 2 would leave you a 6 × 6 image with 36 pixels. You have now lost 75%

of the information, but the object might still be recognizable. If instead you use 3 × 3 you only get an image of 4 × 4 with 16 pixels leaving only 11% of the original information. Now it might be possible to recognize the original shape, but if you are unlucky you have lost too much information. If you decide to push it even further by using 4 × 4 pooling will get you a 3 × 3 image with a total of nine pixels, you have now lost 94% of all the information in the image. Now it is most likely impossible to identify the original shape or figure.

By subsampling a feature map we can extract features that are much larger than

the actual size of the convolution we use. Given that we subsample a feature map

using a 2 × 2 receptive field we will extract features that are four times as large

without changing the size of the convolution. Let us say we have a 10 × 10 feature

map that we want to turn into one pixel. Without subsampling we would need a

10 × 10 convolution for this with 101 trainable coefficients. If we first subsample

the feature map using a 2 × 2 receptive field we will get a feature map with the size

5 × 5. Then, a 5 × 5 convolution with only 26 trainable coefficients would be enough.

(29)

3.5. TRAINING THE NETWORK

Figure 3.7: Shows the significant information loss when using a large receptive field for pooling. In this case max-pooling was used; observe that one pixel is ranging from x.5 to (x + 1).5.

This provides a significant speed boost when it comes to both training and testing, and hopefully we did not lose any crucial information. We should note that some loss of information is good as it allows the network to generalize better and avoid overfitting to the training data, this is another reason why pooling is important.

3.5 Training the Network

All parts of the network are trained with error back-propagation using stochastic gradient descent, in batch mode. This means that during training a batch of images are forwarded through the network until each of the images has been classified to one of the possible classes with the networks current weights. Then every image classification is compared to the ground-truth for that image. The error, if there is one, is then back-propagated along the network changing the weights one by one in the direction that would minimize the error, also known as the steepest gradient descent direction.

Mathematically we can describe the gradient of neuron j at iteration n with the following equations

δ

_j

(n) = e

_j

(n)ϕ

⁰_j

(v

_j

(n)) (3.3)

(30)

Figure 3.8: Visually representation of the two most common pooling methods. Nowadays max-pooling is almost always used. The figure is an edited version of a figure from [24].

where e

_j

(n) is the error signal at neuron j and ϕ

⁰_j

(v

_j

(n)) is the derivative of the neuron output (3.2). This means that if j is an output neuron (classification neuron) e

j

(n) corresponds directly to the classification error. On the other hand if j is a hidden neuron the derivative can be rewritten as

δ

j

(n) = ϕ

⁰_j

(v

_j

(n)) ^X

k

δ

_k

(n)w

_kj

(n) (3.4)

where k is a neuron in the layer after j (assuming we are going from the output to the input of the network, i.e. backwards), in other words, outputs from j are inputs to k. This has a major implication on the maximum depth a network can have since the power of the error signal decreases exponentially for each layer until it becomes meaningless as shown by Bengio in [1].

3.5.1 Amount of Training Data

The amount of training data plays a huge role in the performance for CNNs. When training a network from scratch, a few thousand annotated images will not be enough. One needs tens of thousands or hundreds of thousands, preferably millions of images. This amount of annotated data is hard to come by; ImageNet [6] is a commonly used dataset for training with over a 1,000,000 annotated images with over 1000 object classes.

A way around this problem with the huge amounts of annotated data is to use

networks that have already been trained on ImageNet. There are currently two

common ways of doing this. The first one assumes features from the pre-trained

network are very general and work well for other types of data. This has proved to

be a valid assumption, especially considering that the network has been trained on

1000 different classes. In the first method, the final classification layer is removed

and then the images in our, much smaller, dataset are sent through the network

as is done normally when images are being classified. But instead of getting a

vector of length 1000 (for each image) containing the probability for each class,

(31)

3.5. TRAINING THE NETWORK

a vector of size 4096 (normally) is returned for each image. In practice, one can view this as converting each image to a feature vector of size 4096 using the pre- trained network. It is very important to note that no other changes are made to the pre-trained network other than removing the final classification layer. Feature vectors generated from the training images are then used to train a Support Vector Machine (SVM) classifier. This classifier is adapted specifically to our training data, since the feature vectors generated from the feature extraction are high-level representations of our images. Evaluation of the SVM classifier is simple, we use our feature vectors from the test images and run them through the SVM. It will then predict which class each of the images they belong to or provide a probability estimate of which class it belongs to. This relatively simple method has proven itself by providing outstanding results [43] on several datasets.

The second method has the prospect of providing even better results at the cost of longer training times as it focuses on fine-tuning the weights/features for your particular task. This method is quite similar to the previous method with two small, yet highly influential, differences. First, using exactly the same architecture up to the final classification layer is not required. This means that the architec- ture of the network can be altered, but as previous method, the final classification layer is normally replaced with an SVM or softmax classification layer. The second difference is that the weights are used for initializing the network instead of using randomly initialized weights as is normally done when training from scratch. When the networks weights have been initialized, the network is trained in the same way as you do when you train it from scratch, but using a much smaller learning rate in order to fine-tune the weights for your task. The learning rate controls how much the weights are changed when an image is misclassified.

3.5.2 Dropout

The dropout method [19,46] does not have to be used, but is in practice always used since it improves the networks generalization capabilities immensely. The essence of the method constitutes of dropping neurons randomly for every iteration during the training. In the fully-connected layers around 50% of the neurons are dropped while for the input layer around 20% are dropped. This gives 2

^N

possible networks with shared weights, where N is the number of neurons. The chance of a specific network being trained is therefore slim, implying that the each network is specialized to its training data. This makes each neuron less dependent on the other neurons for correctly classifying an image.

During testing all the neurons are present, but their weights are multiplied with the probability of being active during the training. For example, suppose neuron A was active 4 out of 10 iterations its weight w

_A

is multiplied by p = 0.4 during the testing, while neuron B was active 7 of 10 iterations will have its weight w

_B

multiplied by 0.7. This gives the same expected output during training and testing.

By randomly dropping neurons in the fully connected layers for every iteration, the

neurons are forced to provide more robust classifications by themselves and rely

(32)

Figure 3.9: Shows the effect dropout has on a network. Left: Standard fully-connected, two layer, neural network without dropout. Right: The same neural network after dropout has been applied, crossed circles represent dropped neurons. Figure taken from [46].

less on the other neurons in the network for correctly classifying the object. In Figure 3.9 the effect of dropout in the fully-connected layers is clearly visible.

3.6 Support Vector Machines

Support Vector Machines (SVMs) is a binary feedforward neural network that can be used for pattern classification given both linearly and non-linearly separable data.

Given the simplest scenario with two classes that are linearly separable the main idea of SVMs can be summarized as “Given a training sample, the support vector machine constructs a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized.” [16]

This idea can be extended to non-linearly separable patterns as well with some adaptations.

3.6.1 Linearly Separable Data

We start with the simplest scenario, when the data is linearly separable. Imagine a training set {¯ x

i

, d

i

}

^N_i=0

consisting of an input pattern ¯ x

i

, where i is the ith example, and d

_i

± 1 (desired output) is the corresponding class. A hyperplane separating this training set is described by

w ¯

^T

x ¯

_i

+ b = 0 (3.5)

where ¯ w is an adjustable weight vector and b is the bias term. This means that classes belonging to d

_i

= 1 are described by

w ¯

^T

x ¯

_i

+ b ≥ 0 (3.6)

(33)

3.6. SUPPORT VECTOR MACHINES

and equivalently for d

_i

= −1

w ¯

^T

x ¯

i

+ b < 0 (3.7)

This method will only provide a solution and is know as a perceptron, it is by no means guaranteed to be optimal. In figure 3.10a we can view an example of possible solutions that different perceptrons might come up with.

A better way would be to add a margin to each side to act as a buffer zone to the closest data points of each class, even better is to maximize this margin. Assume the minimum distance between the two classes is 2ρ. This results in the optimal hyperplane having the margin ρ to each class. After normalizing ρ to 1, the optimal hyperplane is described by

w ¯

^T₀

x ¯

i

+ b

₀

≥ 1 f or d

i

= 1

w ¯

^T₀

x ¯

i

+ b

₀

≤ −1 f or d

_i

= −1 (3.8) where ¯ w

₀

and b

₀

is the weight vector and bias for the optimal hyperplane respec- tively. This formulation is equivalent to

d

_i

( ¯ w

^T₀

x ¯

_i

+ b

₀

) ≥ 1 (3.9) It can be proven that ρ =

_{|| ¯}_w¹

0||

giving that maximizing the margin ρ is equal to minimizing the norm of the weights || ¯ w

0

||. This gives the following quadratic con- straint optimization problem.

Minimize:

1 2 w ¯

^T

w ¯ (3.10)

subject to:

d

_i

( ¯ w

^T

x ¯

_i

+ b) ≥ 1 (3.11) Since both ¯ w and b are unknown the problem normally solved is the dual. This is much less complicated, but a lot less intuitive. The dual is given by

Maximize:

N

X

i=1

α

i

− 1 2

X

i,j

α

i

α

j

d

i

d

j

K(¯ x

i

, ¯ x

j

) (3.12) subject to:

P

N

i=1

α

_i

d

_i

= 0

α

_i

≥ 0 (3.13)

where K(¯ x

_i

, ¯ x

_j

) = ¯ x

^T_i

· ¯ x

_j

is a kernel function and α

_i

are Lagrange multipliers. The optimal weight vector is then given by the optimal α

₀

according to

w ¯

0

=

Ns

X

i=1

α

0,i

d

i

x ¯

i

(3.14)

(34)

(a) A few possible hyperplanes separating the training data.

(b) The optimal separating hyperplane with the maximum margin. The filled squares and circle are support vectors (i.e. α

0,i

6= 0).

Figure 3.10: Illustrates the difference between two different solutions to the same problem.

Left: Possible solutions found by a perceptron. Right: Solution found by an SVM. Both figures from [54].

where N

_s

is the total number of support vectors. The bias is b

₀

= 1 −

Ns

X

i=1

α

_0,i

d

_i

x ¯

^T_i

x ¯

^(s)

(3.15) where ¯ x

^(s)

are the support vectors for which the Lagrange multiplier α

_0,i

6= 0. The advantage of maximizing the margin is clearly visible in Figure 3.10b.

3.6.2 Non-Linearly Separable Data

In most cases, however, data will not be linearly separable, see an example in Figure 3.11a. There are three common methods for handling these cases, the basic ideas are listed below and then explained in more depth later in the section.

1. Allow points within the margin.

2. Linearly separate the data in a higher dimensional space.

3. Combine the two previous methods.

Allow points within the margin

When points are allowed to be within the margin (also known as a soft margin) they

can violate the condition (3.9) in two different ways. Either the point is correctly

(35)

3.6. SUPPORT VECTOR MACHINES

(a) An SVM (with hard margin) would not converge for this case, but the fig- ure illustrates that even with zero mar- gin it is an impossible task for a linear, hard margin, SVM to solve. Figure taken from [39].

(b) Soft margin, some points are allowed to be misclassified. For this case the SVM would converge. Figure taken from [53].

Figure 3.11: Illustrates the difference between using soft and hard margin.

classified within the margin or it is misclassified within the margin. To handle both these cases the non-negative slack variable ξ

_i

is introduced, measuring the points deviation from the correct margin. Inserting the slack variable into (3.9) gives the new formulation

d

_i

( ¯ w

^T₀

x ¯

_i

+ b

₀

) ≥ 1 − ξ

_i

(3.16) where i = 1, 2, ..., N as in (3.9). Points on the correct side of the margin are within the region 0 < ξ

_i

≤ 1 with points on the hyperplane corresponding to ξ

_i

= 1 (see Figure 3.11b for an example). Points on the wrong side are in the region ξ

_i

> 1.

What this means for the dual problem (3.12) is that an upper bound (a cost) C is added to α

_i

in the dual constraints (3.13) giving the new formulation.

Maximize:

N

X

i=1

α

_i

− 1 2

X

i,j

α

_i

α

_j

d

_i

d

_j

K(¯ x

_i

, ¯ x

_j

) (3.17) subject to:

P

N

i=1

α

i

d

i

= 0

0 ≤ α

_i

≤ C (3.18)

If C → ∞, it approaches the hard margin (smaller margin) allowing for almost no

points inside the margin. On the other hand, if C → 0, the margin grows as more

points are allowed inside the margin.

(36)

-2 -1 0 1 2 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 3.12: Visualization of the different kernels normalized to be in range [0, 1]. Solid Blue: Sigmoid kernel tanh, Dashed Red: Polynomial kernel of degree 2, Dotted Ma- genta: Radial basis kernel.

Separate in a higher dimensional space

For linearly separable data, using linear kernel K(¯ x

_i

, ¯ x

_j

) = ¯ x

^T_i

· ¯ x

_j

is enough to separate the data. For non-linearly separable data the kernel can be changed to a non-linear kernel that fulfills the “Mercer’s Theorem”. This will allow for linear separation in a high dimensional space. Since this is done implicitly, without ever calculating the coordinates of the data in the higher dimensional space, it is known as the “kernel trick”. The most commonly used kernels are listed below.

• Polynomial Kernel (¯ x

^T

x + 1) ¯

^p

where p (the power) is set by the user.

• Radial Basis Kernel exp(−

_2σ¹₂

||¯ x − ¯ x

_i

||

²

) where σ

²

(the width) is set by the user.

• Sigmoid Kernel tanh(β

₀

x ¯

^T

x ¯

i

+β

₁

) where only some values for β

₀

and β

₁

fulfills the Mercer’s Theorem, they are set by the user.

Observe that for Polynomial and Radial Basis kernels the Mercer’s Theorem is

always satisfied, therefore they can always be used. The shape of different kernels

can be seen in Figure 3.12.

(37)

3.6. SUPPORT VECTOR MACHINES

Figure 3.13: Displays the power of the kernel trick. Something that is very complex to separate in one dimension can easily be separated in a higher dimension. Figure from [33].

Combination

Of course the two above mentioned techniques can be combined, and usually are. By combining, a small degree of separation is permitted, while having a relatively large margin allows for better generalization than if the methods were be used separately.

A good way of thinking about it is that the cost parameter (C) controls the size of

the margin while the kernel functions controls how precisely the hyperplane fits to

the training data. This means that the better the training data represents the test

set, the higher cost C can be used and thereby rely more on the kernel function K.

(38)

Combining RGB and Depth Images for Robust Object Detection using Convolutional Neural Networks

DEGREE PROJECT, IN COMPUTER SCIENCE AND COMMUNICATION , SECOND LEVEL

STOCKHOLM, SWEDEN 2015

Combining RGB and Depth Images for Robust Object Detection using Convolutional Neural Networks

JESPER THÖRNBERG

KTH ROYAL INSTITUTE OF TECHNOLOGY

Combining RGB and Depth Images for Robust Object Detection using Convolutional Neural

Networks

JESPER THÖRNBERG

Master Thesis at CSC CVAP

Supervisor: Mårten Björkman <celle@kth.se>

Examiner: Danica Kragic <dani@kth.se>

Abstract

Specifically, for the detection, we used contours at different thresholds to find the likely object locations in the images.

For the classification task we can report state-of-art results

using only RGB and RGB-D images, depth data alone gave

close to state-of-art results. For the detection task we found

the RGB only detector to be superior to the other detectors.

objektdetektering med neurala faltningsnätverk

“state-of-art” prestanda men nådde inte riktigt hela vägen

fram. För detektionsuppgiften fann vi att RGB-detektorn

vara överlägsen de andra detektorer.

Contents

1 Introduction 1

1.1 Thesis Objective . . . . 1

1.2 Methodology . . . . 2

1.3 Delimitations . . . . 2

1.4 Ethical Considerations . . . . 2

1.5 Outline . . . . 3

2 Related Work 5 2.1 Background of Deep Networks . . . . 5

2.2 Contemporary Works in Computer Vision . . . . 6

2.2.1 Pooling Schemes . . . . 7

2.2.2 Classification with Deep Convolutional Neural Networks . . . 7

2.2.3 Pre-trained CNNs for Semi-Related Tasks . . . . 8

2.2.4 Understanding CNNs . . . . 9

2.2.5 Using RGB and Depth Images . . . . 9

2.2.6 Using Stereo Vision and Multiple Viewpoints . . . . 10

2.2.7 Datasets for Various Tasks . . . . 11

2.2.8 Augmentation of Data . . . . 11

3 Theory 13 3.1 Fully-Connected Layers . . . . 13

3.2 Activation Function . . . . 14

3.3 Convolutional Layers . . . . 15

3.3.1 Elementary Features . . . . 16

3.3.2 Weight Vector . . . . 17

3.3.3 Local Receptive Field . . . . 17

3.3.4 Feature Map . . . . 18

3.4 Pooling Layers . . . . 19

3.5 Training the Network . . . . 21

3.5.1 Amount of Training Data . . . . 22

3.5.2 Dropout . . . . 23

3.6 Support Vector Machines . . . . 24

3.6.1 Linearly Separable Data . . . . 24

4.2 Software . . . . 32

4.3 Data . . . . 32

4.3.1 Classification . . . . 33

4.3.2 Detection . . . . 33

5 Classification 35 5.1 Method and Implementation . . . . 35

5.1.1 Missing Depth Information . . . . 35

5.1.2 Encoding 8-bit Depth into 8-bit RGB . . . . 39

5.1.3 Resizing Images . . . . 40

5.1.4 Training and Testing . . . . 40

5.2 Results . . . . 43

5.2.1 RGB . . . . 43

5.2.2 Depth . . . . 43

5.2.3 RGB-D . . . . 45

6 Detection 47 6.1 Method and Implementation . . . . 47

6.1.1 Finding Regions of Interest . . . . 47

6.1.2 Training . . . . 55

6.1.3 Testing . . . . 55

6.2 Results . . . . 57

6.2.1 Small-Scale . . . . 58

6.2.2 Large-Scale . . . . 61

6.2.3 Evaluation of Candidate Regions . . . . 65

7 Conclusions and Future Work 67 7.1 Conclusions . . . . 67

7.2 Future Work . . . . 68

Bibliography 69 A Object Classes 76 B Class Accuracy and Confusion Matrix 77 B.1 RGB . . . . 77

B.2 Depth . . . . 79

B.3 RGB-D . . . . 81

Chapter 1

Introduction