Improving the Accuracy of 2D On-Road Object Detection Based on Deep Learning Techniques

(1)

IN

DEGREE PROJECT

INFORMATION AND COMMUNICATION

TECHNOLOGY,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

Improving the Accuracy of 2D

On-Road Object Detection Based on

Deep Learning Techniques

(2)

Abstract

This paper focuses on improving the accuracy of detecting on-road objects, in-cluding cars, trucks, pedestrians, and cyclists. To meet the requirements of the embedded vision system and maintain a high speed of detection in the advanced driving assistance system (ADAS) domain, the neural network model is designed based on single channel images as input from a monocular camera.

In the past few decades, forward collision avoidance system, a sub-system of ADAS, has been widely adopted in vehicular safety systems for its great contri-bution in reducing accidents. Deep neural networks, as the the-state-of-art ob-ject detection techniques, can be achieved in this embedded vision system with efficient computation on FPGA and high inference speed. Aimed at detecting on-road objects at a high accuracy, this paper applies an advanced end-to-end neural network, single-shot multi-box detector (SSD).

(3)

Sammanfattning

Detta dokument fokuserar p att frbttra noggrannheten nr det gller att upptcka on-road-objekt, inklusive bilar, lastbilar, fotgngare och cyklister. Fr att uppfylla kraven i det inbyggda visionssystemet, och upprtthlla en hg upptckthastighet i ADAS-domnen (advanced drive assist system), r den neurala ntverksmodellen utformad baserat p enkanalsbilder som inmatning frn en monokulr kamera. Under de senaste decennierna har systemet fr framtida kollisionsundvikande system, ett delsystem fr ADAS, antagits allmnt i fordonsskerhetssystem fr sitt stora bidrag till att minska olyckor. Djupa neurala ntverk, som den senaste tekniken fr detektering av objekt, kan uppns i detta inbyggda visionssystem med e↵ektiv berkning p FPGA och hg inferenshastighet. Siktat p att upptcka vgar p vgar i hg noggrannhet, tillmpar vi ett avancerat neuralt ntverk, single-shot multi-box detector (SSD).

(4)

Acknowledgment

I would like to thank my internship company Bitsim AB and business manager Mr.Sivard at Bitsim, who has o↵ered me this chance of doing such an interesting and challenging thesis project. During the past five months, I have obtained a lot of industrial experience and practical knowledge which can help me a lot in my future career. My supervisor Andreas Gustafsson at Bitsim, has given me many valuable advice to research on the right path to solve problems and Hanwei Wu as my supervisor at KTH, o↵ered me his kindly help when I encoun-tered some technical doubts. Professor Markus Flierl, as my examiner at KTH, has been supportive all the time and given me more faith on this project. My colleague Andrea Leopardi, who works together with me at Bitsim, has been a great listener and a helper for discussing the crucial problems with me and finding out right answers.

(5)

Introduction

Advanced driving assistance system (ADAS) has drawn growing attention in the global autonomous driving market, for its great value in improving traffic efficiency and safety. This concept is a general term of several high-tech sub-systems which have already been widely used, such as anti-lock braking sys-tems, parking sensors, automotive navigation syssys-tems, etc. While the collision avoidance system (precrash system), one promising technique among them, can dramatically reduce road fatalities with an accurate detection of forward col-lisions and in-time reactions. Recent advances in electronics, control systems, processors and communications now allow for the design of collision avoidance systems with increased sophistication, reduced cost and high reliability [30]. Typical collision avoidance system detects and recognizes vehicles, pedestri-ans, traffic signs and the like around a vehicle using radar, laser radar, GPS, and camera, etc. It provides a warning signal for the driver or takes timely actions by steering and/or braking autonomously when there is an impending collision. Among above four sensors, the camera has the cheapest material cost but can interpret scenes better than the rest by seeing the color. A pair of forward-looking cameras, also called stereo vision, is able to provide 3D depth perception of the environment ahead. However, a fine resolution camera collects a massive amount of data (millions of pixels in each frame), which requires an intensive computation and complex algorithm for processing them. Recent ad-vances in hardware computation efficiency have enabled the rapid development of software algorithms, especially for embedded vision systems.

(8)

cars in a certain grayscale image taken with daytime light can be localized and recognized with a high accuracy.

Figure 1.1: Vehicle detection in a grayscale image.

1.1 Background and motivation

Collision avoidance systems have decreased the injury and death rate greatly in the accidents. With the development of both software algorithms and hardware components, it is getting more precise for cameras to detect multiple objects in a real-time scenario using DNN. But the real-world industrial cases are more required and complex than simple implementations in experimental phases in pure personal computer environment. Practical issues such as color channel transform, power consumption, processing speed and accuracy are always in discussion to ensure the adoptions of DNN in the industrial applications. As DNN involves a huge amount of matrix calculations and other operations which can be massively parallelized, graphics processing units (GPUs) are able to complete this task better than central processing units (CPUs) with its large computational units and higher bandwidth to memory. For industrial purpose, there are usually two main phases involving in DNN implementations, training phase and inference phase. DNN model training with high throughout is car-ried on with one or multiple GPUs, taking from hours to weeks to complete. However, the inference of DNN in real-world applications requires high-speed processing and power efficiency, thus more hardware units such as FPGA-based embedded vision platforms are applied. The adoption of DNN on embedded systems are out of our scope. In this paper, it focuses on the accuracy per-formance of DNN model in the driving context with the condition of real-time processing.

(9)

implemented with grayscale images. To maintain a high speed for real-time detection, Y pixel format has been used, which may degrade the detection re-sults for the loss of color information. Therefore, our research explores on if the loss of color information has influenced the performance of the selected DNN algorithm, reducing the precision and recall of detecting multiple objects in the test set.

1.2 Overview of the work

This chapter introduces the motivation and the thesis work briefly. In chapter 2, a background description is carried out including the basic knowledge of DNN and some advanced methods of the state-of-art in 2D object detection. There are two main branches of DNN models achieving the highest speed and accu-racy performance respectively. Among those DNN models, this paper selects Single Shot MultiBox Detector (SSD), a one-stage learning system, as the basic research methodology and provide justifications for this choice.

Chapter 3 elaborates the preparation work, the main structure of SSD model and evaluation metrics in details. There it also provides reasons for these de-sign choices and analyze potential problems on the chosen method. Through the steps in training and testing process, several advanced algorithms and techniques such as weight update, regularization are explained. Especially, the models are trained based on pre-trained models but not trained from scratch.

(10)

Chapter 2

Literature Review

This chapter reviews the history of DNN, recent advances in DNN approaches for object detection tasks and descriptions of other comparable DNN models with SSD. The base network models for object classification tasks are also in-troduced as they are closely related to our object detection task.

2.1 Artificial Neural Networks

As introduced before, Artificial Neural Network (ANN) is a popular machine learning algorithm consisting of a collection of connected units. Though ANN exists for years, the attempts in training deep architectures of ANNs failed until Geo↵rey Hinton’s breakthrough work of the mid-2000s. In addition to algorith-mic achievements, the increase in computing capabilities using GPUs and the collection of larger datasets are all boosting the recent surge of ANN develop-ment.

(11)

Figure 2.1: Example of a three-layered ANN model.

2.2 Convolutional Neural Networks

Convolutional Neural Network is a branch of deep, feed-forward neural net-works, usually designed for visual imagery problem. Usually, when we talk about deep learning, it refers to deep convolutional neural network, instead of deep reinforcement learning. Krizhevsky et al. [29] reached a milestone in 2012 with Alexnet, an 8-layer deep convolutional neural network (CNN) to get 63.8% accuracy in ImageNet Challenge: ILSVRC 2012 [42], which has improved 10% over the prior years best e↵orts [24]. As figure 2.2 shows below, with the struc-ture of AlexNet going deeper, the object feastruc-tures it can interpret become more complex. With the first convolutional layer, it is only able to distinguish color and simple textures, while the fifth layer is able to extract complex and key features such as sunflower shape.

With the success of AlexNet, a growing number of computer vision and ma-chine learning researchers became focused on using successively more intricate CNNs to solve image classification and other problems. Many new model struc-tures have been proposed and verified to update the accuracy records for several challenge recognition competitions, such as ImageNet [42] and MS COCO [35].

Figure 2.2: Extracted features of the first 5 convolutional layers of Alexnet, which classifies 1000 classes of images (ImageNet).

(12)

layer. We will not put many details about fully connected layer since our cho-sen model is fully convolutional.

2.2.1 Convolution

Convolution function is the core building block of a CNN. One dimensional con-volution function can be denoted as equation 2.1. It is a discrete operation for one-dimensional array x and one-dimensional filter w. With 2D images as our input signals, the convolutional filter (kernel) should also be of 2 dimensional, which needs to be sliding along weight and height axis of the input, presented as equation 2.2.

Except for weight and height, images have another dimension ”depth”, with value 1 for grayscale images and 3 for RGB images. If we assume the size of an input as WI⇥ HI⇥ DI and the size of the convolutional filter is Wf⇥ Hf⇥ Df,

then for each layer, the Df must equal to DI. The convolution output of the

filter across the input is often called a feature map, which represents the cross-correlation between the pattern within the filter and local features of the input. With a translation invariant property, CNN layer can detect the same features in di↵erent parts of the image. The DI of this feature map only depends on the

number of filters in the previous layer. The WI and HI of the feature map are

determined by the WI and HI of the input respectively, the stride of the filter

and the number of padding zeroes. s(t) = (x⇤ w)(t) = 1 X a= 1 x(a)w(t a) (2.1) S(i, j) = (I_{⇤ K)(i, j) =}X m X n I(i m, j n)K(m, n) (2.2) Besides, Levine and Shefner (1991) defined a receptive field (RF) as ”an area in which stimulation leads to a response of a particular sensory neuron” [33]. For object detection CNN, RF can be explained as the region in the input that a CNN’s feature after one or more convolutional layers is looking at. As depicted in figure 2.3, after cascading two 3_{⇥ 3 convolutional filters, the receptive field} of each feature in the second layer has a receptive field of size 5_{⇥ 5 in the input} space.

The concept of RF is important to object detection tasks since it provides the insights of which region a default box in a certain layer is targeting in DNN. In general, the RF lk of layer k is defined as equation 2.3, where lk 1 is the RF of

layer k 1, fk is the filter size (the height and the weight are the same) and si

is the stride of layer i. The equation 2.3 can calculate the RF from bottom to top intuitively.

lk= lk 1+ ((fk 1)⇤ k 1_Y

i=1

(13)

Figure 2.3: Reception field example.

2.2.2 Activation

Activation function is a critical part for CNN. It introduces non-linear prop-erty into the CNN, which is necessary for learning complex functional mappings from input data, especially unstructured data such as images, videos, speeches. Without activation component, the neural network would become a linear re-gression model with limited power in describing complex features.

In modern neural networks, the popular recommendation for activation function is Rectified Linear Unit (ReLU) [25], defined as g(z) = max_{{0, z}. Though it} yields a non-linear transformation, ReLU maintains the linearity in two stages, z < 0 and z > 0 or z = 0. It still preserves the capability of generalizing linear models [14].

2.2.3 Pooling

Max-pooling layers of size 2⇥ 2 with a stride of 2 have been used in many popular network architectures, such as VGG16 network, depicted in figure 2.4. They basically extract the max values in each 2_{⇥ 2 block in the output of the} previous convolutional layer. In this case, even there is some small translation of the input, the pooling outputs can still maintain the same or slightly change, hence the final output would not be a↵ected. Pooling can also improve the statistical efficiency of the network model [14].

2.3 2D Object Detection

As CNN represents the state of the art, we applied deep, feed-forward CNN model for our 2D multiple object detection task. The CNN models are designed to imitate the behavior of a visual cortex. Compared to other object detection methods, it needs little work in image preprocessing, which saves a lot of time and e↵ort on the extraction of hand-drafted features, and it is easier to train since it has much fewer parameters than fully connected networks with the same amount of hidden units.

(14)

Method data mAP FPS Fast R-CNN 07++12 68.4 0.5 Faster R-CNN VGG-16 07++12 70.4 7 YOLO 07++12 57.9 45 YOLOv2 544 07++12 73.4 40 Faster R-CNN ResNet 07++12 73.8 5 SSD300 07++12 72.4 46 SSD512 07++12 74.9 19 rgbtext Table 2.1: VOC2012 test results for di↵erent detection frameworks region-proposal methods with two-stage learning and end-to-end-learning sys-tems. Table 2.1 presents some of the state of art DNN that are working on object detection tasks, and the results are tested on the VOC2012 test set.

2.3.1 Region-proposal methods

Region-proposal methods usually contain two stages: region proposal genera-tion and object classificagenera-tion. After the achievements in image classificagenera-tion area, it is not difficult to associate it with detection by applying the classifier to the sliding windows that vary in size and location in one image. But doing this ergodic search to find the target object will be costly in computation and not efficient. Thus, researchers came up with region proposal generators, such as selective search [48] or edge boxes [51], which suggest promising windows that have more possibility of containing an object.

In 2013, NYU published Overfeat algorithm for using deep learning in object detection, which was the winner of the localization task of ILSVRC2013 [43]. They introduced a novel method for object localization and classification by accumulating predicted bounding boxes, integrated with a single CNN. Quickly after Overfeat, regions with CNN features (R-CNN) [13] was published which boosted an almost 50% improvement on the object detection challenge. It com-bines Selective Search, CNN, and SVMs as a three-stage method. However, R-CNN performs CNN for each object proposal independently without sharing computation, leading to its expensive training in space, time and resulting in low-speed detection as well. Later on, spatial pyramid pooling network (SPP-Net), introduced by He et al. [19], was proposed to speed up R-CNN by sharing computation. It generates a convolutional feature map on the whole input im-age and classifies the object proposal by extracting feature vectors from this map, thus avoiding repeated evaluation of the DNN on each proposal.

(15)

map with a feed-forward network for classification and bounding box regression while ignoring the time for generating region proposals, which improves by a large extent in both accuracy and speed. After that, the object classification task can achieve nearly real-time, but proposals are relatively time-consuming, which still remains the computational bottleneck in object detection tasks. After Simonyan et al. created a successful DNN VGGNet [44], using only 33 convolutions with up to 19 layers, Ren et al. introduced Faster R-CNN [41], upgraded from ”Fast R-CNN”, which is based on a VGG-16 net architecture. The most impressing step they have made is applying a region proposal network (RPN) instead of selective search. It achieves a frame rate of 5fps (including all steps) on a GPU. RPN makes a large breakthrough in the detecting speed for regions of interest.

In summary, there are various ways of multi-scale object localization. One of them is to generate image/feature pyramids of multiple scales, applied in CNN-based methods like Overfeat, SPPNet, Fast R-CNN, which turns out to be e↵ective but costly in the computation. Another method is using multi-scale sliding windows on the feature map which is actually changing the size of the filter. For instance, Faster R-CNN fixes a single scale for the feature map and a single size for the sliding window, and uses a novel method, a pyramid of an-chors. The bounding box regression and classification rely only on these anchors of various scales and aspect ratios, thereby dramatically reducing the number of parameters in the output layer.

2.3.2 End-to-end learning systems

(16)

padding” with padding 1 and stride 1. In each convolutional layer, there is one convolutional function followed by one activation ReLU function.

SSD reuses the computation from the VGG16 model, thereby saving a lot of time. It discards the fully-connected layers in VGG16 which are mainly used for object classification output and applies a set of convolutional layers to the end of the truncated VGG, which enables extracting features at multiple scales and decreases the size of the input to each subsequent feature maps progres-sively. Inspired by Szegedys work on MultiBox [47], SSD associates default boxes varying in scales and aspect ratios with the extracted feature maps of di↵erent resolutions. Moreover, over 80% of inference time is spent on the im-age classification base network (VGG16), which implies that the improvement in the base network will also boost the detection speed of SSD.

There are many improved versions of SSD, such as Deconvolutional single shot detector (DSSD) [10], RainbowSSD (R-SSD) [26]. DSSD has applied decon-volutional layers to the multi-scale feature maps and ResNet feature extractor instead of VGGNet. However, it has improved the accuracy performance, es-pecially for small objects, but increased the time latency. Other attempts such as R-SSD have optimized the SSD model based on a di↵erent dataset, but have not succeeded to upgrade the overall performance by a large extent.

For these end-to-end learning systems, the selection of base network as a classi-fier is also important, which a↵ects the inference time and classification scores directly. In these years, more advanced base feature extractors have been re-leased, image classification has also gained several significant improvements, thereby boosting the speed and accuracy performance of object detection mod-els as well.

(17)

(18)

Chapter 3

Methodology

3.1 Dataset preprocessing

In ADAS context, the video stream used in DSP (Digital Signal Processor) con-troller in the embedded vision system is UYVY pixel format, which is totally di↵erent from RGB color format. Although it is feasible to convert the UYVY format to RGB in experiments, by firstly extracting UYVY to YUV format and then converted to RGB, the total processing time will be increased and it is not the best solution for a real-time system. To maintain the speed of DNN in this real-time scenario, this paper is exploring how to improve object detection accuracy in grayscale images (Y channel).

3.1.1 Grayscale image from UYVY format

Before explaining the details of data preprocessing, this paper will introduce the color format UYVY. UYVY is a packed YUV 4:2:2 format, in which the luminance component (Y) is sampled at every pixel, and the chrominance com-ponents (U and V) are sampled at every second pixel horizontally on each line [23]. The YUV formats can be divided into two groups, the packed formats where Y, U and V components are packed together into macropixels stored in a single array, and the planar formats where each component is stored in a separate array with the final image being a fusion of the three separate planes. The reason why YUV formats are preferable in displaying digital video signal is that it is developed to provide compatibility between color and black/white analog television systems and can imitate a human vision. It allows reduced bandwidth for chrominance components so that the transmission errors can be masked by the human perception efficiently. Figure 3.1 shows one example im-age and figure 3.1.1 shows the comparison of its separate visualizations of RGB and YUV channels.

3.1.2 Data augmentation

(19)

Figure 3.1: An original image example.

image, contains pixels whose values vary from 0 to 255 in digital formats. The value 0 means black color with the weakest intensity and the value 255 means white with the strongest intensity. Grayscale images have been widely used in medical imaging, monitoring system etc.

When using DNN to detect RGB images, for the purpose of making the model more robust and enhancing the accuracy of detection, it often applies data aug-mentation to the original dataset. Obviously, there is more information such as hue, saturation including in the RGB images than in the grayscale images. It has been proven that in object detection tasks using computer vision meth-ods, the additional use of color information will perform better than only using shape information [28]. While some researches claim that the color information is not efficiently used in the deep learning models [17]. In other words, adding color information in the input of the model has a negligible e↵ect on the results. To get a more clear understanding of the importance of color information, this paper will explore this in section 4.6.

Moreover, there are di↵erent conditions when people are driving vehicles, such as raining, weak lighting, fogging etc., which lead to significant changes in the brightness and contrast in the camera view. To enhance the performance of SSD, more distortions have been added into the contrast and brightness in the grayscale training set and apply random flipping to handle the object detection on di↵erent sides.

Brightness Brightness is a relative term, showing how bright the image ap-pears compared to another reference image, based on our visual perception. For grayscale images, it is the mean pixel value intensity that can be used to change the brightness. Higher brightness corresponds to the weather condition of more sunlight reflection existence. In our experiments, this global attribute has been chosen as a data augmentation method for all experiments. This paper applies a brightness change brightness to the intensities of all pixels in each image, within the range of [ 32, 32] with a probability of 0.5.

(20)

(a) R channel (b) Y channel

(c) G channel (d) U channel

(e) B channel (f) V channel

Figure 3.2: Separate RGB channel and YUV channel visualizations for grayscale images, it is determined by the di↵erence in the brightness of the object with other objects. The human visual system is more sensitive to the contrast than to the absolute luminance. We have scaled the contrast of each image within the range of [0.5, 1.5] with a 0.5 probability. The brightness change and contrast change can be represented as equation 3.1. The mean value is the mean pixel value intensity for all of the images in the dataset, not calculated individually.

f (x) = contrast f actor _{⇥ (x} mean value) + mean value + brightness (3.1) Flipping We apply random flipping on the dataset with a probability of 0.5. It can efficiently solve the problem of detecting objects on di↵erent sides.

3.2 Ca↵e framework

(21)

contributors now. Ca↵e supports seamless switching between CPU and CUDA capable GPU, by simply setting a single flag. It not only o↵ers the model definitions, optimization settings but also pre-trained weights in the format of ”.ca↵emodel” binaries in the ca↵e model zoo. Ca↵e can be accessed using API including C++, python, and matlab. Ca↵e has been chosen with Python API in our experiments.

3.3 Network Architecture

Based on SSD300 and SSD512 with VGG16 feature extractor [36], we o↵er a new design for fine-tuning a pre-trained model targeted on grayscale image in-put. The computation complexity of the whole neural network has not been degraded by our structure change.

3.3.1 Modified input layer

SSD often uses RGB images as input and can reach a high accuracy on perfor-mance, so the question for us is how to modify the network structure to take only one-channel input images and maintain the performance? A dumb method of solving this problem is to duplicate the grayscale channel of the image twice and merge these three channels together to get a ”3-channel” grayscale image. This solution leads more disk space usage for storing the extended channels and more useless computation in the input layer as well.

Figure 3.3: Di↵erent convolutional filters in Conv1 1.

Another method is to change the input convolutional layer, conv11 layer in the

(22)

slides over the input image with height 300 and width 300 (convolution opera-tion) to produce a same sized output. There is a concept called depth, which implies the number of 3_{⇥ 3 filters used in each layer, which is 64 shown in} figure 3.3. Besides, the number of the conv11output equals to the depth of the

input layer. Here we specify that the convolutional filter has a third dimension, also called channel, which should always be the same as the channels of the input from the previous layer. If the number of input channels changes from 3 to 1, the filter channel should also change to 1. In this way, we lose the color information from di↵erent color channels, so that we assume the performance of model would also downgrade by a certain degree due to the loss of color in-formation. Previously the mean values of an RGB image are commonly set as [104, 117, 123], based on Pascal VOC dataset [8]. Now we apply 96 as our mean value on our grayscale images and [93, 98, 95] on our color images, which are the mean pixel values based on the images in our datasets, explained in details in section4.1.

3.3.2 Multi-scale feature maps

There are several methods of bounding box prediction in DNN models. An ear-lier method is using pyramids of images and feature maps. The feature maps are generated from each image of various scales in the image pyramid, which may take a long time. One more advanced method is using a single feature map with di↵erent scales and aspect ratios of bounding boxes in fixed grids, i.e. Overfeat [43], YOLO [39], which gets faster on speed but loses the performance of accuracy.

SSD makes a good compensation by adding extra convolutional layers to the end of the truncated base network and efficiently using the computation from the base network. The truncated base network is generated by removing all fully connected layers and dropout layers from the original VGG16 model, shown in figure 2.4. The original network extracts features of the targets for image classification purpose. Multiple scales of feature maps from the additional con-volutional layers called ”feature pyramid” in figure 3.4 and 3.5 can be used for bounding box prediction, which maintains the detection speed and improves the precision at the same time. Here, we called the bounding boxes for matching the objects in images as ”default box”.

(23)

Figure 3.4: SSD300 architecture.

(24)

3.3.3 Default box selection

As mentioned in section 3.3.2, aspect ratios and scales need to be manually as-signed to each default boxes in feature maps, which directly determines the total number of default boxes in each image. More default box usage can directly in-crease the inference time for each frame/image. It is critical to set the default scales and aspect ratios since it directly a↵ects matching efficiency during the training process, which is the number of matched positive samples. The more positive samples there are in the training set, the more robust model we can obtain. Below we will introduce our choice and the reasons behind.

Scales

Feature maps generated from di↵erent levels of layers have di↵erent receptive fields [50], thereby targeting on objects of di↵erent sizes. We assume the number of feature map is m and the scales are in [smin, smax] ([0.2,0.9]), then the scale

Sk of the default box of the kthfeature map is computed by:

Sk= Smin+Smax Smin

m 1 (k 1) , k2 [1, m] (3.2) Taken SSD300 as an example, the highest feature map from conv4 3 layer has the maximal scale 0.2, the lowest feature map from conv92layer has the minimal

scale 0.9 and other scale values are evenly placed in this range. Thus the scales of the feature maps are not corresponding to their receptive sizes. This scale range is generally used for large dataset, however, it can be adjusted according to the distribution of scales of objects in the dataset.

Aspect ratios

Aspect ratio is used for describing di↵erent object shapes. In original paper, it is assigned as ar2 {1, 2, 3,1₂,1₃} for all feature maps. Then the width and height

of a default box are skpar, sk/paraccordingly. When aspect ratio is 1, another

scalepSkSk+1of bounding box is applied. Then the maximal number of default

boxes at each cell is 6. To obtain a higher detection speed, the first and the last two feature maps are assigned 4 bounding boxes with aspect ratio 1, 2,1

2.

With a larger size of input, SSD512 is able to detect more objects of various scales. Here we list the number of default boxes at each feature map in original SSD300 and SSD512 models in table 3.3.3. For example, the first convolutional feature map of size 38_{⇥ 38, uses 4 default boxes in each cell, therefore, the total} number of default boxes in this map is 38⇥ 38 ⇥ 4. We calculated the number of boxes for the rest feature maps and add them together. In the orignal design, it uses 8732 and 23574 default boxes in SSD300 model

As seen in the figure3.6, there are 6 default boxes in each cell in 10⇥ 10 feature map, detecting smaller objects by comparison, while 4 boxes in 5_{⇥ 5 feature} map in figure3.7, detecting relatively bigger objects. Each detected boxes will output the confidence scores for all classes (c1, c2, ..., cp) and 4 location o↵set

(25)

# of box positions 38x38 19x19 10x10 5x5 3x3 1x1 - total boxes

SSD300 4 6 6 6 4 4 8732

# of box positions 64x64 32x32 10x10 8x8 3x3 1x1 1x1 total boxes

SSD512 4 6 6 6 6 4 4 23574

Table 3.1: The number of default boxes for each classifier layer and the number of total boxes. 4 and 6 are the number of di↵erent default boxes.

Figure 3.6: 10x10 feature map Figure 3.7: 5x5 feature map Though SSD300 and SSD512 can reach very high accuracy and speed at the same time, there still exists some bottlenecks. For instance, it has been found that the lower the feature map is, the more complex semantic features it can interpret [37, 18]. In higher feature maps, the default boxes are in smaller scale but lack of strong semantic features, therefore the classification confidence would decrease. This problem can be tackled by finding the optimal scales and aspect ratios. Another novel method that change total the structure is ”Feature Pyra-mid Network” [34], in which a top-down architecture with lateral connections is developed for strong semantic feature maps at all scales.In our experiments, we optimized SSD using the first method to get a high accuracy. The experimental results will be discussed in section 4.4 and 4.5.3.

A potential issue in selecting the optimal scales and aspect ratios is that the scale and aspect ratios should be re-designed for di↵erent datasets. For on-road objects, the aspect ratio will have a di↵erent distribution than the 20 classes object in VOC dataset. If the scales and aspect ratios we choose are making the matching between default boxes and ground truth boxes better, there are more positive training samples in the training set and the training loss will converge to a smaller value.

3.3.4 Hard negative mining

(26)

boxes. We called the matched boxes as ”prior boxes”, all the prior boxes as ”pos-itive training samples”. As there are only a few objects in the image/frame, the number of negative training samples would be disproportionate compared to positive training samples. By ”positive”, we select the defaults boxes whose IoUs are larger than 0.01. To restrict the total amount of training samples, only the top 400 with highest IoU in the positive samples are selected.

Then, instead of using all negative predictions, we keep a ratio of negative to positive examples of around 3:1. As there is a background class, which rep-resents the incorrect detections, the negative samples would also be randomly picked by random taking negative samples and select the samples with the high-est confidence scores. By learning from positive and negative collections, the model can be more robust for background interference.

3.4 Training

During the training phase, abstractly the DNN model is learning representative features from the training data and can be generalized to generate outputs that predict the ground truths of new data. First, we define the loss function to measure how far the results are from the ground truths. Then, there are mainly two steps, propagation and weight update, for learning the parameters in the neural network iteratively. Another related problem introduced here is regularization that prevents the model from overfitting. To training the network model fast and efficient, fine-tuning is used in our experiments.

3.4.1 Loss function

In the deep learning scenario, cost function or loss function is used for describ-ing the di↵erence between the current network output and the expected output. For an object classification task, only the confidence loss of the object category needs to be considered. While for an object detection task, there are confidence loss (conf) and localization loss (loc) that should be handled with. The loss function will be calculated through the forward pass of the DNN. In the origi-nal paper, the loss function inspired by [7] has been extended to fit multi-class task, computed as:

L(x, c, l, g) = 1

N(Lconf(x, c) + ↵Lloc(x, l, g)) (3.3) where N is the number of detected default boxes. It applies softmax loss for the confidence loss over all classes and a Smooth L1 loss for localization loss between the detected boxes and the ground truth boxes, as equation 3.4 states. The ”detected” specifically means that the confidence score is more than 0.1.

8 > > < > > : Lconf(x, c) = PNi2P osx p ijlog( ˆc p i) log( ˆc0i) where ˆc p i = exp( ˆcpi) P pexp( ˆc p i)

Lloc(x, l, g) =PN_{i2P os}P_m2bboxxijksmoothL1(lmi ˆgmi )

(27)

3.4.2 Propagation

The process of passing inputs forward through the neural network is called for-ward propagation and its output is the class, the confidence and the bounding box coordinates in this case. With those outputs, loss function explained in section 3.4.1 can be computed. Then the question is how to minimize the loss function.

Backpropagation is a conceptually simple and computationally efficient neu-ral network learning algorithm [32]. It is based on gradient descent algorithm in which we need to compute the gradients of the loss with respect to the weight of each hidden unit in the DNN. There are many hidden layers with thousands of nodes, leading to a complex gradient computation that has to go backward from the output to the target node by using the chain rule. Equation 3.5 shows the simplest form of the chain rule applied to two functions. It can be used to compute local gradients by multiplying the Jacobians of each node backward through the network until the target node and add the products from all of the di↵erent tracks backward from the output loss. If y = f (u) and u = g(w), equation 3.5 can be written as equation 3.6.

(f g)0(w) = f0(g(w))· g0_(w) _(3.5) dy dw = dy dx · dx dw (3.6)

3.4.3 Weight update

After getting the local gradient of each node, we apply gradient descent algo-rithm to update the weights. As equation 3.7 states, the weight wk+1_{at current}

state k + 1 can be updated by a certain degree from weight wk _{at the previous}

state k. Here it needs to manually choose a learning rate to decide on the size of the change in all the weights for each iteration. It is denoted as ⌘, often set as 10 2 or 10 3_{. The change in weights can reflect the influence on L of an}

increase or decrease in wk.

wk+1 wk _⌘ @L

@wk (3.7)

After training with a learning rate for a certain number of iterations, the loss function may get stuck within a small range, which means the learning rate might be too big for finding the minima. Hence, the learning rate has decreased by multiplying a gamma, which is often set as 0.1. Then, the next period training will use 0.1_{⇤ initiallearningrate as the new learning rate. Therefore,} can get rid of the fluctuation and converge the loss to a smaller value. Usually, it uses three periods with 80000, 40000, 20000 iterations respectively to converge the loss function. It is called ”multistep” training strategy.

3.4.4 Regularization

(28)

section 3.1.2. Hence, regularization has been added for optimization during the training process. One popular form of regularization is the L2 loss, also known as weight decay. This loss shown as equation 3.8, has been parameterized by the constant ! and added into the loss function. The inspiration behind this L2 loss is that weight matrices with lower and uniformly distributed values perform better in exploiting all the input data than sparse weight matrices with higher, intensively values [15].

L2(!) =

2 k!k

2

(3.8) Another way of regularization is dropout [45], which deactivates units or neu-rons by setting their output to 0 with a certain possibility. By randomly cutting o↵ neurons during the training process, it can keep the most robust features in the training set but also takes 2-3 times longer to train than a standard neural network of the same architecture. For each epoch, it would train a di↵erent random architecture. It has been considered as an e↵ective method in improv-ing the performance of neural nets and prevent overfittimprov-ing problem in a wide variety of application domains. However, the SSD has cut o↵ the dropout lay-ers in the VGG16 base network, which means it will not be used in this method.

3.5 Testing

During the training process, all of the weights and bias of the network are saved to snapshot periodically. When testing a network, the values will be restored to apply to the input images in the testing set. Following the loss described in section 3.4.1, the algorithm firstly calculates the confidence of each detection, represented by the product of the object confidence scores and classification scores. Then the top k (set as 400) predictions with confidences more than 0.01 among all confidences in each image have remained. Among those scores, it is likely to happen that multiple bounding boxes are assigned to the same object, thus Non-Maximum Suppression (NMS) explained below [21] has been applied to these detections within each image and class with a threshold of 0.5. In the end, this filtering algorithm returns the bounding boxes, confidences, and classes for the final detection in each image.

3.5.1 Non-Maximum Suppression

(29)

and their confidences are more than 0.01. An example is shown in figure 3.8.

Figure 3.8: Non-Maximum Suppression example. Top: image after detection before NMS processing; Bottom: result image after NMS processing.

3.6 Fine-tuning

There are two main approaches of training a neural network model, training-from-scratch and transfer learning. By training, it means executing the back-propagation and optimize the parameters of the neural network model to lower the loss function. As there is limited computational resources and large net-work model, it will take long for training model parameters from scratch, hence transfer learning has been applied to train the models.

Transfer learning is a machine learning technique in which a model trained on one task can be re-purposed on another related task. As Emilio Olivas stated (2009), ¨Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned.¨In fact, most prevalent object detection tasks have adopted transfer learning strategy, such as Overfeat [43], Fast R-CNN [12] which makes use of Alexnet [29] (trained with ImageNet). Moreover, most object detection models are using classification models partially. Thus, this paper adopted the transfer learning method by taking VGG16 (pretrained with VOC2007 and VOC2012 dataset) as pretrained model. The weights in truncated VGG16 have been al-ready trained to classify object classes with a high accuracy for VOC2007 and VOC2012 datasets.

(30)

and create 6 and 7 extra layers for bounding box prediction in SSD300 and SSD512 respectively.

3.7 Evaluation metrics

Since in the object detection research field, DNN models can be trained to fit into di↵erent problems, in which di↵erent objects are detected and the distribution of them is not uniform, thus a simple precision metric may be not comprehensive for evaluation. Besides, by defining detected targets with a certain threshold, it needs to associate them with a confidence score, and it also needs to be involved in the evaluation metric. Therefore, mAP (mean average precision) metric has been widely used for evaluating object detection models and accepted to evalu-ate several object detection competitions, such as Pascal VOC[8], ImageNet[42]. To understand mAP, firstly we need to introduce the Precision and Recall graph for a classifier. As 3.9 shows, the precision score for category c is the ratio of the number of true detections of object c to the total number of detected c objects. While the recall refers to the ratio of the number of true detections of object c to the number of ground truth box of object c in all examples. But they all need a confidence threshold to define which object is considered as ”predicted” or ”detected”. For example, there would be probably more objects being detected when the threshold is 0.2 than the threshold is 0.8. The threshold we used here is Intersection over Union (IoU) threshold.

P recisionc =N (true positives)c

N (all detections)c

(3.9) Recallc=

N (true positives)c

N (all ground truths)c (3.10)

As shown in the equation 3.11, the area (Bp\ Bgt) refers to the overlap between

the ground truth box and the output predicted box, and the area (Bp[ Bgt)

is their union. IoU equals the ratio of them, ranging in [0, 1]. For Pascal VOC challenge, it regards a prediction with IoU equals or more than 0.5 as a posi-tive prediction. Precision and recall vary with the strictness of our classifiers threshold. If we choose larger IoU threshold, the precision got shrunk fast at a smaller recall rate. The maximum value of recall for each graph means the recall for all the prediction result. In our experiments, we set 0.5 as the IoU threshold, which is the same as the threshold in Pascal VOC2007.

IoU = area(Bp\ Bgt))

area(Bp[ Bgt) (3.11)

Average Precision is used to describe the precision of detection for one class of object [9]. It summarizes the shape of the precision/recall curve by sampling pre-cision at a set of eleven equally spaced recall levels, Recalli= [0, 0.1, 0.2, ..., 1.0]

(31)

Recalli = [0, 0.125, 0.250, ..., 1.0] and the APs is calculated following equa-tion 3.12. AP = 1 41 X Recalli P recision(Recalli) (3.12)

(32)

Chapter 4

Experimental Results and

Analysis

In this section, we introduce the implementations of SSD model training and testing. Two datasets are used in this work and their data analysis are pre-sented in section 4.1. By analyzing the drawbacks of SSD models for detecting all classes of objects in grayscale images, we have adjusted our model structure to fit the selected dataset, thereby efficiently optimizing the models and improv-ing the accuracy. We also implemented two trials for explorimprov-ing the e↵ects of color input on SSD model performance.

4.1 Datasets

There are many open-source datasets where images are taken on the road con-taining labels for multiple objects, working on autonomous driving problems. While some datasets among them are targeting single objects like pedestrians [5, 6] or cars [1, 3]. For our task, we are targeting at the multiple objects on the roads. To meet the requirements of forward collision avoidance system, we have chosen three datasets, two Udacity labeled datasets and the KITTI vision benchmark suite [11], and the statistics of object labels are listed in table 4.1.

Dataset Images Annotations

Car Truck Pedestrian Cyclist KITTI 7481 28742 1094 4487 1627 CrowdAI 9423 62570 3819 5675

(33)

4.1.1 KITTI 2D object detection dataset

KITTI 2D object detection dataset is collected by an autonomous driving plat-form Annieway published on KITTI Vision Benchmark Suite, a project of Karl-sruhe Institute of Technology and Toyota Technological Institute at Chicago and has been introduced their publication in 2012 [11]. The images are extracted from videos recorded on the streets of Karlsruhe with large lighting variations and extensive occlusions. It has been unique among these datasets because of the resolution of its images, which is 1240⇥ 375, as seen in figure 4.1. Though it has 7481 images as the training set and 7518 images as the testing set, only the training set is available but not the testing set. Hence, we take the training set as our whole dataset and select the images that contain cars, pedestrians, cyclists, and trucks. The mean value for these images is 96.2. We randomly select 5237 images as the training set and 2244 images as the testing set.

Figure 4.1: Sample images from the KITTI 2D object detection dataset. KITTI dataset o↵ers 8 classes of objects, including Tram, Misc, Cyclist, Person (sitting), Pedestrian, Truck, Car, Van. For each object, the coordinates in the image is also provided. It is prominent that car, pedestrian and cyclist objects are the majority, leading to an imbalance problem between di↵erent classes. In our research, we take 4 classes, car, pedestrian, cyclist and truck as our research targets. Due to the dataset imbalance problem, we assume during the training process the model tends to decrease the loss of larger classes than the smaller classes, thus the AP of car and pedestrian detection would be greater than the AP of the truck and the cyclist. However, this assumption has been verified to be wrong, described in section4.3.2, which is not the main reason for poor performance.

4.1.2 Udacity Annotated Datasets

(34)

Figure 4.2: Object size distributions for two categories ”cars” and ”pedestrians” in KITTI dataset.

dataset includes four classes of object, car, person, and truck. The images are collected from a Point Grey research cameras running at a resolution of 1920x1200 at 2hz during the drive in the mountain of California and the neigh-boring cities in a daylight condition. For each object in the images, it pro-vides the class label and 4 vertex coordinates of the corresponding ground truth bounding box, [xmin, ymin, xmax, ymax].

The Autti dataset has the same context and same image resolution as CrowdAI dataset but includes two extra classes of traffic lights and bikers. In our case, we take bikers into consideration but not traffic lights. Since the contexts of Autti and CrowdAI dataset are similar to each other, we merge these two datasets, and call it as ”Udacity dataset” in our paper. At the same time, from table 4.1 we notice that the merged dataset has an imbalance problem with many cars and few bikers, which may pose a problem for training. To enlarge the small class, like the person class, may help ease the problem.

We have randomly sampled the dataset into 70% (17107) for the training set and 30% (7327) for the testing set. The calculated mean value for the training set is 95.8. Figure 4.3 shows several example images in Udacity dataset, which have already been converted into the grayscale images. Since there is no height or width of reference objects in the images, we cannot compute the actual widths or heights of the ground truth boxes in the images in Udacity dataset.

4.2 Pre-trained model testing

(35)

Figure 4.3: Sample images from Udacity dataset. respectively for training and testing set preparation.

It is remarkable that for di↵erent detection purposes, the data and its labels are totally di↵erent. Though the general large datasets include hundreds or thousands of objects from di↵erent categories, they do not fit well with other industrial applications, such as on-road detection for ADAS. For on-road object detection, it is very important to use a power-efficient system with fast and robust detection in unconstraint environments. To enhance the performance for this requirement, the collection of a large amount of data is necessary but also costly in manpower and resources.

4.2.1 Performance on KITTI dataset

Figure 4.4 shows that neither SSD300 or SSD512 do not perform robustly on detecting human objects. By comparison, it can reach much higher AP for detecting cars. The underlying reason might be that cars have larger scales than persons in the images, which are detected by the default boxes in lower feature maps with strong semantic interpretation power. It is also remarkable that the SSD512 model can reach 100% more in accuracy for both person and car detection compared to the SSD300 model.

(36)

Figure 4.4: P-R graphs of pre-trained SSD300 and SSD512 models for ”car” and ”person” classes.

within a range of [0, 1.4] compared to the aspect ratio of boxes in other datasets, such as Pascal VOC 2012, where the aspect ratio of objects are in range [0, 9].

Figure 4.5: Recall vs. Average of each class precision graph for pre-trained SSD300 and SSD512 models.

4.2.2 Performance on Udacity dataset

(37)

Figure 4.6: P-R graphs of pre-trained SSD300 and SSD512 models for ”car” and ”person” classes.

(38)

hence we cannot evaluate the data quality of both dataset.

4.3 Fine-tuning based on the original design

In this section, the experimental results of SSD300 and SSD512 models based on gray images are presented referred to the original design in the paper [36]. By exploring multiple methods of implementing the models, we got more in-sights on this one-stage learning DNN. The purpose of this part of experiments is to get a brief understanding of how to select hyperparameters and how many epochs are usually required to make the loss function converge.

Since the training process for DNN models involves the compute-intensive task of matrix multiplication and other operations that can take advantage of a GPU’s massively parallel architecture, we apply one GPU (Nvidia Tesla P100) with 16 GB memory capacity in Ubuntu 16.04 system to conduct the experiments. For testing, we also use the same GPU to generate comparable results.

4.3.1 Hyperparameter selection

Most of the hyperparameters set for training and testing phases in our experi-ments are set referring to the original paper[36]. The ”multistep” learning rate policy is adopted to decrease the learning rate after reaching a certain number of iterations in each step, by multiplying gamma 0.1 or 0.3. The step sizes in ”multistep” policy are chosen according to the number of iterations it will take for the loss function to get fluctuated within a small range. Our initial learning rate is set as 0.0001. Besides, in each iteration, we are taking a batch of images for training. We can choose the batch size among 16, 32 and 64 depending on the size of GPU memory. Higher batch size will lead to longer time in each it-eration as there will be more computation. To speed up our learning algorithm, we apply stochastic gradient descent with 0.9 momentum. We choose 0.0005 as weight decay to specify regularization in the neural network.

4.3.2 Performance for both datasets

(39)

Dataset Method mAP(%) Car Cyclist Person Truck Udacity (CrowdAI,Autti) SSD300 44.2 66.9 24.3 26.5 59.2 SSD512 59.9 80.5 47.7 38.1 73.4 KITTI SSD300 45.1 67.4 26.4 25.0 61.6 SSD512 58.5 77.8 45.7 38.9 71.4 Table 4.2: Model performance trained and tested on grayscale images

# of box positions 38x38 19x19 10x10 5x5 3x3 1x1 - total boxes

SSD300 4 6 6 6 4 4 - 8732

SSD300(Enhanced) 6 6 6 6 4 4 - 11620 SSD300(More boxes) 8 6 6 6 6 6 - 14528

# of box positions 64x64 32x32 10x10 8x8 3x3 1x1 1x1 total boxes

SSD512 4 6 6 6 6 4 4 23574

SSD512(Enhanced) 6 6 6 6 6 4 4 31766

Table 4.3: The default boxes statistics after modifying SSD300 and SSD512.

4.4 Enhancement on detection accuracy

As the performances of SSD models are fairly poor with overall mAP less than 60%, we experimented further on finding a better set of scales and aspect ra-tios on enhancing the performance. Here we only verify this method on KITTI dataset for both SSD300 and SSD512 models.

By analyzing the statistics of KITTI dataset in figure 4.2, it is obviously that the width and length of pedestrains are both mostly in the range [0, 1], while the width of cars are mostly in range [1.3, 2.2] and length are in [3.0, 5.0] (unit: meter). To increase the performance in detecting pedestrains and cyclists, one method is to increase the variation of default boxes in multi-scale feature maps. As the highest feature map generated by conv4 3 in both SSD300 and SSD512 can detect the smallest objects, such as pedestrians, we change the aspect ratio from ar2 {1, 2,1₂} to a0r2 {1, 2, 3,12,

1

3}. Then the total number of default box

prediction in each image has increased compared to the numbers in table 3.3.3. The increased versions, named as SSD300(enhanced) and SSD512(enhanced) respectively, have been shown in table 4.3.

To verify the e↵ect of adding extra default boxes in the highest feature map, we apply modified SSD300 and SSD512 models on KITTI dataset. From the re-sults in figure4.8, we can conclude that it has efficiently improved the accuracy performance in accuracy for each class in the KITTI dataset. For ”car” and ”truck”, it reaches high accuracy around 90% in the SSD512 (enhanced model), while for ”person” and ”cyclist” it reaches 58.3% and 72.6% respectively. It is obvious that the model performance for person class is still not robust. Fig-ure 4.9 shows that the overall mAP has increased 26.1% for SSD300 model and 20.2% for SSD512 model.

(40)

ratios for default boxes, the performances have been improved by a large extent.

Figure 4.8: P-R graphs of SSD300, SSD512 and their enhanced models on KITTI grayscale dataset for all classes.

Figure 4.9: Recall vs. average of each class precision graph for SSD300, SSD512, and their corresponding enhanced models.

(41)

{1, 2, 3,1 2,

1

3}. However, the result mAP is 71.4%, which does not increase

dra-matically, compared to 71.2% mAP from SSD300 (enhanced model). Therefore, in section 4.5, we will explore the choice for scales and aspect ratios for KITTI dataset and Udacity dataset.

4.5 K-means evaluation on default boxes

To design an optimal set of scales and their corresponding aspect ratios, we applied K-means clustering algorithm. In the end, we would choose the cen-troids of large clusters in aspect ratios with regards to all scales. Therefore, the selected aspect ratios can represent the ground truth bounding boxes in the datasets better, but this will not enhance the generality of our models. Here we have chosen 6 scale values, which is corresponding to the 6 feature maps in SSD300 model and the experiments for SSD512 model can be left for future work.

4.5.1 Analysis on KITTI dataset

The experiments of SSD300, SSD512, and their enhanced models have shown that the prior knowledge of scale and aspect ratio for the target dataset is im-portant for improving the efficiency of the training process. More default box matching can provide more positive samples, thereby enhancing the accuracy. To analyze the scales and aspect ratios of ground truth bounding boxes, we ap-ply k-means clustering on both datasets. It is an unsupervised machine learning algorithm, aimed to partition the data into k clusters. The target is to find the centroids of all clusters and the data points are closest to their centroids in each cluster. In this way, we can find the most representative aspect ratios for each scale.

The scale we choose before is [0.2, 0.9], which may not fit the dataset well. First, we define the scale of the default boxes as equation 4.1. After evaluating KITTI dataset, we found the scale of ground truth boxes is in the range of [0.01, 0.7]. For SSD300, the scale are evenly spaced in this range, thus defined as [0.065, 0.17, 0.285, 0.395, 0.505, 0.605].

Scale = s

widthbbox⇥ heightbbox

widthimg⇥ heightimg

(4.1) Then, we apply k-means cluster for each scale to find the mean aspect ratio. The clustering algorithm is one-dimensional since we fix the scale factor and only aim to cluster the aspect ratios. We select the k by trial and error. Besides, the aspect ratio is not based on the original image of 1242⇥ 375 resolution but based on transformed 300⇥ 300 training sample resolution. Figure 4.10 reflects that the aspect ratio only ranges in [0, 1.4], therefore the large aspect ratios such as 2 and 3 actually influence little for the performance. While the added 1

3 is

(42)

Scale Aspect ratio (percentage of data) 0.065 0.40 (17.61%) 0.31 (13.96%) 0.12 (11.00%) 0.52 (9.21%) 0.68 (8.44%) 0.87 (5.18%) 0.17 0.42 (7.98%) 0.16 (4.96%) 0.63 (4.78%) 0.84 (3.90%) 0.285 0.50 (4.61%) 0.29 (2.42%) 0.78 (1.50%) 0.395 0.65 (2.24%) 0.45 (1.63%) 0.505 0.38 (0.47%) 0.615 0.37 (0.13%)

Table 4.4: K-means result of aspect ratios at di↵erent scales in KITTI dataset. Run for all bounding boxes.

Table 4.5.1 shows the percentage of bounding boxes in each aspect ratio cluster for di↵erent scales. This inspired us that the 0.40 can be another aspect ratio we can add, as in both scales 0.065, 0.17, it takes a large percentage of data points. Other aspect ratios such as 0.68, 0.87 also can be experimented further.

Figure 4.10: K-means result of aspect ratios at di↵erent scales in KITTI dataset.

To explore more details for each class in KITTI dataset, we select two repre-sentative classes, ”car” and ”person”. For cars, the shapes of them are flatter, thus the corresponding aspect ratios are in general bigger than the aspect ratios for persons. As seen in figure 4.11, the scatter plots of aspect ratio to scale of bounding boxes di↵er in both attributes for ”car” and ”person”. Persons, in general, are taking smaller region in the image, thus the largest scale is only 0.45, while the maximal scale of cars in this dataset is 0.53. It should be noted that the original aspect ratio of images in KITTI dataset is higher as we resize the image from 1224⇥ 370 to 300 ⇥ 300, so the aspect ratio is then recalculated as equation 4.2. However, as the widthbnx to widthimg ratio and heightbnx to

heightimgratio remain the same after resizing, the scales are the same.

Aspect ratio = widthbbox⇥

widthresize

widthimg

heightbbox⇥height_heightresize_i_mg

(43)

Scale Aspect ratio (percentage of data) 0.065 0.40 (19.6%) 0.31 (13.4%) 0.49 (10.91%) 0.73 (8.32%) 0.60 (7.55%) 0.92 (4.56%) 0.17 0.45 (8.69%) 0.65 (5.54%) 0.85 (4.67%) 0.26 (2.22%) 0.285 0.51 (5.40%) 0.33 (2.51%) 0.80 (1.77%) 0.395 0.67 (2.32%) 0.51 (2.32%) 0.505 0.54 (0.17%)

Table 4.5: K-means result of aspect ratios at di↵erent scales in KITTI dataset. Run for ”car” ground truth bounding boxes.

Scale Aspect ratio (percentage of data)

0.065 0.11 (28.6%) 0.08 (16.07%) 0.17 (5.27%) 0.13 (17.35%) 0.31 (0.91%) 0.17 0.14 (11.91%) 0.11 (9.74%) 0.17 (5.43%) 0.31 (1.04%)

0.285 0.17 (2.44%) 0.32 (0.81%) 0.395 0.31 (0.38%)

Table 4.6: K-means result of aspect ratios at di↵erent scales in KITTI dataset. Run for ”person” ground truth bounding boxes.

Figure 4.11: K-means result of aspect ratios at di↵erent scales in KITTI dataset for car and person objects.

Table 4.5.1 and Table 4.5.1 provide the distributions of aspect ratios in each cluster for di↵erent scales, for class ”car” and ”person” respectively. It is re-markable that both of them have their majority in the lowest scale interval. To adjust for the data points, by trials and errors we have assigned the di↵erent number of clusters for each scale in the whole dataset. We defined k as [6,4,3,2,1] for the first 5 scale values for cars and [5,4,2,1] for the first 4 scale values for persons.

Combined with the overall K-means result, other than the aspect ratios{1, 2, 3,1 2,

1 3}

(44)

Scale Aspect ratio (percentage of data) 0.07 0.75 (22.91%) 0.58 (17.83%) 0.92 (16.32%) 0.29 (13.74%) 1.22 (8.99%) 1.77 (4.78%) 0.19 0.73 (4.14%) 0.35 (1.64%) 1.01 (3.54%) 1.66 (1.02%) 0.31 0.77 (1.98%) 0.42 (0.60%) 1.32 (0.46%) 0.43 0.79 (0.10%) 1.97 (0.05%) 0.44 (0.03%) 0.55 0.78 (0.48%) 1.88 (0.02%) 0.67 1.07 (0.008%)

Table 4.7: K-means result of aspect ratios at di↵erent scales in Udacity dataset.

4.5.2 Analysis on Udacity dataset

Di↵erent from KITTI dataset, the images in Udacity dataset have a smaller aspect ratio 1.6, calculated by 1920/1200, leading to the aspect ratios of ground truth bounding boxes after resizing are higher, in the range [0, 6]. It is actually more close to the common camera resolution such as 1920⇥ 1080. Figure 4.12 presented the redesigned scale values [0.07, 0.19, 0.31, 0.43, 0.55, 0.67] and their corresponding aspect ratio clusters. It is also noted that the pre-defined aspect ratio of value 2 is helpful for describing the centroids of clusters in several scale intervals, in most general cases.

Figure 4.12: K-means result of aspect ratios at di↵erent scales in Udacity dataset After several di↵erent trials, we determined k as [6, 4, 3, 3, 2, 1] for 6 scales. We notice that 84.57% of data points have their scales in the range of [0.01, 0.13] with centroid scale 0.07, presented in table 4.5.2. By defining the aspect ratios according to the 6 k-means result, it is likely to enhance the final accuracy. However, for objects of small scales, it is always costly in time for iterating all grid cells in the highest feature map. Thus, we can add one representative value ”0.75” to the pre-defined aspect ratio set.

(45)

Scale Aspect ratio (percentage of data) 0.07 0.79 (24.75%) 0.64 (21.29%) 0.96 (16.19%) 1.27 (9.06%) 0.40 (7.74%) 1.80 (4.99%) 0.19 0.75 (4.42%) 1.02 (3.76%) 0.38 (1.39%) 1.63 (1.08%) 0.31 0.79 (2.13%) 0.49 (0.68%) 1.29 (0.50%) 0.43 0.75 (1.07%) 1.03 (0.22%) 0.45 (0.21%) 0.55 0.71 (0.28%) 0.94 (0.24%) 0.67 1.00 (0.01%)

Table 4.8: K-means result of aspect ratios at di↵erent scales in Udacity dataset. Run for ”car” ground truth bounding boxes.

Scale Aspect ratio (percentage of data)

0.07 0.22 (24.0%) 0.27 (22.18%) 0.32 (19.26%) 0.17 (16.10%) 0.42 (11.02%) 0.19 0.14 (1.64%) 0.17 (0.92%) 0.43 (0.40%)

0.31 0.22 (0.37%) 0.35 (0.26%) 0.43 0.23 (0.13%) 0.34 (0.11%) 0.55 0.33 (0.06%)

Table 4.9: K-means result of aspect ratios at di↵erent scales in Udacity dataset. Run for ”person” ground truth bounding boxes.

bounding boxes for the person class is in the first scale interval [0.01, 0.13], tak-ing account for 92.56%. By addtak-ing smaller aspect ratio 0.22±0.05, the matching between default boxes and ground truth boxes would be more precise, thereby enhancing the accuracy performance for person class.

Another question is that the scale of bounding boxes for the person class are mostly in a small scale range, and thus in lower feature map, there are very few true training samples for persons. In a real-world scenario, if a person ap-pears from one side of the car and is very close to the vehicle, it needs to be detected either by the camera or other hardware components of the vehicle. For the former case, more person objects of larger scale need to be added in the training data. For the latter case, other components such as Ladar (laser radar) or Radar should be applied.

Figure 4.13: K-means result of aspect ratios at di↵erent scales in Udacity dataset for car and person objects.

(46)

4.5.3 Algorithm validation

We will verify the efficiency of the K-means clustering algorithm on improving the accuracy of SSD model for KITTI dataset in this section. From the scale and aspect ratio distribution of the images in KITTI dataset, we set up the scale [0.05, 0.75] for 6 feature maps and the number of aspect ratios for them are{8, 6, 6, 6, 4, 4}. Based on the SSD300 (enhanced) model, We added {10, 0.1} as new aspect ratios for the default boxes with smallest scale, as 0.1 describes better the shape of person object. We name this model as ”SSD300 (enhanced + 0.1 AR) model”. As shown in figure 4.14, the AP of person classes gets most significant improvement by 10%, greater than the AP of the rest classes. It also boost the overall mAP increased from 71.2% to 76.8%, as shown in figure 4.15.

Figure 4.14: P-R graphs of SSD300, SSD300 (enhanced), and SSD300 (enhanced + 0.1 AR) model for all classes.

(47)

Figure 4.15: Recall vs. average of each class precision graph for SSD300, SSD300 (enhanced), and SSD300 (enhanced + 0.1 AR) model.

a di↵erent resolution of images from KITTI dataset.

We have not trained a model for Udacity dataset for our limited time and resource. However, the experiments in this section has verified the better choice of aspect ratio and scales will improve the performance of SSD model. The training of larger dataset such as Udacity dataset can be left for future work.

4.6 Color information analysis

To verify the impact of the color information on our detection task, we imple-mented two trials on KITTI dataset with SSD300 model. We experiimple-mented on training two models for grayscale input and RGB input respectively, named as grayscale model and color model, in a condition that the training and testing set are the same images with di↵erent color channels and the data augmentation strategy are the same with contrast and brightness only. The APs for each class are evaluated and presented in figure 4.16 and 4.17.

As there is still color distortion on hue and saturation that can be added in the data augmentation part, we trained another model on the same training and testing set, with data augmentation in contrast, brightness, hue, saturation, and flipping. The hue distortion within [ 18, 18] and the saturation distortion of range [0.5, 1.5] has been added to the RGB images with a probability of 0.5 respectively. The result has been shown in figure 4.18. The mAP has increased to 73.7% from 70.8%, which is also slightly greater than the mAP of grayscale model.

(48)

Figure 4.16: P-R graphs of grayscale model and color model for all classes.

Improving the Accuracy of 2D On-Road Object Detection Based on Deep Learning Techniques

IN

DEGREE PROJECT

INFORMATION AND COMMUNICATION

TECHNOLOGY,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

Improving the Accuracy of 2D

On-Road Object Detection Based on

Deep Learning Techniques

Abstract

Sammanfattning

Acknowledgment

Contents

Chapter 1

Introduction

1.1

Background and motivation

1.2

Overview of the work

Chapter 2

Literature Review

2.1

Artificial Neural Networks

2.2

Convolutional Neural Networks

2.2.1

Convolution

2.2.2

Activation

2.2.3

Pooling

2.3

2D Object Detection

2.3.1

Region-proposal methods

2.3.2

End-to-end learning systems

Chapter 3

Methodology

3.1

Dataset preprocessing

3.1.1

Grayscale image from UYVY format

3.1.2

Data augmentation

3.2

Ca↵e framework

3.3

Network Architecture

3.3.1

Modified input layer

3.3.2

Multi-scale feature maps

3.3.3

Default box selection

3.3.4

Hard negative mining

3.4

Training

3.4.1

Loss function

3.4.2

Propagation

3.4.3

Weight update

3.4.4

Regularization

3.5

Testing

3.5.1

Non-Maximum Suppression

3.6

Fine-tuning

3.7

Evaluation metrics

Chapter 4

Experimental Results and

Analysis