Surround Vision Object Detection Using Deep Learning

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

Surround Vision Object Detection Using Deep Learning

YUAN GAO

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Surround Vision Object Detection Using Deep Learning

YUAN GAO

Master in Computer Science Date: July 4, 2018

Supervisor: John Folkesson Examiner: Mårten Björkman Principal: Zenuity AB

School of Electrical Engineering and Computer Science

(4)

(5)

iii

Abstract

The thesis first develops an object detection framework for front view camera images in surround vision data set. And with the goal of reducing as much annotated data as possible, various domain adaptation methods are applied to train other camera images based on the pretraining of a baseline model. Relevant data analysis work is performed to reveal useful information in object distribution over all cameras. Regularization techniques involving dropout, weight decay, data augmentation are attempted to lower the complexity of training model. Also, the experiments of ratio reduction are carried out to find the relationship between model performance and the amount of training data. It is shown that 30% of the training data for left rear and left front view cameras can be reduced without hurting the model performance significantly. In addition, the thesis plots the errors regarding vehicle locations through heatmap which is useful for further study.

Overall, the results in these extensive experiments indicate that the model trained by domain adaptation is effective as expected.

(6)

iv

Sammanfattning

Avhandlingen börjar med att utveckla ett ramverk för objektdetekte- ring i bilder från den framåtriktade kameran i surroundvision-data.

Med målet att minska mängden annoterad data så mycket som möj- ligt, tillämpas olika metoder för domänanpassning för att träna på andra kamerabilder baserat på en basmodell. Relevant dataanalys ut- förs som avslöjar användbar information i objektdistributioner för al- la kamerorna. Regulariseringstekniker som infattar Dropout, viktsön- derfall, data-augmentering testas för att reducera träningsmodellens komplexitet. Experiment med kvotreduktion utförs också för att hitta förhållandet mellan modellens prestanda och mängden träningsdata.

Det påvisas att 30% av träningsdata för den vänstra bakåtriktade och den vänstra framåtriktade kameran kan reduceras utan att modellens prestanda minskar väsentligt. Dessutom visas i avhandlingen felen an- gående fordonens placeringar genom värmekartor som är användbara för vidare studier. Sammantaget indikerar resultaten i dessa omfattan- de experiment på att modellen tränad med domänanpassning är, som förväntat, effektiv.

(7)

Chapter 1 Introduction

This chapter will first introduce the background of the thesis, and then it will clarify the thesis objective and the delimitation. Also, the previous work and development regarding object detection and autonomous driving vehicles will be presented. In the end, the social impact and ethic issues in autonomous driving will be given.

1.1 Background

Recent developments of hardware and software are pushing the fron- tiers of autonomous driving. A fully self-driving car will avoid the accidents caused by human errors and thus save lives, which contributes to a revolution in transportation and mobility. In order to perform better than how human drives, it is likely that the vehicles will need to perceive and understand the surrounding information on the road.

This can be achieved by adding more sensors to the vehicles, such as radar, ultrasonic and surround vision cameras. In the setting of current research at the company, a basic model was developed to detect objects involving vehicles, pedestrians and two-wheelers with public datasets (e.g. KITTI [10]). However, the analysis of single-view camera image is not sufficient for the perception system to give accurate and instant responses and ensure people’s safety to the maximum extent.

It is then necessary to enable an integrated model such that various objects moving all around an ego-vehicle can be recognized. Since the algorithm requires a large amount of annotated data, which is both expensive and time-consuming, one approach to speeding up the development cycle for a multi-camera system would be to first develop a

1

(10)

2 CHAPTER 1. INTRODUCTION

model for front view data and then generalize the model using domain adaptation for detecting other camera views.

1.2 Thesis objective

The purpose of the thesis is to develop a general object detection framework to detect vehicles for the in-house surround vision dataset. It mainly involves three tasks.

• An object detection model for front view camera is developed, so that it is able to precisely detect most of the specified target objects for front view camera.

• Since the in-house surround vision dataset is recently collected, which is not as refined as public datasets such as KITTI, a detailed data distribution analysis is performed in the thesis. In fact, there is much information regarding statistics and data properties to be known in order to make decisions for relevant experiments.

• A variety of domain adaptation techniques are applied and compared for different cameras in terms of detection performance and resource utilization with the purpose of reducing the cost of data annotations. It aims to validate the assumption that the front view images contain sufficient information of image features for other camera views.

1.3 Delimitations

The thesis is limited to detect target vehicles on the road with in-house surround vision camera images which are independent between each time frame. The algorithms selected from relevant literature is in line with the decisions at the company.

The collected images and corresponding annotations are constantly improved along with the thesis, therefore some experimental results could vary at different stages with regard to data amount and quality.

The evaluation criteria is specific to the situations given within the company and thus might not conform to the general benchmark of object detection.

(11)

CHAPTER 1. INTRODUCTION 3

1.4 Related work

Feature extraction transforms input images into abstract features which is useful in object detection tasks. The goal is to remove noise from the input and increase the accuracy of models by compactly extracting salient features from the input data [16]. A commonly used feature extractor is residual network (ResNet) [15]. It provides a deep representation and effectively solves the degradation problem existing in deep neural networks by applying the structure of residual mapping.

Based on the strength of ResNet, the inception module [33] together with ResNet contributes the effectiveness of ResNeXt [34] with the in- tention of solving the problem of diminishing feature reuse in very deep ResNet, which goes deeper or wider when increasing model ca- pacity. Another variation of ResNet is Dilated ResNet [35]. Its dilation property allows for an exponential increase on the resolution of output feature maps and does not affect the size of receptive field for each individual neuron. Thus the spatial information in the image features are saved, which is useful for the detection of multi-scale objects.

As region-based convolutional neural network (R-CNN) [12] starts to prevail recently, Girshick proposed Fast R-CNN [11] with the insight of Region of Interest (ROI) Pooling and model combination. Instead of extracting features from each selected region using a convolutional network in R-CNN, Fast R-CNN shares the computation through all proposals by running the network only once. In order to achieve the goal of real-time detection, Faster R-CNN [29] was presented to cut down the computational expense by replacing selective search with regional proposal networks. The fully-convolutional network [30] in Faster R-CNN outputs a bounding box per anchor (reference box) and a score representing how likely the image in the anchor is estimated as an object.

The YOLO (You only look once) paper [28] used another totally different approach that applying a single neural network to the full image which is extremely fast. YOLO relies on the global context of an image to make predictions unlike systems such as R-CNN which requires many inference passes per single image. R-FCN [6] designed a concept of position-sensitive score maps, aiming to compromise between location invariance for classification and location variance for detection of objects. A fully convolutional network learns feature rep- resentations and makes predictions based on local spatial input, which

(12)

4 CHAPTER 1. INTRODUCTION

means that its region-based detector enables the computation to be shared throughout the entire network.

Object distortion is an issue that commonly occurs in wide-view and fisheye images. In [6], Dai et al. designed a deformable convolution to effectively model geometric transformation by learning the shapes of convolutional filters conditioned on input feature maps, which shows an impressive performance especially for pixel-wise prediction tasks.

In general, vision-based vehicle detections rely on monocular cameras that are placed towards the front view of ego-vehicles to detect preceding and oncoming objects [31]. And cameras mounted on the side-view mirror for displaying the rear view of the vehicles are able to capture the blind spot of the vehicle, as presented in [32] who achieved lateral driving assistance by combining optical-flow estimation and road plane segmentation. Furthermore, a common practice of pro- viding a 360-degree panoramic view for the surroundings of the ego- vehicle is to mount an omnidirectional camera on top of the vehicle, yielding a full surround analysis for the scene on the road, as implemented by [9]. Instead of analyzing the surround vision information of the vehicle by panoramas, the thesis applies eight separate cameras mounted around an ego-vehicle to perceive the road scene from different views.

1.5 Ethical aspects and social impact

Self-driving cars help eliminate traffic accidents that are mainly caused by human error, leading to less traffic congestion. Also, self-driving taxis can greatly increase efficiency in terms of better traffic flow and less fuel consumption [5]. However, they come with a highly con- troversial social dilemma, the so-called trolley problem, that certain crashes require theses cars to make difficult ethical decisions, which involve formidable challenge. Thus, the algorithms that control self- driving cars have to embed moral principles guiding their moral rules in the case of inevitable harm [3]. One possible solution to the problem is to follow the typical ethical theory of utilitarianism which minimizes casualties and maximizes utility.

Moreover, to achieve the goal of Smart Cites and Internet of Things (IoT), self-driving cars are required to provide their internal data to in-

(13)

CHAPTER 1. INTRODUCTION 5

teract with surroundings such as road signs or other vehicles, which might invade data and privacy protection especially when the information is accessed without the consent of involved people [17]. It is worried by the public that how the obtained data is used and where the data will be distributed. Therefore, relevant legislation such as The General Data Protection Regulation (GDPR) (EU) 2016/679 should be applied to regulate the use of protect privacy. As for the thesis, the images collected by the ego-vehicle contain many other vehicles and pedestrians on the road. Thus, to meed the requirement of protecting personal data, the image visualization of the thesis occludes sensitive information regarding vehicles such as number plates.

1.6 Thesis outline

Chapter 2 will introduce relevant theory regarding deep learning and object detection. Chapter 3 will focus on the methods specific to the thesis involving the description of surround vision data, model architecture, domain adaptation and some regularization approaches.

Chapter 4 will introduce each experiment in detail, present the results in tables and figures with related analysis. Chapter 5 will conclude the results in the thesis and also list possible work for future study.

(14)

Chapter 2 Relevant theory

The chapter will mainly give the theory background of the thesis including deep neural networks, activation functions, backpropaation and convolutional neural networks. At last, the general knowledge with regard to object detection will be detailed and compared.

2.1 Deep feedforward networks

Feedforward network allows for information flowing from the input of a network to the output without any feedback that the result from one training is not fed backwards for the next training. The goal is to approximate a target function ˆf in the case of regression or classification. For example, an input x is mapped by the function ˆf into an output in relation to a category t. The feedforward network defines such a mapping t = f (x, θ) that finally yields the best function approximation through the training of network parameters [14].

The method of deep learning aims to learn feature hierarchy through features at high levels of the hierarchy which is formed by the stack of features at lower levels [13]. A deep feedforward network appears when a chain of network layers are composed together where each set of connections between two adjacent layers approximates a function.

And the layers have specific names that depends on where these layers are positioned. For example, the Figure 2.1 below presents a simple feedforward network with three fully connected layers.

6

(15)

CHAPTER 2. RELEVANT THEORY 7

Figure 2.1: feedforward neural network, reprinted from

The input is formed by three neurons in the first layer which is called input layer. It is then directly fed into the next layer via a set of network parameters or network weights, and any such layer other than the final output layer is called a hidden layer. In fact, the connection between two layers is implemented by the sum of the product of network weights followed by an appropriate activation function.

Moreover, neighboring layers are fully connected because every neuron in one layer is associated with every neuron in another layer.

2.1.1 Backpropagation

The backpropagation algorithm introduced by [7] has prevailed since 1986. It is used to minimize the difference (cost) between the actual output and estimated output by repeatedly adjusting the weights in the connection of network layers, which eventually enables the hidden layers to represent the internal features of the input data domain.

The cost function or error function for a single data example xi can be expressed by the Euclidean distance between the estimated output vector yiand actual output vector ti as,

J = 1 2

y_i− t_i

2 (2.1)

Therefore the total error over N data examples is, J = 1

2N X

i=1

N

y_i− t_i

2 (2.2)

(16)

8 CHAPTER 2. RELEVANT THEORY

Backpropagation applies the errors to calculate the gradient of the loss function, which is the partial derivative _∂w^∂J of the cost function J with respect to network weights w. The expression tells how fast the cost changes as the network weights are adjusted. It gives a deep insight of how the weight change affects the behaviour of the overall network [26]. Behind the gradient calculation, there is actually an optimization step with two cycles: forward propagation and weight update. An input vector is propagated along the network layer by layer until the final output layer, and then a cost function determines the error values which are to be propagated backwards through the network. The gradients calculated from these error terms for each neuron are used to update the network weights for cost minimization [2]. The network weights updated after backpropagation can be expressed in the following equation, where η is the learning rate parameter and w⁰ is the updated weight.

w⁰ = w − η∂J

∂w (2.3)

2.1.2 Activation functions

Activation function converts the input of a neuron unit to the output by its non-linear property, increasing the complexity of the model and resulting in a more accurate approximation for target functions.

In principle, there are three widely used non-linear activation functions for neural network: Sigmoid function, Hyperbolic tangent function(Tanh) and Rectified linear units function(ReLU). The Figure 2.2 gives the plots of Sigmoid, Tanh and ReLU activation functions.

Figure 2.2: activation functions, reprinted from [25]

• Sigmoid function

(17)

Sigmoid function yields values in the range(0,1), which sets bounds for outputs instead of an infinite range of linear functions. It has a smooth gradient and is regarded as a common activation function for training classifiers because of its S-shape curve. Another advantage of Sigmoid is when the input x is given around the middle of (0,1), there will be a significant change for output activations.

The equation for Sigmoid activation function and its derivative form is,

f (x) = σ(x) = 1

1 + e^−x (2.4)

f⁰(x) = f (x)(1 − f (x)) (2.5) However, the property of such S-shape function gives rise to a vanishing gradients problem that often appears in gradient based learning methods such as backpropagation. Sigmoid maps real numbers into a relatively small range of (0,1) such that the change of network outputs whose inputs are not in the center region of the input space is not obvious. In other words, if a change in the value of a parameter leads to a very small change for the network output, the network will not be able to learn the parameter effectively [23].

• Tanh function

It is also S-shape and looks similar to Sigmoid function. How- ever, the optimization for a model using Tanh as activation function is much easier because the input range of the function is (- 1,1) which is zero centered [1]. Also, negative inputs are mapped into negative outputs in Tanh function unlike Sigmoid of which outputs are all above zero. Another difference from Sigmoid is that the gradient strength of Tanh is relatively larger due to a steeper curve in the center of input space.

The Tanh function and its derivative form can be expressed as, f (x) = tanh(x) = e^x− e^−x

e^x+ e^−x (2.6)

f⁰(x) = 1 − f (x)² (2.7)

• ReLU function

(18)

Most of the deep learning applications in recent years replace other activation functions with ReLU because of a few remark- able advantages. For example, Relu activation function addresses the issue of vanishing gradient that exits in Tanh activation function. And it makes the network lightweight not only because of its simplicity in mathematical computation relative to other com- putationally expensive functions, but also because of its sparse property that only part of the neurons are ’fired’ during an activation. However, the use of ReLU function is limited in the hidden layers of a model and needs the collaboration of Softmax function to construct a classifier. There is also a potential dying ReLU problem that the zero gradients of some neurons might cause these neurons never to be activated again. Thus an alternative to avoiding the problem is a variation called Leaky ReLU, which defines y = 0.01x for the inputs under 0.

The following shows the equations of normal ReLU function and its derivative form.

f (x) =

(0, x < 0

x, x > 0 (2.8)

f⁰(x) =

(0, x < 0

1, x > 0 (2.9)

2.1.3 Convolution neural networks

Convolutional neural networks (ConvNet or CNN) are a category of deep feedforward neural networks that are proved to be very efficient in computer vision tasks such as image classification and semantic segmentation [19]. The inputs of CNN are usually RGB images in the form of pixels matrix with three channels. And the design of CNN allows for encoding useful properties into the architecture which is basically the composition of a sequence of convolution layers and activation functions.

Convolutional layer

Convolutional layers are the building blocks of CNN. To generate the first convolutional layer, an image of size wxhx3 is convoluted with a

(19)

kernel(filter) of size rxrx3 that has the same depth as the image. By sliding a small window to right and down through each row of the image, general features such as a vertical line or a curve on objects inside the image are collected. A simple convolution layer is generally followed by an activation function to introduce non-linearity that decides which part of the convolution output units can be activated. The output response of the non-linearity mapping is called feature map or activation map. Once multiple kernels act on an input simultaneously, the feature maps whose size is equal to the number of kernels are gradually created and combined to capture high level features comprised of low level features, and this process of evolving complex features is clearly illustrated in [36].

Pooling layer

It is common to add a pooling layer between certain successive convolution layers. The pooling operates over each response or activation map independently, which progressively reduces the spatial size of the learned representation so as to reduce the amount of computation for the network and mitigate overfitting [21]. A widely seen pooling method is Max pooling with a filter of size 2x2 and stride of 2 that takes the maximal value over a 2x2 receptive field, therefore 75% of the activations are dropped. The process of Max pooling is illustrated in the Figure 2.3. It is worthy noting that the depth dimension of feature map is not changed by the pooling operation.

Figure 2.3: Max pooling with one response map, reprinted from [21]

(20)

2.2 Object detection

2.2.1 Classification, localization and detection

In general, there are three main tasks involved in computer vision applications regarding the recognition of target objects in image: classification, detection and localization.

The classification task is the simplest. For example, given an image containing only one target object marked with either label1 or label2, the task aims to assign the image to the correct label. Based on the classification result, the localization task tries to find the exact position of the predicted object. As introduced in [24], there has to be a good match between the predicted position and the ground truth by a specified rate, for example 50% in the PASCAL [8] criterion of union over intersection (IoU) as well as a label representing the correct class. IoU score is commonly used to evaluate the accuracy of an object detection model. It requires the area of overlap and area of union between the ground truth bounding boxes and predicted bounding boxes from the model. The calculation of IoU can be formulated as follows,

IoU = Area of Overlap

Area of Union (2.10)

Thus the label prediction and the position prediction of an object are closely associated. Different from the above tasks, the detection task focuses on the classification and localization of multiple target object in a single image, such that the task should predict a class as well as a precise position for each object simultaneously. And the number of objects sometimes vary a lot per image even including zero.

2.2.2 CNN based object detection

The use of bounding box refines the position of an object detected by localization task within an image, and also finds a way to determine the size of the detected object. A bounding box is usually represented by four numbers: (x0, y₀, width, height) or (x0, y₀, x_max, y_max) where x0

and y0correspond to the coordinate of the top left pixel in the bounding box and xmax and ymax correspond to the coordinate of the bottom right pixel in the bounding box. Whatever manner is chosen, the location and size of a detected object can be obtained from these four

(21)

numbers. The localization is in fact implemented by bounding box regression which returns a series of real values in relation to each number in bounding box coordinate instead of a class label.

There are many approaches to performing a CNN based object detection task but an intuitive and efficient way is to attach a classification head and a regression head onto the final feature map trained by a CNN as feature extractor. The classification and regression heads are also composed of a couple of convolutional layers to generate better predictions on object class and locations. More technical details specific to the thesis are explained in Chapter 3. And an general framework for the structure introduced here is shown as Figure 2.4.

Figure 2.4: An object detection framework, reprinted from [20]

(22)

Chapter 3 Methods

This chapter will introduce and explain the methods used in the thesis in detail. First, It will give the description and analysis work of the dataset. And the model architecture and some technical details involved will be explicitly presented by text and figures. The main methods in domain adaptation will be listed which closely relate to the experiments in the next chapter. In the end, the regularization techniques including dropout, weight decay and data augmentation will be described.

3.1 Surround vision dataset

3.1.1 Description

The data set for the thesis is the high-resolution surround vision data, including approximately 25000 images along with annotations cap- tured by cameras mounted around an ego car. There are eight camera views which cover front(F), rear(R), left front(LF), left side(LS), left rear(LR), right front(RF), right side(RS) and right rear(RR), which can be illustrated in the Figure 3.1. Every image has a 1920 × 1208 resolution and is named by a timestamp plus camera position it belongs to. Every corresponding annotation file primarily contains the information regarding the marked target objects such as position and class type. Despite the fact that specific vehicle types are labeled and explained in the annotations, the thesis will regard all vehicles (bus, van, truck, etc.) as one class. Different from the common rules of object annotation for public data sets (e.g. KITTI), the surround vision data set

14

(23)

CHAPTER 3. METHODS 15

defines a set of exclusive annotation guidelines. Small objects that are far away from the ego-vehicle are not counted as ground truths. Image background is also a target class thus there are in total three classes to be detected, which are Vehicle. Don’t care and Background.

Figure 3.1: The positions for eight cameras

3.1.2 Data analysis

Because the data set was recently collected at the time of writing, a great deal of the analysis work is performed to learn the statistical properties from data distribution, which helps decision makings for model training. Part of the data analysis results are plotted as Figure 3.2, 3.3, 3.4 and 3.5,

Figure 3.2: The distribution object across all camera positions

(24)

16 CHAPTER 3. METHODS

Figure 3.3: the distribution of truncated object across all camera positions

Figure 3.4: The distribution of the height of bounding boxes for all camera positions with bin size 10

(25)

Figure 3.5: The distribution of the width of bounding boxes for all camera positions with bin size 10

The first bar plot shows the object distribution over all eight camera views. It can be seen that the number of front and rear view are the highest. Left side and right side views have the least number of images. The second bar plot indicates the total number of objects and the number of truncated objects in eight camera views, where each entire bar shows all objects and lower bars in yellow are the truncated objects. ’Truncated’ here simply means the incomplete objects due to the limit of image size. It is interesting to see that the truncation rates of front and rear views are both around 10%. For the other side view cameras, the rates range approximately from 20% to 30%. And the average truncation rate over all cameras is 18.12%. The next two figures display the size distribution of object bounding boxes in terms of height and width over all cameras, where the Y-axis denotes accumu- lative counts for every 10 bins. It can be seen that the front and rear views capture far more small objects (10 40 pixels) compared to other side views. Also, the peaks of height distribution are more obvious than width distribution. These statistics here together with other un- shown results are useful instructions of how to perform experiments for object detection and domain adaptation.

(26)

3.2 Model architecture

3.2.1 Encoder

The thesis experiments with three particular feature extractors(encoders):

ResNet with 34 convolutional layers, Dilated ResNet with 26 convolutional layers, and ResNeXt with 101 convolutional layers, out of which the Dilated ResNet outperforms others and considered as the feature extractor for the baseline model. To extract representative features from image, the thesis removes the last fully connected layer from the original structure of 26-layer Dilated ResNet and the modified network is illustrated as the Figure 3.6,

Figure 3.6: The model structure of feature extractor

where batch normalization and ReLU are performed right after each convolutional layer. Apart from the normal residual connections represented in solid line across every two convolutional layers, the

(27)

dashed lines indicate image downsampling with filter of size 1x1. In total, there are 512 feature maps after the extraction and the image size becomes half of the input size, which needs to be recovered by upsampling during the decoding process.

3.2.2 Decoder

The form of modular design allows for addition network components added onto the primary model structure and achieves multi-task learning, which is appropriate and effective for object detection. Such process is named decoder because it translates abstract image feature into specific object class. The decoder is composed of two tasks: classification and regression. The classification task generates predicted scores in probability for each class and the regression task generates predicted coordinates of bounding boxes for detected objects. The structure of the two tasks is illustrated as the Figure 3.7,

Figure 3.7: The structure of decoder

Once the output of extracted feature is directly fed into every connected task, two convolutional layers plus one pixel shuffle operation for upsampling purpose transform the extracted feature into predicted results with 3 channels for classification and 4 channels for regression.

3.3 Technical details

3.3.1 Pre-processing

In order to keep the convolutional neural network within a limited size and save GPU memory, a batch of images extracted from surround vi-

(28)

sion data are downsized by a factor of 2 in both height and width.

Then the preprocessing normalizes these images by subtracting with a set of mean value [0.485, 0.456, 0.406] and dividing by a set of standard deviation value [0.229, 0.224, 0.225]. The two criteria of normalization are given by the statistics from ImageNet which is just served as the pre-trained model for the thesis. It makes sense to perform this step because the input features contained in the images are not in the same scale. As mentioned in [21], the normalization should have approximately equal importance with regard to the learning algorithm. At last, the normalized batch of images have to be trimmed from right and bottom side in a way that the height and width dimension can be divided by 32 for full utilization of computing power from available GPU nodes. Therefore, after a series of pre-processing operations, the size of image because network input becomes 576 × 960.

Another purpose of pre-processing before actual training is to add information of image annotations as ground truth input, mainly involving coordinates of objects marked with rectangle bounding boxes inside every image. A bounding box can be represented as (xmin, ymin, xmax, ymax) such that the coordinates of the top left and bottom right pixel of the box are (xmin, ymin) and (xmax, ymax) respectively. The coordinate system following the annotation guideline sets the origin with an offset in left top direction, allowing for paddings added around an image for explicit annotation of truncated object.

Thus the system has to be adjusted by shifting each value of bounding box with corresponding padding offset, such that the origin could precisely reach the top left pixel of an image, which ensures a general coordinate system for the object detection network. Also, the bounding boxes are downsized by the same factor as image pre-processing for consistency at scale.

3.3.2 Input and output

Mask

The network input for ground truth and network output are represented as pixel masks with a downsampling stride of 4, such that the size of a pixel mask is (img_height/4, img_width/4). The mapping between the original coordinate system and new coordinate system of mask can be simplified by constructing two indexing arrays in X and Y direction of the same size as mask. For example, the indexing map of

(29)

X direction is an array of which row values are from 0 to img_width/4 with space 4. Similarly, all column values of the other indexing maps range from 0 to img_height/4 with space 4. In the case where a mapping is from mask coordinate system to real system, the original coordinates can be recovered by directly using respective indexing values in corresponding locations of two indexing maps.

Network input

• input of regression

Each target object represented as rectangle bounding box in an image is covered by an object mask. In practice, the dimension of the mask is not always equal to the size of the bounding box around the object. There is a shrink factor for object mask in four directions when the class type of the object is not Don’t care.

Thus the mask looks like a smaller region located in the middle of the object, which is shown as follows, where the object mask is shrunk to 20% of its original width and height. The following plots in the Figure 3.8 show the ground truth objects where the green rectangle is bounding box and the filled region in the center is shrunk object mask.

Figure 3.8: Objects marked with ground truth and mask

The width and height of shrunk object mask in mask system are described as,

mask_width = (x_max − x_min) × shrink_f actor/stride (3.1) mask_height = (y_max − y_min) × shrink_f actor/stride (3.2) So the converted coordinate of the object mask in mask system become,

(30)

mask_xmin = mask_center[0] − mask_width/2 (3.3) mask_ymin = mask_center[1] − mask_height/2 (3.4) mask_xmax = mask_center[0] + mask_width/2 (3.5) mask_ymax = mask_center[1] + mask_height/2 (3.6) where mask_center is calculated based on the location of center point in original bounding box.

Now that the coordinate of object bounding box is available in mask system, the next step is to create a four-channel object mask in which each channel relates to a value in bounding box coordinate (i.e. xmin, ymin, xmax, ymax). The region where the mask resides in every channel is filled by every real coordinate value of bounding box accordingly. Eventually, a process of encoding is required in order to obtain a set of offset values as final ground truth input of regression which will be detailed later.

• input of classification

There is only one channel for classification mask where the region of the object is filled by the code of class type (i.e. 0 for background, 1 for Don’t care, 2 for car/vehicle)

In principle, the pre-processed image represented as array is sent to the network as training input. And the masks of classification and regression are wrapped into a ground truth input used for network loss computation along with network input.

Network output

• output of regression

The regression head for predicting the location of 2D bounding boxes yields residuals to offsets of the boxes that are encoded relative to the coordinates of the output pixel of the model, which is shown in the Figure 3.9,

(31)

Figure 3.9: The regression output

Specifically, the output of regression consists of four channels in which each channel with size of (img_height, img_width) contains offset values at every pixel location in mask system. A location selected during detection process is marked with small box rectangle, ∆x1 is the offset from the selected point to the left margin of bounding box, ∆y1is the offset from the selected point to the top margin of bounding box, ∆x2 is the offset from the selected point to the bottom margin of bounding box and ∆y2 is the offset from the selected point to the right margin of bounding box. These four offsets are used for recovering the final estimated values for box coordinates. It is worth noting that the regression output which the model predicts is compared with regression ground truth directly for the calculation of network loss without being converted into real coordinates, since the regression ground truth is also represented as a form of offset values.

• output of classification

The classification head yields probability scores by Softmax function for each class type, therefore the number of channels of the output array equals to how many classes the model should predict.

The model generates raw outputs for classification and regression which are used for loss calculation, and these results will be screened in further steps in order to get an individual prediction per object as final detection results.

(32)

3.3.3 Multi-task learning

The object detection framework has two tasks: classification and regression. These tasks involve recognizing object type and looking for the exact location of bounding box outside of target objects in an image, which is called multi-task learning. According to [18], a combined model consisting of various tasks shares representation among these tasks and thus reduces computation in terms of time and resources.

Moreover, there is an agreement between each separate task when generating outputs of the model, which ensures the result not to be bias toward any task.

When comparing the loss between network output and ground truth of the whole network, an intuitive and common way is to calculate a weighted linear sum of each loss for each task. However, tuning the hyper-parameters is painful and time-consuming. Also, the model performance could be highly sensitive to the weights that are carefully selected. The thesis employs the method from [18] who proposes a principled way of weighting multi-task loss to simultaneously learn different tasks of varying quantities and units using homoscedastic task uncertainty, which is formulated as follows,

L(W, σ₁, σ₂, . . . , σ_i) = X

i

1

2σ_i²L_i(W) + log σ_i² (3.7) where L denotes the total weighted loss over all tasks for one batch; Li

denotes the actual loss for task i at the batch; σi is the weight trained by the network for task i. The actual loss Li is thus in combination with the weight σi to represent a new loss _2σ¹2

i

L_i+ log σ_i² for task i. As mentioned in [18], weights will converge to zero fast when they are represented in a simple form of linear sum. While for this new loss function which is smoothly differentiable, the task weights will never converge to zero. Overall, the method enables an automatic learning of relative weights from the data itself and is not vulnerable to weight initialization.

3.3.4 Loss function

Cross-entropy loss is selected as the loss function for classification task because the task belongs to multi-class estimation, and the output of the task is a probability value between 0 and 1. The network should

(33)

not take whatever it predicts in the Don’t care region into account so the output of class Don’t care is removed during the computation of classification loss. In Pytorch [27], it can be implemented by setting the ignore_index to be the label index of class Don’t care, which is 1 in the thesis. The function provided in Pytorch for cross-entropy loss first converts network outputs into probability scores by Softmax function, and it then computes cross-entropy value for each element in a batch, which can be described as,

loss(x, c) = − log( exp(x[c]) P

jexp(x[j])) = −x[c] + log(X

j

exp(x[j])) (3.8)

Where x[c] denotes the output for class label c, and x[j] denotes any other output that is not associated with label c. In fact, This representation of cross-entropy loss combines the negative log likelihood loss and Softmax function together. Eventually, all loss values are averaged across elements in each batch.

Smooth L1, a variation of L1 loss which represents least absolute deviation, is used to be the loss function for regression task. The loss value zi for one batch is simply the average on all elements inside, formulated as follows,

loss(x, y) = 1 n

X

i

z_i (3.9)

And the error for the element i is shown as follows, where xi and yi

denote the estimated value and target value respectively.

z_i =

(0.5(x_i− y_i)², if |xi− y_i| < 1

|x_i− y_i| − 0.5, otherwise (3.10) When the absolute element-wise difference between target value and estimated value falls below 1, the function turns into a form of L2 loss (Mean Squared Error) which computes a squared error instead of absolute error. The reason why L2 loss function is not directly used for all errors is that L2 loss is unexpectedly larger in the case of outliers by which the loss is inclined to be affected, while L1 loss is generally less sensitive to outliers. In the end, the total loss of the model is a combination of two task losses.

(34)

3.4 Inference

3.4.1 Detection

The raw regression output as coordinate offset from model network needs to be decoded into real coordinate for final detection, which is simply implemented by adding or subtracting respective offsets from indexing map in X and Y direction. After that, the classification score along with decoded regression output have to go through two sifting processes to become final detection result.

First of all, the classification outputs with scores above a specified confidence threshold are kept and the others are discarded. And the regression outputs are then filtered according to the remaining indices of classification output.

The next Non-maximum suppression performs an iterative and more rigorous screening step. The bounding boxes are first ranked based on their filtered classification scores and the box with highest score is compared against the rest of predicted boxes. If the IoU of a remaining box regarding the highest ranked box is larger than a Non-maximum suppression threshold, it will be removed from the list of predicted boxes. A new list of predicted bounding boxes is created once all remaining boxes are compared, and the iterative process continues until the last element in the list is accessed, which produces the final detection result of bounding boxes. The Figure 3.10 shows the entire process for performing detection.

Figure 3.10: The process for final detection

3.4.2 Evaluation

The thesis only evaluates on vehicle class regardless of the detection performance on background or Don’t care class after training for every epoch for the moment. The implications of evaluation metrics for the object detection thesis are explained as follows,

• true positive: ground truth vehicles that are correctly detected.

(35)

• false positive: detected vehicles that are actually not ground truth.

• false negative: ground truth vehicles that are not detected.

Each image frame is evaluated separately within the same epoch to get key metrics such as true positive and false positive that are the components for calculating precision and recall. If M and N denote the number of detected vehicle objects and ground truth objects respectively, then the overlapping area between detection and ground truth for each image frame can be constructed as a 2D matrix of size MxN based on the bounding boxes of network output as well as ground truth input, where each element shows the IoU score regarding a pair of detection and ground truth box.

With the goal of any detected object matching with a single ground truth object, the overlapping matrix is then selected iteratively by extracting the highest score one at a time, and no longer considering the detection and ground box in which the the highest score is. The number of iteration depends on the minimal number of objects between detection and ground truth, resulting in a score vector of size Mx1. In a similar way, a Don’t care score vector of the same size is obtained from the comparison between detected cars and Don’t care objects. The score list is then converted into a bool mask as Don’t care mask of which the element is true as along as it is above a specific threshold. This ensures that a detected vehicle that has more than, for instance, 10% overlapping area with a detected Don’t care object (e.g. a group of parked cars) is regarded as a Don’t care object instead of an individual vehicle.

A specified threshold on the score vector decides which detections are the final true positive, generating a bool mask for true positive.

And the mask for false positive is calculated by a series of logical operations, as shown below.

f alse_positive = (not true_positive) and (not Don⁰t_care_mask) (3.11) Therefore, the final IoU score of an image frame is the sum of matching scores filtered by a true positive threshold. And the number of true positive and false positive can be obtained from their respective bool mask. After all image frames within an epoch are evaluated in terms of true positive and false positive, the metric of mean IoU becomes

(36)

the current IoU score divided by the number of true positives. The precision and recall can also be generated from available evaluation metrics, which are formulated specifically in the thesis as,

precision = true positive

true positive + f alse positive (3.12) recall = true positive

ground truth (3.13)

It is also noted that

ground truth = true positive + f alse negative (3.14) At last, the F1 score decided by precision and recall is considered as the final evaluation metric for the detection performance on vehicle at one epoch. In the next Chapter, the F1 score for every epoch is plotted to demonstrate the trend of detection performance of the model over time.

3.5 Transfer learning

Transfer learning is used to generate a baseline for each camera. In- stead of being trained as an entire convolutional neural network from scratch, the encoder of the object detection model is pretrained on Im- ageNet data set (which has 1.2 million images with 1000 categories) as an initialization such that it is able to extract features with high level information of input images. And what is actually transferred through the model is a set of network weights initialized on the encoder network. Compared to ImageNet data set, the size of surround vision data is relatively small so that the choice of pretraining on a larger data set like ImageNet is a good idea to improve the generalization performance on a new data set. Furthermore, the subsequent process of fine-tuning allows for learned features being associated with the target surround vision data set, and incorporating the properties of equivariance and invariance into the classification and regression task.

(37)

3.6 Domain adaptation

The model is initially focused on the training on only a single camera, which is the front view camera that has the largest number of diverse target objects. It is based on the assumption that the model trained on front-view image could be generalizable enough to perform further trainings on other cameras.

Similar to the operation in transfer learning, a baseline model pretrained on front view camera is used as an initialization for trainings on other seven cameras. Thus, the encoder network in the pretrained model now contains high level image features that mostly come from the surround vision data set.

It becomes a process of domain adaptation where front view camera is the source domain, and other seven cameras including rear, left- front, left-side, left-rear, right-front, right-side and right-rear views are target domains.

Various empirical techniques of fine-tuning are added on the trainings for other seven cameras with the goal of investigating the most appropriate domain adaptation method for each camera. The pretrained model used for domain adaptation is the baseline model of front view camera because it should demonstrate the best performance on the source domain.

3.6.1 Direct training

The first domain adaptation method is simply to read the well-trained model of front view camera which is saved in a checkpoint file containing all network weights as an initialization, and then to continue training on each of seven cameras for 100 epochs. The validation set for each camera remains the same as the one used in respective baseline training so that the effect of evaluation on validation set can be com- parable. Therefore, all of the weights of the encoder, decoder and task weight components start training with meaningful weights instead of random initializations.

3.6.2 Re-initialization

This method is similar to the first method apart from the change on weight initialization. The idea is that the encoder centers on finding

(38)

distinctive features about vehicles while the other model components such as decoder concern the object classification and localization task.

It might be easier for the network to make predictions on classification than learning significant features. And the training of decoder with initialized weights from a different domain could lead to a deviated learning direction. Without the contribution of well-trained decoder weights, the pretrained model actually serves as a feature extraction mechanism.

3.6.3 Freeze encoder

• Freeze encoder

One extreme way to avoid over-training of object features is to freeze the encoder during the whole training. The encoder weights are not affected by the backpropagation process of network weights such that they keep unchanged as initialized values, In pytorch, this can be achieved by controlling the flag requires_grad that provides exclusion of active nodes from gradient computation. The goal here is to observe how representative the features learned from the pretrained model are, which gives an instruction of how to perform the following domain adaptation methods on partial training.

• freeze by epoch

Freezing the encoder only for a few epochs rather than the whole training can be a more reasonable alternative. The fine-tuning operation that the encoder weights are allowed for gradient update begins as soon as the model is unfrozen. As a result, The model makes detection based on the features learned from front view camera until they are improved iteratively over fine-tuning.

The number of epoch for model frozen can vary in the experiments.

• freeze by layer

Another option is to make the weights of initial layers frozen and fine-tune the rest of them per epoch. The idea behind it is that not every convolutional layer serves the same purpose in the encoder network, especially for the encoder in the baseline model, which is Dilated ResNet. The choice of a suitable threshold can

(39)

be tricky. However, in the next Chapter, it can be seen that how many layers are frozen depends on the dilation property of the encoder.

3.6.4 Different learning rates

The common purpose of the methods above is to keep the model from over-learning on extracted features by tentatively disabling weight up- dates. It is worth noting that the use of two different learning rates for encoder and other components can also reach this goal. A relatively larger learning rate compared to the previous single rate is selected for other components, which accelerates the training on the detection of classification and regression. On the other hand, the reason why the learning rate for encoder is relatively small is that the feature extractor is expected to be trained in a slow pace, so that the weights of encoder network are not updated as frequently as before, which provides an alleviation of the issue in over-learning to some extent.

3.6.5 Freezeout

Freezeout(Andrew et al. [4]) is a way of substantially reducing training time by gradually freezing layers at the cost of slight drop in model performance. Different from the previous methods that directly pre- vent layers from updating in backpropagation, Freezeout holds a set of learning rates and changes them layerwise over time, which is a type of learning rate annealing. Basically, the full model is trainable at first. After some iterations, a few layers get frozen if their learning rates reduce to zero, and the rest still keeps being trained and so on. A layer is set to inference mode and excluded from backpropagation once frozen, leading to an immediate iteration-wise growth in speed proportional to the layer’s computational expense. Thus, layers are progressively stopped from training as the number of iterations increases. What is more, there are two user involvements with regard to the structure of Freezeout including t_0 which indicates how far to freeze the first layer into training and annealing strategy. Gener- ally, Freezeout has four strategies with the variation of scaling mode and time of duration mode. And in the experiment, several elaborate strategy combinations are presented to show the effect of speedup in training time under different conditions.

(40)

It is worth noting that the method only works efficiently for network with residual connections as stated in the paper, which just fits for the network used in the thesis.

3.6.6 Horizontal flipping

Making use of the potential symmetry in left side and right side cameras, horizontal flipping addresses the issue of data shortage in terms of efficient annotation to some degree. It is a novel attempt to validate the model on left side camera while train the model on right side camera which is horizontally flipped into a simulated left one as network input. In the case where the input of left side data is constructed from available right side data along with its object annotation, the amount of image and annotation expected for left side camera can be greatly reduced. Conversely, the right side images can be also generated from left side ones by flipping in horizontal direction. To eliminate the dis- crepancy between the two types of cameras as much as possible, the network input can be constructed by some left side data which is to be flipped into right side, mixed with a part of right side data that are annotated.

3.6.7 Combination

Because some domain adaptation methods elaborated above are com- patible with each other, it would be very interesting to see whether there will be any performance enhancement when the methods are integrated in some way.

Three combinations are carefully picked based on the comparison of experiment results. The first is to freeze a specific number of layers of dilated residual network for all epochs while initialize the weights of model component except encoder before training. Another combination is to freeze a specific number of layers of the encoder for only several epochs. The last combination stills freezes some layers but with a smaller learning rate for the encoder. Overall, these experiments are based on rational decisions in a way that they take the advantage of the strength from each domain adaption method.

(41)

3.7 Regularization

3.7.1 Weight decay

Residual networks are more prone to overfitting than other neural networks, which can be seen in the experiments of domain adaptation. In spite of an increasing trend on F1 score for validation data, the loss curve keeps going up until certain epoch where the validation loss value reaches its minimum. Therefore, regularization techniques are necessary here to lower the complexity of training model and mitigate overfitting.

The first method of regularization used in the thesis is weight decay which penalizes large weights. It is equivalent to L2 regularization even though the subtle difference is that weight decay is an extra term added in the update rule for network weights. If L denotes the current loss, η is the learning rate, wi is the weight parameter, then an update function with weight decay can be expressed as,

wi = (1 − ηλ)wi− η∂L_i

∂w_i (3.15)

where λ controls the trade off between the amount of weight decay and the current cost L. Since the optimizer for the training model is Adam, weight decay values are incorporated in Adam as the optimizer-specific options. Several decay values are tried in order to find a better penal- ization and there is no weight decay for the baseline model by default.

3.7.2 Dropout

The building block of dilated residual networks does not provide dropout layers. Thus one regularization idea on the model itself is to modify the original network itself. Different from weight decay which has effect on all components of the training model, this method only focuses on the feature extractor by randomly dropping a few units in the residual blocks. As a result, the incoming and outcoming connections of a dropped unit are cut off and the unit is removed temporarily, which stops its updated weights from backpropagating through the extractor network.

A new building block with one dropout layer is shown in Figure 3.11a. The dropout layer is located in-between two convolutional layers with a specified probability, for example 50%, which means 50% of

(42)

the unit activations are zeroed out and then randomized on every forward pass. Instead of increasing the dropout rate on a single dropout layer, one alternative is to adding another dropout layer after the second convolutional layer with a lower rate. The corresponding building block structure is shown in Figure 3.11b.

(a) One dropout layer (b) Two dropout layers Figure 3.11: Building blocks of dilated residual network with one dropout layer (a) and two dropout layers (b)

3.7.3 Data augmentation

Considering the similarity of images within continuous timestamps and the number of available images, data augmentation techniques are added in the training process to increase the size and diversity of

(43)

training images, which becomes an alternative to avoiding overfitting in the case.

One could prepare augmented images beforehand in a folder to be loaded along with the original images in the training. However, due to the limit of memory and other resources, the training images are randomly selected at every batch for dynamic augmentations on the fly before further preprocessing. The total number of training set thus remains unchanged and in other words, the images that are augmented during each epoch are different.

There are three data augmentations methods in the thesis: basic image processing, horizontal flipping and random cropping.

• Basic image processing

The first operation is to add Gaussian blur with a random sigma value between 0 and 0.5 per image with a probability of 50%.

Image contrast is then strengthened or weakened by a factor of 0.75 to 1.5 per image. Also, white noise is added once per pixel for 50% of all images, and sampled channel-wise for the rest of images from a Gaussian distribution with mean value 0 and variance 0.05x225, which changes the brightness as well as the color of an image. Another way to adjust the above properties is to directly multiply all pixel intensities within a specific range between 0.8 and 1.2 for 80% of all images per pixel and 20% per channel.

• Horizontal flipping

Images are flipped horizontally along with the coordinates of ground truth bounding box for each object. And one image has a 50% chance of being flipped. This operation is useful when the model is trained on only left side view cameras while validated on right side view cameras with the assumption that these two side views are symmetric to some extent. Hence, the amount of annotation work for right side view camera can be reduced in terms of annotation time and cost.

• Random cropping

One method is to start cropping from the right and bottom side of an image. A random number smaller than 100 is sampled as the width value for cropped area on the right, and the cropping

(44)

value for the other side is calculated according to the aspect ratio of the image. The cropped image is then resized into the original dimension by an upsampling factor along with the bounding boxes which are recomputed into new positions.

The cropped region can be more randomized in four directions.

If X1and X2denote the top left and top right point of the region with the coordinate (xmin, ymin) and (xmax, ymin) respectively.

The width and height of the image containing this cropped region are represented as img_width and img_height. Then the value of xmin is in the range (1, img_width/2 and ymin is sampled in the range (1, img_height/2) so that point X1 is limited within the upper left quarter area of the image, which makes sure the cen- tral part of the image will be always in the cropped region. The width of this region decided by xmax − xmin is at least equal to half of the image width img_width/2 such that this random selected cropped region can not be too small otherwise few or no objects will appear in the augmented image. The value of ymax is calculated based on the aspect ratio as above. The image is then resized after cropping by linear interpolation, and the bounding boxes of objects that are not completely cropped are re-adjusted to cover the area of upsampled objects. It is noted that if there is no target object left in the cropped region, this augmentation operation is then disregarded for the image.

The following images in the Figure 3.13 are three examples of data augmentation on front, right side and rear right view images. And the positions of ground truth bounding boxes in green are recalculated and drawn as rectangles on top of the original image.

(45)

(a) Image (b) Basic (c) Flipping (d) Cropping

Figure 3.13: (a) is the original downsampled image (b) is the processing mentioned above on the original image, (c) is the horizontal flipped image, (d) is the the cropped image. The location of bounding boxes in (c) and (d) is recalculated and drawn on top of the original image

This chapter will present the results for the experiments regarding benchmark, domain adaptation, regularization, ratio reduction and error analysis. Figures and graphs will be served to intuitively display the effect of various methods. Extensive analysis work will be presented to discuss the reason and relationships for different results.

(46)

Chapter 4 Experiments & results

4.1 Benchmark

All the trainings regardless of the camera direction are based on the data set split that 60% for training and 40% for validation. The images identifiers for all cameras are saved in respective data list in advance along with the location information of where the data stores in server, so that the model can load different data set through specified data loader before training, and dynamically extract data according to the identifiers from corresponding data list. This eases the work of data extraction and separates it from the core object detection task. Once new data comes in the server, the only task is to add or modify the original data list to make data loading flexible and diverse.

The Table 4.1 lists the evaluation results after training for 100 epochs on front view camera using three different feature extractor networks.

The third experiment outperforms the other two and is considered as the baseline model for front view camera, which is used later as pretrained model in domain adaptation. The baseline model is obtained after extensive experiments with various combinations of hyper-parameters and other variable conditions.

38

(47)

CHAPTER 4. EXPERIMENTS & RESULTS 39

Table 4.1: Performance on validation set using three encoders for front view camera

Encoder F1 Precision Recall IoU

ResNet 0.569 0.678 0.489 0.797

ResNeXt 0.545 0.634 0.478 0.795

Dilated ResNet 0.623 0.659 0.591 0.799

The first two plots below in the Figure 4.1 are the training and validation loss curves. The thin line indicates how the weighted loss de- creases as the training process goes on whereas the bold line indicates the moving average of the final loss for a smoothing purpose. It can be seen that the model is trained in a correct direction since there is an obvious decline of the training loss over epoch. And the validation loss keeps reducing in spite of a slight fluctuation in the later stages. As for the evaluation, it is observed that both the training and validation F1 scores rise dramatically and then plateau to a relatively high score. In general, these two plots imply more accurate predictions made by the model as the training time increases.

(48)

40 CHAPTER 4. EXPERIMENTS & RESULTS

(a) Training loss curve (b) Validation loss curve

(c) Training F1 score curve (d) Validation F1 score curve Figure 4.1: The training and validation results on baseline model. (a) shows the training loss, (b) shows the validation loss, (c) shows the training F1 score, (d) show the validation F1 score

The carefully picked representative detection results for front view camera are given in the following pictures in the Figure 4.2. The green masks covering objects are ground truths and the green boxes around objects are detection results for vehicles. The first picture (a) presents normal detections on highway where the majority of the target objects are successfully detected in bounding box with a appropriate size. Apart from the detection of the rear face of vehicles, other faces such as the side of vehicles can also be detected by a front view cam-

Surround Vision Object Detection Using Deep Learning

Surround Vision Object Detection Using Deep Learning

YUAN GAO

Surround Vision Object Detection Using Deep Learning

YUAN GAO

Abstract

Sammanfattning

Contents

Chapter 1 Introduction

1.1 Background

1.2 Thesis objective

1.3 Delimitations

1.4 Related work

1.5 Ethical aspects and social impact

1.6 Thesis outline

Chapter 2

Relevant theory

2.1 Deep feedforward networks

2.1.1 Backpropagation

2.1.2 Activation functions

2.1.3 Convolution neural networks

2.2 Object detection

2.2.1 Classification, localization and detection

2.2.2 CNN based object detection

Chapter 3 Methods

3.1 Surround vision dataset

3.1.1 Description

3.1.2 Data analysis

3.2 Model architecture

3.2.1 Encoder

3.2.2 Decoder

3.3 Technical details

3.3.1 Pre-processing

3.3.2 Input and output

3.3.3 Multi-task learning

3.3.4 Loss function

3.4 Inference

3.4.1 Detection

3.4.2 Evaluation

3.5 Transfer learning

3.6 Domain adaptation

3.6.1 Direct training

3.6.2 Re-initialization

3.6.3 Freeze encoder

3.6.4 Different learning rates

3.6.5 Freezeout

3.6.6 Horizontal flipping

3.6.7 Combination

3.7 Regularization

3.7.1 Weight decay

3.7.2 Dropout

3.7.3 Data augmentation

Chapter 4

Experiments & results

4.1 Benchmark