Depth prediction by deep learning

(1)

Depth prediction by deep learning

VALENTIN FIGUÉ

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

learning

VALENTIN FIGUÉ

Double Degree in Computer Science Date: November 8, 2018

Supervisor: Mårten Björkman Examiner: Danica Kragic

Swedish title: Djupförutsägelse genom deep learning School of Electrical Engineering and Computer Science

(4)

(5)

Abstract

Knowing the depth information is of critical importance in scene understanding for several industrial projects such as self-driving cars for instance. Where depth inference from a single still image has taken a prominent place in recent studies with the outcome of deep learning methods, practical cases often offer useful additional information that should be considered early in the architecture of the design to benefit from them in order to improve quality and robustness of the estimates.

Hence, this thesis proposes a deep fully convolutional network which allows to exploit the informations of either stereo or monocular temporal sequences, along with a novel training procedure which takes multi-scale optimization into account.

Indeed, this thesis found that using multi-scale information all along the network is of prime importance for accurate depth estimation and greatly improves performances, allowing to obtain new state-of-the- art results on both synthetic data using Virtual KITTI and also on real images with the challenging KITTI dataset.

(6)

iv

Sammanfattning

Att känna till djupet i en bild är av avgörande betydelse för scenför- ståelse i flera industriella tillämpningar, exempelvis för självkörande bilar. Bestämning av djup utifrån enstaka bilder har fått en alltmer framträdande roll i studier på senare år, tack vare utvecklingen inom deep learning. I många praktiska fall tillhandahålls ytterligare information som är högst användbar, vilket man bör ta hänsyn till då man designar en arkitektur för att förbättra djupuppskattningarnas kvali- tet och robusthet.

I detta examensarbete presenteras därför ett så kallat djupt fullstän- digt faltningsnätverk, som tillåter att man utnyttjar information från tidssekvenser både monokulärt och i stereo samt nya sätt att optimalt träna nätverken i multipla skalor.

I examensarbetet konstateras att information från multipla skalor är av synnerlig vikt för noggrann uppskattning av djup och för avse- värt förbättrad prestanda, vilket resulterat i nya state-of-the-art-resultat på syntetiska data från Virtual KITTI såväl som på riktiga bilder från det utmanande KITTI-datasetet.

(7)

1 Introduction 1

1.1 Context . . . 1

1.2 Societal Impact . . . 2

1.3 Ethical consideration . . . 2

1.4 Objectives . . . 2

1.5 Validation of the results . . . 3

2 Theoretical Background 4 2.1 Definition of the problem . . . 4

2.2 3D vision . . . 5

2.2.1 Mathematical Notations . . . 5

2.2.2 Stereoscopic vision . . . 6

2.2.3 Temporal monocular vision . . . 6

2.3 Deep Learning . . . 7

2.3.1 Composition of a deep convolutional neural network . . . 7

2.3.2 Experimental implementation . . . 9

3 Related Work 11 3.1 ResNet . . . 11

3.2 Prediction from a single image . . . 12

3.2.1 Eigen Network . . . 12

3.2.2 Laina Network . . . 13

3.3 Stereoscopic depth inference . . . 14

3.3.1 Kuznietsov network . . . 15

3.3.2 Godart Network . . . 16

3.4 Sequential inference . . . 17

3.4.1 Vijayanarasimhan Network . . . 17

3.5 Extension to similar problems . . . 18

v

(8)

vi CONTENTS

3.5.1 FlowNet . . . 19

3.6 Synthesis . . . 19

4 Approach 21 4.1 MSDOS-Net Architecture . . . 21

4.1.1 Muli-Scale Coarse to Fine (MSCF) module . . . . 23

4.1.2 Coarse-to-Fine Inference . . . 26

4.2 Multi Scale Training Approach . . . 27

5 Experimentation and results 30 5.1 Implementation Details . . . 30

5.2 Evaluation Metrics . . . 31

5.3 Virtual KITTI . . . 31

5.4 KITTI . . . 34

5.4.1 Comparison with the State-of-the-art . . . 35

5.4.2 Generalization to temporal sequences . . . 35

6 Conclusion 37

Bibliography 39

(9)

Introduction

1.1 Context

During the last few years arised many new industrial projects such as in the area of robotics, or for self driving cars where the core of the problem is to retrieve spatial information from a single, couple or sequence of images. Recovering dense depth information from a pair of images is a non trivial essential task in computer vision which has been explored for several decades [7], [17]. However all these classical approaches need a pair of stereoscopic images to be performed and achieve intermediate performance.

The outcome of deep learning methods allowed to improve this field of research, in terms of performance. Indeed thanks to the avail- ability of large amount of RGB-D (colors and depth) data collected with dedicated depth sensors and the emergence of synthetic data, several deep networks nowadays achieve impressive results in depth prediction. The effectiveness of deep learning approaches even allows to solve, to some extent, ill-posed problem such as depth prediction from a single image.

Indeed, most of the current deep architecture do not use a couple of images as classical approaches do but infer depth from a single image. However, multiple-image acquisition tends to become a standard in vision-based applications and systems. Personal and public image collections are continuously growing, with even more redundancies and overlapping between images. Even shooting a simple photogra- phy with a smartphone often implies acquiring multiple frames before combining them, sometimes from different sensors. That’s why this

1

(10)

2 CHAPTER 1. INTRODUCTION

thesis will focus on developing a novel deep architecture for depth prediction from a couple of images.

1.2 Societal Impact

Improving the depth prediction can have a huge societal impact. In- deed, this can be easily adapted for the self driving cars or even for improving classical cars. With one or two different cameras on the car, the solution developed in this thesis will be able to build the depth of the scene in front of the car. This depth map can be used to localize spatially obstacles such as pedestrians, other cars, or even dogs crossing the road. Thanks to this detection, the trajectory of the vehicle can be adapted to avoid collisions. This can help to reduce the number of accidents. This solution can be adapted to help blind or partially sighted persons in everyday life.

1.3 Ethical consideration

One of the key of the deep learning approaches lies in the amount of data and the accuracy of those data for the given problematic. How- ever, if we consider the self-driving cars problem for example, in order to detect efficiently the pedestrians crossing the road, there should be enough representations of pedestrians in the data. But from an ethical point of view, the collection of those data, can be done only with the agreement of the pedestrians. This is why, one should always verify when using a dataset that all the agreements have been collected. If the agreements have not been collected, the faces can be blurred out.

1.4 Objectives

Due to the performance of the recent deep learning methods and their very recent outcomes , this thesis will focus first on a review of the different preexisting approaches for depth prediction from a single image and for a couple or a sequence of images. This review aims to compare the different architectures and to highlight how they differ from classical approaches, in order to understand from a qualitative point of view why they are so efficient.

(11)

Once those approaches synthesized, this thesis will propose a new deep architecture. The model will tackle only two different depth prediction problematics :

• Recovering depth from two stereoscopic images.

• Recovering depth from two sequential images.

1.5 Validation of the results

In order to compare the efficiency of our approach with the preexisting methods, its performance will be evaluated on two different academic datasets :

• Real sparse RGD-D dataset Kitti

• Synthetic dense RGB-D dataset Virtual Kitti

One of the main advantages of those two datasets is to provide images from car driving sequences which illustrates how well these methods can be used for self driving cars.

(12)

Chapter 2 Theoretical Background

This chapter presents the different formulas and notations necessary to understand the deep neural networks introduced later.

2.1 Definition of the problem

This master thesis aims to propose a model in order to predict depth from RGB images. Mathematically speaking, the depth prediction of an image by a given model follows this scheme :

F (I, Supp) = Z (2.1)

where F represents the function of the model, I the image for which the depth need to be predicted, Supp the supplementary images to help the prediction and Z the predicted depth map. The ideal depth prediction model given a certain error criterion fulfill the next condi- tion :

F_ideal = min_F{X

i2S

E{F (Ii, Supp_i), Z_i⁰}} (2.2) with E the error criterion, S the dataset of images on which we want to test the model, and Z⁰ the depth ground truth.

Depending on the data provided, three different situations will be explored. These situations differ on how the supplementary images are constitued :

• In most deep learning approaches, only a single image is used to predict the depth. In this this case Supp = {}. This is defined as monocular prediction.

4

(13)

• Most classical approaches use a single supplementary image from a different calibrated camera. Il and Irrepresent the images provided respectively by the left and by the right camera. In this case, Supp = Ilor Ir, and this is named stereoscopic prediction.

• The last case represents the situation where the supplementary images come from the same camera but at a different time. Itwill represent the image of interest, on which we compute the prediction and It 1,...,t n the n previous frames. This will be called sequential prediction.

2.2 3D vision

Along this master thesis, several geometric operations will be performed. In order to be as clear as possible, this section introduces the different mathematical notions which will be used.

2.2.1 Mathematical Notations

A point in a given image will be represented by two values x and y which represents the two usual axis. For instance I(x, y) represents the pixels located at the coordinates (x, y) in the image I.

The coordinates of a 3D point will be notated by capital letters X, Y and Z where Z represents the depth.

One can convert the 3D point coordinates to the image coordinates by the usual projection matrix which will be noted P by the following formula :

0

@x y 1

1 A = kP

0 BB

@ X Y Z

1 1 CC

A (2.3)

where k represents a scaling factor. The projection matrix P contains all the intrinsic parameters of the camera such as the focal length. We do not explain the exact composition of the matrix P as it won’t be necessary in the following.

In the case where the camera space origin and axis of the two different cameras are not aligned, the rotation and translation between the two different geometrical spaces need to be taken into account. R will represent the rotation matrix and T the translation vector between

(14)

6 CHAPTER 2. THEORETICAL BACKGROUND

those two. In this situation, the conversion from 3D point coordinates to image coordinates will contain another term :

0

@ x y 1

1 A = kP

0 BB

@

tx

R3⇥3 ty

t_z

0 0 0 1

1 CC A

0 BB

@ X Y Z 1

1 CC

A (2.4)

where k still represents a scaling factor.

2.2.2 Stereoscopic vision

The stereoscopic case, that is to say when images come from different horizontal positions, presents a specific notion. This notion is called disparity and defined by the following implicit equation :

Il(x, y) = Ir(x ⇢(x, y), y) (2.5) where ⇢ represents the disparity function. The main advantage to use the disparity in our case, is the formula which links the disparity with the depth. Indeed, one can deduce from spatial transformation that :

⇢(x, y) = f b

Z(x, y) (2.6)

where f is the focal length of both cameras (one of the intrinsic parameter) and b the distance between the two different cameras. We can see from equations 2.5 and 2.6 that we have an indirect way to solve the depth prediction problem by first estimating the disparity.

2.2.3 Temporal monocular vision

A similar concept exists in the sequential situation known as optical flow. The optical flow links two images from the same cameras at two different time-stamps. As for the disparity, the optical flow is defined by an implicit equation :

It+1(x, y) = It(x + Vx(x, y), y + Vy(x, y)) (2.7) where Vx and Vy represents the optical flow along the two different axis.

(15)

The first difference between those two different concepts is the dimensional difference. The calibration of the cameras in the stereoscopic case limits the equation 2.5 along a single axis instead of the two dimensions for the equation 2.7. This is why it is easier to estimate disparity rather than the optical flow. The second difference is that there does not exist a direct formula which links optical flow and depth.

The study of the optical flow can seem useless but as both images come from the same camera, given the displacement of the camera - that is to say the rotation R and translation T - it is possible to link the optical flow, and the 3D coordinates of the points which depends of the depth.

2.3 Deep Learning

This section aims to provide a quick overview on how a deep neural network is build.

2.3.1 Composition of a deep convolutional neural net- work

Microscopic approach

A convolutional neural network can be decomposed as a succession of three different kinds of operations. This stack of operations composes what is called a layer of the network. Mathematically speaking if F represents the function of the network and fithe function of each layer, the following equation describes the behavior of the network :

F (X) = fn ... f1(X) (2.8)

nrepresents the total number of layers of the network and X represents its input. Most often, the input is usually a 3 dimensional matrix and represents an image. The first dimension is called the depth or channels, the two others are the height and width of the input image. Each 2 dimensional matrix for each channel value is called a feature map.

The width and the height of those features maps are most of the time reduced during the neural network. This reduction can be performed by different operations of the network (convolution, max-pooling).

(16)

The max pooling operation consists in keeping the max value for each square of size (2x2) of the feature map. This will result in a reduction by a factor of 2.

The fist generic operation of a layer is the convolution. It aims to filter the input to extract the useful information. The explicit formula of the convolution is the following one:

(X ⇤ C(k, s))(z, x, y) = XN n=1

Xk i=1

Xk j=1

X(n, s.x, s.y).Cz(n, s.x i, s.y j) (2.9) The convolution has different parameters. The first one is the kernel size k which can be seen as the size of the receptive field. The second one s is called the stride. Usually s = 1 but when s = 2 the output width and height is reduced by 2. The depth of the output is the last parameter which can be set.

The second operation of the layer is a batch normalization layer, introduced by Ioffe and Szegedy [9]. This operation aims to normalize and center the coefficients of the input tensor. It has two different parameters : the batch mean called m and the standard deviation of the batch called . The operation is the classic normalization :

BN (X) = X m

(2.10) where BN represents the batch normalization function. This operation is not compulsory but is present most of the time to increase either performance or speed convergence of the network.

The last operation of the layer is a non linear operation. It can be performed by different functions such as:

• ReLU : f(x) = max(0, x)

• Sigmoid : f(x) = (1 + e ^x) ¹

• Hyperbolic tangent : f(x) = tanh(x)

The ReLU function is the most used due to its efficiency compared with its computational cost.

Macroscopic structure

Most often the structure of a deep neural network is very similar, not depending on its goal. The first part of the network usually aims to

(17)

encode the input images. This encoding results in spatially small features. The width and the height of these features can be for example 32 times smaller than the one of the original image. However we usually see a high number of feature maps at this level of encoding, containing useful information on the input with a reduced total number of dimensions.

Then depending on what function the network aims to fulfill, the network performs a decoding procedure. This decoding decrease the number of channels and transform the features to the desired form of output. For classification, for instance, it will result in a probabilistic vector for each classes. In the depth prediction case it will result in an image which will represent the depth map.

Training procedure

In order to achieve its goal, a network need to be trained for a specific task on a dataset. Initially all the different parameters of the network (the coefficients of the convolutions and batch normalization) are initialized randomly. Then the network is trained by minimizing a loss function. This loss function can be formulated as :

Loss = X

(X,Z⁰)2S

Error(F (X Z⁰)) (2.11)

where X represents the input image tensor, Z⁰ the ground truth and S the dataset. Usually the norm 1 or 2 are chosen for the Error function but more complex functions can also be used.

The minimization of this function is performed by stochastic gradient descent. All the experiments conducted during this thesis have been performed by using the Adam optimization introduced by Kingma and Ba [10], which is one of the most efficient stochastic gradient method for training neural networks. The gradient is computed by backpropagation, introduced first for neural network by LeCun et al. [13].

2.3.2 Experimental implementation

One of the main drawback of deep neural network methods is the computational cost. Some specific libraries such as Torch, Tensorflow, Caffe, or Pytorch have been developed during the last 5 years. All the different experiments performed during this thesis have been implemented with Pytorch. This very new Python library created by Paszke

(18)

et al. [16], have been released early 2017. It implements efficiently on GPUs to speed up the training of all the different operations that we discussed in the previous section.

(19)

Related Work

3.1 ResNet

Before reviewing the different preexisting deep neural networks for depth prediction, it is essential to explain briefly one of the main recent architecture that revolutionized the images classification challenge. It is named ResNet and was first introduced by He et al. [8]. This network is essential for us because all the different networks explained in this chapter reuse its architecture in their encoding part.

The main contribution of this network is the creation of a new module with a skip connection : an additional branch performing the iden- tity function and skipping some convolutions. The feature maps resulting from the skipping branch will then added to the features maps of the main branch.

This module exists in two forms : a skip module (Figure 3.1) and a skip projection module (Figure 3.5). The only difference is the presence or not of a convolution in the skip connection. This convolution is

Figure 3.1: Illustration of Resnet Skip, one of the residual module introduced by He et al. [8]. This module does not perform any change in the number of channels or resolution of the features.

11

(20)

12 CHAPTER 3. RELATED WORK

Figure 3.2: Illustration of Resnet Proj, one of the residual module introduced by He et al. [8]. This module is used to increase the number of features and decrease the resolution.

present only to make sure that before the addition, features have the same size.

The basic idea behind this module is to ease the convergence of the network. Indeed due to the large number of layers, during the backpropagation, the gradient tends to vanish. The skip connection limits this effect by giving "shortcuts" paths during the back-propagation and so allows to train deeper networks, increasing further the performances achieved.

Nowadays, most of the state-of-the-art networks, no matter the function to achieve, encode the input with Resnet modules before ap- plying specific modules.

Different versions of this network exist depending on the number of layers. ResNet50 will refer to ResNet with 50 layers, for instance.

3.2 Prediction from a single image

Most of the recent deep learning approaches to tackle the problem of depth prediction use only a single image, instead of two images as it is the case with classical approaches.

3.2.1 Eigen Network

The first network which tackled the problem of depth prediction was published by Eigen, Puhrsch, and Fergus [2]. The Figure 3.3 details the composition of this network. It can be decomposed in two small networks, the first one in blue in Figure 3.3 aims to predict a coarse depth map. The second one in orange, refines the coarse result.

(21)

Figure 3.3: Architecture of Eigen network with all the different operations performed.

Several differences can be observed between those two. The first difference is the presence in the coarse network of two fully connected layers. These layers perform operations which links every coefficient of the features. The local specificity will be lost but global structure will be observed. The second network performs only several convolutions with a small kernel to refine the information locally.

This network was the first major contribution for monocular depth prediction using deep neural networks.

3.2.2 Laina Network

The state-of-the-art network for monocular depth prediction has been implemented by Laina et al. [12]. Figure 3.4 illustrates the microscopic composition of it. It presents a classical encoder-decoder structure as explained in the previous section.

The encoding operation is performed by ResNet50.

The decoding operation is specific to this network. The basic idea is to reduce the number of features during the upsampling. To do so, Laina et al. [12] introduces a new module called up-projection. This module performs first an unpooling in order to increase the width and height of the features and then convolutions to decode the information by reducing the number of features. The unpoling operation consists in filling the value of a feature map in a two times larger feature map with only zeros. The coefficients are filled only one column on two and one row on two.

Figure 3.5 illustrates the microscopic structure of this module. One can notice the presence of a skip connection identical to the one of the

(22)

Figure 3.4: Architecture of Laina network with all the different operations performed.

Figure 3.5: Illustration of the up-projection module introduced by Laina et al. [12]. The number of channels is decreased at the end of this module and the resolution of the features is multiplied by 2.

residual module of ResNet. It was designed to achieve the same goal.

This network architecture is quite simple in the sense that it inputs a single image and outputs its depth map. It achieves state-of-the-art results on several academic datasets such as NYU depth dataset [15]

and justify the encoding-decoding approach for depth prediction. All the following will use similar architecture.

3.3 Stereoscopic depth inference

Very recently some deep neural networks have integrated the stereoscopic information. Two publications tackled this problem.

(23)

Figure 3.6: Illustration of the macro architecture of the nework introduced by Kuznietsov, Stückler, and Leibe [11].

3.3.1 Kuznietsov network

Kuznietsov, Stückler, and Leibe [11] published an article, within which they introduce a network achieving state-of-the-art performance on Kitti dataset. Its structure is similar to Laina net. The only difference is in the loss function.

Indeed, Kuznietsov, Stückler, and Leibe [11] use the stereoscopic image of the pair, that is to say the image issued from the other calibrated camera, to add a term in the error. According to Equation 2.6, if the depth is known, one can retrieve the disparity and doing so can reconstruct the other image of the pair with Equation 2.5. The loss function they introduce is the following :

Loss =||F (Il) Z⁰|| + ||Ir(x ⇢(x, y), y) Il|| (3.1) where F represents the network function, Z⁰ the ground truth and ⇢ the disparity computed based on the depth prediction. This loss function forces the model not only to predict depth with pixel values close from the ground truth but also depth which leads to consistent stereo reconstruction.

It is interesting to notice that one can keep only the second term of the loss function : ||Ir(x ⇢(x, y), y) Il||. By doing so, the ground truth Z⁰is no longer necessary to train the model. The model can though be

(24)

Figure 3.7: Illustration of training procedure of networks introduced by Godard, Mac Aodha, and Brostow [6].

trained with a dataset containing only stereoscopic images. This is called unsupervised learning.

3.3.2 Godart Network

Another network published by Godard, Mac Aodha, and Brostow [6]

use stereoscopic images in a non supervised way to train their network for depth prediction.

This network has a very similar structure : an encoding part performed by ResNet50 module and then a specific decoding. The training procedure is slightly different because it takes into account both predictions from left image and right image to make more consistent their depth prediction. This network is trained without depth ground truth. Figure 3.7 illustrates how the training procedure works. First the network will output two different disparities d^land d^r, which will be used to reconstruct the right image from the left image left and in- verse. The loss function is composed of three different terms. The first one is an error on the two stereo reconstructions and defined by this equation :

Lossreconstruction =||Ir(x d^r(x, y), y) Il|| + ||Il(x d^l(x, y), y) Ir||

(3.2)

(25)

The second term is a consistent term on the disparities :

Lossdisparities =||d^l(x, y) d^r(x + d^l(x, y), y)|| (3.3) The loss function contains also a smooth cost which aims to superpose the gradient of the depth prediction on the gradient of the image I.

Losssmoothness=||@xd||e ^|@^x^I|+||@yd||e ^|@^y^I| (3.4) The total loss function is a sum of the three different terms for both left and right disparities.

This network is the most efficient network for non supervised depth prediction nowadays on the Kitti dataset.

3.4 Sequential inference

From the same idea, very recent networks tried to use sequential information to increase the performance of their network.

3.4.1 Vijayanarasimhan Network

Vijayanarasimhan et al. [18] published a network this year which use sequential information to better their predictions. Figure 3.8 presents its architecture. It is quite different from the previous network. Indeed a first network predicts the depth from a single image. A second network will input a sequential pair of images and will predict an object mask, the motion from the camera and each object between the two frames. The idea of this network is to compute the optical flow from the depth and motion estimation to add a reconstruction error in the loss function as it was already the case in the stereoscopic situation with the disparity.

To do so one can compute for each pixels the 3D coordinates, given the parameters of the camera, by using the Equation 2.3. From those 3D points, one can use Equation 2.4 to project the 3D points on the image t + 1. The rotation and translation matrices are obtained by the composition of the objects and camera motion predictions. Once the projection on the image t+1 is performed, one can compute the optical flow by comparing the displacement of each pixels between the two frames.

(26)

Figure 3.8: Architecture of network introduced by Vijayanarasimhan et al. [18].

The loss function contains a term based on the reconstruction error using the optical flow. The loss function will in this case be of the following form :

Loss =||F (I^t) Z⁰|| + ||I^t+1(x + Vx(x, y), y + Vy(x, y)) It|| (3.5) where V corresponds to the estimated optical flow, Z⁰the ground truth, and Itthe image at time t. Similarly than both previous networks, this network can be trained from an unsupervised way, namely without depth ground truth if the first term of the loss function is deleted.

Another network introduced by Zhou et al. [19], this year, presents a similar architecture and training procedure. The only difference is the absence of the object mask and object motions. This network re- quires that the only motion between two different instants is the camera motion, unlike the one of Vijayanarasimhan Network which also integers object motions.

3.5 Extension to similar problems

All the different networks from the previous sections have all the same common point: they only input a single image to predict depth. From the best of our knowledge, there is no state-of-the-art network using

(27)

Figure 3.9: Architecture of Flownet introduced by Dosovitskiy et al.

[1].

a couple of images as input. This is why this thesis aims to propose a novel architecture to perform so.

However we found some networks taking a pair of images as input, but that are not designed for depth prediction.

3.5.1 FlowNet

For instance, Dosovitskiy et al. [1] introduced a network to predict optical flow from a sequential pair. This network presents two specifici- ties in its architecture. This architecture is illustrated Figure 3.9. The first one is the presence of a correlation module. Indeed both images are first encoded independently and are then correlated (yellow operations in Figure 3.9). This correlation operation will be explained further in the next chapter. Roughly, it can be understood as a scalar product between features along the depth dimension. Then the correlation features are encoded deeper and injected in a refinement module. This refinement module is illustrated Figure 3.10. The second specificity is the fact that the network will output four different optical flow maps at different resolutions. Each optical flow maps, once predicted, are re-injected back in the network to help the decoding.

3.6 Synthesis

This section aims to synthesize the different core elements of the previous networks in order to build the most efficient network as possible.

• All the recent and most efficient networks presented in the previous sections are based on an encoding-decoding architecture.

(28)

Figure 3.10: Architecture of the refinement module introduced by Dosovitskiy et al. [1].

The encoding is performed by layers coming from ResNet50. In order to perform the best, the network proposed by this thesis will reuse this scheme.

• Several comparisons can be made. Both Eigen Network and FlowNet use a global to fine approach even though they do not achieve the same function. This can seem similar to pre-existing disparity estimation methods.

• The performance of the prediction can be increased if temporal or stereoscopic information are available. These information are integrated via the loss function, most of the time. On the other hand, FlowNet inputs directly the temporal information and performs a specific operation: a correlation. As the goal of this thesis is to build a model which input a pair of images, the correlation module should be a part of the network design.

(29)

Approach

This chapter presents a novel network for double inference : MSDOS Net (Multi Scale Depth Optimization Strategy Network). This network aims to predict depth from a pair of images (sequential or stereoscopic). The design of this network is inspired by the previous state- of-the-art networks. Three new contributions to preexisting networks are introduced with MSDOS Net, in order to answer the drawbacks of preexisting networks : a pyramidal structure inspired by classical disparity estimations, a new decoding module called EnF-DED, and a new training strategy.

4.1 MSDOS-Net Architecture

The overall MSDOS-Net architecture is presented Figure 4.1. The model can be decomposed into three separated macroscopic modules, detailed in the sections hereafter.

The first module transposes a classical pyramidal correlation approach into several encoders from different image resolutions. En- coded images are first concatenated, then correlated with the second image features using the correlation layer firstly introduced by FlowNet Dosovitskiy et al. [1]. This module will be explained in the following section. Resulting features are concatenated with a part of the encoded images so that some monocular information move forward in the network.

In order to enforce the robustness of the correlation, the same operation is performed on feature maps picked up at different resolutions in our encoding network pyramid (showed in red and green in Figure

21

(30)

22 CHAPTER 4. APPROACH

Figure 4.1: MSDOS-Net overall architecture: a coarse to fine depth map prediction from a pyramidal left-right encoding. Correlations are performed at multiple resolution (in blue, green and red in the figure) and integrated successively in the corresponding Expand and Fuse modules.

(31)

4.1). Each output of this Multi-Scale Coarse-to-Fine (MSCF) module is treated in a Depth Encoder-Decoder (DED) component, to predict a depth map for the corresponding level of details. Their inference is performed sequentially, in a coarse-to-fine manner: outputs from the first DED module allow initializing the second one and so on.

Between each DED is the last key component of our architecture, an up-sampling module which doubles the size of the prior depth map and features, then concatenates these two with both the corresponding correlation result and a down-sampled instance of the reference input image (typically the left one for our stereoscopic study). This module is then referred as the Expand and Fuse (EnF) module.

A last EnF-DED-like sequence allows refining the depth map in half the resolution of the input images.

To serve the clarity of explanations, the pyramidal decomposition is limited to three levels in this thesis, but the proposed approach could be easily adapted to any input resolution.

4.1.1 Muli-Scale Coarse to Fine (MSCF) module

Stereoscopic or sequential inputs are first down-sampled following a pyramidal decomposition scheme, with each level being half the resolution of the previous one.

The resulting images are encoded until the desired feature map size. This encoding part is partially composed of modules coming from pre-trained ResNet50 on ImageNet dataset to ease the convergence. This encoding process is very similar to the preexisting network detailed in the previous section. Table 4.1 shows the specific architecture for each input resolution. One can notice that deeper encoding are performed with higher resolutions so as to obtain the same final dimensions.

Considering left and right inputs separately, feature maps of equal resolutions are picked from the down-sampled encoding branches and concatenated. The number of these samples is limited the following way: every encoder contributes to the coarser level (in blue Figure 4.1]), while two out of three provide mid-size features and only one the finest output.

Left and right aggregated features are then matched at each resolution. A correlation layer repeats the principle detailed in FlowNet [1], introducing a module that performs multiplicative patch comparisons

(32)

Figure 4.2: Detail of the Correlation layer which consists in the inner product of the left and right encoded images. The number of outputs is fix to 473 features, no matter the size of the correlation; extra features come from a single convolution of the left image.

by convolving left and right data. Thus, it has no trainable weights.

Dosovitskiy et al. [1] detail the correlation of two patches of size ⌦, centered at x1,and x2in the feature maps f1 and f2, as follows :

C(x1, x2) =sX

o2⌦

f1(x1+ o).f2(x2+ o) (4.1) This operation is repeated all around x2, in a neighborhood ex- pected to contain the effective matching displacement. The prospected area should depend on both the considered application, and the down- sampling level. Indeed, working on stereoscopic pairs implies a prior knowledge on the disparity direction, while temporal overlapping does not provide such a constraint.

Doing so, the output of the correlation layer is sized by the sur- rounding neighborhood explored. For instance, computing the correlation in a 7 ⇥ 7 pixel area will result in 49 correlation features. In the multi-scale framework, the number of these outputs is fixed to 473 features, no matter the sizes of the correlation. Extra features come from a single convolution on the encoded images (see Figure 4.2]). Table 4.2 summarizes the effective correlation parameters set up used to train on Virtual KITTI for temporal inter-frame consistency.

(33)

Image 1 : 1 Image 1 : 2 Image 1 : 4 Conv₂⁷(3, 64)

Conv⁵₂(64, 64) High-res corr.

Conv⁵₂(64, 64) Conv⁷₂(3, 64) Resnet1(64, 256) Conv₂⁵(64, 64)

Mid-res corr. Mid-res corr.

Resnet2(256, 512) Conv₂⁵(64, 64) Conv₂⁷(3, 64) Resnet1(64, 256) Conv₂⁵(64, 64) Low-res corr. Low-res corr. Low-res corr.

Table 4.1: Architecture of the networks for the different resolution inputs where Conv^k_s(channelsin, channelsout)represents the convolution of stride s kernel k which takes channelsinand returns channelsout. All the convolutions are followed by a batch normalization step and a non linearity ReLu. Resneti represents the i-th layer of ResNet50 which is composed of 4 different global layers.

Input res size Corr. feat. Mono. feat.

Low-res (7,7) 49 424

Mid-res (11,11) 121 352

Full-res (21,21) 441 32

Table 4.2: Size of every correlations and output feature details w.r.t.

the input resolution.

(34)

4.1.2 Coarse-to-Fine Inference

The basic idea that led the design of this network is the multi-scale inference.

As explained above, the successive downsampled inputs and their correlations at different level of encoding realizes the multi-scale encoder. On the other hand the refinement module, i.e. the decoding part of our network, also integrates a multi-scale strategy with a coarse-to- fine prediction, inspired by Dosovitskiy et al. [1].

After a first encoding-decoding step, the network will output a coarse depth map at a low resolution, then refined with the help of the next correlation result, and so on. There will be four different depth maps predicted at the end.

The results of the next two correlations are successively injected in the refinement, concatenated with the prior depth map and the reference input image down-sampled to the right resolution. As all this information come from different sources, to our refinement model a short encoder network has been added each time after the incorpora- tion of new information. This module aims to homogenize them in order to increase the quality of our prediction.

In practice, correlated data at every resolutions are deeper encoded, decoded and up-scaled, by alternating DED and EnF modules.

The first one, a ResNet projection showed in Figure 3.2, aims to increase the number of features and lower the resolution. Two ResNet skips (Figure 3.1) are then added to increase the depth. DED decodings follow the original up-projection scheme proposed in Laina et al. [12]

(Figure 3.5)

While DED does not change the overall resolution of feature maps, it adds a dedicated output branch that transforms features into a depth map.

The Expand and Fuse (EnF) module initializes every higher-res prediction. To do so, it combines three functional parts. First, an up- scaling layer re-sizes the prior depth map by a factor two. An other up-projection coming from Laina et al. [12] multiplies the resolution of the input feature maps by two. Lastly, a fusion stage concatenates the up-scaled depth map and up-projected features with both the corresponding MSCF outputs, and the reference input image previously down-sampled.

The overall sequence is as follows. From the lowest correlation

(35)

Figure 4.3: Detail of the Expand and Fuse (EnF) module.

layer, a first DED module provides a coarse depth map. Then, three additional EnF-DED couples refine these prediction up to half the input image resolution. It should be noted that the last EnF module does not aggregate any correlation data.

Table 4.3 summaries the different operations performed in the refinement.

4.2 Multi Scale Training Approach

For the training procedure, this thesis propose a new approach which consist in training the network to sequentially learn each output resolution, starting with the lowest one instead of training all the predictions together or just having one full resolution target. This approach is inspired by the pyramidal technique which is commonly used for classical image processing (contrast enhancement for instance) and aims to first train our network to predict the global structure of the depth map and gradually refine it thanks to the images, intermediate depth maps predictions and correlation injected in the refinement.

To process this multi-scale training, this thesis introduces an evo- lutive loss depending on the resolution of the output we want to train

(36)

Module Type Channels in Channels out Scale

Up projection 1024 512 x2

Concatenation 256 516 x4

Resnet Proj 516 1032 x2

Resnet Skip 1032 1032 x2

Convolution 68 1 x16

Table 4.3: Summary of the different operations in the refinement and the evolution of the number of channel.

(37)

:

L_i = Xi k=1

1

2^{2i 1}||Zi Z_i⁰||²2 (4.2) where Z_i⁰ is the ground truth depth,Ziour prediction and i the resolution : i = 1 represents the lowest resolution and i = 4 the highest. The different coefficients have been computed in order to put the same weight in the error for every pixels of every maps. The training phase begins during 2 epochs with L1, then with L2 during 2 epochs, 5 epochs with L3, and finally L4.

(38)

Chapter 5 Experimentation and results

The training approach and network architecture have been evaluated on two different datasets : the raw sequences of KITTI introduced by Geiger et al. [4] and synthetic sequences of Virtual KITTI designed by Gaidon et al. [3]. KITTI, one of the most used dataset for depth prediction, is composed of several stereo RGB sequences from different environments such as suburbs, cities, or highways. Along them, the dataset provides 3D laser measurements from a LIDAR which results in sparse depth maps for each image. The training, validation and test split is the one proposed by Eigen, Puhrsch, and Fergus [2].

On the other hand, Virtual KITTI, provides perfect dense depth maps for each images of the synthetic mono sequences. One sequence has been isolated to build the test set, and the four others constitute the training.

Experimenting the model on those two different datasets explore several approaches such as training on sparse and dense labels, and infer with stereo and temporal inferences in order to prove the efficiency of this network on different contexts.

5.1 Implementation Details

All the residual layers of the encoding part of the network is initialized with pre-trained ResNet-50. The decoding part is initialized according the commonly used Xavier initalization Glorot and Bengio [5]. The optimization has been performed with Adam as optimizer with parameters 0.9 and 0.99. The training used a Nvidia GT1080 with 8GB ram.

The learning rate has been fixed to 0.0001 and exponentially decayed

30

(39)

after 15 epochs.

For KITTI, crops of size 480x192 for training and crops of size 320x256 for Virtual KITTI, have been used. For testing, the images have not been re-sized for both datasets and so are inferred at full resolution.

Every training have been realized according the proposed new multi- scale training approach.

5.2 Evaluation Metrics

The proposed model for each dataset was evaluated according to the following criteria :

RM SE : vu ut 1

T XT

i=1

||Z⁰(xi) Z(xi)||²2 (5.1)

ARD : 1 T

XT i=1

||Z⁰(x_i) Z(x_i)||1

Z(xi) (5.2)

SRD : 1 T

XT i=1

||Z⁰(x_i) Z(x_i)||²2

Z(xi) (5.3)

Accuracy : |{{i 2 {1, ..., T }|max(^Z_Z(x⁰^(x_iⁱ₎⁾,_Z^Z(x₀_(xⁱ⁾

i)) = < thr}|

T (5.4)

where T represents the total number of pixels where the ground truth is available, that means all the pixels of the image if the ground truth is dense or only a small subset if it is sparse. Z⁰ represents the prediction made by our model and Z the correct depth. ARD stands for absolute relative difference and SRD square relative difference. All the depth have been capped to 80m.

5.3 Virtual KITTI

To evaluate the efficiency of the model according to perfect dense depth maps, it was first trained using the Virtual KITTI dataset. As the stereo images are not provided two consecutive frames of the same sequence has been used to feed the network. In order to compare the efficiency of the proposed model, the state-of-the-art networks for monocular depth prediction [12] has been trained on this specific dataset.

(40)

32 CHAPTER 5. EXPERIMENTATION AND RESULTS

Figure 5.1: Illustration of the multi scale approach on Virtual Kitti.

From top to down, the figures represent one of the input images, the ground truth, and the four predictions from the the lowest resolution to the highest

(41)

Figure 5.2: Illustration of our best model on Virtual KITTI. The left above image represents one of the RGB images of the sequential pair. The left below image represents the prediction performed by the state-of-the-art network. The right above image represents the dense ground truth which has been capped to 80m. The image in the last corner is the MSDOS net prediction.

Approach RMSE ARD SRD

Laina et al. [12] 14.110 0.627 8.036

Ours 8.612 0.459 4.153

Table 5.1: Quantitative results of our method and the state-of-the-art method on the test set of Vitual KITTI. The ground truth depth have been limited to 80 m to be similar to KITTI dataset

(42)

Approach RMSE ARD SRD

Eigen, Puhrsch, and Fergus [2] fine 7.156 0.215 -

Liu et al. [14] 6.986 0.217 1.841

Godard, Mac Aodha, and Brostow [6] 5.849 0.141 1.369 Kuznietsov, Stückler, and Leibe [11] supervised only 4.815 0.122 0.763 Kuznietsov, Stückler, and Leibe [11] semi supervised 4.621 0.113 0.741

MSDOS stereo input 3.102 0.098 0.520

MSDOS stereo input + Virtual Kitti 3.050 0.091 0.472

MSDOS sequential input 4.612 0.157 1.082

Approach < 1.25 < 1.25² < 1.25³

Eigen, Puhrsch, and Fergus [2] fine 0.692 0.899 0.967

Liu et al. [14] 0.647 0.882 0.961

Godard, Mac Aodha, and Brostow [6] 0.818 0.929 0.966 Kuznietsov, Stückler, and Leibe [11] supervised only 0.845 0.957 0.987 Kuznietsov, Stückler, and Leibe [11] semi supervised 0.862 0.960 0.986

MSDOS stereo input 0.914 0.969 0.987

MSDOS stereo input + Virtual Kitti 0.921 0.968 0.987

MSDOS sequential input 0.813 0.932 0.972

Table 5.2: Quantitative results of our method and previous state-of- the-art methods on the test set of KITTI according to the split of Eigen, Puhrsch, and Fergus [2]. Best results are shown in bold.

Table 5.1 summarizes the different scores obtained on the test set of Virtual KITTI. Since the dataset contains no stereo pairs and for fair comparison, the model was trained and tested only with sequential information. Different predictions on test set with both models are illustrated in Figure 5.2. The results are better and show an amelioration of almost 40% between these two networks. Visually, the depth predictions by the proposed model are much sharper than the one inferred by the state-of-the-art network [12]. Thin details such as the ramifica- tions of the different trees appear in the proposed model predictions and cannot be seen in the others.

5.4 KITTI

MSDOS net has also been evaluated on the KITTI dataset where only sparse ground truth is available. This evaluation can be performed from two different ways as KITTI provides stereo pairs and sequential

(43)

information.

5.4.1 Comparison with the State-of-the-art

Most of the state-of-the-art deep neural networks for depth prediction use a single image for the inference and use its stereo image in a semi- supervised loss to achieve more consistent predictions. In order to compare our performance with these networks, MSDOS net was first trained with stereo pairs. To do so, two different initialization strate- gies have been tried: the same strategy as the one described for training on Virtual KITTI and another one with the coefficients of the model pre-trained with Virtual KITTI.

Table 5.2 summarizes the performance obtained for MSDOS net model and state-of-the-art methods for depth prediction. The thesis model pre-trained on Virtual KITTI achieve a new state-of-the-art score for every evaluation metrics for depth prediction.

Figure 5.3 illustrates the predictions of MSDOS model on the test- set. To illustrate the depth ground truth, the sparse ground truth has been densified by triangulation and linear interpolation, a bilateral filter is then applied to increase the sharpness. One interesting thing to notice on the results is the fact that the predictions from MSDOS net seem to fit better the images than the reconstruction from ground truth although only these sparse points has been used for training.

5.4.2 Generalization to temporal sequences

In order to compare the performance of MSDOS model in the case where the input images are coming from the same camera, the model was also trained on Kitti with sequential inputs. The different scores obtained in this situations can be found Table 5.2.

Once again, MSDOS architecture achieve good results, even if those results are less better than the one obtained in the stereoscopic case.

This can be explained by the fact that more 3D information is con- tained in a stereo pair than in a sequential one.

(44)

Figure 5.3: Illustration of our best model on KITTI, that is to say pre- trained on Virtual Kitti and stereo inputs. The left above image represents one of the RGB images of a stereo. The left below image represents the sparse ground truth of depths. The right above image represents the dense ground truth obtained by a Delaunay triangulation on the sparse distribution and linear interpolation followed by a bilateral filter to increase the sharpness. The image in the last corner is MSDOS net prediction.

(45)

Conclusion

The huge amount of data available, nowadays, and the development of new depth sensors enable the possibility to design algorithms which could predict depth with a great accuracy. However most of them only use images from a single camera which is an ill posed problem and consequently hard to solve.

Using several images at two different time frame from the same camera or a couple of images from stereo cameras (most of embedded systems for vision provide stereo cameras) should improve the accuracy of the prediction. This is why the core of this thesis is to propose a new method for depth prediction from several inputs and compare it with single image based algorithms.

To do so, this thesis introduces a novel architecture: MSDOS net, which achieves new state-of-the-art results for multi-ocular depth prediction on two famous datasets for self driving cars : KITTI and virtual KITTI. The results show an improvement in both accuracy and visual quality.

To achieve such results, different modules of its architecture have been inspired from state-of-the-art networks : the correlation module, the coarse to fine scheme, the multi-resolution prediction, ... and several new modules such as the pyramidal architecture, decoding module and training procedure have been introduced.

However, due to the fact that there does not exist yet several multi- input depth prediction system, it’s hard to estimate how much the gap between those and single input systems in terms of performance.

These contributions can easily be adapted for other tasks such as semantic segmentation or optical flow estimation. Hence, we think

37

(46)

38 CHAPTER 6. CONCLUSION

that one of the main extension to this thesis is to adapt a similar architecture to other deep learning challenges.

Another possible extension could be to explore the potential of the MSDOS model through non supervised settings, as it is the case with several recent deep neural networks for depth prediction.

(47)

[1] Alexey Dosovitskiy et al. “FlowNet: Learning Optical Flow with Convolutional Networks”. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. 2015, pp. 2758–2766.DOI: 10.1109/ICCV.2015.316.

[2] David Eigen, Christian Puhrsch, and Rob Fergus. “Depth Map Prediction from a Single Image using a Multi-Scale Deep Net- work”. In: Advances in Neural Information Processing Systems 27:

Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 2014, pp. 2366–

2374.

[3] A Gaidon et al. “Virtual Worlds as Proxy for Multi-Object Track- ing Analysis”. In: CVPR. 2016.

[4] Andreas Geiger et al. “Vision meets Robotics: The KITTI Dataset”.

In: International Journal of Robotics Research (IJRR) (2013).

[5] Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neural networks”. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Ed. by Yee Whye Teh and Mike Titterington. Vol. 9.

Proceedings of Machine Learning Research. Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 249–256.

[6] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. “Un- supervised Monocular Depth Estimation with Left-Right Con- sistency”. In: CVPR. 2017.

[7] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.

[8] Kaiming He et al. “Deep Residual Learning for Image Recogni- tion”. In: arXiv preprint arXiv:1512.03385 (2015).

39

(48)

40 BIBLIOGRAPHY

[9] Sergey Ioffe and Christian Szegedy. “Batch normalization: Ac- celerating deep network training by reducing internal covari- ate shift”. In: International Conference on Machine Learning. 2015, pp. 448–456.

[10] Diederik Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014).

[11] Yevhen Kuznietsov, Jörg Stückler, and Bastian Leibe. “Semi-Supervised Deep Learning for Monocular Depth Map Prediction”. In: CoRR

abs/1702.02706 (2017).

[12] Iro Laina et al. “Deeper depth prediction with fully convolutional residual networks”. In: 3D Vision (3DV), 2016 Fourth In- ternational Conference on. IEEE. 2016, pp. 239–248.

[13] Yann LeCun et al. “Backpropagation applied to handwritten zip code recognition”. In: Neural computation 1.4 (1989), pp. 541–551.

[14] Fayao Liu et al. “Learning Depth from Single Monocular Im- ages Using Deep Convolutional Neural Fields”. In: IEEE Trans.

Pattern Anal. Mach. Intell. 38.10 (2016), pp. 2024–2039. DOI: 10.

1109/TPAMI.2015.2505283.

[15] Pushmeet Kohli Nathan Silberman Derek Hoiem and Rob Fer- gus. “Indoor Segmentation and Support Inference from RGBD Images”. In: ECCV. 2012.

[16] Adam Paszke et al. PyTorch. 2017.

[17] Daniel Scharstein and Richard Szeliski. “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms”. In:

International journal of computer vision 47.1-3 (2002), pp. 7–42.

[18] Sudheendra Vijayanarasimhan et al. “SfM-Net: Learning of Struc- ture and Motion from Video”. In: CoRR abs/1704.07804 (2017).

[19] Tinghui Zhou et al. “Unsupervised learning of depth and ego- motion from video”. In: arXiv preprint arXiv:1704.07813 (2017).

(49)

(50)

TRITA EECS-EX-2018:698

www.kth.se