Real-Time Semantic Stereo Matching

(1)

Real-Time Semantic Stereo Matching

PIER LUIGI DOVESI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

stereo matching

PIER LUIGI DOVESI

Master in System, Control and Robotics Date: June 10, 2020

Academic supervisor: Hedvig Kjellström Company supervisor: Alessandro Pieropan Examiner: Elena Troubitsyna

School of Electrical Engineering and Computer Science Host company: Univrses AB

Swedish title: Semantisk stereomatchning i realtid

(4)

(5)

Abstract

Scene understanding is paramount in robotics, self-navigation, augmented re- ality, and many other fields. To fully accomplish this task, an autonomous agent has to infer the 3D structure of the sensed scene (to know where it looks at) and its content (to know what it sees). To tackle the two tasks, deep neural networks trained to infer semantic segmentation and depth from stereo images are often the preferred choices. Specifically, Semantic Stereo Matching can be tackled by either standalone models trained for the two tasks independently or joint end-to-end architectures. Nonetheless, as proposed so far, both solu- tions are inefficient because requiring two forward passes in the former case or due to the complexity of a single network in the latter, although jointly tack- ling both tasks is usually beneficial in terms of accuracy. In this paper, we propose a single compact and lightweight architecture for real-time seman- tic stereo matching. Our framework relies on coarse-to-fine estimations in a multi-stage fashion, allowing: i) very fast inference even on embedded de- vices, with marginal drops in accuracy, compared to state-of-the-art networks, ii) trade accuracy for speed, according to the specific application requirements.

Experimental results on high-end GPUs as well as on an embedded Jetson TX2 confirm the superiority of semantic stereo matching compared to standalone tasks and highlight the versatility of our framework on any hardware and for any application.

The work described in this thesis is also available in [1], ICRA 2020.

(6)

Sammanfattning

Scenförståelse spelar en viktig roll inom robotik, självnavigering, augmented reality och många andra områden. För att fullständigt kunna utföra denna upp- gift måste en autonom agent kunna förstå 3D-strukturen i sin omgivning (för att veta var det den tittar på är) och omgivningens innehåll (för att veta vad det är den ser). För att lösa dessa uppgifter är ofta det föredragna valet att träna djupa neurala nätverk till att beräkna semantisk segmentering och pixeldjup från stereobilder. Specifikt kan semantisk stereomatchning hanteras antingen genom fristående modeller tränade att utföra de två uppgifterna oberoende av varandra eller genom en gemensam end-to-end arkitektur. Såsom föreslagits hittills är båda lösningarna däremot ineffektiva eftersom det krävs två framåt- passeringar i det tidigare fallet och på grund av komplexiteten hos det samman- slagna nätverket i det senare, även om gemensam träning av båda uppgifterna vanligtvis är fördelaktigt när det gäller noggrannhet. I den här artikeln föreslår vi en kompakt och beräkningslätt arkitektur för gemensam semantisk stereo- matchning i realtid. Vårt ramverk bygger på att uppskatta scenmodellen i flera steg från grovt till noggrant, vilket tillåter: i) mycket snabb inferens även på in- byggda enheter, med minimal minskning i noggrannhet jämfört med moderna nätverk, ii) övervägning mellan hastighet och noggrannhet enligt de specifi- ka tillämpningskraven. Experimentella resultat på högpresterande grafikkort samt på en inbyggd Jetson TX2 bekräftar överlägsenheten med semantisk ste- reomatchning jämfört med fristående nätverk och belyser mångsidigheten i vårt ramverk för all hårdvara och för alla tillämpningsområden.

Innehållet i denna uppsats finns även beskrivet i [1], ICRA 2020.

(7)

1 Introduction 1

1.1 Background . . . . 2

1.1.1 Stereo depth regression . . . . 2

1.1.2 Multitask Learning . . . . 3

1.1.3 Semantic segmentation . . . . 3

1.2 Research problem . . . . 5

1.3 Objectives . . . . 5

1.4 Contribution . . . . 6

1.5 Ethical considerations . . . . 6

1.6 Overview . . . . 6

2 Related work 8 2.1 Stereo depth: traditional methods . . . . 8

2.2 Stereo depth: deep learning methods . . . . 10

2.2.1 The very first approach . . . . 10

2.2.2 DispNet: the first end-to-end architecture . . . . 11

2.2.3 GCnet: the new stereo baseline . . . . 15

2.2.4 PSM-Net and other state of art methods . . . . 16

2.3 Stereo depth: alternative approaches . . . . 18

2.3.1 Online adaptation . . . . 19

2.3.2 Real-time performances . . . . 20

2.3.3 Multi-task learning in stereo depth . . . . 21

2.4 Semantic Segmentation . . . . 23

2.4.1 U-net: the traditional approach . . . . 23

2.4.2 State of the art of semantic segmentation . . . . 24

2.4.3 Fast segmentation networks . . . . 24

3 Methods 27 3.1 Concept . . . . 27

v

(8)

3.2 Architecture Overview . . . . 28

3.3 Siamese Bifid U-Net module . . . . 30

3.3.1 Network Structure . . . . 30

3.3.2 Operational Block . . . . 32

3.4 Synergy disparity refinement module . . . . 32

3.4.1 Network Structure . . . . 33

3.4.2 Operational Block . . . . 33

3.5 Objective Functions . . . . 36

3.5.1 Hierarchical loss weighing . . . . 36

3.5.2 Dynamic Weight Average . . . . 36

3.5.3 Segmentation losses . . . . 37

3.6 Implementation and training . . . . 40

4 Experiments 41 4.1 Datasets . . . . 41

4.1.1 Synthetic datasets . . . . 42

4.1.2 Real-World dataset . . . . 43

4.2 Metrics . . . . 44

4.3 Experiments and ablation study . . . . 46

4.3.1 Training schedule search . . . . 46

4.3.2 Effect of Multi-task Learning . . . . 48

4.3.3 Effect of the Synergy Module . . . . 51

4.4 Comparison with baseline and run-time experiments . . . . . 53

4.4.1 Architecture alternatives and benchmarks . . . . 53

4.4.2 Anytime settings . . . . 55

4.4.3 KITTI online benchmark . . . . 57

5 Conclusions 58 5.1 Results and contributions . . . . 58

5.2 Future Works . . . . 59

Bibliography 60

A Qualitative maps 68

(9)

Introduction

Deep neural networks prove to be an exceptional tool in the field of computer vision and robotics. They systematically outperform traditional approaches in almost all vision tasks, such as semantic segmentation, object detection, depth regression, and sceneflow [2]. This success is vastly motivated by their large versatility, high modularity of core building blocks and high robustness of the models. As we will illustrate in this work, network architectures de- signed for a specific task are easily found beneficial for other objectives [3], this is indeed correlated by the intrinsic generality of the features extracted during the training process [4]. Recently though [5], in the field of computer vision, traditional methods regained vast research interest, indeed their core geometric principles proved to be an extraordinary leverage for deep neural networks. Given this framework, in this work, we will focus on three specific challenges of geometry applied to deep learning. First, we will focus on the task of stereo depth regression, i.e. predict a depth map given a pair of recti- fied images. As second objective we will address the challenge of managing multiple tasks together with a single network. In particular, we will study and empirically prove that, with a suitable architecture, performing multiple tasks together could be transformed from an additional burden to an opportunity to reach even higher performances. In this case, the two tasks are depth re- gression and semantic segmentation. This field of deep learning is commonly referred as Multi-task learning [6]. Third, we will tackle another systematic problem of deep neural networks that are still precluding their application to mobile robotics and real-world application: the high computational load and slow inference time. In this work, we aim to find an architecture able to run with real-time performance in low power embedded devices.

1

(10)

1.1 Background

1.1.1 Stereo depth regression

Depth estimation is an essential task for 3D scene reconstruction and scenario understanding. The fields of applicability spans from AD navigation [7], to mobile robotics [8], virtual and augmented reality [9]. Thus, several depth regression techniques have been developed exhibiting different deployability and levels of effectiveness. Among them, stereo vision has been proven to be one of the most promising. While LiDAR solution could offer much higher accuracy, stereo cameras provide a dense depth estimation with much lower equipment costs, power requirements in a smaller and lightweight solution [10]. Given a rectified stereo image pair, the stereo baseline and the focal length, the goal of depth estimation is to compute the disparity for every single pixel in the reference image. Herewith disparity, we refer to the horizontal displacement between a pair of corresponding pixels in left and right frames.

Given a pixel (x, y) in the left frame, if the corresponding pixel is in (x − d, y) in the right image, then the depth of this pixel is computed as

depth = f × B

d (1.1)

where f is the focal length of the camera and B is the baseline, i.e. the distance between the two cameras. Therefore, given a constant camera setup, depth and disparity are inversely proportional.

Figure 1.1: Example of disparity regression. Qualitatively we can notice the

simple inverse correlation between disparity (highlighted in the image) and

depth. Image taken from [11].

(11)

1.1.2 Multitask Learning

In deep learning, we usually focus on one single task, whether it is a set of cor- related metrics or a benchmark. This problem is commonly addressed with a single network trained to maximize or minimize the metric of interest. This is, for example, the case for a generic classifier network, that simply aims to minimize the prediction error or even a depth regression error, that, despite its objective function might embed different components, each of them is strictly focused on the description of one single goal and one single task. This "laser- focus" approach usually achieve acceptable performances, but in doing this, we ignore a lot of other information that might help us in reaching even higher performance. Training a model on different, but intrinsically related tasks could enable us to extract more generalized features. This approach takes the name of Multi-Task Learning (MLT). A good definition of Multi-Task learning is offered by Caruana [12]: "Multi-task learning improves generalization by leveraging the domain-specific information contained in the training signals of related tasks.".

Figure 1.2: Multi-task learning architecture presented by Alex Kendall in [13], one decoder is shared among several tasks.

1.1.3 Semantic segmentation

Semantic segmentation refers to the operation of labeling every pixel in an

image with its relative class. The classes are usually related to a semantic

understanding of the image, such as cars, street signs or road conditions in

case of self-driving applications or cells or body parts in case of biomedical

application. Semantic segmentation could also be considered as image classi-

fication expanded at the pixel level. Beside semantic segmentation, other two

(12)

segmentation processes are often used: instance segmentation, that usually focus only on one or few classes and aim to classify each pixel in the correct instance. This could be used for example to identify and individually segment and count how many people are in a crowd, cars are on a street or pieces pass- ing on an industrial conveyor. Finally, recently panoptic segmentation [14] has been introduced to solve both of the tasks together, i.e. correctly identify the class of each object and recognize each object as a different instance. Usually, semantic segmentation is addressed through encoder-decoder or multi-branch convolutional neural networks, it is indeed essential to extract both local de- tails as well as general context to perform a correct pixel classification.

Figure 1.3: Example of semantic segmentation annotations (right) and corre- sponding images (left), image taken from ADE20K dataset [15]

The network output usually take the form of a class volume probability distribution with the dimensions H × W × C where H and W are the image height and width, while C is the number of classes. The output segmentation is obtained then performing a simple argmax over the classes, resulting in a one channel map. Similarly to the classification network the objective function is represented by the cross-entropy:

loss(x, class) = weight[class] −x[class] + log X

j

exp(x[j])

!!

(1.2)

Where x and class identify pixel and class, while weight represents a

vector of static weights assigned to each class. The introduction of this element

turns out essential in case of heavily unbalanced datasets [16].

(13)

1.2 Research problem

The research problem addressed in this work is manifold:

• Multi-task learning: We want to address if and how the auxiliary task of semantic segmentation could improve stereo depth regression. In par- ticular: how should we design a suitable architecture for a multi-task setup? What approaches are possible for full exploitation of the auxil- iary output? How effective are these techniques?

• Real-time on Mobile Devices: We want to design an architecture able to perform both tasks in small embedded devices, this implies limita- tions both in computational power and available memory. Moreover, the model has to be able to operate with real-time performances. Therefore the research problem concern what methods, and techniques should be adopted to obtain a high accuracy output with low computational con- straints.

• Training dataset: Finally, as a supplementary problem we address in, given a specific deployment scenario (such as street scenes for ADAS purpose), what datasets can we exploit for training? Are synthetic datasets [17] ready to reach competitive performances on real test sequences (without employing domain adaption techniques, other than finetuning)?

1.3 Objectives

In summary, our objective is to design a novel architecture able to successfully

produce a high-quality depth regression and semantic segmentation in real-

time on mobile settings (using the common benchmark platform NVIDIA Jet-

son TX2). We aim to get competitive performances and even reach the state of

the art results among comparable fast inference methods. Our goal is to prove

that, with a suitable design, the additional burden of adding an auxiliary task

can be overcompensated by much higher accuracy.

(14)

1.4 Contribution

Our contributions are:

• A novel feature extraction Siamese network able to produce satisfac- tory semantic segmentation as well as disparity features at high speed (up to 13FPS on mobile devices).

• A novel refinement module able to vastly improve disparity regression accuracy by exploiting semantic cues at almost no extra computational burden.

• We prove that the two modules can successfully cooperate on a single end-to-end architecture for disparity regression obtaining an error of 3.2% with an output rate of 6.25FPS on mobile devices.

1.5 Ethical considerations

Obtaining fast and reliable depth prediction and semantic segmentation repre- sent a key step towards an "autonomous future": the vision and challenge of an AI technology able to understand and interact with the real world beyond the walls of scientific laboratories. The ethical implications of such technol- ogy are meaningful and need to be properly addressed. On one side intelligent mobile robotics and other AIs systems can have a tremendous impact on our society, for example substituting humans in dangerous or arduous tasks. More- over, reliable vision systems could improve driving safety, enable smart-cities and even help people with impaired vision. On the other hand, we must keep in mind that the power of such systems can easily be adapted for abuses of privacy and civil rights. It is concerning to see that among the highest tier sponsors of the most renowned computer vision and machine learning con- ferences there are many companies whose business core is rooted in public surveillance activities [18]. To make it worse, the fact that usually these com- panies mainly operate in countries when the basic human rights, freedom of speech, organization and press are often violated [19].

1.6 Overview

In the following chapter, Chapter 2 - Related Works, we will thoroughly ex-

plain, analyze and discuss the most important publications regarding the stereo

(15)

depth regression. We will start with a short introduction regarding the tradi- tional methods and how the traditional geometrical approach is still embedded into the newest architectures. We will also go through some specific methods that successfully tackled the challenges of improving the result with auxiliary tasks or managed to obtain good performances in low inference time. After- ward in, Chapter 3 - Methods, we will explain step by step the novel architec- ture we are proposing and our contributions. In Chapter 4 - Experiments, we will present the extensive hyperparameters and architecture variations search.

We will also empirically assess how our changes are improving the perfor-

mances in terms of accuracy and inference time. Moreover will compare our

architecture against the original baseline and the state of art in stereo depth

and semantic segmentation. Finally in Chapter 5 - Conclusions, we will sum-

marize our findings, contributions, perks and flows of our methods.

(16)

Related work

The work of this thesis relies on recent improvements in the fields of depth regression, semantic segmentation and multitask learning. Besides the ex- tremely fast accuracy improvements, our project will benefit greatly from the new advancements in network compression and fast inference time. Both of these factors result crucial in an effective deployment in mobile robotics and autonomous driving (AD) scenarios. In this chapter, we will first address our main task, the stereo disparity regression. We will start our analysis from the traditional approaches (Section 2.1), then we will discuss how deep learning methods radically changed the field up to the current state of the art (Sec- tion 2.2). A specific section has been dedicated to alternative deep learning methods for stereo depth regression focused, not only on accuracy but also on low inference-time, domain shift problem and multi-task learning (Section 2.3). Finally, the last section focuses on semantic segmentation deep learning methods (Section 2.4).

2.1 Stereo depth: traditional methods

Before the advent of machine learning, several stereo depth estimation ap- proaches have been explored. Several stages or techniques proposed still rep- resent key components of modern state-of-the-art approaches. An exhaustive taxonomy and evaluation of these traditional approaches have been presented by Sharstein and Szeliski [20]. In this context, we need to provide a more for- mal definition of the aforementioned concept of disparity. The goal of every stereo algorithm is to compute an univalued disparity map function d(x, y) with respect to a reference image (usually one of the input images). It is also worth to mention that most of the stereo algorithms consider disparity just on

8

(17)

the horizontal dimension, but a vertical disparity is also possible with other setups. Another central concept that requires a proper introduction is the disparity space image or DSI. In general, Scharstein and Szeliski define the DSI as any image or function defined over a continuous or discretized version of disparity space (x, y, d). In practice, the DSI represents the confidence, log-likelihood or cost of a particular match implied by a specified disparity d(x, y). This formulation allows us to redefine the goal of a stereo algorithm as the identification of the surface embedded in the DSI that minimizes the cost function. We stress that under this new formulation the minimization of the cost function could consider more elements than just the stereo matching cost (such as smoothness or other constraints). The ultimate goal is indeed to find the disparity map function that best describes the shape of the surfaces in the scene. In particular, the traditional stereo matching algorithms comprise the following steps:

1. Matching cost computation: the matching cost could be computed both on raw pixels or robustified measures. It could involve cost func- tions such as mean-squared error (MSE) and mean absolute difference.

2. Cost aggregation: Local and windows-based methods are employed to perform cost aggregation over the DSI. Gaussian 2D and 3D convolu- tions were usually employed for this process.

3. Disparity computation: the disparity can be computed both with local and global methods. Local methods emphasize the minimization of the matching cost, while global methods focus on a holistic approach where multiple costs functions are considered. Usually, there are at least two components: a measure of how well the disparity function agrees with the stereo input and smoothness assumption. This losses usually need to keep into consideration the discontinuity preservation of the image, avoiding then to naively solve the smoothness assumption by just blur- ring the image. Besides local and global approaches also others methods have been explored such as dynamic programming [21] and cooperative algorithms [22].

4. Disparity refinement: This step summarizes many different postpro- cessing methods that aim to improve the provided disparity map. These could include sub-pixel refinement and management of occluded areas.

Finally, other methods have been considered and tested through the years

and many of them are still sources of useful inspirations today. Worth to men-

(18)

Figure 2.1: traditional stereo estimation pipeline divided in: (a) input stereo pair, (b) matching cost computation, (c) cost aggregation, (d) disparity com- putation and (e) disparity refinement. Image taken from [23].

tion, iterative refinement approaches are based on image warping and hierar- chical approaches that start from an initial coarse approach to a refined one based on previous computations [24].

2.2 Stereo depth: deep learning methods

2.2.1 The very first approach

The first use of convolutional networks aimed to solve the Matching compu- tation cost step. In particular, the very first successful approach has been pre- sented by Zbontar and LeCun in [11] that scored the top performing method on the popular street scenario dataset KITTI [25] stereo leader-board (August 2014). Zbontar and LeCun trained a convolutional neural network to predict how well two images patches match (step 1) and then they follow applying cross-based cost aggregation and semi-global matching. Finally, they also ap- plied left-right consistency check to eliminate errors due to occlusions. The approach was based on the exploitation of the matching cost as loss function of a supervised learning problem. The matching cost C(p, d) where p is the image position and d the disparity.

C

_AD

(p, d) = X

q∈Np

|I

^L

(q) − I

^R

(qd)| (2.1)

Where I

^L

(q) and I

^R

(q) are images at position p of the left and right image

intensities and N

p

is the set of locations within a fixed rectangular windows

centered at p. Here p, q and r represent 2D image coordinates and appending

the d after the coordinate represent the subtraction of the disparity d to the x

(19)

coordinate. Equation 2.1 is a cost measure associated with matching a patch from the left image, centered at position p, with a patch from the right image centered in position pd.

Figure 2.2: Zbontar and Le Cun convolutional architecture. Note the novel introduction of Siamese networks and a concatenation layer for disparity esti- mation. Image taken from [11].

The network architecture illustrated in Figure 2.2 is composed by a Siamese convolutional network followed by fully connected layers that concatenate the two branches terminating with two labels fed through a softmax producing a probability distribution over the two classes: good and bad matches. The network has been trained uniquely on the KITTI dataset. The following steps follow the traditional stages presented previously.

2.2.2 DispNet: the first end-to-end architecture

The pioneering work presented by Zbontar and Le Cun was severely limited by

the shortage of available data fully annotated with ground truth. This limita-

tion is intrinsic in the case of real scenes dataset like KITTI where the ground

truth is annotated through the exploitation of 3D LiDARs and then fitted with

CAD models. This implies intrinsic constraints in the quality of the ground

truth and a large amount of annotation work required represents an additional

limit to the quantity of available data. A notable breakthrough has been pre-

(20)

sented by Mayer et al. with the fully annotated synthetic dataset Scene Flow and the first end-to-end neural network for stereo regression, DispNet [26].

Scene Flow contains more than 39000 stereo frames in 960x540 pixel resolu- tion, rendered from various synthetic sequences. The disparity ground truth describes directly how pixels move between the two views of a stereo frame.

It is a formulation of depth which is independent of camera parameters (even though it is still depended to the baseline). The dataset is divided into three sections:

• FlyingThings3D: it represents the main part of the dataset and it is a collection of everyday objects flying along random 3D trajectories. The camera is also moving. The objects do not have any interaction and the dimension and textures are also randomly settled.

• Monkaa: The second part of the dataset is made from the original 3D rendering of the animated short film Monkaa. Monkaa contains non- rigid and softly articulated motion as well as visual challenging fur.

This, of course, implies a much lower naturalism of the scenes.

• Driving: Driving has been largely inspired by the popular KITTI 2015 dataset, and it is the most realistic part of Scene Flow. It simulates driv- ing scenes in a simplified rendered city. The camera intrinsics have been chosen to roughly match the KITTI 2015 set up.

Such a large dataset allowed the designing and the training of an end-to-end architecture, DispNet. This network might be seen as an adaptation for stereo disparity regression of FlowNet by Dosovitskiy et al. [27]. FlowNet is a deep convolutional architecture designed for the estimation of Optical Flow. Optical Flow and Stereo Depth estimation are very similar tasks in the sense that both of them require the solution of the disparity regression problem. Optical Flow indeed represents the regression of movements in the image, these movements are then computed as the disparity between sequential frames.

FlowNet (and consequently DispNet), is a convolutional neural network with an hourglass architecture, i.e. a down-convolutional part for the feature extraction followed by an up-convolutional section for the disparity regression.

In the network are also present skip-connections between corresponding layers

between the convolutional segment and the deconvolutional one. Moreover,

each deconvolutional layer is also producing a disparity sub-sampled disparity

map. Summarizing: in the deconvolutional part, each layer receives 3 different

inputs: the previous lower level features, the previous disparity map and the

corresponding feature map from the convolutional segment. Worth to mention

(21)

Figure 2.3: Example image from the Scene Flow Dataset from the Monkaa section. On the top, we have the RGB image, on the bottom from the left we have ground truth of optical flow, disparity and disparity change. Image taken from [26].

that, thanks to the skip layers connections, FlowNet might be also considered a U-net [28] ante litteram.

Both FlowNet and DispNet receive two RGB inputs: two sequential frames for FlowNet and a stereo pair for DispNet. Fisher et al. presented two options about how to extract information from the pair: FlowNetSimple and FlowNet- Corr. FlowNetSimple just stack the two images into a 6 channels input (H x W x 2C. This option results in more general and straightforward, but the network might struggle to easily find a satisfactory optimal minimum. Similarly to the architecture from Zbontar and LeCun, a step towards a more optimized archi- tecture, is represented by two identical and parallel processing streams for the two images and the following merging between them at a later state stage.

The combination between the two branches is performed through a cor- relation layer that perform multiplicative patch comparison between the two feature maps. In particular: given the two multi-channel feature maps f

1

, f

₂

: R

²

→ R

^c

, with w, h and c being their width, height and number of channels.

The correlation layer allow the network to compare each feature patch from

f

1

with each patch from f

²

. Considering only a single comparison operation,

the correlation of two patches centered at x

1

in the first map and at x

2

in the

(22)

Figure 2.4: Down convolutional architectures of FlowNet (and consequently DispNet): on the top the 6-channels input option (FlowNetSimple) and on the bottom the optimized version with Siamese Networks (FlowNetCorr). Note the skip-layers connection between down and up convolutions. Image taken from [27].

second map is then defined as:

c(x

₁

, x

₂

) = X

o∈[−k,k]×[−k,k]

hf

₁

(x

₁

+ o), f

₂

(x

₂

+ o)i (2.2)

for a square patch of size K = 2k + 1. Worth to mention that Equation 2.2 cor-

responds to one step of a convolutional layer, but in this the case the convolu-

tion is applied from data to other data, therefore this operation does not involve

any trainable parameter. Since the whole convolution would include w

²

˙h

²

op-

eration, it would be too computationally expensive for the backward pass and

therefore a limit to the maximum displacement is introduced. In contrast to

FlowNet, in DispNet we consider disparities only in one dimension. This dif-

ference allows us to consider larger disparities since the correlation operation

is one order of magnitude smaller and increase linearly (and not quadratically)

with larger maximum disparity. Note that the disparity here can even be com-

puted in just one direction, for example, given the left image and looking for

correspondences within the right image, then all disparities displacements are

to the left. Finally, while FlowNetsimple and FlowNetCorr had comparable

performances, in DispNet, DispNetCorr (provided with correlation layer) per-

forms systematically better than DispNetSimple (without correlation layer).

(23)

Figure 2.5: Up convolutional architectures of FlowNet (and consequently DispNet). Note that each deconvolution layer receive three inputs (lower level feature maps, down-sampled disparity, skip connection from down convolu- tion) and it return two outputs (disparity map and feature map). Image taken from [27].

2.2.3 GCnet: the new stereo baseline

A breakthrough in the depth stereo regression is represented by the work GC- Net by Kendall et al. [29] that set the new state-of-art benchmark on the KITTI dataset ranking (on 13 March 2017). The network architecture design offers several novelties motivated by the geometrical understanding of the problem.

The main contributions consist in new ways to incorporate context directly from the data employing 3D convolutions for the cost volume and the definition and employment of a novel soft-argmin function for the disparity extraction from the cost volume.

Figure 2.6: Overall GC-Net architecture. Image taken from [29].

Similarly to previous architectures [11, 26] the feature extraction relies

on 2D Siamese convolutional networks forming unary features. In the sec-

ond stage, we start the processing for the creation of the cost volume. This is

(24)

initially done with the concatenation of each feature with the corresponding one from the opposite stereo frame, the result is the formation of two vol- umes with dimensions height × width × (max

^d

isparity + 1) × f eaturesize.

Differently, from other approaches, that were employing differences, distance measures or dot product operations [26, 10], this concatenation allow learning to incorporate context which can operate over feature unaries. This gives the architecture the capacity to learn semantics and not just relative representa- tions. Afterward, several 3D convolutions and encoder-decoder architecture are applied to the volumes finally obtaining the regularized cost volume with dimensions height×width×disparity. On the the final cost volume, Kendall et al. introduce a new soft-argmin function that offers the advantages of being fully differentiable:

sof targmin :=

Dmax

X

d=0

d × σ(−c

_d

) (2.3)

Where σ is the softmax operation, c

d

the predicted cost and d the disparity.

Finally as loss has been employed a mean absolute difference (MAD).

2.2.4 PSM-Net and other state of art methods

The works presented by Kendall et al. [29] and Mayer et al. [26] greatly influ- enced the research community towards new architectures. Some approaches focused on larger networks with iterative accuracy refinement, such as Cascade Residual Learning (CRL) by Pang et al. [30]. CRL-Net is indeed a composi- tion of two stages (modified) DispNet. CRL-Net first computes a dense half- resolution disparity map and then it refines the results using another DispNet that this time computes only the disparity residual error between one frame and the opposite frame warped using the newly computed disparity. The sec- ond stage then can focus only on highly non-linear residuals error relieving the

"learning burden" from the first stage. The loss is then computed as an average between the MAD at each stage of the deconvolution. CRL-Net outperforms GC-Net in most of the benchmarks on KITTI 2015 raking 1st in the ranking on July 30, 2018 (link of the updated KITTI 2015 ranking). As we will ex- plain in further details in the next section, the "cascade residual" approach will unexpectedly turn out extremely useful in case of real-time approaches [31].

Another extremely versatile and influential work has been proposed with

Pyramid Stereo Matching Network (PSM-Net) by Chang et al. [3], also this

method ranked first in the KITTI 2012 and 2015 leaderboards before March

18, 2018. PSM-Net embraces the experience of semantic segmentation studies

(25)

Figure 2.7: Overall CRL-Net architecture. Image taken from [30].

introducing a pyramid pooling module for incorporating global context infor- mation into image features. Moreover, they present a stacked hourglass 3D CNN structure to extend the regional support of context information in the cost volume. The Spatial Pyramid Pooling Module (SPP) was designed to re- move the fixed-size constraints of CNN and incorporate hierarchical context information to learn relationships between objects and sub-regions. SPP is based on adaptive average pooling to compress features into four scales and is followed by a 1x1 convolution to reduce feature dimension and eventually the low dimensional feature maps are up-sampled to the same size of the orig- inal feature map and concatenated. The cost volume is created with the same approach used in the GC-Net. Finally, to aggregate the feature information along the disparity dimension as well as spatial dimensions, two architectures are proposed: basic and stacked hourglasses. The basic option consists of a simple network composed of residual blocks with a final bilinear up-sampling to the original image dimension and regression for the disparity estimation.

The second option instead is more sophisticated and similar to other works such as CRL, it employs several stages. In this case, the network is composed of three stacked encoder-decoder structures, each of which generates a dispar- ity map. During the training, the loss is computed as the average between the outputs, while during the test we consider only the third output.

Worth to mention that PSM-Net strengthened with the CSPN module pro-

posed by Cheng et al. [32] reaches even higher accuracy, and currently it ranks

1st on the KITTI 2015 stereo leaderboard (2019).

(26)

Figure 2.8: Overview on PSM-Net architecture. On the top we illustrate the end-to-end pipeline. On the bottom left the SPP module. On the bottom right the two cost volume regularizations. Image taken from [3].

2.3 Stereo depth: alternative approaches

GC-Net, CRL-Net, and PSM-Net have reached outstanding accuracy results:

less than 2% of outliers averaged over all the ground truth pixels (D1-all) and

less than 3% of outliers averaged only over foreground regions (D1-fg). Any-

way, with the current technology, those applications are far from being used

in any kind of real-time or mobile application, indeed their inference time

on the KITTI dataset is higher than 0.4 seconds tested on powerful desktop

GPUs. Both the inference time, the energy requirement and the amount of

used equipment makes of these solutions unsuitable for a real-time applica-

tion in AD or mobile robotics scenario. Another subtle problem is represented

instead by generalization capabilities of these networks. All of them indeed

has been largely trained on Scene Flow and finally fine-tuned for KITTI. This

rise questions about their robustness in other domains. Will they reach the

same performances even in diverse scenarios such as different weather condi-

tions, different landscapes or street objects? Recently, numerous works have

analyzed one or more of these aspects proposing several solutions.

(27)

2.3.1 Online adaptation

Regarding domain adaptation in stereo matching, two relevant works have been proposed by Tonioni et al. [33, 34]. In Unsupervised Adaptation for Deep Stereo [33], the key idea is to enable online adaptation leveraging on con- fidence measures that assess the reliability of the disparity estimation. Then, whether the confidence is below a certain threshold, they employ a traditional method such as AD-CENSUS or SGM as the unsupervised loss. The overall loss is then: L = C

^L

+ λS. With C

L

: guided confidence loss, S : smoothing loss and λ hyperparameters. Given a reliable confidence measure C(p) of a traditional disparity estimation D(p) where p is a spatial location. And given D

_n

et disparity estimation from a network (DispNet in this case). We define as Confidence Guided Loss (with t adjustable hyperparameter):

C

_L

= 1

|P |

X (p) (2.4)

(p) =

( C(p)|D

n

et(p) − D(p)| if C(p) ≥ t

0 otherwise. (2.5)

Another interesting step towards adaptive and real-time performances has been proposed in Real-time self-adaptive stereo [34] where they present a novel Modular ADaptive architecture (MAD-Net). MAD-Net is divided into two parts: the first one is two Siamese networks composed of 6 modules each of them composed by two down-convolutional layers. The second part is com- posed of 6 modules each of them composed by a concatenation layer similar to CG-Net, 5 up convolutional layers and finally a bilinear up-sampling. The output disparity is then sent both to the following decoder block and used for computing the photometric consistency, i.e. computing the reconstruction er- ror between a real frame and warped one obtained through the re-projection of the opposite frame using the disparity (similar method to the one implemented by Godard [35]).

The online adaptation capabilities rely on the modularity of the structure,

indeed it is possible to train each couple of encoder-decoder modules inde-

pendently. This is interesting since, while computing a full back-propagation

would be computationally too expensive for real-time performances, a "par-

tial back-prop" is instead possible. To choose what part of the network has

the highest impact if trained, they developed the Modular ADaptation (MAD)

algorithm that, after the update assesses the network improvement and stochas-

tically guide the network to train more some blocks than others.

(28)

Figure 2.9: Overview on MAD-Net architecture. On the left it is illustrated the modular pyramidal architecture. On the right instead the cost volume and the photometric re-projection. Image taken from [34].

2.3.2 Real-time performances

Despite the large performances improvement obtained by MAD-Net (1.3 Hz on Jetson TX2), the output rate might still result limited for a real-time appli- cation when a fast response is required. A work that can instead successfully tackle this issue has been recently proposed by Wang et al. [31]. In particular, they propose AnyNet, a modular architecture able to trade-off computation and accuracy at inference time. The network can work in anytime settings i.e. the model can produce valid results even if a shorter time is allocated. The infer- ence image process frequency spans from a minimum of 10FPS to a maximum of 35FPS on NVIDIA Jetson TX2. Wang et al. also proposes an interesting evaluation of the complexity scaling of depth estimation algorithms: the com- plexity scales linearly with the maximum considered disparity and cubically with the image resolution. Leveraging these facts they refine the depth map successively, while always ensuring minimal computational time.

As a first stage a U-Net feature extractor is employed to obtain feature

maps with resolution 1/16, 1/8, 1/4 of the original size of the image. Starting

from the first stage, a traditional disparity cost volume is created and then

regularized through 3D convolutions (similarly to GC-Net). The disparity is

finally obtained through Kendall’s soft-argmin function producing a first, low-

resolution disparity map. If more time is allocated the network will accede to

the following stage. From stage 2, the network will not recompute the disparity

volume from scratch but instead will proceed with an operation very similar to

what we have seen in CRL by Pang et al.. Indeed from stage two, the disparity

network is computed between the left frame scaled feature map and the feature

(29)

Figure 2.10: Overview of AnyNet with highlight on the U-Net feature extractor and the subsequent network stages. Image taken from [31].

obtained through the warping of the right feature map with the up-scaled stage 1 disparity. The residuals disparity errors are then obtained and summed with the up-sampled stage 1 disparity. The same operation is repeated in the third stage, while for the fourth and final stage a SPNet [36] is employed to provide an even sharper disparity map.

2.3.3 Multi-task learning in stereo depth

Recently, several methods proved that leveraging on a multi-task setup could lead to better performances. The multi-task network usually involves an initial architecture segment where the tasks share the same parameters, for example in the feature extractor. Following the same embeddings are fed in two spe- cialized networks: one for a classic disparity regression and one focused on the auxiliary task. In the following stage, the results of the two networks are merged again (usually directly concatenated) in a "hybrid volume" that, after an additional regularization, provides the final disparity. Regarding the ob- jective function, usually, both task-specific and multi-task loss functions are employed. Many methods analyzed the effect of multi-task learning in the depth regression from a mono camera, but just a few managed to improve the depth performances in a stereo setup.

To the best of our knowledge, only four methods successfully achieved

this result: SegStereo [37] and DispSegNet [38] leverage on segmentation

as auxiliary task, AMNet [39] on Background foreground segmentation and

EdgeStereo [40] that used edges cues. To better understand the proposed con-

(30)

Figure 2.11: Architecture overview of SegStereo. Image taken from [37]

tributions we further explain the two methods based on semantic segmenta- tion. SegStereo proposes a network divided into three parts: at first, the stereo pair is processed by a ResNet-50 siamese feature extractor. Then the extracted features are then processed by three different modules:

• Correlation Layer: similarly to DispNetC a correlation layer is employed to extract disparity information.

• Convolution Block: the left features are processed by a Convolution Block to preserve RGB cues.

• Segmentation Network: both left and right features are processed by a segmentation network obtaining then the corresponding semantic maps.

In a third phase, the three left embeddings are concatenated in an hybrid volume. This volume is then processed with Encoder-Decoder architecture fi- nally obtaining the let disparity map. Regrading the objective functions, the network can be trained both in an unsupervised and supervised fashion. Worth to mention the unsupervised objective function that involves both the photo- metric reprojection of the left RGB on the right RGB (task-specific) and the reprojection of the right segmentation on the left ground truth (multi-task).

A similar approach has been presented in DispSegNet that instead employ a

(31)

backbone composed by a 3D convolution U-net structure for the disparity es- timation. The introduction of RGB embeddings is kept for a final Refinement Module that with a convolutional residual structure allow further improvement of the disparity map. The segmentation is computed in parallel using a module similar to the one adopted in PSPNet [41].

Figure 2.12: Architecture overview of DispSegNet. Image taken from [38]

2.4 Semantic Segmentation

In this work, we leverage semantic segmentation as an auxiliary task to im- prove disparity prediction in a real-time embedded setup. This leads us to- wards the following requirements: light-weight architecture, satisfactory pixel accuracy and mean intersection over union (meanIOU) and a modular archi- tecture.

2.4.1 U-net: the traditional approach

The U-Net [28] is an extremely popular architecture introduced by Ronneberg et al. in 2015 with the aim of semantic segmentation in biomedical images.

Anyway, the extreme versatility of the architecture made it a common standard

in the research. A U-net is composed by a backbone convolutional neural net-

work that progressively reduces the feature dimension - the encoder - and a

symmetrical network that enlarges the bottleneck embedding up to the origi-

nal dimension. Moreover, a fundamental feature of the U-net is the presence

of skip connections between encoder and decoder between the corresponding

dimensions of the two architecture. This implies that the bottleneck of the U-

net does not need to compress the full input in a single embedding, but just

the residuals. This highly residual structure allows the U-net to work with

(32)

large inputs using a relatively small number of parameters. In our case, these proprieties make the U-net an excellent baseline for our segmentation.

Figure 2.13: Structure of the orginal U-Net. The blue blocks represent con- volutional layers, while the grey arrows are skip connections. Channels and feature dimensions are written alongside the block. Image taken from [28]

2.4.2 State of the art of semantic segmentation

Recently several methods proved the importance of multi-scale image pars- ing to exploit image-level context and detailed boundaries together. Pyra- mid Scene Parsing (PSPNet) [41] originally proposed the Pyramid Pooling Module. This module successfully incorporates context information through a pyramidal pooling of the feature map in several different resolutions. Then each resolution is processed individually and finally, the different volumes are up-sampled to the original scale and concatenated together with the original volume. Worth to mention this method inspired the backbone module of the popular stereo method PSMNet [3]. Recently DeepLab [42] proposed the Atrous Spatial Pyramid Pooling Module able to effectively enlarge the field of view and incorporate more context. New approaches are also exploiting edge cues to improve the segmentation boundaries, i.e. that part that usually is the keenest to error [43].

2.4.3 Fast segmentation networks

All the aforementioned networks anyway managed to achieve high accuracy

at the expense of speed, indeed no one of those methods can run in real-

(33)

Figure 2.14: Overview of PSPNet with highlight on the Pyramid Pooling Mod- ule architecture. Image taken from [41]

time neither on the most powerful GPUs. Moreover, we have to consider that in real application semantic segmentation should be able to run faster than real-time, since usually it only represents a preprocessing step for other time- critical. Furthermore, usually, segmentation is employed in mobile robotics, this also implies the necessity of good performances in low power devices.

These requirements justify great research interest in fast and computationally efficient approaches. Among them we consider two of the fastest architec- tures: Residual Pyramid Learning (RPNet) [44] and two branches architec- tures (ContextNet [45] and Fast-SCNN [46]). Both of them will be further considered for our segmentation network. RPNet to reach high inference effi- ciency proposes a single-shot architecture. This means that instead of having a classic encoder-decoder architecture (such as U-net or SegNet [47]), the only convolutional branch is the encoder, while the decoder is composed of simple addition of residuals. This means that the encoder produces three different predictions at shallow, mid and deep levels that will be recomposed for the final segmentation output. Moreover, the full network operates in half resolu- tion to further decrease the computational costs. This results in an inference time on NVIDIA Jetson TX2 of 5FPS with a meanIOU of 67% on CityScapes [48].

An alternative, but not radically different, approach for fast segmentation

has been proposed by Rudra PK Poudel et al. in the two work ContextNet and

Fast-SCNN. Here they propose different kinds of "two-branches" networks: a

deep encoder that process very low resolution features to include more context

in parallel with a second much more shallow branch that encodes high non-

linearities such as edges and small objects. The two branches are then added

in a late state of the network to produce the final class volume. They prove

how the deep branch is beneficial to extract context information, they also em-

(34)

Figure 2.15: Overall structure of RPNet. As described in the illustration, the decoding of the segmentation is obtained with simple residual addition, there- fore is computationally very light. Image taken from [44]

pirically prove that deeper architecture manages to improve the accuracy at the expense of time. This second problem is then tacked with the introduction of depth-wise separable convolutions that significantly reduce their compu- tational cost. They do not provide results on embedded platforms but they reached more than 100FPS on the desktop GPU NVIDIA Titan Xp.

Figure 2.16: Architecture overview of Fast-SCNN. The network structure is composed by two parallel branches: one is low-res deep and one high-res shal- low. Image taken from [46]

In the following chapter, we will explain thoroughly our novel architecture

and in what points we are adopting and embedding in our design many of the

aforementioned ideas and contributions.

(35)

Methods

In this chapter, we will illustrate the applied methodologies and the motiva- tions that drove our design strategies. We will first introduce the concept of the network (Section 3.1) to specify the desired output, we will then provide a general overview of the proposed convolutional neural network (Section 3.2).

The chapter follows with one specific section on the two novel modules we de- signed: the Siamese Bifid U-Net module for feature extraction and semantic segmentation (Section 3.3) and the Synergy Disparity Refinement module 3.4.

Following, we will discuss the employed objective functions (Section 3.5) and finally we will describe implementation details (Section 3.6).

3.1 Concept

Given the goal of achieving disparity regression and semantic segmentation in a multi-task fashion, we organize the network in the following logical com- ponents:

• A common feature extractor able to provide shared embeddings both for disparity cost volumes and for segmentation network.

• A segmentation network that receive encoded shared embedding and focus on the specific task of semantic segmentation.

• A stereo disparity network that uses the extracted features to compose cost volumes, regularize them, and finally extract the stereo disparity.

• A final refinement module that merge disparity and semantic embed- dings to further improve the disparity maps leveraging segmentation cues.

27

(36)

3.2 Architecture Overview

As backbone disparity architecture we adopted the model proposed in AnyNet.

AnyNet indeed has several perks that make it a suitable baseline for our devel- opment:

• AnyNet is the fastest published method able to perform disparity stereo regression reaching an inference rate of 10FPS at highest resolution up to 35FPS at the lowest. The accuracy though is significantly worse (more than double) than most of the other available models with a D1 error over 6% for the highest resolution and more than 10% for the lowest.

• AnyNet is an end-to-end fully convolutional network. This makes its structure very flexible and versatile to significant changes in several parts of the architecture.

• AnyNet presents a highly residual structure. In particular: the Siamese feature extractor is a U-Net with pyramidal outputs. Moreover, also the disparity network is fully residual. Indeed the "full disparity" is re- gressed only at the lowest resolution, while the remaining high reso- lution stages use cost volumes composed by disparity residuals between the left and the warped-right features. This smart use of the photometric re-projection forces the cost volumes to only include residual errors of the previous disparity. The previous disparity is then up-sampled and added to the new residuals to get the higher resolution output (AnyNet is explained in further details in Chapter 2, Section 2.3).The residual ar- chitecture represents a great perk in terms of computational burden and will allow us to design a symmetric network for semantic segmentation and disparity refinement modules.

Our first intuition starts with the consideration that AnyNet is already using a Siamese U-Net to extract pyramidal residual features in the disparity decoder, and U-Net [28] proved itself as a solid base for semantic segmentation (Section 2.4). This suggests us to partially preserve the existing feature extractor archi- tecture, but also imposes the design of an additional decoder uniquely focused on semantic segmentation. This will change the feature extractor to a "bifid"

shape since from a common encoder we will have two parallel decoders: one

for disparity and one for segmentation, we will refer to this module as Bifid

U-Net. As this stage, the network would be able to provide in a parallel fashion

disparity and segmentation outputs, anyway, a further and deeper exploitation

of the segmentation cues is possible and it can be achieved with a specific

(37)

refinement module. Influenced both by the AnyNet residual structure and by previous successful works [37, 38] we decided to merge disparity cost volumes and semantic embeddings into hybrid volumes. Differently from other works though, we want to keep also this part of the architecture fully residual and pyramidal. We will refer to this second module as Synergy disparity refine- ment module. In the following sections, we will thoroughly analyze both of these modules as well other significant changes regarding the objective func- tions. In Figure 3.1, we propose a synthetic concept scheme of the designed architecture.

Input stereo pair

D1 4.0% 8.1 FPS

D1 3.2% 6.3 FPS

mIOU 62.3% 15 FPS

Figure 3.1: Full architecture overview.

(38)

3.3 Siamese Bifid U-Net module

As we have already explained in Chapter 2, stereo disparity features are usually obtained by Siamese feature extractors. In the case of AnyNet, the feature extractor outputs features at three different resolutions: 1/16, 1/8, 1/4 with a U-net. A significant difference of this approach comapred to a classic U-Net is the fact that we use as output the processed concatenated volume at each stage of the decoder and not only at the highest resolution (similarly to [49]).

3.3.1 Network Structure

To add the segmentation, we need to design an additional independent decoder.

A naive approach could just be to add the decoder directly on the lowest resolu- tion embedding symmetrically as the disparity decoder. This approach would anyway encounter several limitations:

• We would add too many tasks on one single embedding (i.e. the encoder embedding at resolution 1/16, see Figure 3.2), indeed it is not granted that the encoded features could reach a sufficient level of generality to successfully fulfill both disparity and segmentation requirements. This problem is even more accentuated when the whole architecture has to rely on only a few convolutional layers with a limited number of filters.

• As proved by ContextNet [45] and Fast-SCNN [46], semantic segmenta- tion requires and benefit from deeper structures that allow to incorporate more context.

These two reasons suggested reaching one deeper level (resolution of 1/32) in the encoder creating a sort of "bridge" between the two decoders. Moreover, to alleviate the request of features generalization on the disparity bottleneck, we add an additional convolutional block before the first pyramidal output.

A further implication of this choice is that the deepest level of the encoder

will receive gradients only from the segmentation objective function, allow-

ing a further specialization. While the first three encoders block will receive

gradients from both disparity and segmentation granting the extraction of gen-

eralized features. As we anticipated, for the segmentation decoder we want to

keep a residual structure similar to the one adopted in the disparity network of

AnyNet. This is done both to keep the computational burden as low as possi-

ble and to require similar (residual) embeddings from the shared encoder. The

pyramidal decoder structure then follows the following logic:

(39)

1/4 1/8

1/32 1/16

encoder disparity decoder semantic decoder

2D conv block 2D conv layer addition data flows

Figure 3.2: Bifid U-Net feature extractor

1. Feature concatenation (encoder skip connection + upsampled residual) and volume processing.

2. Compression from embeddings to segmentation class volume.

3. Residual element-wise addition with previous up-sampled segmented class volume (when available, i.e. stage 2 and 3).

Symmetrically to AnyNet disparity, the full segmentation is only computed at the lowest resolution and further refined by the following stages through di- rect residuals. This mechanism, added to the intrinsic residual structure of the U-Net, allows the segmentation decoder to obtain a coarse (but complete) seg- mentation in very low resolution (1/16) while the following stages will only focus on smaller image details, such as boundaries or other ill-posed areas.

This allows us to reach a satisfactory segmentation accuracy with an extremely

low computational burden. Indeed, this whole network will have a computa-

tional overhead of just 0.025s seconds on inference time on Jetson TX2. The

full Siamese network (consisting of two graph calls for disparity and one call

for segmentation) if working standalone runs at 15FPS on TX2. The major

drawback of this architecture is the still limited depth both in the number of

convolutional layers, filters, and down-sampling. As we will see in Chapter 4

4, failure cases are mostly related to details, boundaries, ambiguous objects or

scenarios greatly different from the training set. The major perks are instead

the very low computational burden and the high versatility, indeed we will

show that the same architecture can easily be adapted with a simple change

in the number of parameters to fulfill different requirements of speed and ac-

curacy. Moreover, since this module is providing three segmentation maps at

different resolutions (1/16, 1/8 and 1/4), we still respect the anytime. As a

final remark, we also want to specify that, even if during the training phase the

segmentation is performed twice (since the network is siamese), during test

time, only the left one is employed.

(40)

3.3.2 Operational Block

Each block of the architecture is fully convolutional and the operation block is composed by the sequence of batch normalization, rectified linear unit acti- vation function (ReLu) and a 2D convolution.

The number of layers and filters can be settled according to the required performances. The number of channels, as in the original AnyNet implemen- tation doubles every time that the image resolution is halved. In our proposed network we use two layers at each stage with an additional block of two layers after the disparity bottleneck, one additional layer for the bottleneck transfor- mation of the embeddings into segmentation classes and one additional bottle- neck layer after the disparity features. The bottleneck applied after the dispar- ity feature extraction allows the decoupling between the number of channels in the feature extractor and the disparity network. It is indeed important to limit as much as possible the number of filters in the disparity network since the cost volume formation and the 3D convolutions used in the volumes reg- ularization are extremely computationally expensive. We will present experi- ments with and without disparity embedding compression, showing that for a higher number of initial channels, the disparity bottleneck effectively reduces the computational burden.

3.4 Synergy disparity refinement module

Once the Bifid U-Net is settled, the overall network can provide both seman-

tic segmentation and stereo disparity together. We also claim, and we will

empirically prove, that now the extracted features are more generalized and

therefore more robust, implying a higher accuracy. Anyway, once the encoded

embeddings have been further processed by the disparity and the segmenta-

tion decoders we now have two task-specific embeddings. In particular, in the

disparity network, these features take the form of regularized cost volumes

(originally obtained through the shifted differences among the right and left

U-Net disparity features). In the semantic network instead, the features are

obtained directly from the semantic decoder. As we said, now these features

have been further refined to fulfill the specific task they are assigned to (in

Section 3.5 we will further describe the employed objective functions). This

means they own deeper and more meaningful representations regarding one

of the two tasks. Recent work [37, 38] proved that semantic features are a

better descriptor for some image details and therefore can help the disparity

regression task. Edges and corners, for example, might be easily inferred by a

(41)

segmentation network especially when they divide two different classes. The same features can instead be harder to be correctly addressed by disparity cost volumes, leading to blurry edges or holes "infilling" or other inconsistencies.

One simple yet effective way to leverage both features to improve the disparity regression consists of a simple concatenation in one hybrid volume.

3.4.1 Network Structure

Despite this trivial process proved to be successful so far, it is also the only experimented technique to accomplish this task, then we decided to push this method one step further. In light of the fully residual strategy proposed both in the disparity network and in the segmentation decoder, we aim to reproduce it even in the final refinement module. As always, this procedure has sev- eral advantages: it is computationally less expensive, it allows to easily focus on non-linearities, it respects the "anytime" settings originally presented by AnyNet and preserved in the Bifid U-Net. To achieve this, instead of one sim- ple concatenation, we perform a cascade of residual concatenation between semantic and disparity volumes.

3.4.2 Operational Block

Step by step, the procedure is the following, and it is repeated at each resolution stage (low, medium and high):

1. Semantic feature compression: The semantic features are taken from the disparity decoder before their transformation in classes volume.

Since we will create unique hybrid volumes we might need to set a bot- tleneck to balance the channels of the two features, moreover reducing the number of filters will also reduce the computational time.

2. Concatenation: We then concatenate the two features together, if we are on the medium or high resolution stages, we also concatenate the previous refined disparity volumes (properly up-sampled).

3. Fusion refinement: To get an improved disparity cost volume, we pro-

cess the hybrid volume into a new one with the same dimensions of the

input disparity embedding. This operation is handled with a convolution

block. Moreover, to focus the network on a refinement task on the top of

the input disparity cost volume, we provide the input disparity volume

as skip connection and add it after the fusion process.

(42)

Finally, we get the refined disparity cost volumes, then we just need to feedforward it (upsampled) to the higher resolution level for the hybrid con- catenation. As a final step to obtain a full disparity map, we need to apply a simple argmin operation on the volume to obtain the final disparity map.

Regarding the implementation details, the compression and semantic prepro-

cessing module is composed of a simple convolutional block with an initial

bottleneck that reduces the number of channels of the semantic features. For

a complete and detailed network, architecture scheme refer to Figure 3.3.