• No results found

Active Stereo Reconstruction using Deep Learning

N/A
N/A
Protected

Academic year: 2021

Share "Active Stereo Reconstruction using Deep Learning"

Copied!
62
0
0

Loading.... (view fulltext now)

Full text

(1)

Master of Science Thesis in Electrical Engineering

Department of Biomedical Engineering, Linköping University, 2019

Active Stereo

Reconstruction using Deep

Learning

(2)

Helena Kihlström LIU-IMT-TFK-A–19/569–SE Supervisor: Tuan Pham

IMT, Linköping University

Erik Ringaby

SICK Linköping

Examiner: Anders Eklund

IMT, Linköping University

Division of Medical Informatics Department of Biomedical Engineering

Linköping University SE-581 83 Linköping, Sweden Copyright © 2019 Helena Kihlström

(3)

Abstract

Depth estimation using stereo images is an important task in many computer vision applications. A stereo camera contains two image sensors that observe the scene from slightly different viewpoints, making it possible to find the depth of the scene. An active stereo camera also uses a laser projector that projects a pattern into the scene. The advantage of the laser pattern is the additional texture that gives better depth estimations in dark and textureless areas.

Recently, deep learning methods have provided new solutions producing state-of-the-art performance in stereo reconstruction. The aim of this project was to in-vestigate the behavior of a deep learning model for active stereo reconstruction, when using data from different cameras. The model is self-supervised, which solves the problem of having enough ground truth data for training the model. It instead uses the known relationship between the left and right images to let the model learn the best estimation.

The model was separately trained on datasets from three different active stereo cameras. The three trained models were then compared using evaluation images from all three cameras. The results showed that the model did not always per-form better on images from the camera that was used for collecting the training data. However, when comparing the results of different models using the same test images, the model that was trained on images from the camera used for test-ing gave better results in most cases.

(4)
(5)

Acknowledgments

I want to thank SICK Linköping for giving me the opportunity to perform this thesis project. I would like to direct a thanks to everyone at SICK who has con-tributed in one way or another. A special thanks to my supervisor Erik Ringaby for his valuable feedback throughout the project. I would also like to thank my ex-aminer and supervisor at Linköping University, Anders Eklund and Tuan Pham, for their time and help with the thesis. Finally, I want to thank my friends and family for always supporting me.

Linköping, June 2019 Helena Kihlström

(6)
(7)

Contents

Notation ix 1 Introduction 1 1.1 Background . . . 1 1.1.1 Classical methods . . . 1 1.1.2 Related work . . . 2 1.2 Problem formulation . . . 2 1.3 Limitations . . . 3 1.4 Thesis outline . . . 3 2 Theoretical background 5 2.1 Stereo correspondence . . . 5 2.1.1 Epipolar geometry . . . 5 2.1.2 Image rectification . . . 7

2.2 Stereo reconstruction methods . . . 7

2.2.1 Disparity map algorithms . . . 8

2.2.2 Common problems in active stereo reconstruction . . . 9

2.3 Deep learning . . . 10

2.3.1 Artificial neural networks . . . 10

2.3.2 Convolutional neural networks . . . 11

2.3.3 Residual neural networks . . . 13

2.3.4 Training a neural network . . . 14

2.3.5 Learning a proper data fitting . . . 16

3 Method 19 3.1 Model architecture . . . 19

3.1.1 Matching cost computation . . . 20

3.1.2 Cost aggregation . . . 21 3.1.3 Disparity computation . . . 22 3.1.4 Disparity refinement . . . 22 3.1.5 Invalidation network . . . 23 3.2 Training . . . 24 3.2.1 Cameras . . . 25 vii

(8)

3.2.2 Data . . . 25 3.2.3 Loss computation . . . 26 3.3 Evaluation . . . 29 3.3.1 Qualitative evaluation . . . 29 3.3.2 Quantitative evaluation . . . 31 4 Results 33 4.1 Results from qualitative evaluation . . . 33

4.2 Results from quantitative evaluation . . . 37

5 Discussion 43 5.1 Results . . . 43 5.1.1 Qualitative results . . . 43 5.1.2 Quantitative results . . . 44 5.1.3 Summary . . . 45 5.2 Method . . . 46 6 Conclusions 49 6.1 Answers to research questions . . . 49

6.2 Future work . . . 50

(9)

Notation

Abbreviations

Abbreviation Meaning

ANN Artificial neural network CLR Cyclic learning rate

CNN Convolutional neural network LCN Local contrast normalization ReLU Rectified linear unit

ResNet Residual neural network STN Spatial transformer network

(10)
(11)

1

Introduction

Depth estimation using stereo images is an important task in many computer vi-sion applications. The recent advent of deep learning has provided new learning-based solutions producing state-of-the-art performance in stereo reconstruction. This thesis provides an investigation of such a solution for an active stereo set-ting.

The purpose of the project was to evaluate an end-to-end deep learning method for active stereo reconstruction, according to the questions given in 1.2.

1.1

Background

The goal of stereo reconstruction is to find the depth of a scene given images from two image sensors observing the scene from slightly different viewpoints. An ac-tive stereo camera consist of two image sensors and one laser projector projecting a pattern into the scene. The advantage of adding the active illumination is that it results in better depth estimations in dark and textureless areas, where passive stereo methods - methods using only two image sensors - normally struggle [13]. Figure 1.1 shows an example scene captured by an active stereo camera.

1.1.1

Classical methods

The classical methods for depth estimation from stereo image pairs are divided into local methods and global methods. Local methods are based on finding patch-wise correspondences, where one takes a patch around a desired pixel in one image and finds the most suitable matching patch in the other image given the epipolar constraints. Global methods are based on optimization algorithms minimizing some energy function for all disparity values [5]. Global methods are

(12)

(a)Left IR image. (b)Right IR image.

Figure 1.1: An example showing the left and right output from an active stereo camera. Note the dot pattern from the laser projector, flooding the scene with texture. By finding the projections of the same point in both images, one can compute the depth to that point in the scene.

computationally cumbersome which leads to a low framerate and/or a compro-mised depth accuracy [16].

1.1.2

Related work

Some methods have been presented for stereo reconstruction, where a part of the process is learning-based. There are, for example, approaches that train a neural network to learn a feature space where a more efficient patch-wise matching can be performed [2, 3]. There are also some deep learning methods for passive stereo reconstruction implemented in an end-to-end manner [9, 11].

One problem that occurs when transferring the problem to an active stereo case is that there are no large scale data sets available for training. However, it has recently been shown that it is possible to implement a self-supervised end-to-end solution for active stereo systems, which solves the problem of lacking ground truth data (ActiveStereoNet [20]). Instead of comparing the result to a ground truth reconstruction, this solution uses the known relationship between the left and right images to formulate the objective function [20].

The method mentioned above is claimed to enable reconstruction results with a subpixel precision that is one order of magnitude better than other active stereo matching methods. This is done at a framerate around 60 Hz when running on images with resolution 1280x720 on a high-end GPU [20].

1.2

Problem formulation

The project aimed to investigate the behavior of an end-to-end deep learning method for active stereo reconstruction, when using different cameras. The fol-lowing questions will be answered.

(13)

1.3 Limitations 3

camera that was used for collecting the training dataset, compared to mod-els trained using images from other cameras?

2. Does the model perform better when being evaluated on images from the camera that was used for collecting the training dataset, compared to when being evaluated on images from other cameras?

3. How does the model perform compared to the stereo reconstruction meth-ods implemented on the cameras used to collect training data?

1.3

Limitations

One limitation for this project was the time resource; the project was to be re-alized within approximately 800 hours. The number of cameras was limited to three active stereo cameras that were available for the project; two cameras from the Intel RealSense series [10] and one camera from SICK.

1.4

Thesis outline

The remaining chapters of the thesis are structured as follows. Chapter 2 pro-vides the theoretical background needed to understand the thesis. Chapter 3 describes the method that was used in order to answer the research questions. Chapter 4presents the results of the experiments. In Chapter 5, the results and the method are discussed. Finally, Chapter 6 presents answers to the research questions as well as general conclusions of the project.

(14)
(15)

2

Theoretical background

This chapter provides the theoretical background needed to understand the the-sis. First, stereo correspondence and stereo reconstruction are described. There-after, an introduction to neural networks and deep learning is given.

2.1

Stereo correspondence

A stereo camera consists of two image sensors placed on a distance b apart from each other, where b is called the baseline for the camera. The task is to find the correspondence of each pixel between the images captured by the two sensors.

The setting where only the camera sensors are used will be referred to as passive stereo. Passive stereo usually fails in textureless and dark regions, where correspondences are hard to find. A remedy to this isactive stereo, where a laser projector projects a pattern into the scene providing additional texture [13].

The following sections give an introduction to the stereo correspondence prob-lem.

2.1.1

Epipolar geometry

Two image points are said to be corresponding when they are projections of the same 3D point. To understand stereo correspondence, an introduction to epipolar geometry might be in place. Epipolar geometry is the geometry of two images captured of the same scene but from two distinct viewpoints [15].

When a point p0in the scene is captured by an image sensor, it is projected

onto the point x0 in the image plane according to x0 = C0p0 where C0 is the

camera matrix defining the position and rotation of the sensor relative to the world coordinate system defining the point p0. This can be visualized as the

(16)

point where the projection line, the line from the point p0 to the camera center

c0, crosses the virtual image plane.

Figure 2.1: A visualization of epipolar geometry, describing correspon-dences between the image points. The points p0 and p∞ are projected onto

different image points in the right image, while projected onto the same im-age point in the left imim-age.

Figure 2.1 shows an example of a point captured by two sensors. It is clear that any point pL(s) = c0+ s(p0−c0) will be projected onto the same image point x0in

the left image, whereas each pL(s) will be projected onto a line in the right image. This line is called anepipolar line, and it is bounded by the projection e1 of the

left sensor’s camera center c0 and the projection of p∞ – the point along pL(s)

that is infinitely far away. The point e1is theepipolar point, or epipole, of the right

image.

Consequently, any point pR(s) = c1+ s(p0−c1) will be projected somewhere

onto the corresponding epipolar line of the left image. That is, the line bounded by the epipole e0and the projection in the left image of the point pR(s) that lies

infinitely far away.

It is important to point out that a point projected onto x1 in the right image

does not have to be the same point p0that was projected onto x0in the left image.

It might e.g. be the case that p0 is occluded by a point p1of some closer object.

In this case, unless p1is occluded from the left view, this point will be projected

on the epipolar line of the left image [18]. See figure 2.2.

Figure 2.2: An example demonstrating occlusion. The point p0is occluded

(17)

2.2 Stereo reconstruction methods 7

2.1.2

Image rectification

We now know from section 2.1.1 that for a pixel in one image, the corresponding pixel can be found along the epipolar line in the other image. However, a search along each individual epipolar line would be a bit cumbersome. Therefore, one wants to userectified images.

A camera pair is said to beweakly rectified if the sensors’ optical axes are per-pendicular to the baseline seperating the sensors, see figure 2.3. With this setup, the epipolar lines of the images will be parallel in each image. This is because the epipoles of both sensors will lie at infinity [15].

Figure 2.3:A rectified setup. The optical axes of the image sensors are per-pendicular to the baseline separating them.

An even more useful setup isfully rectified cameras, when also the image coordi-nate axes are aligned so that the epipolar lines are horizontal. This means that each pair of corresponding points in the images shares the same vertical coordi-nate, i.e. lie on the same row, in both images.

Since a fully rectified stereo rig is hard to achieve (and maintain over time); one usually performs a synthetic rectification of the images. This is done by ap-plying a linear transformation H to each image, where each new image pixel is obtained by

x0L= HLxLand xR 0

= HRxR.

HL and HR are calledrectifying homographies. The reader interested in how to

find rectifying homographies is refered to [15].

2.2

Stereo reconstruction methods

When using rectified images, the task in stereo reconstruction is to find the hori-zontal disparity d for each point captured in both images. The disparity is simply the displacement of the corresponding point from one image to the other. The re-sulting image with a disparity value at each pixel is called adisparity map. The depth z to the point projected onto the pixel (u,v) is obtained by

z(u, v) = f b d(u, v),

(18)

where f is the focal length of the sensors and b is the baseline seperating the sensors. The depth error  depends on the depth according to

 = δz

2

bf , where δ is the subpixel disparity precision [18].

Some common approaches to disparity estimation, as well as some common problems, are explained in the following sections.

2.2.1

Disparity map algorithms

Generally, a disparity map algorithm follows four steps: 1. Matching cost computation

2. Cost aggregation 3. Disparity computation 4. Disparity refinement

The algorithm assumes a rectified image pair which is passed through the steps above, and the final output is a smooth disparity map. Disparity map algorithms can generally be divided intolocal methods and global methods. Depending on the type of method, the four steps above are of varying importance. Global methods usually skip step 2 and go straight to step 3 [16].

Local methods

Local methods are window based and consider only local information within a support window around each pixel. Classically, the matching cost computation for each pixel is based on the pixel intensities within the support window in the reference image in relation to the pixel intensities within the support window around a candidate pixel in the other image [5]. Some modern approaches in-stead use feature representations of the patches for a more efficient matching, where the representation space is learned by a neural network [2, 3].

The information within the patches can be compared in different ways. Com-mon approaches are to calculate the sum of absolute differences or the sum of squared differences between each pixel pair within the patches. Other approaches use e.g. the relationship between the center pixel and each of the other pixels within the patch as a representation of the patch to compare with the representa-tion of a candidate patch [5] .

The cost at each pixel position (u,v) for each candidate disparity d forms a cost volume V(u,v,d). The costs are then aggregated by filtering of V(u,v,d), either in 2D (supporting fronto-parallel surfaces) or 3D (supporting slanted surfaces).

Finally, the disparity map is obtained by, for each pixel, choosing the disparity corresponding to the lowest cost in a winner-take-all-manner [16].

(19)

2.2 Stereo reconstruction methods 9

Global methods

Global methods aim to minimize a global energy function over all disparity val-ues for each pixel. The cost computation can be performed by e.g. computing the absolute difference or the squared difference between the pixel intensity values within a disparity range. At this step, most global methods formulate the energy function to minimize, including a smoothness term that assumes smoothness in the scene.

Step 3, the disparity map computation, is now performed by optimization of the global energy. The chosen disparity map is the one that minimizes the energy [18].

Disparity refinement

When the disparity map is computed, many methods also perform a refinement of the disparity at a sub-pixel level. This can be done in several ways, e.g. by fitting a curve to the matching costs. Other ways to improve the final disparity can be to apply a median filter to get rid of spurious values, or to detect occluded regions by comparing left-to-right and right-to-left disparity maps [16].

Shortcomings in classical methods

In the local methods, all pixels within a support window are assumed to have similar disparity values. This implicitly assumes smoothness which might cause errors at edges and thin structures. The support window approach also assumes that the intensity values within two matching patches are similar independent on viewpoint which might not be the case. Furtermore, the winner-takes-all-selection of disparity values only enforces uniqueness of matches for the refer-ence image. This means that one pixel in the other image might be assigned matches from several pixels in the reference images, while another pixel might not be matched at all.

In the global methods, smoothness is explicitly assumed which can cause aforementioned errors. The global optimization is also very computationally expensive, which makes global reconstuction methods impractical for real time use [16].

2.2.2

Common problems in active stereo reconstruction

There are some common problems that might arise in active stereo reconstruction. Some of them are common for both passive and active stereo, such as occlusion. In stereo vision, there are usually regions that are visible to only one of the sen-sors while they are occluded from the other sensor’s view. If one tries to match the occluded pixels, it will cause artifacts in the reconstructed image such as e.g. flying pixels, edge fattening and/or oversmoothing.

The laser pattern in active stereo also causes some problems. In passive stereo, one can usually gain a lot of performance by performing feature matching at a low resolution [11]. However, since the illumination pattern in active stereo is

(20)

very dense, the reconstruction method must process high resolution images to match the high frequency of the pattern.

The method also has to avoid erronously matched pixels due to local minimas stemming from alternative alignments of the projected pattern. Furthermore, because of the distance dependent intensity of the laser, the projected pattern will be more intense on nearby surfaces than on distant surfaces. This requires a method that compensates for luminance differences [20].

2.3

Deep learning

The field ofdeep learning concerns the class of machine learning methods that are based on complex data processing architectures that learn data representations. Usually, these architectures areartificial neural networks involving a cascade of processing units learning representations on different abstraction levels [4].

The following sections explain different types of artificial neural networks, as well as important concepts for designing and training them.

2.3.1

Artificial neural networks

The concept of artificial neural networks (ANNs) is based on neuroscience. The biological neuron can receive electrical nerve impulses from other neurons and, depending on the received impulses, emit an impulse to other neurons. This has given inspiration to the mathematical model of the artificial neuron, receiving some input values x and outputting a value y as a function of x; y = f (x) [14]. See figure 2.4.

Figure 2.4: An artificial neuron. The output y is a function of the input values x.

The function f is a weighted sum of the input values and some bias, followed by a nonlinear activation function u. The activation function enables the network to learn nonlinear patterns. An example of a common activation function is

u(z) = max(0, z),

the rectified linear unit (ReLU). An adaptation of ReLU is the Leaky ReLU that allows a small positive gradient when the unit is not active, according to

u(z) = max(az, z), where the slope a is a constant [4].

(21)

2.3 Deep learning 11

An ANN is a set of connected layers, each containing a set of artificial neurons. The values y at the output layer are the results of the chain of functions applied at each layer, y = fN(...f2(f1(x))), where x are the values fed into the network

at the input layer. The layers between the input layer and the output layer are called hidden layers. A deep neural network is a network that, unlike a shallow neural network, uses multiple hidden layers [4]. A simple example of an ANN with three layers is given in figure 2.5.

Figure 2.5:An example of a net-work with one hidden layer. A function fi,j is applied at each

node (i, j).

Figure 2.6: An example of a one dimensional convolutional layer. The output is a convolu-tion of the input x and the ker-nel f .

2.3.2

Convolutional neural networks

A convolutional neural network (CNN) is a special type of ANN that is most com-monly used for processing grid-like data, such as images. It is calledconvolutional because it uses convolution instead of a general matrix multiplication in at least one of its layers. This means that the nodes of a convolutional layer share the same weights, but spatially shifted. See figure 2.6 for an example.

CNN:s usually have sparse weights, meaning that the convolution kernel is smaller than the input data. While an input image can have thousands or millions of pixels, a convolution kernel with only tens or hundreds of elements can be sufficient to detect important local features in the image. This requires much fewer operations as well as fewer parameters to store and update in each iteration.

The size of the convolution output depends on the size of the filter kernel and/or whether or not we choose to zero pad the input. If no zero padding is chosen, the kernel will only visit positions where it is entirely contained within the input image. Thus, unless the kernel is of size 1 × 1, the output will be smaller than the input. By padding the input with zeros according to the size of the filter kernel, it is possible to obtain an output that is of the same size as the input. An example of such a convolution is given in figure 2.7.

(22)

Figure 2.7: An illustration of a convolution kernel of size 3x3 (the dotted region) sliding over a zero padded input image (the blue matrix), yielding an output of the same size as the input (the green matrix).

Usually, one wants each layer to extract many kinds of features. In the context of neural networks, what is denoted as a convolution layer is usually a set of convolutions applied in parallel. The number of output channels may therefore differ from the number of input channels [4].

Variations of the convolution operation

One way to decrease the computational cost is to downsample the input by only applying the filter kernel at a subset of the possible positions. By defining the stride s for the convolution, we choose to sample only every s pixel in each direc-tion. It is also possible to use a separate stride for each direcdirec-tion. An example of a convolution with stride 2 is visualized in figure 2.8.

Figure 2.8: An illustration of a convolution with stride 2, where the kernel (the dotted region) is only applied at every second pixel in the input im-age (the blue matrix). The output imim-age (visualized as the green matrix) is smaller than the input image because of the stride.

Another way to reduce the computational cost is to use dilated convolution, in which one inserts zeros into the filter kernel. By doing this, the receptive region can be increased without increase of the kernel size. The dilation rate d controls how many zeros to insert between the kernel elements, where the number of zeros inserted is d − 1 [4]. An example is shown in figure 2.9.

(23)

2.3 Deep learning 13

Figure 2.9:An example of a convolution with dilation 2, where every second pixel within the receptive field is sampled in each direction. The darker shaded pixels within the dotted region represent the pixels that are sampled from the input image (the blue matrix). The output (the green matrix) is of the same size as the input image because of the zero padding in this example.

2.3.3

Residual neural networks

It is clear that the number of layers – the depth of the network – is important, and that a deeper network can enable an increased accuracy because there are more possibilities for extracting features at different levels. However, it is not always the case that just stacking more layers leads to an improved accuracy. It turns out that just increasing the number of layers will eventually cause the accuracy to start degrading [6].

A remedy to this isresidual learning, where one simply applies two parallel mappings onto the input signal x; one mapping F through the layers and one identity mapping that lets the signal skip some layers. Figure 2.10 shows an example of a simple residual network (ResNet).

Figure 2.10:A residual network. The input x is added to the output F(x) of the two layers.

The idea behind this construction is that for a desired underlying mapping H(x), it should be easier for the network to learn the residual F(x) = H(x) − x rather than the mapping H(x) itself. The output of the residual block is H(x) = F(x) + x, which gives the desired output. This construction applied in deep architectures has been shown to give better performance than plain (non-residual) networks of the same depth.

An example that might give a better intuition for residual learning is the case where the ideal mapping would be an identity mapping. It is easier to learn a

(24)

zero residual, F(x) = 0, than to try to fit a plain network H(x) to an identity mapping [6].

2.3.4

Training a neural network

The goal is to let the network learn the weights and biases at each layer that min-imize some predefined loss function L that is dependent on the output y. Some parameters of a network are fixed, and hence not changed by the training pro-cess. These parameters are calledhyperparameters. Examples of such parameters could be the learning rate, the number of hidden layers, the parameters of the activation functions etc. [4].

Optimization

The optimization is performed iteratively, using small batches of signals at each iteration. When the input signals have propagated through the network, the in-fluence of each network parameter on the resulting loss is determined by back-propagation [4].

By calculating the partial derivatives with respect to each parameter, and thereby obtaining the gradient ∇L of the loss function, each parameter can be updated by taking a step in the direction of the steepest descent of the loss func-tion according to

θk+1 = θkαη∇Lθ,

where θ is the parameter to be updated, η is the learning rate and α is a parameter that might be dependent on the batch data and/or the gradient at each particular iteration [19].

Adam [12] is an optimization method that uses moving averages to estimate the first and second moments of the gradient of the loss function:

gk = ∇θL

mk = β1mk−1+ (1 − β1)gk

vk = β2vk−1+ (1 − β2)gk2

Here, mk and vk are the estimations of the first and second moments of the

gra-dient, respectively, at iteration k − 1. The parameters m and v are initialized to 0, and β1, β2 ∈ [0, 1). To avoid moment estimates that are biased towards zero,

because of the initialization values, the estimates are bias-corrected according to ˆ mk = mk (1 − β1k)and ˆvk = vk (1 − β2k), which yields the final update rule

θk+1= θkα ˆ mk √ ˆ vk+  [12],

(25)

2.3 Deep learning 15

Learning rate

Choosing an appropriate learning rate is essential when training a neural net-work. If the learning rate is too small, the model will converge very slowly. On the other hand, if the learning rate is too large, the model might never converge because it continues to miss the optimum by taking too large steps through the solution space.

It appears reasonable to start with a large learning rate and let it decrease dur-ing the traindur-ing process. By dodur-ing this, the loss will decrease quickly in the begin-ning and still allow for a gentle convergence towards the optimum. However, if the optimizer reaches a bad local minima or a saddle point when the learning rate has been significantly decreased, it might have a hard time finding its way out of it. A remedy to this is to use acyclical learning rate, CLR. The idea of a CLR is to let the learning rate vary cyclically between a lower bound and an upper bound throughout the training process. It has been shown that although increasing the learning rate might have a short term negative effect on the model’s performance, the overall effect of a CLR is still beneficial [17].

The period time of a cycle is usually in the same order of magnitude as the number of iterations in an epoch – the number of iterations needed to iterate through the entire dataset once. Suitable upper and lower bounds for the CLR can be found by letting the learning rate increase linearly from a very small value during a number of iterations. The loss will usually decrease in the beginning, but as the learning rate increases, the loss will eventually start to explode. By plotting the loss with respect to the learning rate, one can choose the bounds of the interval where the loss decreases the most. To get the best result, one should stop training at the end of a cycle [17].

Batch normalization

During training, a batch of samples from the dataset is fed through the network at each iteration. This means that each batch provides an estimation of the gradient over the entire dataset. However, the distribution of data usually varies between batches, which causes variations in the gradient from batch to batch. This will lead to shifts of the values in the hidden units, which is refered to ascovariate shift. This behavior is undesired as it leads to slow convergence. The effect is even stronger as the network depth increases.

This problem is solved by usingbatch normalization at the end of the layer, which is a normalization of the activation output. This approach fixes the mean and variance of each layer input, and it also reduces the gradients’ dependencies on the scale and initialization of the parameter values. This reduces the risk of divergence, which allows for a larger learning rate leading to faster convergence of the model.

The normalization is initially performed by subtraction of the batch mean and division by the batch standard deviation. However, this may change the representational abilities of the layer. To avoid this, each normalization layer also learns a scale γ and a shift β of the normalized value. For the input batch

(26)

x= {xi}1m, the mean and variance is µx= 1 m m X i=1 xi, σx2= 1 m m X i=1 (xiµx)2

and the final output y of the normalization layer is defined as y= γ ˆx + β, ˆxi =

xiµx

p σx2+ 

,

where  is a small value added to the denominator to avoid division by 0 [7].

Levels of supervision

The ways of training a model can be divided into two main approaches;supervised learning and unsupervised learning. The difference between the two is the use of prior knowledge about the dataset.

In supervised learning, the ground truth is available to the algorithm. That is, the values that the model should output are known and used in the optimization. By knowing the desired output, one can construct a loss function based on a comparison to the model output. In unsupervised learning, the ground truth is unknown. The goal in this case is to let the model learn a representation of the data structure that minimizes the loss function. The best way to define a loss function depends on each individual problem.

Self-supervised learning is a form of unsupervised learning where the prob-lem is formulated in a supervised manner. Although not dependent on explicit ground truth labels, a self-supervised method needs some knowledge about the data structure that is used for supervision during training [4].

2.3.5

Learning a proper data fitting

When training a neural network, it is important to learn a proper data fitting. We want the model to fit well to the training data, but we also want it to generalize well when exposed to new data. Figure 2.11 demonstrates three simple examples of data fitting.

In figure 2.11a, the model isunderfitted to the data points and is likely to fit poorly also to new data points. In figure 2.11c, the model isoverfitted to the train-ing data, resulttrain-ing in small errors on the traintrain-ing data but low generalizability to new data even if the data points follow the same distribution. Figure 2.11b shows a model that is properly fitted to the data and is likely to fit well to unseen data with similar distribution [4].

(27)

2.3 Deep learning 17

(a)Underfitting. (b)Proper fitting. (c)Overfitting.

Figure 2.11: Examples of models (blue line) fitted to the data points (red dots). The desired model is well fitted to the data and generalizes well to new samples.

Data augmentation

The performance and generalizability of the model depends on the size of the dataset. By adding more data to the dataset, more knowledge is fed to the model which enables an improved model accuracy. One way to obtain more data with-out collecting new instances is to performdata augmentation by adding transfor-mations of the existing dataset. Examples of ways to transform image data are translation of the image, modification of the image brightness, mirroring of the image or addition of random noise. By randomly performing these transforma-tions on some of the data, the model becomes more robust to variatransforma-tions of these kinds [4]. Figure 2.12 shows examples after adjusting the contrast and adjusting the brightness.

There are other sorts of data augmentation that are common for image pro-cessing, such as rotation or other deformations, which are not applicable in the stereo case. This is because a stereo reconstruction algorithm expects rectified im-age pairs, with their epipolar lines parallel to the pixel rows of the imim-ages [16].

(a)Original input im-age.

(b) Input image with adjusted contrast.

(c) Input image with adjusted brightness.

Figure 2.12: Examples of image operations that can be performed to aug-ment the training dataset.

(28)

Regularization

Another method to improve the generalizability of the model is to apply regular-ization. There are many regularization techniques, all aiming to minimize the generalization error without compromising the training accuracy. One simple technique that is commonly used is L2 parameter regularization, which is

per-formed by adding a penalty term to the loss that is to be minimized. The penalty term Ω is defined as Ω= λ n X i=1 θ2i,

where {θi}ni=1are the layer parameters and λ is some suitable regularization

con-stant.

The regularization term pushes the weights towards the origin. This helps simplifying the model, which makes it less prone to overfitting [4].

(29)

3

Method

This chapter describes how the project was performed. The experiment aimed to answer the questions in section 1.2. To determine how the model behaved using different cameras, the model was trained using training images from one camera. Then, the performance was evaluated using evaluation images from all three cameras. This was repeated for all three cameras which resulted in three trained models; each trained using training data from one of the cameras. The reconstructions were also compared to the internal reconstructions from the cam-eras.

The model uses an architecture that is explained in section 3.1. This is the net-work that performs the disparity estimation. The model is optimized using the known relationship between the input images – if the right image is transformed using the disparity estimation, it should (in most regions) end up looking like the left image. Section 3.2 describes how the training was performed, as well as details about the loss computation. The performance of the model was evaluated qualitatively and quantitatively as described in section 3.3.

3.1

Model architecture

The used model is an adaptation of the model described by Zhang et al. [20]. An overview of the reconstruction model is shown in figure 3.1.

In the same manner as many other reconstruction algorithms, described in sec-tion 2.2.1, the model workflow can be divided into 4 steps; matching cost compu-tation, cost aggregation, disparity computation and disparity refinement. These steps are described below in section 3.1.1, 3.1.2, 3.1.3 and 3.1.4. Section 3.2.3 defines the loss function that was used when training the network. Finally, sec-tion 3.1.5 describes an invalidasec-tion network used for detecting invalid pixels in the disparity map.

(30)

Figure 3.1: An overview of the disparity network, that takes the left and right images as input and outputs the estimated disparity. The flat and cubic layer representations corresponding to filtering in 2D and 3D, respectively. The c operation means concatenation along the channel axis.

3.1.1

Matching cost computation

The matching cost is based on feature differences between the left image and the right image. The images are separately fed into identical networks, provided by Zhang et al. [20].

Each layer uses convolution kernels of size 3 × 3 and gives a 32 channel out-put. The stride and dilation rate are 1 unless otherwise stated. The network first applies an initial convolution followed by three residual blocks. Thereafter, the images are downsampled using three blocks of convolution with stride 2, batch normalization and leaky ReLU with slope 0.2. A final convolution is applied to yield the final 32 channel feature map. Given input images of size H × W , a feature map is of size H/8 × W /8 × 32. The network is visualized in figure 3.2.

Figure 3.2: The network that extracts features from the input image. The parameters at each layer are C: number of output channels, K: kernel size, S: stride, D: dilation rate, and a: the slope for x < 0 in leaky ReLU.

Once the feature maps of each image is obtained, we build a volume of matching costs. The value of V(u,v,d) is the Euclidean distance between the features at posi-tion (u,v) of the left feature map and the features at posiposi-tion (u − d,v) of the right

(31)

3.1 Model architecture 21

feature map. This is calculated for each disparity d within a predefined disparity range, d ∈ [0, dmax] at the low resolution where the matching is performed. This was the approach used by Khamis et al. [11]. The cost volume was formed by shifting the right feature map d pixels to the right, and then computing the Eu-clidean distance between the left feature map and the shifted right feature map.

The maximum allowed disparity at the low resolution, dmax, was set to 18 for

all datasets. This was done to make the models as similar as possible. The maxi-mum disparity is inversely proportional to the minimaxi-mum allowed distance from the camera to the object. Table 3.1 gives the corresponding minimum allowed distance for each of the cameras.

Camera Minimum distance (cm)

RealSense D435 22

RealSense D415 36

Visionary-S 42

Table 3.1: The minimum allowed distance from an object to each camera. The minimum distance corresponds to the maximum disparity that can be detected by the network.

3.1.2

Cost aggregation

Following Khamis et al. [11], the costs are now aggregated by filtering of the cost volume along three dimensions; the spatial domain and the disparity domain. Four 3D convolutions with kernel size 3 × 3 × 3 are applied, each followed by batch normalization and leaky ReLU activation with slope 0.2. These layers use 32 output channels. A final convolution is performed that does not use batch nor-malization or activation, and that uses 1 output channel. We denote the filtered cost volume ˆV . The network is visualized in figure 3.3.

Only the kernel sizes and number of layers were given by Khamis et al. [11]. The other hyperparameters were chosen because they turned out to be well suited for the problem.

Figure 3.3: The cost volume filtering. The parameters at each layer are C: number of output channels, K: kernel size, S: stride, D: dilation rate, and a: the slope for x < 0 in leaky ReLU.

(32)

3.1.3

Disparity computation

The chosen disparity ˆd at position (u,v) is now calculated by, according to Khamis et al. [11], applying soft argmin along the disparity dimension of ˆV , according to

ˆ d(u, v) = 18 X d=1 d ·P18exp(− ˆV (u, v, d)) d0=1exp(− ˆV (u, v, d0)) ,

which is chosen because of its differentiability. The output disparity at this stage is of size H/8 × W /8.

3.1.4

Disparity refinement

The low resolution disparity is upsampled to full resolution using bilinear inter-polation. It is then fed into a refinement network, defined by Zhang et al. [20], which refines the disparity with help from the left original input image. Both the upsampled disparity and the input image are separately fed into networks with one convolution, batch normalization and leaky ReLU activation with slope 0.2, followed by 3 residual blocks with dilation rate 1, 2 and 4. All of the layers use kernel size 3 × 3 and 16 output channels.

Thereafter, the outputs from the sub-networks are concatenated along the channel axis and fed into 3 residual blocks with dilation rate 8, 1 and 1. These layers use kernel size 3 × 3 and 32 output channels. The residual blocks are followed by a final 3 × 3 convolution that outputs the 1 channel disparity residual. All layers use stride 1 and dilation rate 1 if nothing else is specified. Adding the residual to the upsampled disparity estimation yields the final full resolution disparity map. See figure 3.4.

Figure 3.4: The refinement network. The parameters at each layer are C: number of output channels, K: kernel size, S: stride, D: dilation rate, and a: the slope for x < 0 in leaky ReLU. The c operation means concatenation along the channel axis.

(33)

3.1 Model architecture 23

3.1.5

Invalidation network

In parallel with training the disparity network, an invalidation network is trained as proposed by Zhang et al. [20]. The aim of this network is to learn to predict a mask that invalidates regions in the disparity map estimated from the left view that are inconsistent with the disparity map estimated from the right view. This type of inconsistency typically arises in regions that are occluded from one of the views. The invalidation mask is typically found by estimating the disparity from both the left and the right view, as described in section 3.2.3. By training a net-work that estimates this invalidation mask, only the computation of the disparity from one viewpoint is necessary at runtime. The network is shown in figure 3.5.

Figure 3.5: The invalidation network. The parameters at each layer are C: number of output channels, K: kernel size, S: stride, D: dilation rate, and a: the slope for x < 0 in leaky ReLU. The c operation means concatenation along the channel axis.

The outputs from the feature extraction network are concatenated and fed into a series of five residual blocks, using kernel size 3 × 3 and 64 output channels, fol-lowed by a 3 × 3 convolution that outputs a 1 channel low resolution prediction of the invalidation mask. Furthermore, this invalidation mask is upsampled to full resolution and concatenated with the left image and the full resolution disparity. These concatenated images are fed into a 3 × 3 convolution with batch nor-malization and leaky ReLU activation with slope 0.2. This is followed by four residual blocks, using kernel size 3 × 3 and 32 output channels. A final 3 × 3 con-volution outputs the 1 channel invalidation mask residual. Adding the residual to the upsampled invalidation estimation yields the final full resolution invalida-tion mask. Every layer uses stride 1 and dilainvalida-tion rate 1. All parameters are the same as defined by Zhang et al. [20].

(34)

3.2

Training

The model was implemented and trained in Python (version 3.6.7) using the Ten-sorFlow library (version 1.12.0) [1]. Pure TenTen-sorFlow (that is, no higher-level API such as Keras) was used because of its flexibility and the possibility to control what happens in each step.

The training was performed on a Linux computer equipped with an Intel Xeon E5-2643 v4 processor with frequency 3.40 GHz and a GeForce GTX 1080 Ti graphics card with 11 GB memory. Using the loss aggregation described in section 3.2.3, which turned out to be the bottleneck of the optimization, a model required either ∼30 or ∼42 hours of training. Using only uniform smoothing, either ∼7 or ∼11 hours of training was required.

The model was trained during either 30,000 or 50,000 iterations, depending on what was needed for each model to reach convergence. This corresponds to 3 or 5 epochs using a dataset of 10,000 image pairs. See table 3.2 for the number of iterations used for each dataset. L2regularization with λ = 0.001 was applied to the filter weights to avoid overfitting to the training data. The model was trained with the Adam optimizer, using β1 = 0.9, β2= 0.999 and a cyclical learning rate

as shown in figure 3.6. These methods were chosen because they have been shown to perform well on many other optimization tasks and because they performed well also in this case.

The regularization parameter λ was set to a small value because the model did not show any strong signs of being prone to overfitting, and a large λ could enforce an underfitted model. The optimization parameters, β1and β2, were set

to the default value chosen by TensorFlow and remained unchanged because they were considered to give good performance. The minimum and maximum values of the CLR for each training dataset are given in table 3.2. The learning rates were chosen for each dataset according to the method described in section 2.3.4.

N/3 N Iterations min max Lear ning rate

Learning rate throughout the training process

Figure 3.6:Plot of the learning rate that was used in the training. The learn-ing rate varies cyclically between the minimum and maximum learnlearn-ing rate. The number N is the total number of iterations.

(35)

3.2 Training 25

Camera Min Max N

RealSense D435 1 · 10−5 1 · 10−4 30,000 RealSense D415 3 · 10−6 5 · 10−4 30,000

Visionary-S 5 · 10−6

5 · 10−5

50,000

Table 3.2:The minimum and maximum values of the CLR. The number N is the total number of iterations.

3.2.1

Cameras

The cameras used for collecting data for the experiment were Intel RealSense D415, Intel RealSense D435 and SICK Visionary-S. Figure 3.7 shows example im-ages from each of the three cameras. The imim-ages are captured at equal distances from a flat wall. The cameras differ in image resolution, illumination pattern, illumination intensitiy, field of view, baseline etc. Some of these differences are visible in the images below. The size of the images from the RealSense cameras were 1280 × 720 pixels, and the Visionary-S images were of size 640 × 512.

The RealSense cameras output depth images at ∼30 Hz. The internal recon-struction algorithms for the RealSense cameras use local binary descriptors com-bined with a semi-global matching scheme. An overview is given by Keselman et al. in [10]. The output depthmap is computed relative to the left image sensor. The D415 camera is specified to work best within a range between 0.3 m and 10 m. The D435 camera is specified to work best within a range between 0.2 m to 10 m.

The Visionary-S produces depth images at ∼30 Hz, relative to the center of the baseline separating the left and the right image sensors. The algorithm for Visionary-S is not available, although it is known to not be deep learning based. The camera is specified to work best within a range between 0.5 m and 2.5 m.

(a)RealSense D435 (b)RealSense D415 (c)Visionary-S

Figure 3.7: Left IR images of a wall captured at equal distances with the three cameras that were used in the project.

3.2.2

Data

With each camera, 10,000 image pairs were collected for training, as well as ad-ditional validation and test images. The training images were all collected in an office environment. The images were captured so that, as well as possible, the dis-tances to the closest objects were not smaller than the minimum disdis-tances given

(36)

in table 3.1. All images were scaled to the interval [−1, 1] prior to being fed into the network.

During training time, the images were cropped to the size 1024 × 256 for the RealSense cameras and 512 × 512 for the Visionary-S. The cropping region was picked randomly for each sample. This was done to enable training on full resolution without exhausting the processing resources.

Data augmentation was applied by randomly adding noise to the images, ran-domly flipping the images horizontally and/or vertically, and/or ranran-domly chang-ing the brightness and/or contrast of both images. The operations were per-formed using dedicated functions provided by TensorFlow. In the cases of flip-ping the images, both the left and right images were flipped in the same way. In the case of horizontal flipping specifically, the left and right images also had to switch position to keep the geometrical relationship.

The augmentation operations were performed on each sample with a 5% prob-ability, which in total meant that there was a ∼77% probability of each image pair to remain unchanged during an epoch. Note that, due to the random cropping performed for each sample, the exact same image pair was unlikely to appear more than once during the entire training process.

3.2.3

Loss computation

The loss was formulated as proposed by Zhang et al. [20]. Once the final disparity from the left view was estimated, an estimate of the left image was calculated by sampling from the right image according to

ˆIL(u, v) = IR(u − dL(u, v), v),

where dL is the disparity estimated from the left viewpoint. The model used a fully differentiable bilinear sampler based on the spatial transformer network, STN [8]. The warped image ˆIL and the true left image IL were used in the loss computation.

To avoid a loss that was biased towards bright areas, the images were normal-ized with local contrast normalization (LCN) according to

ILCN =

I − µ σ + .

Here, µ and σ are the mean value and standard deviation in a 9 × 9 patch around each pixel, and  = 0.001 is added to the denominator to avoid division by zero.

The loss weights were computed according to

Cu,v= ||σu,vL (ILCN u,vLˆILCN u,vL )||1,

where the scaling by σu,vL was made to remove the effect of an amplified residual

in low texture regions due to a small σ .

To avoid having a loss function with many bad local minimas, a smoother loss function was desired. Zhang et al. [20] proposed to aggregate the weights over a

(37)

3.2 Training 27

2k × 2k support window, yielding the final loss weights ˆ

Cu,v =

Pu+k−1 x=u−k

Pv+k−1

y=v−kwx,yCx,y

Pu+k−1 x=u−k Pv+k−1 y=v−kwx,y , where wx,y = exp(− |Iu,vIx,y| σw ),

with σw = 2 and k = 16. This, however, was not realizable using the processing

resources available. An approach using the above aggregation with k = 8 was tried, but it did not lead to any major improvements compared to only using a uniform smoothing. The uniform smoothing used for the final model was defined as ˆ Cu,v= Pu+k−1 x=u−k Pv+k−1 y=v−kCx,y (2k)2 ,

with k = 16. The final loss was defined as L = 1 N X u,v ˆ Cu,v,

where N is the number of pixels in the image. Unlike Zhang et al. [20] that computed this loss for both the low resolution disparity and for the final refined one, the loss for this model was only computed for the full resolution disparity because it gave better results.

Invalidation mask during training

To avoid fitting the model to erroneous values, due to e.g. matching of occluded regions, these pixels’ influences on the loss needed to be suppressed. By comput-ing both the disparity dL from the left viewpoint and the disparity dRfrom the right viewpoint, a mask was defined for each pixel as mu,v = |du,vL − ˆdu,vR | < θ.

Here, θ is the maximum allowed pixel error and ˆdRis the disparity map from the right viewpoint, resampled to fit the disparity map from the left viewpoint.

This mask was applied to the loss, allowing only the pixels with mu,v = 1

to affect the loss. However, a trivial way to minimize the loss would now be to learn a model that makes sure that all pixels are invalidated. To avoid this, an additional classification task was formulated where the true label for each pixel was set to 1. The additional loss to minimize was the cross-entropy loss for this classification task, which was added to the total loss. That is,

Lmask = −

1 N

X

u,v

yu,v· log( ˆyu,v),

where yu,v is the true label at pixel (u, v) and ˆyu,v is the predicted label. Here,

N is the number of pixels in the image. The invalidation mask was computed both for the low resolution disparity and for the full resolution disparity. These

(38)

masks were used as ground truth values when training the invalidation network described in section 3.1.5.

The formulation of the invalidation mask was initially the same as defined by Zhang et al. [20]. Below is a description of adaptations that were made for this particular problem.

A small θ for the full resolution invalidation mask was shown to invalidate many regions with only one pixel or a very small number of pixels, which ap-peared to be an effect of the high frequent laser pattern. Since these pixels did not correspond to occluded regions, and occluded regions normally are larger than just a few pixels, these regions were removed by applying a combination of morphological operations. By applying dilation (expanding the regions with ones) followed by erosion (shrinking the region of ones), most of the unwanted invalidation regions disappeared. See figure 3.8 for an example.

(a)Original mask. (b) Mask after dila-tion.

(c) Mask after dila-tion plus erosion.

Figure 3.8: Images of the invalidation mask before and after applying the morphological operations. Note how most of the dot artifacts vanish when these operations are applied.

To allow the network to quickly find a good solution to start with, the invalida-tion mask was not added until after the first 10,000 iterainvalida-tions. Thereafter, the threshold θ was set to 0.25 for the low resolution invalidation mask and set to 3 in the full resolution. This was considered to be the best option since a smaller θ allowed too many erronous values in non-occluded regions that could not be removed by the morphological operations for all datasets.

Invalidation network loss

The learning of the invalidation network was formulated as a classification task using the invalidation masks described in section 3.2.3 – the mask computed at the low resolution and the mask computed at the full resolution – as the true labels. Since most of the pixels – around 90% – should be valid, the output from the invalidation network needed to be reweighted to avoid the trivial solution of validating all pixels. Following Zhang et al. [20], the valid pixels were assigned the value y+= 1 and the invalid pixels are assigned the value y−= −10.

The loss was defined as Linv= 1 N X u,v |yu,vyˆu,v|,

(39)

3.3 Evaluation 29

where yu,v is the true label and ˆyu,v is the reweighted network output label in

pixel (u, v). N is the number of pixels in the image. This loss was calculated both for the low resolution invalidation mask as well as for the full resolution invalidation mask. The invalidation network was disabled during the first 10,000 iterations of training. This was because the invalidation mask was not used until after 10,000 iterations.

3.3

Evaluation

When a model was trained, a disparity estimate was computed in ∼0.39 seconds for the RealSense cameras (with resolution 1280 × 720 pixels) and ∼0.14 seconds for the Visionary-S (with resolution 640 × 512 pixels). Computing the disparity and the invalidation mask estimate was done in ∼0.51 seconds for the RealSense cameras and ∼0.18 seconds for the Visionary-S. The performance of each model was evaluated qualitatively and quantitatively, as described below.

3.3.1

Qualitative evaluation

For the qualitative evaluations, images of some representative scenes were col-lected with all three cameras. These scenes include thin structures, sharp edges, flat surfaces, non-flat surfaces and other interesting regions that are useful for demonstrating the model’s performance. The figures 3.9, 3.10 and 3.11 show three example scenes.

(a)RealSense D435 (b)RealSense D415 (c)Visionary-S

(d) RealSense D435 (LCN) (e) RealSense D415 (LCN) (f) Visionary-S (LCN)

Figure 3.9:Left IR images of the first example scene. All images are captured at equal distances from the objects. The lower row shows the local contrast normalized images.

(40)

(a)RealSense D435 (b)RealSense D415 (c)Visionary-S (d) RealSense D435 (LCN) (e) RealSense D415 (LCN) (f) Visionary-S (LCN)

Figure 3.10: Left IR images of the second example scene. All images are captured at equal distances from the objects. The lower row shows the local contrast normalized images.

(a)RealSense D435 (b)RealSense D415 (c)Visionary-S

(d) RealSense D435 (LCN) (e) RealSense D415 (LCN) (f) Visionary-S (LCN)

Figure 3.11:Left IR images of the third example scene. All images are cap-tured at equal distances from the objects. The lower row shows the local contrast normalized images.

The same reconstruction model was applied to all images, and the final recon-structions were compared visually. The reconrecon-structions were also compared with the internally reconstructed depth outputs from each camera. The results are presented in section 4.1.

(41)

3.3 Evaluation 31

3.3.2

Quantitative evaluation

Due to the difficulty of obtaining ground truth data for depth images, the quan-titative evaluation was restricted to a simple scene – a flat wall. Images were captured at distances ranging between 50 cm and 350 cm – one image pair every 10 cm – with the cameras fronto-parallel to the wall (facing the wall at a ∼0◦

an-gle) as well as facing the wall at a ∼45◦

angle. The images in figure 3.7 are from the fronto-parallel dataset. Example images from the dataset at a ∼45◦

angle are shown in figure 3.12.

The same model was applied to all images. In the cases where regions outside of the wall were visible in the images, as in figure 3.12a, the disparity map was cropped to contain only values corresponding to the wall.

(a)RealSense D435 (b)RealSense D415 (c)Visionary-S

(d) RealSense D435 (LCN) (e) RealSense D415 (LCN) (f) Visionary-S (LCN)

Figure 3.12:Left IR images of a wall at a ∼45◦angle. All images are captured at equal distance from the wall. The lower row shows the local contrast normalized images.

The obtained disparities were used to compute the 3D point cloud correspond-ing to each disparity map by triangulation. A ground truth plane was estimated for the point cloud from each reconstruction using robust plane fitting. The er-ror for each 3D point was defined as the Euclidean distance between the point and the ground truth plane. The mean and standard deviation of the error were calculated for each image. The results are shown in section 4.2.

(42)
(43)

4

Results

This chapter describes the results of the evaluations described in section 3.3.

4.1

Results from qualitative evaluation

The results of the first example scene are shown in figure 4.2, 4.3 and 4.4. The output from the cameras are shown in figure 4.5. The invalid regions are black in all figures. These regions are decided by the mask learned by the invalidation network. All depth images from the scene are scaled equally, according to the scale given in figure 4.1.

Figure 4.1:The depth scale used for the first example scene, given in meters.

(a)D435 (b)D415 (c)Visionary-S

Figure 4.2:Results of the model trained using the D435 training data.

(44)

(a)D435 (b)D415 (c)Visionary-S

Figure 4.3:Results of the model trained using the D415 training data.

(a)D435 (b)D415 (c)Visionary-S

Figure 4.4:Results of the model trained using the Visionary-S training data.

(a)D435 (b)D415 (c)Visionary-S

Figure 4.5:Camera outputs.

The results of the second example scene are shown in figure 4.7, 4.8 and 4.9. The output from the cameras are shown in figure 4.10. All depths from the scene are scaled equally, according to the scale given in figure 4.6.

Figure 4.6: The depth scale used for the second example scene, given in meters.

(45)

4.1 Results from qualitative evaluation 35

(a)D435 (b)D415 (c)Visionary-S

Figure 4.7:Results of the model trained using the D435 training data.

(a)D435 (b)D415 (c)Visionary-S

Figure 4.8:Results of the model trained using the D415 training data.

(a)D435 (b)D415 (c)Visionary-S

Figure 4.9:Results of the model trained using the Visionary-S training data.

(a)D435 (b)D415 (c)Visionary-S

Figure 4.10:Camera outputs.

The results of the third example scene are shown in figure 4.12, 4.13 and 4.14. The output from the cameras are shown in figure 4.15. All depths from the scene

(46)

are scaled equally, according to the scale given in figure 4.11.

Figure 4.11:The depth scale used for the third example scene, given in me-ters.

(a)D435 (b)D415 (c)Visionary-S

Figure 4.12:Results of the model trained using the D435 training data.

(a)D435 (b)D415 (c)Visionary-S

Figure 4.13:Results of the model trained using the D415 training data.

(a)D435 (b)D415 (c)Visionary-S

Figure 4.14: Results of the model trained using the Visionary-S training data.

(47)

4.2 Results from quantitative evaluation 37

(a)D435 (b)D415 (c)Visionary-S

Figure 4.15:Camera outputs.

4.2

Results from quantitative evaluation

This section presents the results from the quantitative evaluation – using the im-ages of the flat wall. Figure 4.16 shows examples of the point clouds that were computed from the reconstructions when using the model trained on the D435 training dataset. The input images are captured at a 1 m distance and a ∼0◦angle. These point clouds were used to fit the planes that were used for computing the errors at distance 1 m.

(a)D435 (b)D415 (c)Visionary

Figure 4.16: Example point clouds of the flat wall. These points are trian-gulated from the images taken at distance 1 m and angle 0, using the model trained on images from camera D435.

The figures 4.17, 4.18 and 4.19 show the plots of the calculated means and stan-dard deviations for each of the trained models using the robustly estimated planes and the depth estimations. At larger distances, the standard deviation of the er-ror for the model trained on the D415 dataset became very unstable at a ∼45◦ angle. The vertical axes are therefore cropped in these cases to make it possible to see the other results properly.

(48)

0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.00 0.05 0.10 0.15 0.20 Mean (m)

Mean error for model trained on D435 at angle 0

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.1 0.2 0.3 0.4 Standard de viation (m)

Standard deviation of error for model trained on D435 at angle 0

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Mean (m)

Mean error for model trained on D435 at angle 45

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.1 0.2 0.3 0.4 0.5 Standard de viation (m)

Standard deviation of error for model trained on D435 at angle 45

D415 D435 Visionary-S

Figure 4.17:Plots of the quantitative results from the model trained on data from D435. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Mean (m)

Mean error for model trained on D415 at angle 0

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Standard de viation (m)

Standard deviation of error for model trained on D415 at angle 0

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mean (m)

Mean error for model trained on D415 at angle 45

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Standard de viation (m)

Standard deviation of error for model trained on D415 at angle 45

D415 D435 Visionary-S

Figure 4.18:Plots of the quantitative results from the model trained on data from D415. At distances larger than 2 m, the standard deviation of the error when using the D435 evaluation data varies on the interval [1,25] at a ∼45◦ angle.

(49)

4.2 Results from quantitative evaluation 39 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mean (m)

Mean error for model trained on Visionary-S at angle 0

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Standard de viation (m)

Standard deviation of error for model trained on Visionary-S at angle 0

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Mean (m)

Mean error for model trained on Visionary-S at angle 45

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Standard de viation (m)

Standard deviation of error for model trained on Visionary-S at angle 45

D415 D435 Visionary-S

Figure 4.19:Plots of the quantitative results from the model trained on data from Visionary-S.

The figure 4.20 shows the plots of the calculated means and standard deviations from the ground truth plane using the outputs from the cameras.

0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.00 0.02 0.04 0.06 0.08 0.10 Mean(m)

Mean error for camera Visionary-S at angle 0

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.00 0.02 0.04 0.06 0.08 Standard de viation(m)

Standard deviation of error for camera Visionary-S at angle 0

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Mean(m)

Mean error for camera Visionary-S at angle 45

D415 D435 Visionary-S 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Distance (m) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Standard de viation(m)

Standard deviation of error for camera Visionary-S at angle 45

D415 D435 Visionary-S

Figure 4.20:Plots of the quantitative results from the cameras’ outputs. To be able to compare the performances of the three trained models for each camera, the figures 4.21, 4.22 and 4.23 present the results of the models and the camera outputs sorted by the camera that was used for evaluating.

References

Related documents

Däremot är denna studie endast begränsat till direkta effekter av reformen, det vill säga vi tittar exempelvis inte närmare på andra indirekta effekter för de individer som

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa

However, the effect of receiving a public loan on firm growth despite its high interest rate cost is more significant in urban regions than in less densely populated regions,

Som visas i figurerna är effekterna av Almis lån som störst i storstäderna, MC, för alla utfallsvariabler och för såväl äldre som nya företag.. Äldre företag i

The government formally announced on April 28 that it will seek a 15 percent across-the- board reduction in summer power consumption, a step back from its initial plan to seek a