• No results found

Mobile-based 3D modeling: An indepth evaluation for the application to maintenance and supervision

N/A
N/A
Protected

Academic year: 2022

Share "Mobile-based 3D modeling: An indepth evaluation for the application to maintenance and supervision"

Copied!
61
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,

SECOND CYCLE, 60 CREDITS STOCKHOLM SWEDEN 2021 ,

Mobile-based 3D modeling: An in- depth evaluation for the

application to maintenance and supervision

KTH Thesis Report

MARTIN DE PELLEGRINI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Authors

Martin De Pellegrini ­ martindp@kth.se Information and Communication Technology KTH Royal Institute of Technology

Place for Project

Trento, Italy Arcoda s.r.l.

Examiner

Markus Flierl ­ mflierl@kth.se Stockholm, Sweden

KTH Royal Institute of Technology

Supervisor

Hanwei Wu ­ hanwei@kth.se Stockholm, Sweden

KTH Royal Institute of Technology

Nicola Conci ­ nicola.conci@unitn.it Trento, Italy

University of Trento

Lorenzo Orlandi ­ lorenzo.orlandi@arcoda.it Trento, Italy

Arcoda s.r.l.

(3)

Abstract

Indoor environment modeling has become a relevant topic in several applications fields including Augmented, Virtual and Mixed Reality. Furthermore, with the Digital Transformation, many industries have moved toward this technology trying to generate detailed models of an environment allowing the viewers to navigate through it or mapping surfaces to insert virtual elements in a real scene. Therefore, this Thesis project has been conducted with the purpose to review well­established deterministic methods for 3D scene reconstruction and researching the state­of­the­

art, such as machine learning­based approaches, and a possible implementation on mobile devices. Initially, we focused on the well­established methods such as Structure from Motion (SfM) that use photogrammetry to estimate camera poses and depth using only RGB images. Lastly, the research has been centered on the most innovative methods that make use of machine learning to predict depth maps and camera poses from a video stream. Most of the methods reviewed are completely unsupervised and are based on a combination of two sub­network, the disparity network (DispNet) for the depth estimation and pose network (PoseNet) for camera pose estimation. Despite the fact that the results in outdoor application show high quality depth map and and reliable odometry, there are still some limitations for the deployment of this technology in indoor environment. Overall, the results are promising.

Keywords

Computer Vision, 3D Reconstruction, Deep Learning, Indoor, Digital Twin, Point

Cloud.

(4)

Abstract

Modellering av inomhusmiljö har blivit ett relevant ämne inom flera applikationsområden, inklusive Augmented, Virtual och Mixed Reality.

Dessutom, med den digitala transformationen, har många branscher gått mot denna teknik som försöker generera detaljerade modeller av en miljö som gör det möjligt för tittarna att navigera genom den eller kartlägga ytor för att infoga virtuella element i en riktig scen. Därför har detta avhandlingsprojekt genomförts med syftet att granska väletablerade deterministiska metoder för 3D­scenrekonstruktion och undersöka det senaste inom teknik, såsom maskininlärningsbaserade metoder och en möjlig implementering på mobil. Inledningsvis fokuserade vi på de väletablerade metoderna som Structure From Motion (SfM) som använder fotogrammetri för att uppskatta kameraställningar och djup med endast RGB­bilder. Slutligen har forskningen varit inriktad på de mest innovativa metoderna som använder maskininlärning för att förutsäga djupkartor och kameraposer från en videoström. De flesta av de granskade metoderna är helt utan tillsyn och baseras på en kombination av två undernätverk, skillnadsnätverket (DispNet) för djupuppskattning och posenätverk (PoseNet) för kameraposestimering. Trots att resultaten i utomhusanvändning visar djupkarta av hög kvalitet och tillförlitlig vägmätning, finns det fortfarande vissa begränsningar för användningen av denna teknik i inomhusmiljön, men ändå är resultaten lovande.

Nyckelord

Datorsyn, 3D­rekonstruktion, Deep Learning, inomhus, Digital Twin, Point

Cloud.

(5)

Acknowledgements

I would like to thank Arcoda s.r.l. for allowing me and helping in carrying the Master Thesis project. In particular I would like to thank Lorenzo Orlandi for his support during both the research stage and the writing of the final project. It has been a hard work doing this work with the ongoing global pandemic, but I very happy that with the internship in Arcorda has given me the possibility to grow both under the technical and personal aspects.

I would also like to thank my examiner from KTH, Markus Flierl, and the supervisor from Trento University, Nicola Conci, whose supported me during my work.

I would like to finish by thanking my family and my dearest friends Nicola, Luca,

Federico, Davide and Skender. They made the stressful coronavirus time much more

easier with the the moral boost they gave me. I am grateful for all the friends I found

and all the places I visited thanks to the EIT Digital Master.

(6)

Acronyms

SfM Structure from motion RGB Red, Green, Blue ToF Time­of­Flight

CNN Convolutional Neural Network

SC­SfM Scale Consistent Structure from Motion RANSAC RANdom SAmple Consensus

CPU Central Processing Unit GPU Graphic Processing Unit DispNet Disparity estimation Network PoseNet Pose estimation Network

GIS Geographic Information System LiDAR Light Detection And Ranging SIFT Scale Invariant Feature Transform PnP Perspective­n­Point

BA Bundle Adjustment VR Virtual Reality AR Augmented Reality XR Extended Reality DOF Degree Of Freedom

SSIM Structural Similarity Index GC Geometry Consistency

DispResNet Disparity estimation Network with Residual Layers RGB­D Red Green Blue Depth

ICP Iterative Closest Point FPS Frames Per Second

IDE Integrated Development Environment

(7)

GUI Graphical User Interface RAM Random Access Memory

API Application Programming Interface GT Ground Truth

MVS Multi­view Stereo matching

(8)

Contents

1 Introduction 1

1.1 Problem . . . . 2

1.2 Purpose . . . . 3

1.3 Research Question . . . . 4

1.4 Approach . . . . 4

1.5 Thesis Outline . . . . 5

2 Theoretical Background 6 2.1 Image Formation . . . . 7

2.1.1 Camera Intrinsics . . . . 8

2.1.2 Camera Matrix . . . . 9

2.2 Homography . . . . 10

2.3 Epipolar Geometry . . . . 11

2.4 3D Reconstruction Algorithms . . . . 12

2.4.1 Linear Perspective . . . . 12

2.4.2 Atmosphere Scattering . . . . 13

2.4.3 Binocular Disparity . . . . 13

2.4.4 Motion . . . . 15

2.4.5 Structure from Motion . . . . 15

2.5 Deep Learning­based Methods . . . . 19

2.5.1 DispNet . . . . 21

3 Methodologies and Methods 22 3.1 Image­based Reconstruction . . . . 23

3.2 Novel Approach ­ Deep Learning . . . . 23

3.3 SC­SfM Learner . . . . 24

3.3.1 Loss Functions . . . . 24

(9)

CONTENTS

3.3.2 Network Architecture . . . . 27

3.3.3 Dataset Rectification . . . . 27

3.4 Dataset . . . . 28

3.4.1 Custom Dataset . . . . 28

3.5 Point Cloud . . . . 29

3.5.1 Point Cloud Alignment . . . . 30

3.6 Hardware & Software . . . . 30

4 Experiments and Results 32 4.1 3D Reconstruction Visual SfM . . . . 32

4.2 Network Evaluation . . . . 33

4.2.1 Depth map prediction . . . . 33

4.2.2 PoseNet . . . . 34

4.2.3 Custom dataset training . . . . 36

4.3 Mobile implementation . . . . 37

4.4 Point Cloud Generation . . . . 38

5 Conclusions 42 5.1 Future Work . . . . 43

5.2 Final Words . . . . 43

References 44

(10)

List of Figures

1.0.1 Digital twin for Utilities. . . . 2 1.2.1 Augmented reality for maintenance. . . . 4 2.1.1 Pinhole camera model. The Figure shows the geometry involved in the

point projection onto the image plane. Note: for simplicity, the image plane is placed in front the camera center even if it should be behind. . 8 2.1.2 Simplified model describing the intrinsic camera parameters. Focal

length (f) and optical center (c x , c y ). H and W are respectively the image height and width. The image axis orientation (x s , y s ) is following the conventional orientation adopted by the majority of Computer Vision Library. . . . 9 2.2.1 Planar homography induced by points all lying on a common plane with

general plane equation ˆ n 0 p + c 0 = 0. . . . 11 2.3.1 Illustration of the geometry involved in epipolar geometry. . . . 12 2.4.1 Gradient plane assignment, green points refers to the vanishing points

while red lines are the vanishing lines. . . . 13 2.4.2Atmosphere scattering phenomenon on a landscape. . . . 14 2.4.3Binocular disparity . . . . 14 2.4.4Motion parallax illustration, the further the object the slower they

appear to move. . . . 15 2.4.5Structure from Motion used in architecture 3D reconstruction. . . . 16 2.4.6Structure from Motion pipeline scheme. . . . 16 2.4.7 The figure shows the basic principle of the Structure from Motion

approach . . . . 17

2.5.1 Depth CNN and Pose CNN for structure learning in street scenario. . . 20

2.5.2 Depth Network and Pose Network Architecture (encoder­decoder layout). 21

(11)

LIST OF FIGURES

3.3.1 Overall SC­SfM Learner network architecture. . . . 27 4.1.1 Full dense 3D reconstruction using Visual SfM . . . . 33 4.2.1 Depth map prediction with office07(a) and freiburg_360(b) sequences 35 4.2.2Qualitative illustration of the results achieved on an unseen video

sequence. . . . 35

4.2.3Representation of the camera trajectory in sequence freiburg_360. . . 36

4.2.4Results achieved from the network trained on the custom dataset. . . . 37

4.2.5Network fail due to its specialization in inferring depth for near objects. 37

4.3.1 Result of camera registration. . . . 39

4.4.1 Full 3D reconstruction using freiburg_360 depth and pose ground truth. 40

4.4.2Point Cloud generated from office07 predicted depth maps. . . . 41

(12)

List of Tables

3.4.1 Details of the three dataset used in the testing phase. All images are (640x480). *The samples used for the NYUv2 dataset are a portion of the actual dataset, provided for test in the publicly available implementation of [37] . . . . 29 4.2.1 Average depth prediction accuracy on sequences freiburg_360 and

office07. . . . 34 4.3.1 Samsung Galaxy Note 10+ intrinsic parameters of the RGB and ToF

sensors accessible through the camera API in AndroidStudio. . . . 38 4.3.2Samsung Galaxy Note 10+ extrinsic parameters of the RGB and ToF

sensors accessible through the camera API in AndroidStudio. . . . 38

(13)

Chapter 1 Introduction

From the beginning, Computer Vision scientists are driven by the aim to replicate the Humans’ vision system, a system that is incredibly powerful when it comes to understand and interact with the 3D world. People are naturally capable of processing extremely quickly every useful information from the scene in order to recognize and locate obstacles, and infer the 3D structure of the environment, allowing the navigation through it. The ability of perceiving the 3D space became one of the fundamental issues in Computer Vision, however, with the technological advancement in both hardware and technical sides, tasks like generating 3D models from 2D images have became simpler and more accurate. In recent years, the attention towards Computer Vision has grown fast, as highlighted by the increment in the number of publications related to this topic. Several methods have been developed to address this problem, many of those are based on traditional image processing, while others follows the trend of using Machine Learning to train models and achieve comparable results.

Nowadays, this technology is employed in a considerable number of industries, such

as robotics, landslide mapping, gaming, mixed reality, archaeology, medicine and

many others. For instance, robotic makes use of 3D reconstruction for autonomous

navigation taking images form a camera mounted on a robot in order to generate a

3D model of the environment and perform real­time processing to avoid obstacles. In

the gaming industry this technology finds application in the generation of a virtual

map belonging to a real scenario, landslide scanning use images acquired by satellites

and drones to create inventory maps that are very accurate with a precision such that

it is possible to distinguish even small objects. In medicine, the advantage of using

(14)

CHAPTER 1. INTRODUCTION

Figure 1.0.1: Digital twin for Utilities.

3D reconstruction is in a smaller amount of radiation that patients are exposed to compared to the traditional body scanning techniques like MRI and CT scanning.

The objective of this thesis is researching on possible solutions for 3D scene reconstruction on mobile devices with main focus on exploiting RGB images and video stream to retrieve three dimensional structure of indoor environment and generate a virtual replica to be employed in Augmented Reality AR and Virtual Reality VR applications. The research has been conducted in the premises of ARCODA [3], a company located in Trento (Italy) that provides innovative GIS solutions for the private and public sector. The Company is currently working on projects with one of the main utility Company in Italy for the monitoring of electricity cabins combining 3D Vision and Augmented Reality, therefore the Thesis is part of this project with the purpose of creating a digital twin of mentioned cabins.

1.1 Problem

3D Reconstruction in Computer Vision refers to the process of extracting the

appearance of real objects or scenarios. This process has always been difficult to

perform, especially when it makes use of images. Indeed, images are simply a visual

representation of the radiance reflected by the environment in the field of view of

the camera. In image acquisition, 3D points in the real world are projected onto a

2D surface to form an image, but in this process the depth information is lost. In

(15)

CHAPTER 1. INTRODUCTION

order to reconstruct the scene depicted in a photo, one have to reverse the process of image formation; however, without the depth, back­projecting a point form the image to the real world is an ill posed problem. Despite that, there are several methods to reconstruct objects or environments and they can be divided in two main groups:

Active and Passive methods. As suggested by the name, Active methods actively interact, mechanically or radiometrically, with the object or scene to be reconstructed.

Nowadays, these methods commonly make use of radiometric techniques that use moving light sources, Time­of­Flight (ToF) lasers and microwaves to emit radiance towards the targets and measure the reflected part. On the other hand, Passive methods are based on sensing the reflected or emitted luminance by the objects and infer their 3D structure through traditional image processing. Since the Thesis purpose is to research on the mobile applicability of 3D reconstruction techniques, we will focus on passive methods because of the active ones require additional expensive hardware, although sensors like ToF or LiDAR could be found embedded on top smartphones.

1.2 Purpose

Recent years have faced a significant increase in the number of studies related to 3D

Reconstruction applied in Indoor environment. The popularity growth have brought

many industrial field to approach this new technology. In particular, in the utility

sector, 3D modelling and extended reality XR are combined in order to simplify the

maintenance of electric lines and cabins, and to speed up the intervention on the

field while ensuring an high level of operations’ safety. For instance, generating a

virtual model of electric cabins could have an enormous impact when it comes the

need, for on­the­field workers, to have support from experts, which can easily visualize

the digital replica through the smartphone or by wearing headsets for a first­person

experience. The purpose of the thesis is then to evaluate the current state­of­the­art in

3D Reconstruction of indoor environment and discuss on possible implementations to

be applied in the context of the electric cabins.

(16)

CHAPTER 1. INTRODUCTION

Figure 1.2.1: Augmented reality for maintenance.

1.3 Research Question

The main objective of the thesis is to investigate solutions suitable to generate a virtual model of challenging environment, like electric cabins. In particular the solution have to allow in­place workers, such as technicians and maintainers, to be able to generate the model with a limited equipment, that typically is a smartphone provided by the company. The selected solution should be able to generate disparity maps and infer ego­motion. These two aspect are needed in order to generate an accurate 3D model of the scene that can be later used for virtual inspections and AR purposes.

1.4 Approach

The research have been performed starting from an analysis on the well established methods used for 3D scene reconstruction, in this initial stage the framework VisualSfM [42] has been tested to provide us a benchmark. The next step is the 3D reconstruction literature review on deep learning based approaches, since the operation involved in the VisualSfM pipeline are computationally heavy and time consuming, it has been chosen to research on state of the art deep learning implementation which overcome the weaknesses of the more traditional methods.

Then, the framework selection is done based on requirements stated by the equipment

limitation, in particular the research focused on the possibility to train the selected

(17)

CHAPTER 1. INTRODUCTION

network in an unsupervised manner for predicting the desired depth maps. Moreover, the network should perform well on indoor data. Indoor scenarios are known to be challenging environments due to the weakly textured regions such as walls and floors, limited space that causes a lack of view coverage, complex object interaction and occlusion. Moreover, the solution should be able predict accurate camera poses to be applied in the final step to obtain a single 3D model from the several depth maps generated.

Once the framework has been selected and tested, the acquisition of a more specific data set is done for training and fine tuning purposes. Even though the network is required to work without supervision, the created data set have to contain at least depth ground truth for the training validation. In order to do so, and based on the information for which the utility company provides its employees with an android smartphone with embedded ToF sensor (Samsung Galaxy Note 10+), an application have been developed to acquire both RGB images and Depth ground truth.

The final step consist in the generation of a sparse point cloud of the indoor environment that involves the use of the predicted depth maps and camera poses.

1.5 Thesis Outline

In chapter 2, a theoretical background is present that covers the literature review and

provides the base knowledge to help the reader to better understand the content of the

thesis. Chapter 3 is about the methods used in the testing phase of both the traditional

and novel approaches, it also presents the data set used and the methodologies behind

the data acquisition process. In chapter 4, the experimental results are illustrated

and discussed. In conclusion, chapter 5 provides a final thesis discussion and future

works.

(18)

Chapter 2

Theoretical Background

3D Reconstruction is a well known problem in Computer Vision, it refers to the process of generating a three­dimensional model of objects or scenes from images.

To perform such operation, a good understanding of the image formation process is needed since images are the resulting projection of the 3D world on the 2D image plane. 3D reconstruction is thus the inverse procedure. However, when an image is captured, the depth information is lost. The principle on which all reconstruction algorithms are based is retrieving the missing depth information in order re­create a three­dimensional coordinate system where to project the image points. These algorithms approaches the depth restoration issues in a passive manner. The reason why the approach is known as passive is because, unlike active methods, there is no interaction with the object or the environment to be rebuilt, since the images used in the reconstruction process are simply a measure of the luminance emitted or reflected by the target, for this reason these methods are also known as image­based reconstruction. On the other hand, active methods actively interfere with the scene by enlightening it with specific signals and measuring the reflection.

The reconstruction approach requires to solve a certain number of problems. First, it

requires the knowledge about the camera calibration parameters. Focal length, optic

center (or principal point) and skew are known as intrinsic parameters and provide

information on how real world points are projected on the image plane. When they

are unknown, a pinhole camera model can be used as approximation. Then, extrinsic

camera parameters, such as camera’s rotation and translation, are used to link the

3D points coordinate to the camera position. When the reconstruction task involves

(19)

CHAPTER 2. THEORETICAL BACKGROUND

multiple images of the same target (object or scene), the next step is to find and associate points that appear in different pictures. This process is known as Feature Matching. Once calibration parameters are known and correspondences have been found, it is then possible to project the 2d image in a virtual 3D world space. Depending on the number of key points found in the matching step, the resulting reconstructed model appear to be sparse or dense. Typically, a dense reconstruction is preferred as final result for the application purpose, therefore, if the number of projected points is sufficiently high, a sparse representation is enough for visualization purposes and structure understanding.

2.1 Image Formation

Image formation [34] is the process by which a 3D scene in the real world is projected onto the image plane. When an object is illuminated it absorbs and reflects the light radiation based on the physical properties of the material of which it is composed. The reflected light passes through the camera lens and is captured by the light sensor.

The way the real scene is mapped on the image plane can be done either by using orthography model, which is more suitable for telescopic lenses, or the perspective model. The latter is more commonly used since is the more accurate in modelling the behaviour of real cameras. In perspective projection, points are projected onto the image plane by dividing their coordinates (x, y, z) by the third component, obtaining the image point location x in inhomogeneous coordinates (2.1) on using homogeneous coordinates (2.2).

x =

 

 x/z y/z 1

 

 (2.1)

'x =

 

1 0 0 0 0 1 0 0 0 0 1 0

 

 ' p (2.2)

where 'p is the homogeneous coordinate of the 3D point in the real world expressed as

(x, y, z, w). In this process, both for inhomogeneous and homogeneous perspective

projections, the last component is dropped thus, after the projection, the depth

(20)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.1.1: Pinhole camera model. The Figure shows the geometry involved in the point projection onto the image plane. Note: for simplicity, the image plane is placed in front the camera center even if it should be behind.

information is lost and it is not possible to recover it. Figure 2.1.1 shows an illustration of the geometry involved.

2.1.1 Camera Intrinsics

Typically, in projecting 3D real point onto 2D image space, two intermediate steps are performed: world to camera projection and then camera to pixel sensor array. The first step consists in the projection using the ideal pinhole camera model, the result is not a 2D projection. It is needed to transform the resulting coordinates according to pixel spacing of the sensor array and its relative position to the origin which is the camera center. The mapping from the 3D camera­centered point p c and the 2D pixel coordinates, which are discrete and positive, is done by the 3×3 camera intrinsic matrix K, known also as calibration matrix. K is often represented in the upper­triangular form (2.3) and contains the intrinsic camera parameters which are: focal lengths f x and f y for the sensor dimension x and y, optical center (c x , c y ), and skew, which is non­zero when the sensor is not mounted perpendicular to the optical axis.

K =

 

f x skew c x

0 f y c y

0 0 1

 

 (2.3)

Sometimes, the focal lengths are expressed making explicit the sensor aspect ratio a

and using common focal length f obtaining f x = f and f y = af . Furthermore, a

simplified intrinsic matrix, with skew = 0 and a = 1, is often used for many practical

applications.

(21)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.1.2: Simplified model describing the intrinsic camera parameters. Focal length (f) and optical center (c x , c y ). H and W are respectively the image height and width. The image axis orientation (x s , y s ) is following the conventional orientation adopted by the majority of Computer Vision Library.

Knowing the intrinsic matrix, it is possible to express the complete projection between the 3D camera­centered point p c and the homogeneous version of the pixel address 'x s

as:

'

x s = Kp c (2.4)

Figure 2.1.2 shows a simplified imaging model involved in the application of intrinsic parameters for 3D camera­centered point projection.

2.1.2 Camera Matrix

While intrinsic parameters are used to map the 3D coordinates from the camera coordinates system to the image space, extrisic parameters are used to express the camera’s position and orientation in the world space. These parameters are expressed with 3 × 4 extrinsic matrix, where the first three columns compose the rotation matrix R and the last column is the 3D translation vector t.

( R |t

) (2.5)

With intrinsic and extrinsic parameters in then possible to write the complete

perspective projection equation that maps a 3D world point p w onto the 2D pixel

(22)

CHAPTER 2. THEORETICAL BACKGROUND coordinates:

'

x s = K ( R |t

) p w = P p w (2.6)

where P = K[R|t] is known as camera matrix. The equation (2.6) is also used in the camera calibration process where, knowing 'x s and p w , intrinsic and extrinsic parameters are simultaneously estimated. In many cases, the camera matrix is preferred to be used in its 4 × 4 invertible form so it is possible the direct mapping from 3D coordinates p w = (x w , y w , z w , 1) to screen coordinates (plus disparity) x s = (x s .y s , 1, d),

x s ∼ ' P p w (2.7)

P = '

 K 0 0 T 1

 R t 0 T 1

 = * KE (2.8)

where E is a 3D rigid­body (Euclidean) transformation and * K is the invertible form of the calibration matrix. Note that after the multiplication by ' P in (2.7), the resulting vector is divided by the third component obtaining the normalized form x s = (x s .y s , 1, d). Here it is possible to observe that the disparity d is actually the inverse of the depth, d = 1/z.

2.2 Homography

The term homography refers to the relation between two images on the same planar surface (Figure 2.2.1). This characteristic allows the mapping from one camera to another as two steps projection, camera 1 to world coordinate and world to camera 2, using the full rank 4 × 4 camera matrix ' P = * KE. Fundamental is the co­planar assumption. As mentioned in section 2.1, the depth information, and so the disparity value d 0 , is not accessible thus making the back projection (2D­to­3D) impractical.

However, with the two images laying on the same plane, it is possible to map points to

infinity (z = ∞, d = 0). This allows to ignore the depth information and to use a 3 × 3

(23)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.2.1: Planar homography induced by points all lying on a common plane with general plane equation ˆ n 0 p + c 0 = 0.

Homography matrix (* H) to map image points to another camera position.

'

x 1 ∼ * H 10 x ' 0 (2.9)

where ∼ refers to equal up to scale, 'x 1 and 'x 0 are the points of camera 1 and camera 0 respectively, and * H 10 is the Homography matrix that transform the coordinate system from the one of camera 0 to camera 1.

2.3 Epipolar Geometry

Epipolar geometry is the geometry involved in stereo vision and it describes the relations between two images of the same scene taken from different positions. Figure 2.3.1 shows two pinhole cameras centered in c 0 and c 1 observing a common point p.

The projections of the point on each image are defined as x 0 and x 1 respectively. Each

image point is defied by the intersection between a ray projected from the optical center

c and the image plane. The ray projected from one optical center to the other defines

the two epipoles e 0 and e 1 . On the image plane, the segment bounded by the image

point x and the epipole e defines the epipolar lines. These lines can be seen also as the

intersection of the epipolar plane, defined by the two camera centers and the observed

point, and the two image planes.

(24)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.3.1: Illustration of the geometry involved in epipolar geometry.

2.4 3D Reconstruction Algorithms

In past years, several algorithms have been developed to retrieve the lost depth cue in a passive manner. Depending of the number of initial input images, it is possible to rely on different techniques to obtain the depth information [41][24].

In particular, the approaches can be categorized in monocular or single­view depth estimation and multi­view depth estimation. The first category, as the name suggests, involves a single still image, therefore methods of this group exploit the geometry and characteristics of the environment to estimate the depth cue. Linear Perspective and Atmosphere Scattering are some Monocular 2D to 3D conversion algorithms. Multi­

view algorithms, instead, make use of multiple images as input. These images can be either acquired by several fixed camera placed in different position and viewing angle, or by a single camera capturing pictures of the environment over time. Algorithms such Binocular Disparity, Motion Parallax, Image Blur, Silhouette and Structure from Motion belong to this category.

2.4.1 Linear Perspective

This approach exploits the geometry of the scene depicted in the image. It is based

on the perspective projection principle for which parallel lines in the real environment

tend to converge to a vanishing point with distance. Battiato, Curti et al.. [4], proposed

a depth cue estimation methods that, based on a rigid scene geometry assumption,

performs edge detection in order to find the vanishing point and corresponding

vanishing lines. Then, for each pair of vanishing line, it assigns a gradient plane

which corresponds to a specific depth level. The closer the pixels are to the vanishing

(25)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.4.1: Gradient plane assignment, green points refers to the vanishing points while red lines are the vanishing lines.

point, the higher the intensity of the gradient plane, thus the larger the depth level. A representative illustration of the proposed approach is shown in Figure 2.4.1.

2.4.2 Atmosphere Scattering

The Atmosphere Scattering algorithm is particularly suitable when it comes to perceive the depth cue in outdoor environment where a portion of the sky is present in the image. It is based of the physical scattering model studied by Lord Rayleigh and presented in his publications in 1871 [31][32]. The scattering model describes how the light interacts with small particles in the air propagating through the atmosphere. The result is a diffusion radiation called atmosphere scattering which causes visual effects for which distant objects appear less distinct and cover in some sort of bluish fog than objects nearby (Figure 2.4.2). A first analysis for the conversion of such phenomenon in representative depth information has been proposed by Cozman and Krotkov [9] in 1997.

2.4.3 Binocular Disparity

Binocular disparity is the most similar mechanism to the Human Vision System, thus

is considered the main approach for depth perception (Figure 2.4.3). The principle of

this algorithm consists in estimating the depth information starting from two images

(26)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.4.2: Atmosphere scattering phenomenon on a landscape.

Figure 2.4.3: Binocular disparity

of the same scene captured from two slightly different view points. A matching algorithm is then applied to find correspondences between the two input pictures and later, the depth of the matched features is computed by mean of the triangulation method [19]. However, this algorithm requires to solve the so called Stereo Matching problem, which refers to how a point in the right image can be found in the left image.

Inherent ambiguities of the pair of image, for instance occlusions, generally make the stereo matching problem hard to solve. To overcome this issue, several constraints are applied. Typically, epipolar geometry and camera calibration are used as major constraints (”hard” constraints) in order to allow the rectification of the images pair.

Image rectification consists in transforming the images from different perspective

and project them onto a common image plane. To achieve better depth accuracy,

other constraints may be applied, for example photometric, ordering, uniqueness

and smoothness constraints. These are called ”soft”. Generally, different versions of

binocular disparity algorithm impose a different sets of constraints.

(27)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.4.4: Motion parallax illustration, the further the object the slower they appear to move.

2.4.4 Motion

Another important mechanism for depth perception is based on the relative motion between the camera and the observed environment. Moving objects seen by the camera appear to move with different speed depending on the distance from the observation point, and same happens for static scene and moving camera (Figure 2.4.4). In both cases, it is possible to find the motion field, a 2D velocity vector induced by the relative motion between camera and scene, under basic assumptions of rigid scene and linear movements. Algorithms that exploit motion to retrieve the depth information can be either optical flow based or feature based. The first approach consider the apparent motion of the image brightness, which is constrained to be constant, as an approximation of the motion filed. On the other hand, the feature based method consists in tracking image features along a sequence of pictures to generate a sparse depth map. Typically, the Kalman filter [12], a recursive algorithm to estimate the position of the moving feature over the sequence, is used as tracker.

2.4.5 Structure from Motion

The Structure from motion algorithm takes advantage from the previously presented

techniques, it aims to simultaneously estimate the 3D structure of scenes and objects

and the camera position from a set of images’ correspondences (S. Ullman 1979 [36]).

(28)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.4.5: Structure from Motion used in architecture 3D reconstruction.

Figure 2.4.6: Structure from Motion pipeline scheme.

Structure from Motion has become popular due to the fact that a specialized equipment is not required, standard customer camera work well, making SfM a suitable solution for various context (Figure 2.4.5). The typical algorithm pipeline involves two main phases. The first is known as matching phase where Feature Extraction, Feature Matching and Geometric Verification are applied. Once relevant and verified correspondences have been found, the reconstruction phase, which involves Image Registration, Triangulation and Bundle Adjustment, takes place. In this stage camera poses and Point depths are estimated in an iterative manner. Furthermore, it is important to mention that a good initialization for the reconstruction phase is needed.

Bad initialization can produce numerous errors that will accumulate over the iterative

cycle. Figure 2.4.6 shows the complete SfM pipeline [7]. The implementation of

such algorithm results in a 3D reconstruction in form of sparse point cloud. Sparse

point clouds, when they are composed by a large amount of points, are enough for

visualization purposes. In addition, if a mesh of the rebuilt scenario is needed, a dense

point cloud can be generated by means of Multi­view Stereo matching (MVS).

(29)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.4.7: The figure shows the basic principle of the Structure from Motion approach

A description of the phases involved in the SfM pipeline is set out below.

Feature Extraction & Matching

Feature Extraction is the first operation to be performed in the standard SfM procedure. Each input image is processed in order to locate points of interest (key points). Typically, Scale Invariant Feature Transform (SIFT) [20] features are used in this phase. SIFT descriptors have the advantage of being particularly robust against change in scale, large variation of view point and challenging conditions such as different illumination and partial occlusions. For this reason, SIFT are widely implemented in the majority of the available SfM algorithms.

Next, the extracted features are used to determine whether the images have common parts of the scene and are therefore at least partially overlapping. Matches are found by comparing key points in different images, if at least a pair of images contains features with same descriptor, it is possible to consider those points to be the same in the scene.

Therefore the images are overlapping. This phase results in a set of overlapping images (at least in pairs) and correspondences between features.

Geometric Verification

Since the matching phase does not guarantee that the matches correspond exactly to

a point in the real world, a Geometric Verification is needed. The process consists in

mapping, through a geometric transformation, the matched points from one image

to the other; if a sufficient number of point is successfully transformed then the two

images are considered geometrically verified, which means that the points are coherent

(30)

CHAPTER 2. THEORETICAL BACKGROUND

with the scene’s geometry. Homography and epipolar geometry are usually employed in this phase to describe the transformation between the two images and the movement of the camera respectively. It is important to mention that epipolar geometry requires knowledge about the intrinsic calibration parameter for the essential matrix E. If such parameters are unknown, it is possible to use fundamental matrix F. Furthermore, the matching stage could generate outliers which may affect the results of the geometric verification. Therefore, RANdom SAmple Consensus (RANSAC) [10] is used as optimisation strategy in addition to the geometric verification [17][25]. At the end of the phase, the so called Scene Graph is obtained, which describes the set of images that are considered verified.

Image Registration

As mentioned in the SfM introduction (Section 2.4.5), a proper initialization is needed to avoid errors in the reconstruction phase. Therefore, before starting the image registration process, the reconstruction initialization is made from a dense region of the Scene Graph to provide a solid foundation to start. Among all the images in the chosen region, the pair of images with the most geometrically verified points is selected.

From these images, the first camera pose will be computed.

At initialization completed, the process of Image registration starts. At every iteration, a new image known as registered image is added to the reconstruction. The new registered image does not bring any information about its camera rotation and translation that must be calculated. In this step, the camera pose is computed using correspondences between key points on the registered image and the 3D points that have been already reconstructed. Then the camera pose, defined as a position in a 3D coordinate system, can be estimated by solving the Perspective­n­Points (PnP) problem.

Triangulation

Once no more points can be added to the reconstruction with image registration, a more dense visualization can be achieved with triangulation. This step tries to define the 3D coordinate for those points that have not been reconstructed in the previous step, from a pair of registered images with common points and known camera poses.

The solution of the triangulation problem requires an epipolar constraint, which state

(31)

CHAPTER 2. THEORETICAL BACKGROUND

that from one camera pose it is possible to identify the other one and vice versa. In this case the two camera poses are the epipoles. Ideally, knowing the epipoles and so the epipolar lines, it is possible to define the epipolar plane on which the point to be estimated lies. However, due to inaccuracies introduced by the image registration phase, the point may not lay on the intersection of the epipolar lines. This error is known as reprojection error.

Bundle Adjustment

Usually, errors generated by inaccuracies in the image registration and triangulation steps are not completely solved and are propagated through the SfM pipeline.

Therefore, it is necessary to minimize the accumulation of the propagated errors.

Bundle Adjustment (BA) [35] is the optimization process applied in this final phase.

It aims to refine the final point cloud by adjusting the error in both camera pose estimation and triangulation step, thus preventing the propagation of the inaccuracies mentioned before. This is the heavier phase in terms of computational load and time due to the fact that the optimization is done globally on all the images and only when the reconstructed point cloud has gained at least a certain percentage since the last time global BA was made.

2.5 Deep Learning­based Methods

Traditional 3D reconstruction algorithms require to perform heavy operations, and

despite the proven effectiveness of these methods, they rely on high quality images

as input. This aspect may introduce some weaknesses when it comes to process

complex geometries, occlusions and low texture areas. Such issues have been tackled

using deep learning. In particular, some stages of the traditional 3D reconstruction

pipeline have been rethought on a deep learning­basis. Many solutions have been

developed employing Convolutional Neural Networks (CNNs) that estimate the depth

information for different purposes other than 3D reconstruction. Some of them are

used for stereo view synthesis such as DeepStereo [11], which learns how to generate

new view from single image in order to recreate a synthetic stereoscopic system

where the underlying geometry is represented by quantized depth plane. Similarly,

Deep3D [43] implements CNNs to convert 2D video into 3D sequences such as

Anaglyph for 3D movie or Side­by­Side view for Virtual Reality (VR) application.

(32)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.5.1: Depth CNN and Pose CNN for structure learning in street scenario.

In this case the scene geometry is represented by probabilistic disparity maps. As well as Deep3D, other methods are following the recent research in learning three­

dimensional structure form single view. Some of them introduced supervision signals such as Garg et al. [13] who proposes a supervision made by calibrated stereo twin for single­view depth estimation. On the other hand, new works are moving toward the depth estimation in an unsupervised or self­supervised manner from video sequences.

These methods works well in the task of inferring the scene geometry and ego­motion (similarly to SfM), but in addition they show great potential for other tasks such as segmentation, object motion mask prediction, tracking and other levels of semantics (e.g., [2][18][39][16][23][27]).

Among the unsupervised/self­supervised methods for learning depth and camera pose from videos, an important research has been conducted by Vijayanarasimhan et al.

[38] with SfM­Net, a framework that jointly trains the network for depth, camera pose and scene motion estimation from video. The framework allows different levels of supervision, form completely supervised with depth ground truth (GT), camera motion GT and optical flow, to a self­supervised learning. Similarly, Zhou et al. [44]

proposed a completely unsupervised approach, named SfM Learner, which jointly

train both single­view depth estimation and camera pose estimation on the same

network. In addition, based on the latter method, Bian et al. proposed an improvement

of SfM Learner, a Scale­Consistent SfM Learner (SC­SfM Learner) [5], by adding a

scale consistent constraint that tackles scale ambiguities that occur due to moving

objects. Furthermore, most of the proposed implementation are tested on outdoor

environment such as streets (Figure 2.5.1), Bian et al. proposed a weak rectification

algorithm that allows to train the network with hand recorded video sequences. The

(33)

CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.5.2: Depth Network and Pose Network Architecture (encoder­decoder layout).

datasets used for SC­SfM Learner and SfM Learner are KITTI [14] and Cityscapes [8], which are composed by video sequences recorded by a camera mounted on top of a vehicle, therefore they are characterized by a dominance of translation in the camera movement. The network employed in these works, DispNet [21] (Section 2.5.1), exploits this dominance of translation to predict the disparity. However, the main characteristic of hand recorded videos is rotation, which does not contribute in the in the depth estimation process as discussed in [6]. Therefore, the proposed weak rectification aims to map video frames on a common plane to reduce the rotation in favor to the translation, as it is in KITTI­like datasets.

2.5.1 DispNet

The pillar of the reviewed methods is DispNet [21]. DispNet is used for single view

depth prediction. It is composed by an initial contracting stage, made of convolutional

layers, followed by upsampling to perform deconvolutions, convolutions and loss

function. Features from the contracting part are sent to the corresponding layer in

the expanding portion. The network operates with a traditional encoder­decoder

architecture with skip connections and multi­scale side prediction. In addition, SfM

Learner uses the initial part of the network for the camera pose prediction. Figure

2.5.2 illustrates the network architecture.

(34)

Chapter 3

Methodologies and Methods

This chapter describes the research methodologies and methods used in the degree project. Following the recent digitalization, the motivation behind this Thesis project (Section 1.2) is to generate a digital replica of electric cabins that can be exploited for augmented and virtual reality applications. As introduced in Section 1.4, the first step has been testing the results achieved by traditional image­base 3D reconstruction algorithms. For this purpose, Visual SfM [42] has been selected. This initial stage was done with the purpose of understanding the basic principles of the SfM algorithm, and therefore, to have an idea about the quality of the results achieved with this traditional image­based 3D reconstruction method.

After gaining familiarity with Visual SfM, the study moved toward recent state­of­the­

art deep learning­based approaches. Specifically, the work has been conducted on the basis of the studies made by Bian et al. [5][6]. These two works were chosen since they are particularly suitable for the problem this thesis addresses. Indeed, as reported in Sections 1.3 and 1.4, the aim is to use an approach that allows to train the network without supervision and that performs well in indoor scenarios. In addition, a data acquisition campaign was launched in order to create a custom dataset to train the network with electric cabin specific data.

Lastly, a script for point cloud generation was developed. The script was designed to

generate single point clouds for each predicted depth maps or to create a single 3D

model combining depth maps and camera poses.

(35)

CHAPTER 3. METHODOLOGIES AND METHODS

3.1 Image­based Reconstruction

Visual SfM is a GUI application for 3D reconstruction developed by ChangChang Wu [42] and based on the Structure from Motion algorithm (Section 2.4.5). It was selected among other SfM implementation due to its characteristic of exploiting multi­core parallelism for feature detection, feature matching and Bundle Adjustment (BA). This allows a faster processing, especially since for a good reconstruction the application requires several high quality images. As a baseline, tests have been performed with this application. Visual SfM is generally used to generate a 3D representation of object, capturing images of the target from different perspectives, and leading to a concentric acquisition. For a good reconstruction, a large number of high quality images are needed. Going from objects to environments, the number of images has to be even larger. In fact, factors as eccentric acquisition, motion parallax, perspective geometry, occlusions and light conditions need to be modeled. These factors may have a considerable impact on the 3D reconstruction. The results we have obtained are accurate, though demanding in terms of computational load, not applicable, for example, to mobile devices.

3.2 Novel Approach ­ Deep Learning

For the purpose of the research, the focus was on the solution by Zhou et al. [44], that, unlike others, propose an unsupervised learning framework for depth map and camera pose estimation, with a basic assumption of scene rigidity. The framework is composed by a network to predict depth map coupled with a second network to predict the relative position between subsequent frames, then image warping is applied and the photometric error is used as supervision signal. In the scenario the Thesis wants to explore, this is extremely important because it allows in­place workers to acquire data for training without explicit annotation, which would require more specialized equipment.

It has been demonstrated in [5] that, due to object motion, and a sub­optimal loss

function, the framework proposed in [44] suffers of scale ambiguity. This problem is

propagated through the network, generating scale inconsistent depth maps, turning

into errors in depth and pose prediction, as the number of frames increases. To

solve this problem, Bian et al. [5] propose an additional supervision signal to

(36)

CHAPTER 3. METHODOLOGIES AND METHODS

ensure the geometric consistency. Despite the results improvement achieved in [5], a recent research [6], conducted by the same authors , has proven the existence of limitations in DispNet. In particular, it wasdemonstrated that the estimation of he depth map is strictly related to a dominance of translation in video sequences. Previous implementation have been tested on datasets like KITTI [14], where the camera configuration and the forward motion did not highlight the issue. In fact, when it comes to acquire videos in challenging environment, such as indoors, the motion of the acquisition system is characterized by a predominance of rotations instead of pure translation in (x, y, z) as it is for the street case. For this reason, Bian et al. presents a weak rectification algorithm to solve the translation problem and a rectified version of the NYU dataset [26].

3.3 SC­SfM Learner

SC­SfM Learner is based on the fully unsupervised framework proposed by Zhou et al. that allows the estimation of depth map and camera pose from monocular videos.

It has been proposed to overcome the issues mentioned above (Section 3.2). The improvement consists in a new loss function, the geometry consistency loss, to enforce the geometry consistency on the predicted results, thus ensuring a globally scale­

consistent camera trajectory over long video sequences. As for SfM Learner [44], the framework first predicts two depth maps (D a , D b ) from two consecutive frames (I a , I b ), then the pose network predicts the 6 DOF relative camera position P ab of these two samples. Once depths and poses are obtained, the first frame I a is warped on the next I b to synthesize a reference image I a ! . Here a photometric loss between the original frame I a and the synthesized one I a ! is applied as supervision. In addition, the photometric loss function is combined with two more supervision signals, the smoothness loss and the geometry consistency loss proposed by the authors.

3.3.1 Loss Functions

Photometric loss

The photometric loss is a widely used loss function in depth estimation and it measures

the appearance between two images. However, this loss does not consider any

structural information since it computes only the pixel­wise intensity difference. The

(37)

CHAPTER 3. METHODOLOGIES AND METHODS

Structural Similarity Index (SSIM) is added to the photometric loss function in order to consider also luminance, contrast and structure between the two images. The basic idea is emphasize the structure in the image by reducing the weights on the luminance and contrast. As mentioned before, the predicted depth is used with the relative camera pose to synthesize the reference image I a ! by warping I b . The final objective function considering the photometric loss and the similarity metric is formulated as follow:

L p = 1

|V | +

p ∈V

(λ i ||I a (p) − I a ! (p) || 1 + λ s

1 − SSIM aa

!

(p)

2 ) (3.1)

where V is the set of valid points successfully projected from I a to I b in the warping operation, and SSIM aa

!

is the Structural Similarity index between I a and I a ! proposed in [40], while λ i and λ s are the weights used in the framework and are set to 0.15 and 0.85 respectively.

Smoothness loss

This loss function is introduced to regularize low­texture and homogeneous regions in the scene. It is formulated as:

L s = +

p

(e −∇I

a

(p) · ∇D a (p)) 2 (3.2)

where ∇ is the first derivative along the two spatial directions, I a is the input image and D a the predicted depth map from I a . It ensures that smoothness is guided by the edge of the image [5][28].

Geometry consistency loss

As mentioned before, the Geometry Consistency (GC) loss was introduced as supervision signal to ensure scale consistency in the predictions. The optimization consists in minimizing the difference between the predicted depth maps D a and D b

related by the relative camera pose P ab . The two depth maps are required to match the

same 3D scene structure. The characteristic of this optimization is that the geometry

consistency is propagated along the entire sequence since it works on a couple of

subsequent frames. e.g. the depth map from frameI 0 match with the one from frame

(38)

CHAPTER 3. METHODOLOGIES AND METHODS

I 1 , and the same goes for the depth map from I i agree with the one from I i+1 . The objective function is then formulated as:

L GC = 1

|V | +

p ∈V

D dif f (p) (3.3)

where D dif f (p) is the depth inconsistency map defined, for each p ∈ V , as:

D dif f (p) = |D a b (p) − D b ! (p) |

D a b (p) + D b ! (p) (3.4)

where D a b in the computed depth map of I b by warping D a using P ab . D a ! is the interpolated depth map from the estimated depth map D b .

Final objective function

The overall objective function combines all the losses described above with an additional level of supervision defined by the self­discovered mask:

L = αL M p + βL s + γL GC (3.5)

where L M p is the photometric loss weighted by the proposed self­discovered mask.

During training, parameters α, β and γ have been set to be 1.0, 1.0 and 0.5 respectively.

The mask M is responsible to handle dynamic objects and occlusions that occur in the video sequence. The inconsistencies produced by these critical situations can be found in the depth inconsistency map D dif f , and are identified as values different form the ideal value of zero. The proposed mask is then defined in [0, 1] as:

M = 1 − D dif f (3.6)

in order to reduce the contribution of inconsistent pixels and assign high weights to the consistent ones. The mask is therefore used for weighting the previously defined photometric loss:

L M p = 1

|V | +

p ∈V

(M (p) · L p (p)) (3.7)

(39)

CHAPTER 3. METHODOLOGIES AND METHODS

Figure 3.3.1: Overall SC­SfM Learner network architecture.

3.3.2 Network Architecture

The framework is composed by a network to predict depth map coupled with a second network used to predict the relative camera position between subsequent frames (Figure 2.5.2). For the depth estimation, DispNet and DispResNet have been used, the latter is an improved version of DispNet where the convolutional blocks are replaced by residual blocks to improve the disparity estimates since it is easy to learn the residual than to learn the disparity itself as described in [28]. On the other hand, for the ego­motion network, the PoseNet without the final expanding part, which is responsible for the explainability mask prediction, was used (Figure 2.5.2 (b)). The overall architecture is illustrated in Figure 3.3.1.

3.3.3 Dataset Rectification

Despite the improved results achieved in [5], a recent research conducted by the same

authors [6] has proven the existence of limitations in DispNet. In particular, it has been

demonstrated that the estimation of he depth map is strictly related to a dominance

of translation in video sequences. Previous implementation have been tested on

datasets like KITTI [14], where the camera configuration and the forward motion

did not highlight the issue. In fact, when it comes to acquire videos in challenging

environment, such as indoors, the motion of the acquisition system is characterized

by a predominance of rotations instead of pure translation in (x, y, z) as it is for the

street case. For this reason, Bian et al. presents a weak rectification algorithm to

solve the translation problem and a rectified version of the NYU dataset. The rectified

dataset available on the GitHub repository [37] is the used to train the network on

indoor scenarios.

(40)

CHAPTER 3. METHODOLOGIES AND METHODS

3.4 Dataset

In this section a brief description of the datasets used during training and test is presented.

• RGB­D TUM Dataset: The dataset used in the first stage of the project was the sequence freiburg1_360, which contains a 360 degree turn in a typical office environment. The choice of this dataset is based on the research objective since the majority of the electric cabins spread over the territory have approximately the same size. The dataset is provided by depth ground truth acquired by Microsoft Kinect sensor, and camera pose ground truth is acquired with an external motion capture system. The sequence is composed by 756 frames with size (640, 480). A more detailed description is available on the dataset website [29] and on the conference paper [33].

• RGB­D Microsoft 7 Scene Dataset: This dataset consists in sequences of tracked RGB­D frames, it is provided with depth ground truth from Microsoft Kinect, while the camera pose ground truth has been computed by mean of the Iterative Closest Point (ICP) algorithm and frame­to­model alignment with respect to a dense reconstruction representation by a truncated signed distance volume. Need to be mentioned that the depth ground truths are encoded with 16 bits leading to 65536 depth levels. Most of the tests have been performed on the sequence office_seq07 which contains 1000 frames with size (640, 480). The frames are not calibrated but default intrinsic parameters are provided on the deataset website [1] [15].

• NYUv2 Depth Dataset: The dataset [26] is composed of RGB­D video sequences from several indoor environments acquired with the Microsoft Kinect sensor. This dataset is used by Bian et al.[6] to train the SC­SfM Learner network after being processed with the weak rectification proposed by the author. The dataset is publicly available on the GitHub repository for the indoor version of SC­SfM Learner.

3.4.1 Custom Dataset

As introduced in Section 3.3.3, hand recorded video are not suitable for the network

training since they are characterized by a strong dominance of rotational camera

(41)

CHAPTER 3. METHODOLOGIES AND METHODS

Name #Images Ref.

freiburg_360 756 [29][33]

Office Seq07 1000 [1] [15]

NYUv2 654* [26]

Table 3.4.1: Details of the three dataset used in the testing phase. All images are (640x480). *The samples used for the NYUv2 dataset are a portion of the actual dataset, provided for test in the publicly available implementation of [37]

motion that does not contribute in the disparity learning. Since the rectification algorithm is not currently available, the idea was to acquire video sequence of electric cabins in a way that encourage the translation motion against the rotational one.

The acquisition has been conducted by the technicians from the utility company with which Arcoda is collaborating. As mentioned in Section 1.4, these technicians are equipped with Samsung Galaxy Note 10 + which has the possibility to acquire depth images thanks to the embedded ToF sensor. Based on that, it has been developed an application to simultaneously capture RGB images and the relative depth information.

However, with the intrinsic parameters of the ToF camera available, it was not possible to calibrate the depth stream with the RGB one. This issue has prevented the usage of depth information in the validation stage during the network training.

Apart from that, by the time being, 2GB of electric cabin video sequence have been acquired, that have became 1.5GB after a cleaning process. This process was needed since in each sequence many redundant objects were present and some frames were capturing the outside of the cabin. In conclusion, the dataset consists of 32 video sequences of different cabins recorded at frame rate of 30 FPS, at (640x480), and with an average duration of 40 seconds .

3.5 Point Cloud

The term point cloud refers to a set of points in a 3D coordinate system. Starting

from the depth maps produced by the framework used in this project, a point cloud is

generated by projecting each valid pixel (pixel with non zero value) with the perspective

projection technique explained in Section 2.1. It is possible to perform a projection

from 2D to 3D since for each pixel of the depth map is associated a depth value encoded

in 256 gray levels. Therefore, the 3D coordinates of each point of the point cloud is

(42)

CHAPTER 3. METHODOLOGIES AND METHODS

computed as follows:

 

 x y z

 

 =

 

(u −c

x

)d f

x

(v −c

y

)d f

y

d

 

 (3.8)

where (u, v) are the pixel coordinates in the image, (c x , c y ) are the coordinates of the optical center, (f x , f y ) are the focal lengths along the optical axis and d is the depth value of the pixel located in (u, v). In order to make the point cloud easy to understand, each point is assigned with RGB values of the respective color image.

3.5.1 Point Cloud Alignment

The point cloud script is designed to generate a point cloud for each depth maps or to create a single point cloud of the environment combining depth maps and camera poses. The spatial transformation that is applied is defined as follow:

 

 

  x ! y ! z ! 1

 

 

 

=

 R t 0 1

−1

 

 

  x y z 1

 

 

 

(3.9)

where [ R t 0 1 ] describe the camera pose as Rotation and Translation matrix in its invertible form (Section 2.1.2). In addition, due to some inconsistencies in camera poses ground truth of the 7 Scene dataset, it has been necessary to perform a manual alignment using the ICP function of the open source software MeshLab[22].

3.6 Hardware & Software

Python programming language on PyCharm IDE was used for network training and

point cloud generation. The training has been performed on Windows 10 with GPU

Nvidia GeForce GTX 1080 Ti, CPU Intel core i7­7700K at 4.2 GHz, RAM 16 GB. For

the Android data acquisition application, Kotlin and Java were used as programming

languages, and Android Studio as IDE. The application have been developed for

Samsung Galaxy Note 10 plus which is equipped with ToF sensor. The point cloud post

(43)

CHAPTER 3. METHODOLOGIES AND METHODS

processing and ICP alignment were performed on the open source software MeshLab [22].

Framework

SC­SfM Learner is publicly available on GitHub [30][37] and is implemented in

PyTorch.

References

Related documents

Maria Holm presented the Mobile Life centre and its research, Oskar Juhlin gave a talk about the book “Enjoying machines”, and Arvid Engström presented the work

Measuring the cost of inlined tasks We have measured the cost of creating, spawning and joining with a task on a single processor by comparing the timings of the fib programs (given

På förskolan Reggio Emilia beskriver däremot pedagogerna ett kompetent barn som ska kunna leka på många olika sätt i verksamheten, vilket syftar till att utveckla deras

Patients’ appraisal of results involved previous experiences, expectations, patient character, and concerned changes in hand function, care process and organisational

A Bland and Altman evaluation of a two-component model 1 , based on body density, for assessing TBF (%) in women before pregnancy, at gestational weeks 14 and 32 and 2 weeks

När Saxton och Waters (2014) applicerade Lovejoy och Saxtons (2012) indelning av inlägg (Information-Gemenskap-Aktion) när de undersökte vad intressenter gillar, kommenterar och

Inledande studie görs för att få fram vilka associationer som finns kopplade till menskoppen samt vad som kommuniceras i samband med marknadsföring av menskopp hos olika

Det är mot bakgrund av detta vi inom vårt examensarbete har valt att kartlägga huruvida det finns något fortbildningsbehov hos den nu yrkesverksamma lärarkåren inom fysik som