FEREIDOONZANGENEHKAMALI Self-supervisedlearningofcameraegomotionusingepipolargeometry

(1)

Self-supervised learning of camera

egomotion using epipolar geometry

FEREIDOON ZANGENEH KAMALI

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

camera egomotion using

epipolar geometry

FEREIDOON ZANGENEH KAMALI

Master in Systems, Control and Robotics Date: September 21, 2020

Supervisor: Patric Jensfelt Examiner: Joakim Gustafsson

School of Electrical Engineering and Computer Science Host company: Univrses AB

(4)

(5)

Abstract

(6)

Sammanfattning

(7)

Acknowledgement

(8)

1 Introduction 1

1.1 Research problem . . . 2

1.2 Objectives and contribution . . . 3

1.3 Overview . . . 3

2 Background 5 2.1 Interest point detection . . . 5

2.2 Multiple view geometry . . . 7

2.2.1 View synthesis . . . 8

2.2.2 Epipolar geometry . . . 10

3 Related work 12 3.1 Visual odometry: traditional methods . . . 12

3.2 Visual odometry: hybrid methods . . . 13

3.3 Learning-based egomotion estimation . . . 14

3.3.1 Monocular depth prediction . . . 15

3.3.2 The earliest egomotion estimation works . . . 16

3.3.3 Optical flow prediction for egomotion estimation . . . 20

3.3.4 Other modules for performance enhancement . . . 25

4 Method 29 4.1 Concept . . . 29

4.1.1 State of the art . . . 29

4.1.2 Our approach . . . 30

4.2 Pose estimation network . . . 31

4.2.1 Architecture design . . . 31

4.2.2 Network input and output . . . 32

4.3 Self-supervised objective function . . . 33

4.3.1 Basic formulation . . . 33

4.3.2 Sparse computations . . . 36

(9)

4.3.3 Local structure dissimilarity . . . 38

4.3.4 Advanced formulation . . . 40

4.4 Auxiliary regularisation . . . 46

4.4.1 Energy function analysis . . . 46

4.4.2 Prior information incorporation . . . 47

5 Experimental setup 49 5.1 Dataset . . . 49

5.2 Synthetic perturbation experiments . . . 50

5.3 Deep network experiments . . . 52

5.3.1 End-to-end experiments . . . 52 5.3.2 Black-box experiments . . . 53 5.4 Evaluation metrics . . . 54 5.4.1 Trajectory construction . . . 55 5.4.2 Rotation error . . . 56 5.4.3 Translation error . . . 57

5.4.4 Absolute trajectory error (ATE) . . . 58

5.4.5 Relative pose error (RPE) . . . 59

5.4.6 Modified absolute trajectory error (mATE) . . . 60

6 Results and discussion 61 6.1 Pose perturbations . . . 61

6.1.1 Convexity and convergence . . . 61

6.1.2 Objective function tuning . . . 65

6.2 Ablation study . . . 67 6.2.1 Network settings . . . 67 6.2.2 Objective function . . . 68 6.3 Benchmarking . . . 71 6.3.1 As an end-to-end solution . . . 71 6.3.2 As a black-box module . . . 73

6.4 Qualitative study of predictions . . . 74

7 Summary and Conclusion 78 7.1 Summary . . . 78

7.2 Conclusions . . . 79

7.3 Future work . . . 79

(10)

A Our work on a global scale 87 A.1 Ethics . . . 87 A.2 Society . . . 88 A.3 Sustainability . . . 88 B Conventions 90 B.1 Frame definition . . . 90 B.2 Translation representation . . . 91 B.3 Rotation representation . . . 92 C Parameter settings 93 C.1 Initial settings . . . 93 C.2 Final settings . . . 94 D Mini-sequence averaging 95 D.1 Algorithm . . . 95 D.2 Averaging transformations . . . 96 D.2.1 Averaging translations . . . 96 D.2.2 Averaging rotations . . . 96

E Objective function tuning 98 E.1 Minimum search on linear combination . . . 99

(11)

Introduction

One of the primary features that a mobile autonomous agent relies on for ful-filment of its mission is its ability to measure how its position changes over time. The ability to measure changes of position is paramount for the agent to complete its mission, since it forms the foundation for any calculated attempts of navigation in the environment.

Knowledge of position changes may be inferred through different approaches and with the help of various sensors. Given that a map of the environment is available to the agent, it can observe the environment to establish correspon-dences with the knowns and therefore estimate its own position. This process is known as localisation. If the environment is unknown prior to operation, the agent’s position is carried out in parallel with a mapping process, which is referred to by the term Simultaneous Localisation and Mapping (SLAM) [1]. A family of methods that could be used for the positioning of agents are the odometry methods, which estimate the position of the agent relative to itself at a previous time instance. The positioning can be performed based on various forms of sensory data, such as data from laser scanners, inertial sensors, or cameras [2, 3]. The decreasing cost of cameras, in conjunction with the increasing on-board processing power at available lower prices to pay, has given rise to the popularity of localisation and odometry solutions using cameras, in particular the monocular variant.

In this work, we focus on the odometry of autonomous agents purely us-ing monocular camera images. This is a type of visual odometry that has been traditionally carried out via classical computer vision and geometric and photometric computations combined with optimisation or estimation theory to infer the position of the agent. Traditional visual odometry methods, however, suffer from the inherent problem of requiring large engineering and tuning

(12)

forts for the conditions they are expected to operate in, which could potentially result in underperformance in extra-nominal conditions [4].

With the advent of neural networks and deep learning techniques for im-age analysis, many attempts have been made to apply deep learning to solve or at least partially facilitate the visual odometry problem. Learning-based solutions using deep network architectures are known to be capable of offer-ing scalable levels of robustness to variations in operatoffer-ing conditions, without requiring additional design or fine-tuning steps.

In this work we focus on the self-supervised training of a deep neural net-work that estimates the relative motion of the camera between a sequence of input image frames. The depth map of the input images is a rich source of in-formation that can facilitate the estimation of camera motion. The state of the art works in this field, therefore, rely on the depth information of input images in the training process of the camera motion estimation network. However, as monocular images do not contain such information, the current solutions de-pend on a second network in the training process, which explicitly estimates the depth map of the input images. Our work, on the other hand, proposes a novel self-supervised training strategy that enables a deep network to learn estimating the relative pose of the camera from its image sequences and inde-pendent of a depth estimation network.

The motion of a camera is known by the term camera egomotion, mak-ing egomotion estimation a synonym for visual odometry. While egomotion estimation appears to be the nomenclature more commonly used in learning-based methods, there is no strong consensus regarding the usage context of the terms visual odometry and egomotion estimation, and they may be used interchangeably. However, throughout this report we refer to traditional non-learning-based methods by the term visual odometry, and use the term

egomo-tion estimaegomo-tionto refer to the learning-based methods, for an easier distinction

between the two contexts.

1.1 Research problem

In this work we investigate the research question of how we can devise a

self-supervised training strategy that enables a deep neural network to learn cam-era egomotion estimation, purely based on its monocular colour images and without reliance on an explicit depth estimation network. This is to achieve

(13)

required to be learned in the training phase, than those needed to make infer-ences. We also aim to evaluate how such solution compares to the state of the

art solutions that depend on explicit depth estimation in terms of the accuracy of predictions.

1.2 Objectives and contribution

In summary, our main objective is to propose a novel strategy that allows a single deep neural network to learn the camera egomotion from a sequence of consecutive images, in absence of ground truth egomotion data, and in isola-tion from other sources of other explicit estimaisola-tions such as depth. With this objective, our contributions are:

• A deep network architecture, which is capable of processing a stacked sequence of images as input, to estimate relative motion of the camera between them.

• A novel objective function that given a pair of input images, evaluates the quality of the predicted camera’s relative motion. This objective function enables the candidate network to learn the camera egomotion through gradient-based optimisation.

1.3 Overview

This report takes the following structure:

• Chapter 2 offers a short summary of some of the important traditional computer vision concepts, which form the foundation for the techniques later introduced in the report.

• Chapter 3 gives an overview of the related literature in our field of focus. • Chapter 4 motivates the direction of work we take based on the state of

the art methodology and lays out the details of our propositions.

• Chapter 5 discusses the important details of the setup used for the eval-uation of our design.

(14)

• Chapter 7 concludes our work by summarising its important points and findings, followed by a short discussion on the possible directions of work that could take our propositions further.

(15)

Background

In this chapter we review some of the important concepts necessary for a good grasp of the design and ideas that will be discussed in the subsequent chapters. One of these concepts is the detection of interest points from images, which is discussed in Section 2.1. Another important concept, discussed in Section 2.2, is the multiple view geometry between the images.

2.1 Interest point detection

One of the essential tasks in visual odometry is finding the correspondence between the pixels of images taken from different viewpoints by a moving camera, which represent the same physical structures. A common approach to find such correspondences is through an analysis of the pixel intensity values of the images. This is done via detection of visually salient points in the images, which are easier to reliably track across sequences of images. These points are generally referred to as interest points.

The basic idea behind intensity-based interest point extraction is the detec-tion of pixel neighbourhoods with large variadetec-tions of intensity values. If this variation, i.e. intensity gradient in spatial domain, occurs only along one ori-entation in the local neighbourhood of a pixel, this pixel is commonly known to be part of an edge in the image. If the image intensity gradient happens to be significant in two orthogonal orientations, the pixel is referred to as a corner pixel. This difference in edge and corner pixels implies that if we move along the orientation of an edge, the local intensity properties stay relatively similar, whereas a movement along any orientation on a corner pixel results in con-siderable changes in the local image intensity. This property makes corners superior candidates to edges for finding correspondences between images.

(16)

There are various intensity-based methods of interest point detection, a good review of which is offered in the work by Schmid et al. [5]. A common approach for interest point extraction is via an analysis of the auto-correlation of the image, as first used by Moravec [6]. This technique was later employed by Lucas and Kanade [7] for the detection of points to track and in the con-text of associating the certainty of interest points with optical flow. In the work by Harris and Stephens [8] an expression of the eigenvalues of the auto-correlation matrix is used for the detection of interest points, which turned out to be one of the prominent methods used until today for this purpose.

As a first step to find the corner pixels of an image I via the method of Harris and Stephens [8], the structure tensor (or the second-moment matrix) of the every image pixel is computed as

M (u, v) =X x,y w(x, y) I2 x IxIy IxIy Iy2 (2.1) for a pixel coordinate (u, v) with a window function w(x, y) around it, where Ix and Iy are partial derivatives of the image along each direction. The

win-dow function may be a rectangular or Gaussian winwin-dow which assigns non-zero weights to the pixel coordinates (x, y) in the neighbourhood of (u, v). The image partial derivatives Ixand Iycould be obtained via any method that

approximates the image gradients. For instance, the Sobel operator, which is composed of Gaussian smoothing and a finite difference kernel, can be used to approximate Ix =   +1 0 −1 +2 0 −2 +1 0 −1  ∗ I and Iy =   +1 +2 +1 0 0 0 −1 −2 −1  ∗ I, (2.2)

where ∗ denotes 2D convolution.

For every pixel of the image a Harris measure H is computed, which is a crafted expression based on the eigenvalues of its structure tensor λ1, λ2 ≥ 0,

with an alternative formulation in terms of the determinant and trace of the structure tensor as

H = λ1λ2 − k(λ1+ λ2)2 =det(M) − k trace(M)2. (2.3)

(17)

the pixel is part of an edge if one eigenvalue is much larger than the other (e.g. λ1 λ2), and the pixel represents a corner if both eigenvalues are large.

This intuition of eigenvalues is reflected in the computation of the Harris measure, such that the larger the measure for a pixel, the more properties of a corner that pixel has. Therefore, corner pixels are detected via forming a mask of the Harris measure for every pixel, and finding its local maxima via non-max suppression, such that the Harris measure of the found local maxima is larger than a predefined threshold H ≥ H0 > 0. On the other hand, the

local minima of such mask formed with this measure, such that H ≤ H0 0 < 0,

correspond to the edge pixels of the image, which by definition have only one large eigenvalue.

Shi and Tomasi [9] take a similar approach to Harris and Stephens [8] in computation of a score function based on the image pixels’ structure tensor. However, they propose a score function composition as

R = min(λ1, λ2). (2.4)

Finding the local maxima of a mask formed with this score function yields salient interest points of the image with no dichotomy between the corner and edge pixels.

Another concept that commonly goes hand-in-hand with interest point

de-tectors, is the concept of feature descriptors. In order to match an interest

point in one image with the interest points of another, it is common to rely on feature descriptors, which are descriptions of the detected interest points in an abstract vector space. The ultimate goal of feature descriptors is to yield unique descriptor vectors for each interest point of the image, such that they remain unchanged when the points are observed from a different viewpoint. For this purpose, the descriptors need to be invariant to the image variations across the different viewpoints, such as scale, brightness, and rotation [10, 11].

2.2 Multiple view geometry

(18)

from each other in Section 2.2.1, and the epipolar geometry ruling between the images in Section 2.2.2.

2.2.1 View synthesis

At the risk of deviating from the common terminology, we divide the view synthesis methods based on their principles, into two groups of image warping and pixel projection. The former group offers a direct mapping of coordinates in the images, whereas the latter involves an intermediary projection of pixels to 3D space.

Image warping

Image warping in a multi-view camera setup refers to the synthesis of a certain viewpoint’s image of a scene from another image of the scene observed from a different viewpoint via a warping function. The warping function consti-tutes a certain resampling method of the source image’s pixels, such that the resampled image mimics an image taken in the target viewpoint of the cam-era. There are various forms of transforms that may be used for this purpose, depending on the degrees of freedom of the warping between the images, such as preservation of orientation, lengths, angles, and parallelism in the images.

Homography for instance, is a transform with 8 degrees of freedom (pre-serving neither of the aforementioned properties in the image) and maps the pixel coordinates of the images between two viewpoints, while preserving the straight lines observed in them. By pre-multiplying a homogeneous pixel co-ordinates of one image with the 3 × 3 homography matrix, we obtain the cor-responding homogeneous pixel coordinates on the other image.

A factor that limits the use-case of homography matrices for view synthe-sis is that every homography matrix assumes a planar surface containing the corresponding points in the images. Therefore, to synthesise a full view of an image, it needs to be segmented based on its planar surfaces, and a separate homography matrix is to be predicted for each plane.

Pixel projection

(19)

to project a point p = [x, y, z]T

∈ R3 _{expressed in the camera frame’s}

co-ordinates, to a pixel coordinate pair {u, v} ∈ R2 _{on the image plane. This}

projection function is equivalent to pre-multiplying the point by the camera’s intrinsic parameter matrix K,

  k · u k · v k  = π(p) = K   x y z  , (2.5)

which yields the homogenous coordinates on the image plane scaled by a factor k, from which the normalised pixel coordinates {u, v} may be extracted. We can also formulate the inverse of this operation, namely the inverse projection of image pixel coordinates back to 3D points,

  x y z  = π−1({u, v}, d) = K−1   d · u d · v d  . (2.6)

For this operation, in addition to the coordinates {u, v} of a pixel, the depth dof the reflected point is also required. This depth d refers to the same quan-tity as does the scale factor k, which is marginalised in the forward projection operation shown in (2.5). In other words, the mapping of a point p ∈ R3 _to

coordinates {u, v} ∈ R2is many-to-one. This calls for the marginalisation of

some quantity in forward projection, i.e. via normalisation, and the incorpo-ration of an external source of information in the inverse opeincorpo-ration.

Given the relative pose between two viewpoints, the forward and inverse projection operations can be used to synthesise the image seen by one view-point from another, for their shared field of view. This is done by first inversely projecting the pixel coordinates of the source image to 3D space, applying the transformation for change of frames on the projected 3D scene points, and fi-nally projecting the transformed scene points onto the image plane of the target viewpoint. For this purpose, having the depth map of one of the images is suf-ficient. Take as an example two images called source and target, where Tt→s

denotes the relative pose between them, and Dt(·)denotes the target image’s

depth map. Following this method, we can find the coordinates of the pixel ps

on the source image, which corresponds to an arbitrary pixel pton the target

image, as ps ∼ π Tt→sπ−1 pt, Dt(pt) = KTt→sDt(pt)K−1pt, (2.7)

(20)

2.2.2 Epipolar geometry

Epipolar geometry refers to the geometric relations between the two images of different viewpoints, expressed purely as a function of the relative pose be-tween them. The term epipole refers to how one of the viewpoints is seen on the image plane of another. The epipolar geometry between the viewpoint allows for computation of a closed-form constraint that determines for an arbi-trary pixel on one image, a line on the other image that contains its projection. This line, known as epipolar line or epiline, contains the candidate projections for all possible depth values of the original pixel, i.e. the term Dt(·)in (2.7).

All of the epipolar lines on an image, pass through the epipole for the optical centre of the image, whose pixels they correspond to. Figure 2.1 illustrates for a point p on image I1, the corresponding epipolar line on image I2, which

contains the true corresponding point q.

C

1

(R, t)

q

p

2

C

e'

e

x

I

₁

I

2

Figure 2.1: Epipolar geometry between two images of a scene, taken by cam-eras C1and C2 from different viewpoints.

To describe the epipolar constraint between the two images, we assume two hypothetical identical cameras with the same intrinsic parameters, taking images of a scene from different viewpoints. The translation vector t and the rotation matrix R describe the pose of the first camera (taking image I1) as

(21)

matrixcan be computed as E = [t]×R =   0 −tz ty tz 0 −tx −ty tx 0  R, (2.8)

where [·]×denotes a skew-symmetric matrix composition that is equivalent to

a cross product. This essential matrix allows for under-determined projection of points to lines in the Cartesian camera frame. In order to perform such projections directly on the image pixel coordinates, the fundamental matrix of the two cameras is computed as

F = K−TEK−1. (2.9) The fundamental matrix establishes the condition

qTF p = 0 (2.10)

for the homogeneous image coordinates p and q respectively on the first and second images to project the same physical point in 3D space. This is known as the epipolar constraint.

We can obtain the closed-form line equation apuq+ bpvq+ cp = 0on the

second image for the epipolar line corresponding to point p on the first image, through the epipolar constraint,

(22)

Related work

In this chapter we look at the recent and prominent works in egomotion esti-mation. Section 3.1 gives an overview of the traditional methods employed for solving the visual odometry problem. In Section 3.2 we look at the attempts made in incorporating deep learning as a module in traditional visual odome-try framework. In Section 3.3 we study the methodology of the recent works focusing on estimation of camera egomotion using end-to-end deep networks. The depth of the analysis offered for each concept/work is proportional to their importance and relevance to our propositions.

3.1 Visual odometry: traditional methods

Traditional visual odometry methods rely on classical computer vision, multi-view geometry and optimisation techniques. The theory of operation of the classical solutions may be grouped mainly based on two criteria.

The first dichotomy of the visual odometry methods divides them into di-rect and indidi-rect methods. Didi-rect methods operate based on inferring the cam-era’s motion directly from the photometric information in the image pixels. These methods include solutions based on estimations of optical flow, which is the measure of instantaneous motion of individual pixels on the image pixel grid [12, 13, 14]. The indirect methods, also known as feature-based methods, operate based on the extraction of salient features of images and projecting them into 3D space. An estimation of the motion of the camera is then in-ferred by solving an optimisation problem to best explain the projection of the extracted features [15, 16, 17]. In other words, the distinction between direct and indirect methods is that direct methods minimise a photometric error be-tween warped images, whereas indirect methods minimise a geometric error

(23)

of projected features.

From a different perspective, visual odometry methods can be grouped ac-cording to the fraction of the image pixels used in the computations. Methods that involve using all pixels are referred to as dense methods, and methods that carry out the computations on only a subset of pixels are referred to as sparse methods. The middle ground between the two extremes of this spectrum is known as semi-dense, which refers to using dense image pixels in regions of the image where useful information lies [18].

Regardless of their principle of operation, all visual odometry methods us-ing monocular cameras suffer from an inherent shortcomus-ing. These methods can estimate the translation of the camera in 3D space only up to an unknown scale. This shortcoming could be remedied by incorporating additional infor-mation cues for the estiinfor-mation of the scale, such as the prior knowledge of camera height, vehicle speed or object sizes, or readings of other sensors like IMU and GPS [19].

3.2 Visual odometry: hybrid methods

Several recent works have investigated the use of deep neural networks in solv-ing the sub-tasks of visual odometry. One of the tasks in feature-based visual odometry is the detection and matching of features or interest points. While hand-crafted feature detectors and descriptors such as SIFT [10] and ORB [11] are traditionally used for this purpose and have proven powerful in a wide variety of settings, they are argued to be limited in their performance as gen-eralised solutions for all operating conditions. Therefore, there exists a fairly considerable motivation for the invention of novel robust feature detection and description methods, such as with the help of deep networks.

In recent years, there have been a large number of studies focusing on learn-ing feature descriptors, as unique identifiers used to discriminate local patches of images. While many of these works such as [20, 21, 22] still rely on tradi-tional interest point detectors for detection of the features, several others also carry out the detection of interest points via learning-based methods. The latter approach tackles an especially difficult problem, since interest point de-tection and annotation is not an intuitive task for human supervision, leading to the scarcity of ground truth data.

(24)

rank-ing of the points in the image based on how invariant they are to a range of image transformations, and selecting a respectively top quantile among them [24]. A more elaborate strategy is to explicitly estimate the orientation of de-tected interest points through a dedicated module, just to nullify them using a Spatial Transformer Network [25], in order to obtain orientation-invariant fea-ture descriptors [26]. In a more recent work [27], this process of detection and the estimation of feature orientation, as well as scale, is achieved by a single module.

A prominent work in this field, known as SuperPoint [28], proposes a so-lution where the feature detector and descriptor tasks shares most of the com-putations for the benefit of faster performance. To work around the absence of ground truth data, SuperPoint uses online-generated synthetic images of geo-metrical shapes to yield simply computed interest points such as corners and edges. After pre-training on the synthetic interest points, the model is gener-alised to real images via generation of pseudo ground truth interest points of the real images, via a technique involving application of a set of random homo-graphies and aggregation of the detected interest points by the base model on the set of the transformed real images. Inspired by this solution, another impor-tant work, called UnsuperPoint [29], proposes a similar architecture and set-ting, except that instead of training on pseudo ground truth interest points, the network uses regression for the detection of suitable points in a self-supervised manner. A number of other works such as [30] embed the task of interest point learning inside a host visual odometry pipeline, and take advantage of the multi-view geometry between image sequences to better learn high quality interest points.

3.3 Learning-based egomotion estimation

(25)

investigate how this concept is extended to self-supervised learning of camera egomotion by the earliest works in this field. Section 3.3.3 expands this dis-cussion on how the addition of optical flow estimations can help with the task. Section 3.3.4 gathers an overview of a number of other modules proposed by other works in learning-based egomotion estimation to improve the quality of predictions.

3.3.1 Monocular depth prediction

Self-supervised learning of depth maps for monocular images sets the founda-tion for the later works in learning of camera egomofounda-tion. In one of the earliest works in this field, Garg et al. [31] propose a training strategy for a convo-lutional neural network to predict the depth map of single-view images. The proposed training strategy assumes a stereo camera dataset, but does not re-quire any ground truth labels. The training of the network is based on using the predicted depth map of one of the images in the stereo pair, in conjunction with the known baseline (i.e. relative pose) between the cameras, to synthe-sise the second image in the stereo pair. The image pixel intensity difference between the original image and the synthesised image provides a photometric error that is backpropagated in the network for training of the depth estimation network. This image warping operation is illustrated in Figure 3.1.

(26)

and build on top of this concept, introducing a number of changes that signifi-cantly improve the system performance. One of these changes is that the pixel disparity map, i.e. inverse depth, is predicted instead of depth, for the benefit of an easier numerical handling of the points that are considerably far from the camera. Moreover, a left-right consistency scheme is introduced, which synthesises left and right images in the pair from each other and ensures a consistency in their syntheses. This results in more accurate disparity predic-tions in inference time.

3.3.2 The earliest egomotion estimation works

SfMLearner by Zhou et al. [33]

In the first work focusing on self-supervised learning of camera egomotion, Zhou et al. [33] extend the self-supervision concept employed in learning of monocular depth in a constrained stereo setup, to a generalised multi-view case. This means that instead of relying on pairs of stereo camera images and a fixed baseline translation, the training is carried out on a feed of monocular image sequences while the camera follows an unknown 3D motion. While this is a harder problem to solve compared to the stereo setup, it can be approached via the same concept of view synthesis between the two camera viewpoints. In terms of the view synthesis expression (2.7), in addition to the depth map D(·), the relative pose Tt→s that is known in a stereo setup, is also unknown

in this generalised multi-view case.

This work proposes using two networks to learn the two unknowns required for view synthesis. One of these networks predicts the depth map D(·) of a single input image, in a similar spirit to earlier monodepth approaches. The other network, for the input pair(s) of a target image and one (or more) source image(s), predicts the relative pose of the camera Tt→s.

With predictions of the depth map Dt(·) of the target image It, and the

relative pose Tt→s between the target and a source image Is, pixel intensity

values of the source image could be used to synthesise the target image as it was shown in the view synthesis expression by pixel projection in (2.7). This technique results in a reconstruction of the target image, which in principle only matches the true target image It if the predicted depth map Dt(·) and

relative pose Tt→sare of high fidelity. This reconstruction of the target image

could also be thought of as the source image warped to the target coordinate frame, which is denoted by ˆIs. Based on this, a view synthesis training loss

is formulated which minimises the intensity difference between Itand ˆIs, for

(27)

Lvs = X s X p |It(p) − ˆIs(p)|. (3.1)

This interaction between predictions of the depth and pose networks for image reconstruction is shown in Figure 3.2, for a target image It at time t,

and two source images It−1 and It+1, symmetrically picked before and after

the target image in time. The architecture of the networks used is illustrated in 3.3. For prediction of single-view depth, the DispNet architecture [34] is adopted with multi-scale side predictions.

Figure 3.2: Principle of operation of the method proposed by Zhou et al. [33]. While the photometric view synthesis error in principle allows for self-supervised training of the depth and camera pose networks, it is not equipped with any mechanism to accommodate outliers. A number of outlier pixels are naturally expected to be present in the view synthesis process, since in a setup involving a moving camera there are pixels in one image, which do not correspond to any pixel in the other image, either due to non-overlapping fields of view or the occlusion of the scene. It should also be noted that such formulation assumes a static scene that does not evolve during the movement of the camera. It is evident that such assumption is rather strong in an urban environment scenario, where cars or pedestrians might be present that follow their own motion profiles.

(28)

view synthesis loss as Lvs = X s X p ˆ Es(p)|It(p) − ˆIs(p)|. (3.2)

In such formulation, if a pixel p in a target-source pair is predicted to be unex-plainable, the predicted explainability coefficient ˆEs(p) tends to zero, which

nullifies the effect of the photometric error of that pixel on the total view syn-thesis loss.

The training of the network is a result of the minimisation of this loss term. However, a trivial but undesirable minimiser of the loss term is a zero explain-ability term ˆEs(·) = 0, which renders every pixel projection unexplainable,

hence a null training loss without the network truly learning the correct pro-jection parameters. To avoid this, a regularisation term Lreg( ˆEs)is added to

the total training loss, which encourages non-zero explainability predictions. This is done by minimisation of the cross-entropy of the output of ˆEs(·)with

a constant label 1 for every pixel.

The paper also proposes using a depth smoothness loss Lsmooth, which

minimises the `1 norm of the second-order gradients for the predicted depth

map. This is to facilitate learning of the true depth and pose estimations from the intensity-derived gradients. The final loss value is a linear combination of the aforementioned components, summed over different image scales l of the predictions, Lf inal = X l Ll vs+ λsLlsmooth+ λe X s Lreg( ˆEsl) ! . (3.3)

SfM-Net by Vijayanarasimhan et al. [35]

Concurrent to Zhou et al. [33], Vijayanarasimhan et al. [35] independently devised a similar solution, which takes a different approach in formulating the photometric error. This work proposes projection of the source image pixels to a 3D point cloud using the predicted depth map, followed by transformation of the obtained point cloud according to the predicted camera motion, and therefore computing the scene flow between the source and target images. This scene flow is used to compute the photometric error between the source and target pixels, which is used for the training of the network.

(29)

Figure 3.3: Network architectures used by Zhou et al. [33] for learning of single-view depth and camera egomotion.

that are deemed unexplainable in a presumed static scene and implicitly ne-glecting the observed dynamic objects, Vijayanarasimhan et al. [35] predict multiple object masks, explicitly predicting pixels reflecting dynamic objects of different motions profiles. In addition to the relative pose of the camera, the network also predicts separate relative poses for each of the predicted ob-ject motion masks. Upon conversion to the 3D point cloud, the pixels are first transformed according to the predicted motion of the mask they belong to, fol-lowed by transformation according to the predicted motion of the camera. A scheme of forward-backward consistency is used for prediction of the depth maps, such that the predicted scene flow and its inverse are used on the depth maps of both source and target images, and the consistency of reconstructions is ensured. The principle of operation of the method and the network archi-tectures used by this work are shown in Figure 3.4. In this setup, the same architecture, but with independent weight parameters, is used for both depth and pose estimation tasks, meaning there are skip connections between the encoder and decoders of the motion estimation network as well, which is in contrast with the pose estimation network of Zhou et al. [33].

(30)

Figure 3.4: Principle of operation of the method proposed by Vijaya-narasimhan et al. [35].

3.3.3 Optical flow prediction for egomotion estimation

A number of studies propose some form of direct prediction of optical flow as part of the learning process of camera egomotion with the aim of better handling of dynamic objects.

GeoNet by Yin and Shi [36]

This work follows the solution proposed by Zhou et al. [33] in using depth and camera pose estimation networks for the reconstruction of rigid (i.e. static) structures of the scene. However, this work proposes a novel model, which complements the optical flow computed from the camera motion, with the residual vectors that explain the motion of non-rigid (i.e. dynamic) structures within the scene. The aggregation of the predicted non-rigid residual flow and the rigid flow due to the camera motion itself allows for more accurate warping of the source and target images.

(31)

contradict each other, in which case they are treated as outliers.

This principle of operation is illustrated in Figure 3.5. DepthNet and Res-FlowNet adopt the network architecture of [32], which is by itself inspired by [34], similar to what is used for the DepthNet by Zhou et al. [33]. PoseNet also follows the same structure as proposed by [33].

Figure 3.5: Principle of operation of the method proposed by Yin and Shi [36]. In computation of the photometric error between the target and warped source images, Yin and Shi [36] propose using the structural similarity be-tween the images in addition to the direct difference bebe-tween their pixel inten-sity values, for more resilience towards outlier pixels. The Structural Similar-ity (SSIM) index is a quantitative measure, which compares two signals via deriving statistical properties of their sample sets [37]. In the case of compar-ison of two images, the image pixel intensities are considered the sample sets of the signals. The similarity of two images is measured by separate statistical comparisons of their luminance, contrast and structure. This similarity mea-sure, which was also used in the earlier work of Godard et al. [32], is combined with the direct photometric error to form the total rigid warp error as

Lrw = α

1 −SSIM(It, ˜Isrig)

2 + (1 − α)|It− ˜I

rig

s | (3.4)

with the combination weight α = 0.85.

This work also achieves an edge-aware smoothness of the depth and flow maps by minimisation of the `1 norm of first order gradient of the depth/flow

map, only at low-gradient areas of the image in terms of pixel intensity. The depth smoothness loss for instance, is the formulated as

Lds =

X

p

(32)

Competitive Collaboration by Ranjan et al. [38]

This work takes a different approach in segmentation of the image pixels based on their dynamicity. The work proposes a connection of four networks, namely a depth map prediction network, a camera motion estimation network, a full optical flow estimation network, and a motion segmentation network.

Similar to the previous works, the training of the camera motion and depth estimation networks is driven by a photometric loss based on view synthesis, and a segmentation mask is used to only consider the pixels belonging to the static scene. However, a different approach is taken in incorporation of the motion segmentation in training. The combination of the camera motion and depth networks leads to the prediction of the motion of the camera in a static scene, which is used to compute the image optical flow map that is purely due to the camera motion itself. Concurrent to this process, a separate optical flow network is also trained, which given the input target and source images, predicts the full image optical flow including the pixels reflecting dynamic objects. This offers two optical flow predictions, which differ in inclusion of the reflected motion of other dynamic objects. Having two optical flows allows for computation of two measures of per-pixel photometric error based on view synthesis.

To train the motion segmentation mask, in addition to the cross-entropy regularisation of the motion segmentation mask similar to Zhou et al. [33], the two optical flow maps and their resultant photometric errors are also exploited. A training loss term is defined that minimises the cross-entropy between the motion segmentation mask and an indicator function that is 1 for a pixel if the magnitude of its full flow vector is sufficiently similar to the static scene flow, or if the photometric error computed with the static scene assumption is lower than the photometric error computed through the full flow prediction. The union of these two conditions is a good indicator of a pixel being from the static environment.

(33)

Figure 3.6: Principle of operation of the method proposed by Ranjan et al. [38].

With such arrangement of the modules, the depth network, the camera motion network, and the optical flow network depend on the mask produced by the motion segmentation network for training. On the other hand, the training of the motion segmentation network depends on the motion, depth and flow estimations themselves. This relation is what is referred to by Ranjan et al. [38] as competitive collaboration. In order to solve this recursive dependency in training of the networks an iterative two phase training strategy is devised. In the first phase, the motion segmented image is used to arbitrate the dynamicity of pixels, which is used in the evaluation and learning of the pair of depth and motion, as well as optical flow networks. In the second phase, the pair of depth and motion estimation networks and the optical flow prediction network collaborate together to train the motion segmentation network.

GLNet by Chen et al. [42]

The approach of explicit prediction of optical flow and using it as a technique for resilience against dynamic objects is also followed in the work of Chen et

al.[42]. This work proposes a three-network setup, predicting depth, camera

motion as well as optical flow between the target and source images.

(34)

devi-ation between the computed 3D point clouds of the target image and the source image using the estimated depth and relative pose.

Another novelty of the loss function formulation of this work is inclusion of the epipolar constraint between the images. This is formulated such that the image pixels coordinates before and after application of the predicted op-tical flow are tested within the epipolar constraint formed using the predicted camera motion. By minimising the resultant error of this constraint check the optical flow prediction network and the camera motion network are encour-aged to learn consistent predictions.

This work also proposes an optimisation scheme, which could be used on top of the predictions of the network. The optimisation may be carried out af-ter the training, and attempts to fine tune the predictions of the network to the inference samples. The optimisation takes the trained predictions as a prior so-lution and either optimises the network parameters based on the loss function in proximity of the prior, or treats the prediction outputs as free parameters and directly optimises them so that they better minimise the loss function.

The principle of operation of this work is presented in Figure 3.7. Depth-Net conforms to a fully convolutional encoder-decoder architecture, with a ResNet18 [39] encoder and the decoder based on DispNet [34] with deconvo-lutional layers, where skip connections and multi-scale predictions are used. CameraNet follows the model proposed by Zhou et al. [33] and FlowNet uses the same ResNet-based architecture as in the work by Yin and Shi [36].

(35)

3.3.4 Other modules for performance enhancement

In this section we review a few other modules that with various techniques aim at improving the training process of the egomotion estimation network.

3D ICP loss by Mahjourian et al. [43]

This work proposes a novel loss term based on the error between the point clouds of the target and source images. To compute this loss term, the pre-dicted depth map for both target and source images are used to obtain their corresponding point clouds Qtand Qs. The predicted relative pose between

the cameras is then used to transform one of the point clouds onto the other’s coordinate frame ˆQs = Tt→sQt. The higher the quality of depth and relative

pose predictions, the better the alignment of the point clouds Qsand ˆQs. This

alignment error is what constitutes the novel loss term.

To compute the alignment loss, Iterative Closest Point (ICP) is used as a rigid registration method [44], which minimises the point-to-point distances between the point clouds. Since the association computation done in ICP is non-differentiable, its gradients are approximated to allow backpropagation of errors for learning egomotion and depth estimates. ICP computes the trans-formation T0 _{that when applied to ˆ}_Q

s, best aligns it with Qs. Besides the

transformation itself, the algorithm also returns the residual error r, which is the remaining error between the point clouds after the alignment.

The magnitude of the computed transformation T0 _{as well as the residual}

error r are used in forming the 3D loss as

L3D = kTt0− Ik1+ krtk1. (3.6)

Although the sources of errors cannot be completely decoupled, the magnitude of T0_{mostly reflects the error in the camera relative pose predictions, whereas}

the magnitude of r reflects the error in the depth map predictions.

Scale consistency by Bian et al. [45]

This study proposes a geometric consistency loss, which helps achieving scale consistent depth and camera motion predictions. With a similar concept to the point cloud alignment loss in the works of Mahjourian et al. [43] and Chen

et al.[42], this work proposes evaluation of the quality of 3D scene structure

(36)

This geometric consistency loss is formulated for two images Iaand Ib as LGC = 1 |V | X p∈V |Da b(p) − D 0 b(p)| Da b(p) + D 0 b(p) , (3.7) where Da

b(·)denotes the resultant depth map from warping the original depth

map Da predicted for Ia using the predicted relative pose, and Db0(·)denotes

the depth map predicted for Ibinterpolated to match the pixel grid of its

coun-terpart. Such formulation normalises the alignment error to the range between 0 (for perfect alignment) and 1 (for anti-alignment), which helps with numer-ical stability of the training.

This geometric consistency loss is also used for the self-discovery of the mask that segments non-static pixels. In the image pairs, if a pixel reflects a dynamic object, its corresponding 3D point does not align well between the warped depth maps of the images, which results in a large geometric consis-tency error. On the other hand, if a pixel reflects a static 3D point, given perfect depth and relative pose estimations, the warped depth map of the two images match perfectly, resulting in a small geometric consistency error. Therefore, the segmentation mask could be simply defined as the binary complement of the geometric consistency error.

Compositional re-estimation by Nabavi et al. [46]

This work proposes estimating the motion of the camera in a recursive pro-cess, using the pose estimation network to find smaller transformations that collectively can warp the source image to the target image. In other words, instead of estimating the relative pose between two frames in a single attempt of inference, the relative pose is seen as being composed of a sequence of cam-era poses in smaller motion steps. Compositional re-estimation refers to the iterative estimation of the relative pose of the camera for each of these smaller motion steps. The putative advantage of this is that several smaller transfor-mations are predicted rather than a large one, and the assumptions made by the view synthesis formulation are more easily satisfied. In this formulation, instead of computing Tt→sin a single attempt, Tt→sr is found after r iterations,

via the network computing ∆Ti

t→s ∈ SE(3) in the increment step shown in

(3.8), where ⊕ denotes a matrix multiplication operator.

(37)

tend to be of much smaller magnitudes than a complete relative pose between image pairs.

Figure 3.8: Principle of operation of the method proposed by Nabavi et al. [46]. Green colour blocks denote pose estimation network used in an recursive process, and yellow colour blocks demarcate the depth estimation network.

D3VO by Yang et al. [47]

In one of the most recent works in this field, known as D3VO, Yang et al. [47] extend the concept of end-to-end deep learning of egomotion estimation and incorporate it in a direct visual odometry method, using the network pre-dictions in both the front-end tracking and the back-end optimisation of the estimations. This work builds directly on top of the earlier works in this field and follows the same approach of learning the depth maps of images and the relative pose between them in a joint training.

One of the novelties introduced in this work is incorporation of a brightness transformation technique, which aims to tackle the image intensity changes due to the variations of the camera viewpoint. An affine transformation is proposed to adjust the image intensity of the target image to that of the warped source image. Such affine transformation that can be written as

Ia,b= aI + b (3.9)

requires two parameters a and b, which could be also learned in a self-supervised manner as additional two parameters by the same network that predicts the camera pose. The predictions of the affine transform parameters are regu-larised to limit drastic brightness transformations.

(38)

sur-faces [48] and moving objects, which cannot be trivially formulated with an-alytic models. However, Yang et al. [47] argue that such complexities may be seen and treated as observation noise, for which the concept of heteroscedas-tic aleatoric uncertainty in deep learning [49] can be leveraged. Following this concept, the authors introduce a mechanism for learning of an uncertainty mask based on the input image, where the network learns to predict higher variance (i.e. uncertainty) for pixels where the brightness consistency assump-tion is more likely to be violated. Unlike the arrangement of Zhou et al. [33], where the explainability mask is predicted based on both target and source im-ages, this uncertainty map is predicted with solely the target image as an input, and as an additional channel of the depth prediction network. This uncertainty measure, in addition to providing a method for discounting unexplainable pix-els in computation of the view synthesis error during training, is also used in the back-end nonlinear optimisation of pose estimations. An overview of the principle of operation of this work is presented in Figure 3.9.

(39)

Method

In this chapter, we discuss the applied methods and expand on the motivations that drove our decision strategies. We first paint the big picture of our work by laying out the overall design, and briefly discussing the motivation behind the novelties of our approach in relation to the state of the art methods in Section 4.1. We then present an overview on the architecture design of a candidate deep network that could be used for end-to-end learning of camera egomotion in Section 4.2. In Section 4.3, we discuss the design and craft of our proposed self-supervised objective function for this purpose. We then discuss some of the challenges in the self-supervised learning process by taking the perspec-tive of energy functions in Section 4.4, and propose a method of auxiliary regularisation of predictions to facilitate the learning process.

4.1 Concept

We first recap the training strategy of state of the art solutions and point out their limitations in Section 4.1.1. We then lay out our proposed approach and contrast it with the available solutions in Section 4.1.2.

4.1.1 State of the art

As it was discussed in Section 3.3, the recent works in self-supervised learn-ing of camera egomotion follow the general recipe of view synthesis between the input image pairs taken at different viewpoints, and minimisation of the photometric error of the synthesised views. As the view synthesis process is a function of the relative pose of the camera between the viewpoints, the min-imisation of the view synthesis error as a training loss corresponds to the pose

(40)

estimation network learning to predict the camera egomotion from its image sequences. However, apart from the relative pose of the camera, the view syn-thesis process also requires the depth information of the camera images.

As monocular images do not contain the depth information of the scene, the state of the art works rely on a second network, to explicitly estimate the depth map required for view synthesis. This results in a joint training of a pose estimation network and a depth estimation network through a shared training loss, as illustrated in Figure 4.1. Such recipe has proven to produce high lev-els of accuracy in the inference of camera egomotion, even compared to the traditional methods [38, 42, 47].

Each of the two networks can be used independently to make inferences re-spectively about the camera egomotion and the single-view depth of images. However, the joint training requirement implies that even if the end-to-end learning of camera pose is the sole objective in an application, the network parameters of the depth estimation network are still needed to be learned. In other words, in a scenario as such, a large set of network parameters are needed to be trained, which will not be used in making inferences. Moreover, the prediction accuracy of the pose estimation network is dependent on the es-timations made by the depth network, which can be a potential performance bottleneck. Image at time t Image at time t+1 Depth estimation network Pose estimation network View synthesis loss {Relative pose} Depth map of the image at time t Backpropagation

Figure 4.1: Self-supervised joint learning of depth and camera egomotion.

4.1.2 Our approach

(41)

single-view depth estimations. This results in a training pipeline that only in-volves the pose estimation network, as illustrated in Figure 4.2, meaning that the training only involves those sets of parameters that will be used in the mak-ing inferences. Image at time t Image at time t+1 Pose estimation network Novel training objective function {Relative pose} Backpropagation

Figure 4.2: Our approach for self-supervised learning of camera egomotion.

4.2 Pose estimation network

The candidate network architecture that we use in our experiments is heavily inspired by the pose prediction network proposed by Zhou et al. [33]. We, however, introduce a few modifications with the aim of modularity and pa-rameterisation of the architecture design, so that the effect of design choices could be measured and analysed in isolation from each other.

4.2.1 Architecture design

Our proposed architecture consists of convolutional blocks, each comprising a number of sequential convolutional layers. The first layer of each block halves the height and width of its input tensor via a convolution stride of two and doubles its channel size. The rest of the layers of the blocks preserve the di-mensions of their input tensor. We treat the number of layers in each block and the number of starting tensor channels at the input block of the network as hyperparameters in our experiments.

(42)

fully-connected layers at each tail, to allow the network to better learn the nuances of translations and rotations in their unique space. This is different from the original design by Zhou et al. [33], where average pooling is used to convert the final tensors to pose component values. An overview of our proposed candidate network architecture can be seen in Figure 4.3.

Another modification made on the baseline architecture is the addition of batch normalisation, and change of ReLU activations to LeakyReLU for bet-ter preservation of gradients in training [50]. This means that every convo-lutional layer consists of the convolution operation, batch normalisation and LeakyReLU activation in the same order. The effectiveness of the addition of batch normalisation is examined in an ablation study (see Section 6.2.1).

20 20 416 128 Input Block 40 40 208 Block 1 80 80 104 Block 2 160 160 52 Block 3 320 320 26 Block 4 640 640 13 Block 5 106 13 Rotation Conv 1 12 13 Rotation Conv 2 1 120 Rotation Linear 1 1 12 Rotation Linear 2 106 13 Translation Conv 1 8 13 Translation Conv 2 1 80 Translation Linear 1 1 8 Translation Linear 2

Figure 4.3: Candidate network architecture with an arbitrary hyperparameter setting of 2 convolutional layers in each backbone block, 4 starting channels per image, and taking in 5 input images.

4.2.2 Network input and output

Similar to the setup proposed by Zhou et al. [33], the input to our pose estima-tion network consists of a single target image, and an arbitrary (but fixed) even number of source images, sampled symmetrically in time, before and after the target image. Each image is encoded in 3-channel RGB colour space, and the input tensor to the network is composed of channel-wise concatenation of the target and source images.

(43)

up to an unknown scale of the true translation (see Section 3.1).

We propose using axis-angle representation of the rotation group instead of Euler angles, as unlike the Euler angle representation it does not suffer from the singularity known as gimbal lock [51]. Although the occurrence of gimbal lock is unlikely due to the small magnitudes of rotation between the image frames, we propose using a representation that safely prevents it.

We also propose using spherical coordinates for representation of the esti-mated translation. This is because as a drawback of removal of depth predic-tions in estimation of the camera pose, we lose the drive to learn any notion of the scale of translations. In other words, epipolar geometry only allows us to estimate 5 degrees of freedom of the camera pose. Therefore, we propose us-ing spherical coordinates, which decouples the scale of translations that cannot be predicted by our method, and only predict the translation vector orientation through 2 angles of azimuth and elevation. See Appendix B for more details regarding the conventions used in the frames and different translation and ro-tation represenro-tation.

4.3 Self-supervised objective function

In this section we introduce our novel objective function that can be used for self-supervised learning of camera egomotion from input image sequences. We first lay out a basic formulation of the objective function in Section 4.3.1, which is purely based on epipolar geometry and pixel intensity analysis. The basic formulation is agnostic to the size of the pixel sets used for computation of the photometric error. However, in Section 4.3.2 we argue why sparse com-putation can be considered an asset in our formulation, despite all learning-based egomotion estimation works relying on dense computations. In Section 4.3.3 we discuss how the inclusion of a more advanced analysis of the sparse pixel intensity values can benefit the error computation. We then propose two more advanced formulations of the objective function with sparse computa-tions and present the nuances of the method in Section 4.3.4.

4.3.1 Basic formulation

(44)

within Q. For simplicity of exposition, we are conveniently assuming that the warped pixels of Q conform to the pixel grid of P . In detailed formulation, however, we use bilinear interpolation to obtain the warped pixel values. As it was shown in the view synthesis expression in (2.7), availability of depth map information and the relative pose between the viewpoints of the images allows for closed-form mapping between pixel sets P and Q of the two images. How-ever, as argued in Section 4.1, we abstain from including depth information in our formulation, which means that we cannot use the view synthesis mapping.

Geometric information for solving pixel correspondence

To work around the lack of depth information, we resort to the epipolar ge-ometry between the images. As it was shown in (2.10) and (2.11), having the relative pose between the viewpoints of the images is sufficient for finding the equation of the line of candidate projections of one pixel on the other image. This implies that the epipolar geometry allows us to limit the search space of the correspondence for a point p ∈ P to a subset lp ⊂ Qthat is considerably

smaller than the original set Q. An illustration of such reduction of search space size by finding an epipolar line can be seen in Figure 2.1.

Photometric information for solving pixel correspondence

While epipolar geometry reduces the size of the search space, the problem of finding the correct correspondence between the image pixels is still under-constrained. Since without further assumptions there are no geometric grounds to prefer one point on the epipolar line over another, we propose using the pho-tometric information available in the image pixels to solve the correspondence problem. Our proposed strategy is to search along the corresponding epipolar line lp ⊂ Qfor every p ∈ P , and find q ∈ lp that has the smallest photometric

error with p. If we denote the images containing P and Q respectively by IP

and IQ, we can formulate this correspondence as

q∗ = argmin

q∈lp

|IP(p) − IQ(q)|. (4.1)

Smooth photometric error from pixel correspondence

(45)

images IP and IQas L_photometric= 1 |P | X p∈P |IP(p) − IQ(q∗)|, (4.2)

where |P | denotes the size of the set P . This loss is a function of the relative pose between the viewpoints of the images through the found correspondence q∗. As the correspondence is a minimiser of the photometric error along the epipolar line, we could express the photometric error directly as

L_photometric = 1 |P | X p∈P min q∈lp |IP(p) − IQ(q)|. (4.3)

In order for this loss function to suit the training of a neural network, it must be differentiable with respect to the relative pose that is used to find the epipolar line. However, the minimum operation in its original definition is not a differentiable operation. For differentiability of the loss function, we propose using softmin operation, which is a smooth differentiable approximation of the minimum function. Denoting the photometric error between the two pixels p and q by

dp,q = |IP(p) − IQ(q)|, (4.4)

we write the total smooth photometric error loss as

L_photometric = 1 |P | X p∈P softmin q∈lp dp,q = 1 |P | X p∈P P q∈lp dp,qe−dp,q P q∈lp e−dp,q . (4.5)

Photometric error colour space

We propose computing the photometric error between image pixels in HSV (hue, saturation, value) colour space. The motivation behind this proposi-tion is that unlike the commonly used RGB colour space, HSV decouples the brightness information (luma) of the image pixels from their colour informa-tion (chroma). With such decoupling and imposing importance weights on the different components of the colour space when computing the photometric er-ror, we can achieve resilience towards unmodelled brightness changes across different frames. This is to achieve the same effect as the brightness adaptation module proposed by Yang et al. [47], however without the introduction of any additional parameters to learn. Below is the photometric loss in HSV space,

(46)

where the min accounts for the periodicity of the hue component from 0 to 1. α, β, γ ≥ 0 are the importance weights corresponding to H, S, and V respectively, which may be for instance chosen 4, 2, 1 in the same order to give higher importance to the chroma of image pixels.

Figure 4.4 illustrates the concept of photometric error minimum search on epipolar lines for correspondences detection between two images.

Target image Source image

Four points on the target image Detected correspondences on the source image

Detected correspondences on the target image Four points on the source image

Figure 4.4: Correspondence detection via minimum search on epipolar lines.

4.3.2 Sparse computations

The basic formulation of the objective function laid out in Section 4.3.1 is generic to any pixel sets P and Q of two images taken from two viewpoints. While in this formulation no explicit assumption was made about the compo-sition of the sets, it is implicitly assumed that for a p ∈ P there is one corre-spondence q∗ _{∈ l}

p ⊂ Q, which among other samples within lp, uniquely yields

the smallest photometric error with p. This implies that if for any reason, such as p coming from a textureless region, there are multiple candidates in lp that

are photometrically similar to p, a wrong correspondence could be chosen. Such wrong correspondence deflates the argument that the photometric error is a good representative of the candidate relative camera pose. Relying on sparse pixel sets that exclude textureless regions helps preventing such wrong correspondences.

(47)

order to achieve this however, it is important that we choose a sparse sampling strategy that reflects the same physical points in images taken from different viewpoints.

Sparsification strategy

We propose composition of P and Q pixel sets from the regions of images with salient photometric information. This could be interpreted as choosing high-gradient points in the images such as corners and edges. For this purpose we use an interest point detector that is very similar to the interest point detection methods by Harris and Stephens [8] and Shi and Tomasi [9] (see Section 2.1). The only difference is that in the final step of selecting interest points based on a score function, we define the score function as

R = λ1+ λ2. (4.7)

We follow a similar thresholding and non-max suppression method as the aforementioned works to obtain the sparse interest points. This score func-tion gives us the advantage that when a pre-set number of interest points are to be extracted, which implies prioritisation, the mixed selection of corner and edge pixels is performed implicitly according to the magnitude of their eigen-values. Compared to the formulation of Shi and Tomasi [9] our proposition gives a higher priority to strong edge pixels over weak corners. As can be seen in Figure 4.5, the resultant pixel sets of our interest point detection module are composed of corner and edge pixels at high-gradient regions of the image. This results in the pixel sets of the target and source images to tend to reflect the same physical points of the scene.

In our design process, we also investigated enforcing a more uniform spread of interest points in all regions of the image by division of the image into a cell-grid and extracting interest points for the cells independently. However, not enforcing this seemed to offer a sufficiently acceptable performance, which does not require any additional engineering or tuning effort. Therefore, we opt for not enforcing a uniform spread of interest points.

Negative implications

(48)

Target image

Source image

Figure 4.5: Extraction of top 500 interest points consisting of salient corner and edge pixels using our score function, on target and source images.

These outliers could be due to any unmodelled phenomena, such as reflections of non-Lambertian surfaces, dynamic objects or even absence of correspon-dence due to occlusions.

Another source of error as a result of the sparsification of computations is the non-validity cases of the assumptions made in the sparsification pro-cess. The proposed method of sparse pixel set composition from high-gradient points in the images assumes that such high-gradient points in different images reflect the same physical points in 3D space. Although this tends to be a valid assumption for the majority of the selected points, there are outlier cases where a chosen pixel in one image has no correspondence in the sparse set of the other image.

4.3.3 Local structure dissimilarity

(49)

information of pixels to form a dissimilarity measure between them.

We define this local structure dissimilarity using the eigenvalues λ1and λ2,

and eigenvectors e1and e2of the precomputed structure tensor for every pixel.

We propose computing a single structure orientation normal vector using these properties for every pixel, as this allows for a direct comparison between two pixels via computation of the angle between their orientation vectors. It is worth noting that an eigenvector e = [x, y]T _{and its numerically negated self}

−e = [−x, −y]T are equivalent. In order to avoid ambiguity, we restrict the

eigenvectors to have x ≥ 0 by considering the negated self of those with x < 0. Also we consider the form with y > 0 for eigenvectors with x = 0. The orientation normal vector of a pixel is computed as

ˆ v = λ1e1+ λ2e2 kλ1e1+ λ2e2k where ei = xi yi s.t. xi > 0 ∨ (xi = 0 ∧ yi > 0). (4.8) Having two normal vectors ˆvp for p ∈ P and ˆvqfor q ∈ Q, the cosine of

the angle between them can be computed as ˆ

vp· ˆvq = kˆvpk kˆvqk cosθ = cosθ. (4.9)

We form our dissimilarity measure as

d_structure= 1 − ˆvp· ˆvq, (4.10)