• No results found

Infrared image-based modeling and rendering

N/A
N/A
Protected

Academic year: 2021

Share "Infrared image-based modeling and rendering "

Copied!
55
0
0

Loading.... (view fulltext now)

Full text

(1)

UPTEC F 17021

Examensarbete 30 hp Juni 2017

Infrared image-based modeling and rendering

Oskar Wretstam

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Infrared image-based modeling and rendering

Oskar Wretstam

Image based modeling using visual images has undergone major development during the earlier parts of the 21th century. In this thesis a system for automated

uncalibrated scene reconstruction using infrared images is implemented and tested.

An automated reconstruction system could serve to simplify thermal inspection or as a demonstration tool. Thermal images will in general have lower resolution, less contrast and less high frequency content as compared to visual images. These

characteristics of infrared images further complicates feature extraction and matching, key steps in the reconstruction process. In order to remedy the complication

preprocessing methods are suggested and tested as well. Infrared modeling will also impose additional demands on the reconstruction as it is of importance to maintain thermal accuracy of the images in the product. Three main results are obtained from this thesis. Firstly, it is possible to obtain camera calibration and pose as well as a sparse point cloud reconstruction from an infrared image sequence using the suggested implementation. Secondly, correlation of thermal measurements from the images used to reconstruct three dimensional coordinates is presented and analyzed.

Lastly, from the preprocessing evaluation it is concluded that the tested methods are not suitable. The methods will increase computational cost while improvements in the model are not proportional.

Examinator: Tomas Nyberg

Ämnesgranskare: Robin Strand

Handledare: Martin Solli

(3)

Populärvetenskaplig sammanfattning

Bildbaserad modellering med visuella bilder har genomgått en stor utveckling under de tidigare delarna av 2000-talet. Givet en sekvens bestående av vanliga tvådimensionella bilder på en scen från olika perspektiv så är målet att rekonstruera en tredimensionell modell. I denna avhandling implementeras och testas ett sys- tem för automatiserad okalibrerad scenrekonstruktion från infraröda bilder. Okali- brerad rekonstruktion refererar till det faktum att parametrar för kameran, såsom fokallängd och fokus, är okända och enbart bilder används som indata till systemet.

Ett stort användingsområde för värmekameror är inspektion. Temperaturskillnader i en bild kan indikera till exempel dålig isolering eller hög friktion. Om ett automa- tiserat system kan skapa en tredimensionell modell av en scen så kan det bidra till att förenkla inspektion samt till att ge en bättre överblick. Värmebilder kommer generellt att ha lägre upplösning, mindre kontrast och mindre högfrekvensinnehåll jämfört med visuella bilder. Dessa egenskaper hos infraröda bilder komplicerar extraktion och matchning av punkter i bilderna vilket är viktiga steg i rekonstruk- tionen. För att åtgärda komplikationen förbehandlas bilderna innan rekonstruk- tionen, ett urval av metoder för förbehandling har testats. Rekonstruktion med värmebilder kommer också att ställa ytterligare krav på rekonstruktionen, detta eftersom det är viktigt att bibehålla termisk noggrannhet från bilderna i modellen.

Tre huvudresultat erhålls från denna avhandling. För det första är det möjligt att beräkna kamerakalibrering och position såväl som en gles rekonstruktion från en infraröd bildsekvens, detta med implementationen som föreslås i denna avhandling.

För det andra presenteras och analyseras korrelationen för temperaturmätningar i

bilderna som används för rekonstruktionen. Slutligen så visar den testade förbe-

handlingen inte en förbättring av rekonstruktionen som är propotionerlig med den

ökade beräkningskomplexiteten.

(4)

Acknowledgements

This thesis work was proposed by Flir Systems AB in Täby, Sweden. I extend my grat- itude to my supervisor at Flir, Martin Solli, as well as to my subject reviewer, Robin Strand, at Uppsala Universitet. They have both provided input and guidance for my project including comments on early versions of this report.

On a personal level I would like to thank Alva Hultén for her invaluable support.

(5)

Contents

1 Introduction 1

1.1 Project description . . . . 1

1.1.1 Problem formulation . . . . 1

1.1.2 Purpose . . . . 1

1.2 Background . . . . 1

1.3 Related work . . . . 2

1.4 Data acquisition . . . . 3

2 Theory 6 2.1 Pinhole camera model . . . . 6

2.2 Reference systems . . . . 6

2.3 Epipolar geometry . . . . 8

2.4 Reconstruction geometry . . . . 9

2.4.1 Projective geometry . . . . 10

2.4.2 Affine geometry . . . . 10

2.4.3 Metric geometry . . . . 11

2.4.4 Euclidean geometry . . . . 11

2.4.5 Auto-calibration equations . . . . 11

2.5 Random sample consensus . . . . 13

3 Implementation 14 3.1 Preprocessing . . . . 14

3.2 Image interest points . . . . 15

3.2.1 Detection . . . . 15

3.2.2 Description . . . . 16

3.2.3 Matching . . . . 17

3.3 Calibration process . . . . 19

3.3.1 Begin calibration . . . . 20

3.3.2 Extend calibration . . . . 20

3.3.3 Upgrade calibration . . . . 22

3.4 Triangulation . . . . 23

3.5 Reconstruction quality . . . . 24

4 Experiments 26 5 Result 27 5.1 Herz-Jesu . . . . 27

5.2 FLIR-Corner . . . . 29

5.2.1 Visual . . . . 30

5.2.2 Infrared . . . . 32

5.3 Aerial . . . . 35

(6)

6 Discussion 39

6.1 Reconstruction results . . . . 39

6.2 Preprocessing . . . . 40

6.3 Matching . . . . 40

6.4 Thermal accuracy . . . . 41

6.5 Evaluation . . . . 43

7 Conclusion 44 7.1 Conclusions . . . . 44

7.2 Future work . . . . 44

8 References 46

(7)

1 Introduction

1.1 Project description

1.1.1 Problem formulation

The goal of this thesis is to investigate how to bring image based modeling into the world of infrared imaging. The work is carried out on behalf of FLIR Systems in Täby, Sweden. Difficulties for both visual and infrared images includes, feature extraction, feature matching and computing geometries. Additional problems for the infrared case is to maintain the thermography accuracy of the original images when an arbitrary view is rendered. Moreover, characteristics of the infrared images differ from visual images and prevents straightforward usage of systems developed on such images. Given a set of images capturing a scene from different perspectives the system should automatically generate a three-dimensional model of the scene. Specifically for this thesis a prototype is implemented for infrared image based modeling. The implementation is evaluated in comparison with visual image based modeling and with respect to thermal accuracy.

1.1.2 Purpose

The overall purpose of this thesis is to work towards an automatized scene reconstruction system using infrared images. The first aim is to examine whether it is possible to obtain camera calibration and pose as well as a sparse point cloud reconstruction from a sequence of infrared images. The second aim is to present and analyze correlations of thermal measurements from the images which are used to reconstruct three dimensional coordinates. The third aim is to present and evaluate preprocessing methods in an attempt to improve the infrared reconstruction process.

1.2 Background

Methods and algorithms for three-dimensional reconstruction using visual images has been heavily developed during the earlier years of the 21st century. The field can be refereed to as scene reconstruction, composed by structure from motion (SfM) and multi- view stereo (MVS). The term multi-view stereo is an extension of the term stereo view, a field where typically, a fixed pair of cameras are used for depth estimation in a scene.

There are two ways to approach the development of a scene reconstruction application, it is possible to use images captured by a camera with unknown calibration but the process can be aided by a know calibration. Depending on the camera the process is referred to as either uncalibrated or calibrated reconstruction. Uncalibrated reconstruction is mainly considered in this thesis.

Scene reconstruction is a large field in computer vision and plenty of open source soft-

ware implementations of the developed algorithms are available. Bundler: Structure

from Motion for Unordered Image Collections [1] is one example of a software which has

been successful in implementing a reconstruction processes for visual images. A second

example of a promising software for visual images is Insight3d [2] which has been used

as a framework in this development process for infrared images. Huge companies such

as Google and Apple both have their own MVS projects. For Apple Maps "Flyover"

(8)

and Google Maps "Earth View" cities around the world, Paris and London for example, has been reconstructed from aerial imagery forming texture mapped three dimensional models. Applications of infrared thermography are for example, inspection of buildings to detect poorly insulated sections and inspection of industrial equipment to plan for predictive maintenance. A three-dimensional infrared model consisting of a large set two-dimensional images would simplify the inspection by providing a complete overview of the scene. During the reconstruction process depth estimation in the scene is calcu- lated. The depth estimation can be used for navigation, in robotics for example. Ability to estimate depths from infrared images could serve to enable navigation in poorly lit up environments.

1.3 Related work

Related articles on scene reconstruction are mainly focused around images in the visual spectrum. However, in [3] a previous thesis carried out at FLIR, the subject of infrared reconstruction is introduced. A variety of methods are summarized and steps in the re- construction process are tested in isolation for infrared images. Results from this thesis include a sparse reconstruction using infrared images captured with a calibrate camera where matches has been manually constructed. Uncalibrated reconstruction is attempted on a visual image sequence but the resulting scene is incorrect. The main issue for the thesis is the lack of ability to match detected features in the infrared images. The math- ematical concepts in scene reconstruction are analyzed to a great extent in [4] which is suitable for getting familiar with the relevant geometry.

The classical problem of reconstructing three dimensional geometry from photographs has occupied researchers for decades [5]. Development began in a laboratory setting with controlled conditions for parameters such as lighting. Only recently has algorithms matured and commercial cameras improved enough to allow for real life reconstruction.

The reconstruction can be divided into two separate steps. The first step being SfM to estimate camera calibration, pose and tracks, forming a sparse reconstruction and connec- tivity for the images. Tracks in this context refers to a reconstructed three dimensional coordinate and the corresponding list of two dimensional image points being projections of the same point. Early advances in SfM are published in [6] where a hierarchical ap- proach is used to find images that are suitable candidates for the process. A real time application with a similar approach, used in robotics, is visual simultaneous localization and mapping (vSLAM) where an unknown map is updated while simultaneously tracking an agent. The second step of reconstruction is MVS where a denser reconstruction can be achieved from known cameras in the sequence. In terms of MVS early advances include:

[7] where a geometric approach is used for surface estimation by solving a set of PDE’s and [8] working with a varying baseline approach to obtain precision at different scales.

Eventually MVS came to reconstruction of small outdoor scenes [9] as well as outdoor

scenes on a larger scale [10]. In an article called Building Rome in a Day [11] Agarwal

et al. set out to perform internet scale reconstruction in less than 24 hours using images

from a simple Google search as input to their system. They fell short of their goal how-

ever, it was believed to be possible and advances has been made since. In Building Rome

in a Cloudless Day [12] the goal was met. Recent advances are in the form of commercial

(9)

use by large companies as mentioned.

Here some of the key steps in the development process of scene reconstruction are listed.

The use of random sample consensus for robust model estimation, explained further in Section 2.5, improved results from the epipolar geometry, Section 2.3. The development of early high quality feature detector [13] and descriptor [14] was also crucial. Efficient indexing [15] as well as connectivity graphs for extracted tracks [16] enabled the possi- bility to build scalable systems for scene reconstruction using a large number of images.

Beyond the development of algorithms and software hardware improvements has played a huge role for the feasibility of scene reconstruction. Today even low grade commercial cameras are able to capture high resolution photos of acceptable quality to be used in scene reconstruction. Improvements of computer hardware in the form of better proces- sors as well as parallelization has also allowed for quicker solutions with a brute force approach.

1.4 Data acquisition

Images used as input for the reconstruction are from three different sources. The visual test case are from Computer Vision Laboratory (CVLAB) [17] while infrared images has been captured using a FLIR T640 thermal camera. The T640 has a visual camera installed along with the thermal sensor, they will simultaneously capture images and align them, this will be used for evaluation as well. Beyond this a sequence has been captured using an infrared camera, Zenmuse XT, mounted on a UAV. For the aerial sequence only infrared images are obtained and no comparison to the visual spectrum can be made.

(a) (b) (c) (d)

Figure 1: Herz-Jesu image sequence from CVLAB used for evaluation, presented here are

four sample images from the full sequence consisting of eight images total. (a) and (d)

are the first and the last image in the full sequence respectively. Images in this sequence

have a resolution of 3072x2048.

(10)

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 2: FLIR-Corner image sequence, captured using the T640 thermal camera. Top row (a)-(d) are images constructed from radiometric data and bottom row (e)-(h) are their counterparts captured with the visual camera. The full sequence contains five images where the last one is similar and omitted for space conservation purposes. The visual images have a resolution of 2592x1944 while the infrared counterparts are 640x480.

(a) (b) (c) (d)

Figure 3: Aerial image sequence, presented here are four sample images from the full sequence consisting of seventeen images total. (a) and (d) are the first and the last image in the full sequence respectively. Images in this sequence have a resolution of 640x512.

The data acquisition phase is crucial in scene reconstruction. The authors of [18]

has compiled a section on best practices when capturing images for three-dimensional reconstruction. Factors to keep in mind include:

• Resolution of the images and quality of the lenses.

• Perspective change between two subsequent images.

• Image overlap.

• Number of images.

(11)

A higher resolution will improve results as long as the quality of the lens is sufficient.

A greater change in perspective will lead to more accurate depth estimation while it complicates the matching step, explained in Section 3.2.3. The matching step will also be affected by image overlap, additional overlap will of course allow for simpler matching.

Modern SfM and MVS algorithms will efficiently exploit a large supply images to improve

the results.

(12)

2 Theory

This section will serve to introduce geometry and notation which is of relevance for scene reconstruction. Readers with experience in scene reconstruction might want to skip this section since the steps are revisited in Section 3 in practice.

2.1 Pinhole camera model

In a mathematical context the camera capturing the scene is modeled as the perspective projection of a pinhole camera [19]. A pinhole camera is a closed box where light is let in through a tiny hole and projected on the back wall. The pinhole camera model describes the camera aperture as a single point and lenses are not considered, in Figure 4

Figure 4: Pinhole camera model with labeled properties. Figure from [20].

perspective projection is visualized. The back wall, subject to the projection, is referred to as the image plane. The point through which the light will pass is the focal point or the center of projection. The distance between the focal point and both the image and virtual image plane is the focal length. A virtual image plane is often used in calculations to avoid negations when considering the orientation of the two dimensional image. Beyond the notation in Figure 4 there are two additional properties that are of importance to the calculations. Firstly, the optical axis is the line through the center of projection which is also orthogonal to the image plane. Secondly, the principal point is the intersection of the optical axis and the virtual image plane. A desired property of this model is that collinear points in the three dimensional object remain collinear in the two dimensional projection.

2.2 Reference systems

Typically four coordinates systems are used to formulate the model in algebraic notations.

A world coordinate system is used to describe position and orientation of the camera

coordinate systems as well as scene coordinates. The camera coordinate system has

its origin in the center of projection and is usually oriented so that the optical axis is

parallel to the Z-axis. The image coordinate system originates from the principal point

and contains the image coordinates in metric units. Lastly, the pixel coordinate system

has its origin on one of the image corners, usually the top left corner by convention,

(13)

and coordinate units. Keep in mind that a set consisting of a camera, image and pixel system will be used for every view of the scene. Using the listed coordinates systems,

Figure 5: Reference systems used for calculations where camera C is capturing an image of a three dimensional point or vertex X. Intrinsic parameters are internal to the camera and will form a relation between image coordinates and camera system coordinates. Extrinsic parameters relates camera system coordinates to the world coordinate system. Figure from [21].

shown in Figure 5, the position of a two dimensional point in the image coordinate system depending on a three dimensional point in the world coordinate system is

x = f X

Z y = f Y

Z (1)

where, f is the focal length, {x, y} are image coordinates and {X, Y, Z} are world coordinates. The position in the pixel coordinate system is then calculated by shifting the relative position of the origins and scaling by row- and column-wise pixel densities in pixels/millimeter, equation 2. Let k u and k v denote column- and row-wise pixel densities respectively and the expression is

u = k u f X

Z + k u x 0 v = k v f Y

Z + k v y 0 . (2)

Then, by introducing homogeneous coordinates and pixel units for the parameters, the projection can be formulated as a matrix multiplication. The projection matrix P projects a given homogeneous camera coordinate (X c , Y c , Z c , 1) to a homogeneous pixel coordi- nate (x, y, 1).

x y 1

 ∼

f u s u 0 0 0 f y v 0 0

0 0 1 0

X c Y c Z c 1

(3)

where f u and f v are the focal length scaled by pixel densities, s is the skew of the pixels

and {u 0 , v 0 } is the principal point in the pixel coordinate system. By assuming equal

(14)

pixel densities in x- and y-direction and zero skew of the pixels the three by three camera calibration matrix is

K =

f 0 u 0 0 f v 0

0 0 1

. (4)

The camera calibration is also referred to as the intrinsics matrix, it contains parameters internal to the camera. However, the intrinsic parameters alone does not hold enough information for the reconstruction process. The pose, rotation and translation, of the camera coordinate system relative to the world coordinate system is included to form the complete camera projection matrix. A coordinate in the world system is projected to the pixel coordinate system by the relation (x, y, 1) T ∼ P (X, Y, Z, 1) T where P is the three by four matrix

P = [K · R|T ] =

f 0 u 0 0 f v 0

0 0 1

 ·

r 11 r 12 r 13 r 21 r 22 r 23

r 31 r 32 r 33

t x t y

y z

. (5)

Here R is a rotational matrix describing the orientation of the basis to the camera coor- dinate system and T is the translation of the origin to the same system.

2.3 Epipolar geometry

As only images are passed to the system for scene reconstruction, a way of relating the cameras and estimating their calibration is necessary, for this purpose epipolar geometry is used. The core of epipolar geometry is the fundamental matrix F . Let x and x 0 denote the two dimensional image projection of a vertex X in three dimensions onto two different image planes. x and x 0 are so called corresponding points and, x 0T F x = 0. This section is dedicated to a brief derivation of the fundamental matrix, a more extensive formulation can be found in [4].

Two corresponding image points and the respective three dimensional point form a plane, the epipolar plane, π. The epipolar geometry is commonly explained by Figure 6. If x l is known, the epipolar geometry can be used to constrain x r . x l and the corresponding point x r will both lie on the epipolar plane. Since π can be constructed from the baseline and the ray defining x l , correspondence search is limited to the intersection of the right image and π, the right epipolar line, l 2 as shown in Figure 6. Using these intuitive geometric properties it is possible to derive the fundamental matrix. Any point x r in the second image corresponding to the point x l must lie on the epipolar line l 2 . The projection of the line X to C l in the right camera center defines a map

x l 7→ l 2 (6)

from a point in the left image to the corresponding epipolar line. The map in fact a projective mapping which is represented by the fundamental matrix.

l 2 = F x l . (7)

Since x r ∈ l 2 , (x r ) T l 2 = 0 and therefore

x r F x l = 0. (8)

(15)

Figure 6: Epipolar geometry where camera centers C {r,l} project the vertex X to x {r,l} . The baseline is drawn between the two camera centers and its intersection with respective image plane e {1,2} are the epipoles, the image of the neighboring camera. Figure from [22].

The fundamental matrix is a three by three matrix of rank two which can be determined up to a scale. In order to determine F at least 7 point correspondences are needed to set up a non-linear system of equations. However, the eight-point algorithm [23] proposing a linear system of equations requiring 8 point correspondences is frequently cited.

2.4 Reconstruction geometry

It is in fact impossible to determine positions of points in the scene uniquely while only using images as input to the system. The ambiguity can be considered trivial while it is relevant to introduce this section. In practice, the absolute position of the scene, its orientation and scale will all be undetermined without more information concerning these properties of at least one of the camera view-points. There are additional ambiguities occurring in the reconstruction process when calibration of the used cameras are unknown.

Using invariant geometric properties, it is possible to upgrade in order to remedy such

additional ambiguities. In this section the achievable reconstruction is formulated step by

step where ambiguities are reduced and the accuracy of the scene is increased. Geometries

describing the upgrading process are, projective, affine, metric and euclidean, these will

be introduced first. The relevant geometries are summarized by Figure 7.

(16)

Figure 7: Geometries included in the reconstruction process with their respective prop- erties, the third column shows distortion of a cube when subject to a transformation belonging to the specific group. Figure from [4].

2.4.1 Projective geometry

The initial reconstruction is achieved in a projective geometry. In the context of this up- grading process projective geometry has the least number of invariants. Lines that should be parallel are not parallel when subject to projective transformations. Size comparison and measurements are not possible.

2.4.2 Affine geometry

Affine geometry is a specialization of projective geometry where the parallel postulate will hold, lines that should be parallel will still be parallel when subject to an affine transformation while orthogonality is not preserved. The set of affine transformations is a subset of projective transformations. In a local region an affine transformation can ap- proximate a perspective change, something that will be discussed further in Section 3.2.1.

The upgrade from projective geometry to an affine one is based on the plane at infinity,

π ∞ . The plane at infinity is invariant under affine transformations and by transforming

π to its canonical position [0, 0, 0, 1] T (in homogeneous coordinates) in the projective

geometry it is assured that, parallel planes and lines will only intersect at infinity. Math-

ematically this statement is deficient. However, if parallel planes and lines only intersects

at π the geometry is affine.

(17)

2.4.3 Metric geometry

In metric geometry size comparison and distances between elements can be measured.

The scale of the scene will still be undetermined. Metric transformations form a subse- quent subset to affine transformations and here orthogonality will be preserved as well.

The invariant to metric geometry used for the upgrading process is the absolute conic Ω ∞ located at π ∞ . By identifying the homography that transforms the absolute conic in affine geometry to the absolute conic in metric geometry the upgrade is possible. With knowledge about the metric structure of the scene it is possible to project the absolute conic back into the images and thereby finding the calibration of the cameras. This rela- tion forms an equivalence for calibrations of the cameras and the geometric structure of the scene.

2.4.4 Euclidean geometry

In euclidean geometry it is possible to measure size of and length in the reconstructed scene. The invariant property of euclidean geometry is the volume, which will be unknown with the usage of images alone. In this report a reconstruction up to a metric geometry with normalized distance units is considered. It is possible to achieve the euclidean counterpart by a similarity transform scaling of the axes corresponding to measurements in the scene.

2.4.5 Auto-calibration equations

Using point correspondences from image pairs in a sequence along with the epipolar geometry it is possible to compute a partial calibration and a projective reconstruction.

The steps are explained further in Section 3.3. The general approach would be

• Obtain a projective reconstruction {P i , X j }

• Determine a rectifying homography H from auto-calibration constraints, and trans- form to a metric reconstruction {P i H, H −1 X j }.

Here P i denotes the camera projection matrix estimate for camera i and X j is the j : th reconstructed vertex in projective geometry. Suppose that m camera projection matrices P M i are known in the true metric situation. A vertex in metric geometry is related to the image coordinates by x i = P M i X M and camera matrices can be decomposed to P M i = K i [R i |t i ] for i = 1, . . . , m.

P M i = P i H i = 1, . . . , m (9) is the relation between camera matrices in a projective reconstruction and their metric counterparts. The matrix H is a 4 by 4 unknown homography to be determined in order to upgrade the geometry. The world frame is set to coincide with the first camera, R 1 = I, t 1 = 0 and the remaining rotation matrices and translation vectors R i , t i will be relative to the first camera, P 1 = K 1 [I|0]. With

H = A t

v T k

!

(10)

(18)

and equation (9) the relation is [K 1 |0] = [I|0]H and A = K 1 , t = 0. The homography can be written on the form

H = K 1 0 v T 1

!

(11) by setting the scalar k = 1, thereby determining the scale of the scene which is considered arbitrary. Notice how this formulation will only find the scene up to a similarity transform meaning that absolute rotation, translation and scale is not concerned. The homography will set π ∞ in the metric reconstruction to its canonical position by

π ∞,P = H −T π ∞,M = H −T

0 0 0 1

= −(K 1 ) −T v 1

!

. (12)

Indexes M and P are used to denote the geometric context for the plane at infinity. To simplify the expressions the notation π ∞,P = (p T , 1) T , p = −(K 1 ) −T v is introduced.

H = K 0

−p T K 1

!

(13) Denote cameras in the projective reconstruction as P i = [A i |a i ] and use equations (9) and (13) to form

K i R i = (A i − a i p T )K 1 f or i = 2, . . . , m. (14) Rearranging the equation and using RR T = I results in the expression

K i K iT = (A i − a i p T )K 1 K 1T (A i − a i p T ) T . (15) Here K i K iT = ω is the dual image of the absolute conic and ω is the image of the absolute conic. Using the introduced notation for the absolute conic and the inverse of equation (15) the basic auto-calibration equations are

ω ∗i = (A i − a i p T ∗1 (A i − a i p T ) T (16) ω i = (A i − a i p T ) −T ω 1 (A i − a i p T ) −1 . (17) The equations are used to relate the unknown parameters of the dual absolute conic ω i as well as the unknowns for the plane at infinity in p with known entries of the projective cameras A i , a i . The auto-calibration equations can then be solved using a range of different constraints to the intrinsic camera calibration matrix K.

Table 1: The number of views required to solve the auto-calibration equations for varying constraints on the internal calibration of the cameras. The asterisk is used to denote that varying solutions may exist.

Condition fixed known views

Constant internal parameters 5 0 3

Aspect ratio and skew known,

focal length and principal point vary 0 2 4*

Aspect ratio skew constant,

focal length and principal point vary 2 0 5*

Skew zero, all other parameters vary 0 1 8*

(19)

In Table 1 varying constraints to the intrinsic calibration matrix are listed with their respective requirements on the input image sequence. A more extensive material on reconstruction geometry can be found in [4].

2.5 Random sample consensus

Under the assumptions that images are perfect and that it is possible to find a sufficient number of exact corresponding image points the reconstructed scene could be found using the relations presented so far. In reality, images are noisy and contain artifacts. In addition, corresponding features will in many cases not match the geometry down to pixel coordinate precision. A method for robust model estimation where outliers are present is required, the paradigm is to use Random Sample Consensus or RANSAC [24]. Figure 8

Figure 8: Comparing the RANSAC approach to a least squares solution for a linear curve fitting problem. Figure from [25].

illustrates the superior performance of RANSAC for a simple linear curve fit to data which contain outliers. The outliers, gray markers in the figure labeled as points, are weighed into the least square solution resulting in a worse fit for the correctly measured data.

Pseudo code for the RANSAC algorithm is presented in Algorithm 1. A minimalistic number of points are sampled from the data in order to construct a temporary model.

If the number of iterations are chosen with consideration to the proportion of outliers to inliers in the data, the probability of finding a correct model is high. The probability for the model not to contain any outliers P can be related to the number of iterations N by

N = log(1 − P )

log(1 − (1 − ) q ) (18)

where q is number of elements in the minimal subset and  is the proportion of outliers

in the data set. Determining  can turn out to be difficult or even impossible while some

heuristic choice for N performs well.

(20)

Algorithm 1 RANSAC algorithm

1: procedure RANSAC(data, nrIterations) . typically a large number of iterations

2: nInliers = 0

3: for i = 1 to nrIterations do

4: sample = SelectRandomSubset(data) . sample a minimal number of points

5: tmpM odel = ComputeM odel(sample) . compute model from sample

6: tmpInliers = ComputeInliers(data, tmpM odel) . check consistency

7: if tmpInliers > nInliers then . a better model is found

8: model = tmpM odel

9: nInliers = tmpInliers

10: end if

11: end for

12: return (model, nInliers)

13: end procedure

3 Implementation

In this section the implementation is proposed, methods and software packages are sug- gested for the pipeline. In particular, a process is explained in detail that will reconstruct a sparse point cloud, camera positions and pose given only an image sequence as input.

A number of open source software packages are used in the implementation, Insight3d provides a GUI and storage conventions for the data. Insight3d is in turn built using SDL [26] for event handling, OpenCV [27] for image processing, OpenGL [28] for render- ing, LAPACK [29] and BLAS [30] for algebraic computation and lastly pThreads [31] for parallelization. Additional software packages that were used have a stronger correlation to pipeline specific details and are mentioned in their respective subsections. The project is built in Visual Studio 2015 on a machine running 64-bit Windows 10.

3.1 Preprocessing

The preprocessing will depend on whether the image sequence consists of visual or in- frared images. In general, infrared images will have less contrast and less high frequency content. The lack of contrast becomes a problem when trying to identify corresponding image points, attempts are made to remedy this issue with additional preprocessing. For a visual image sequence preprocessing is simply a conversion from RGB color to gray scale. In order to retrieve radiometric data from the thermal images an additional soft- ware package is used, namely Spartacus SDK. Spartacus is developed at FLIR Systems to enable image processing using their thermal image formats. For infrared images a number of preprocessing steps are tested. The steps include a normalization of the image histogram

f (x, y) = g(x, y) − g min

g max − g min · (2 bpp − 1) (19)

where g(x, y) is the original image, f (x, y) is the corrected image and bpp is the available

bits per pixel in the image. Histogram equalization and edge enhancement are also tested

in order to increase contrast in the image, the increased contrast will enable detection of

(21)

suitable candidates for a correspondence search. Histogram equalization is common in image analysis and edge enhancement is achieved by unsharp masking

I edge = I org + (I org − I σ ) (20) in the experiments. I σ denotes the blurred image received from convolution with a Gaus- sian kernel of variance σ. The kernel size is defined to be (2 · d2σe + 1) 2 where d·e denotes a ceil function. Preprocessing is tested with the ultimate goal to obtain infrared images that are similar to a gray scale visual image. The properties of a visual image is desired since parts of the implementation are based of algorithms developed on such images.

3.2 Image interest points

The first step in the reconstruction pipeline is to find corresponding points in the im- age sequence. Correspondences are found by first detecting features that are likely to be detected from other perspectives as well, image interest points. A region around the interest point is extracted and described to prepare for a correspondence search. Corre- spondences are formed by a nearest neighbor search in the descriptor space aided by the use of epipolar geometry. By conclusions drawn in [32], a second preceding thesis carried out at Flir, it is know that Hessian-affine detection and Local intensity order pattern description are suitable method choices for interest point detection and description on infrared images. For comparison purposes the same method pair is used on visual image sequences.

3.2.1 Detection

The Hessian-affine detector is formulated in [33]. Hessian-affine detection is initialized using points extracted by a multi-scale Harris detector [34] where the second moment matrix or structure tensor is evaluated at local regions in the image. Harris detection for a local neighborhood of an image point x is

µ(x, σ I , σ D ) = σ D 2 g(σ I ) × L 2 x (x, σ D ) L x L y (x, σ D ) L x L y (x, σ D ) L 2 y (x, σ D )

!

(21)

det(µ) − αtrace 2 (µ) > threshold. (22) In equation (21) σ I and σ D are the integration and derivation scales respectively, g is the Gaussian kernel and L is the image already smoothed by a Gaussian. Multi-scale detec- tion refers to a gradual shift in the Gaussian scale-space, where σ I is the single degree of freedom. For large values of σ I the image will be significantly blurred and this enables detection of features on a large scale. Consequently, small values of σ I are for detection on a smaller scale. Eigenvalues to the structure tensor determines the characteristics of the region, large eigenvalues indicates the presence of either a corner or an edge. Since eigenvalue estimation is computationally heavy the Harris measure equation (22) was introduced, for values larger than some threshold an interest point is detected.

The region around a detected interest point is affine covariant for the Hessian-affine

detector. This region is represented by an oriented ellipse with six degrees of freedom,

(22)

it is then normalized to a circle with fixed rotation before description. The covariance is explained further by Figure 9, in this particular case the warp is an affine transformation of the image. Since the affine transformation can approximate perspective changes lo-

Figure 9: Covariance of the region extracted around an interest point in the image. The image is subject to some warp or distortion and the warp of the extracted region is covariant to a high degree. Figure from [35].

cally on a smooth surface the affine covariance is desired, this is in order to detect similar features from different views in the image sequence. The VFLeat C API implementation [35] of the Hessian-affine detector is used for this thesis.

3.2.2 Description

The Local Intensity order Pattern descriptor (LIOP) [36], is based of local order pat- tern techniques, where the image patch is further divided into regions based on image intensity. The patch is divided into subregions by the work flow in Figure 10. A LIOP is computed for each subregion or ordinal bin and then by concatenating the resulting histograms the descriptor i formed. The region division method associated with the LIOP descriptor is invariant to monotonic luminance variance and rotation, something which is an advantage to a number of other descriptors. Histograms of the ordinal bins are constructed by sampling image points in the region and finding the permutation which sorts the intensities in a non descending order. A lookup table, example in Table 2, is used to find the contribution of the sample to the ordinal bin histogram. Parameters

Table 2: Lookup table for permutations of a vector with length N = 3.

Permutation Index

1, 2, 3 1

1, 3, 2 2

2, 1, 3 3

2, 3, 1 4

3, 1, 2 5

3, 2, 1 6

(23)

Figure 10: (a) The detected Hessian-affine region. (b) After normalization in the detector.

(c) Regions R t = {x : τ t ≤ I(x) ≤ τ t+1 } such that card(R 1 ) = card(R 2 ) = . . . . (d-f) Region masks. Figure from [36].

for the LIOP descriptor are suggested by the authors, number of sample points N = 4 and number of ordinal bins B = 6 shows good performance. With these parameters the descriptor vector is of N ! · B = 144 dimensions. LIOP is implemented in VFLeat C API and it is used for compatibility with the detector.

3.2.3 Matching

In order to construct image point correspondences a nearest neighbor search is performed in the descriptor space. Using the initial correspondences the fundamental matrix is cal- culated, epipolar geometry is then used to perform a second guided correspondence search aided by geometric constraints.

A matching pattern is used for the sequence where a choice can be made to match (a) all images in the sequence, (b) neighboring images or (c) using a pattern defined by the user.

The two predefined options for a sequence of six images {a, b, . . . , f } are described by Figure 11. Matching option (b) is suitable for a linear image sequence where performance can be improved based on the ordered input.

The naive implementation of a nearest neighbor search would scale as O(n 2 ) for each

image where n is the number of detected features. An attempt to improve this is made

by the use of a binary search tree. A k-dimensional tree (k-d tree) [37] is built for all the

interest points in each image. Interest point coordinates as stored so that every parti-

tion of tree splits the descriptor space along the dimension of highest variance in current

node. A k-d tree partition for a two dimensional space is in Figure 12. Performing a

nearest neighbor search on a k-d tree will optimally scale as O(n · log(n)) depending on

the data. The descriptor space at hand is as mention of 144 dimensions and the rule of

thumb for performance improvements using k-d trees is n >> 2 k where n is the number of

elements and k the dimensionality [39]. Due to the high dimensionality of this descriptor

(24)

(a)

(b)

Figure 11: (a) Matching all images in the sequence. (b) Matching neighboring images in the sequence, at most four neighbors are used.

Figure 12: (a) Partition of the space. (b) Binary search tree where coordinates stored and the split dimension are listed for every node. Figure from [38].

space there are more suitable methods with respect to performance which are considered in Section 6.3. All of the detected features will not find a correct match in a different perspective. By simply selecting the nearest neighbor in the descriptor space as a cor- responding feature outliers would require a match as well. A method for discrimination is to threshold the nearest neighbor distance, however, this method would be dependent on the descriptor at use. In order to achieve descriptor independent discrimination [40]

suggests that a threshold for the ratio nearest neighbor to second nearest neighbor is used.

distance N N

distance sN N < threshold = 0.7 (23)

The reasoning behind this threshold being that a correct match should have a particu-

larly low descriptor distance, while for a discriminative descriptor, both descriptors for the

match should be different from all other features. On the other hand, outlying features

should have similar distances to more than a single candidate feature. The consequence

(25)

of this argument is that for an incorrect match, the ratio, equation (23) will be close to one since no single feature is a distinct nearest neighbor.

After the first purely descriptor based matching the number of correspondences are im- proved by geometric constraints in the epipolar geometry, guided matching. The funda- mental matrix is estimated using OpenCv where a RANSAC implementation requiring 18 correspondences is available. Using Figure 6 as a reference it is true that, for every feature in the left image its correspondence can be found on the epipolar line in the right image.

l 2 = F x (24)

sets the geometric constraint. Features are stored in buckets by their spatial location in the image to improve performance. For every feature in the left image the epipolar line is found and buckets intersecting the line are selected for the additional match search.

Figure 13 is the relevant part of the epipolar geometry in this step. The distance to the epipolar line as well as the descriptor distance are then used to find the appropriate match. If the threshold in equation (23) is also satisfied it is considered a newfound correspondence.

To finalize the matching phase it is necessary to ensure that all correspondences are

Figure 13: Geometric constraints used to aid the matching process. Every feature in the corresponding image will define an epipolar line. Buckets are used for performance enhancement to efficiently discriminate incorrect matches.

of a one-to-one nature. For consistency, matches found from matching image a to b should also be found when matching image b to image a. By forming a union on all the candidates dissimilar correspondence are removed.

3.3 Calibration process

As stated in Section 2.4 the calibration is achieved in a projective geometry to begin

with, then, by finding a solution to the auto-calibration equations it can be upgraded to

satisfy metric properties.

(26)

3.3.1 Begin calibration

A calibration is initiated by selecting two random views with a sufficient amount of feature correspondences to determine the epipolar geometry. The epipolar geometry for the views is setup by estimation of the fundamental matrix using the OpenCV implementation. The position in the world coordinate system is fixed in this initiation by assuming that the first camera is canonical

P 1 = [I|0] =

1 0 0 0 1 0 0 0 1

0 0 0

. (25)

The epipole of the second camera is the left null vector to the fundamental matrix e T 2 F = 0. Using the epipole it is then possible to find a projective estimate of the second camera.

P 2 = [e 2 × F |e 2 ] = [[e 2 ] x · F |e 2 ] =

0 −e 3 e 2 e 3 0 −e 1

−e 2 e 1 0

 ·

F 11 F 12 F 13

F 21 F 22 F 23 F 31 F 32 F 33

e 1 e 2 e 3

 (26) The second camera is estimated by equation (26), the vector to matrix cross product is implemented as a matrix-matrix multiplication [[e 2 ] x · F |e 2 ] by construction. With two camera calibration matrices defined it is possible to start extending the calibration to include more views.

3.3.2 Extend calibration

Extension of the calibration is performed in iterations where one of the following actions is performed in every iteration until it is not possible to add more subsequent views.

1. Add more views by solving the resection problem.

2. Refine the current calibration by sparse bundle adjustment.

The resection problem is

λ i x i = P X i i = 1, . . . , N. (27) Here λ i and P are the unknowns, x is an image coordinate and X the imaged vertex in projective geometry from the previously calculated camera calibration matrices. For more information on how vertexes are estimated refer to the section on triangulation. P contains 12 elements but since the scale is arbitrary it ads 11 degrees of freedom to the problem. Each projection X i to x i results in three additional equations and one additional unknown λ i .

3N ≥ 11 + N → N ≥ 6 (28)

At least six point projections are needed in order to find a solution up to the arbitrary scale factor. In practice the equation is solved using a direct linear transform. By introducing

P =

p T 1 p T 2 p T 3

 (29)

(27)

the resection problem equation (27) can be written on the form

 

 

X i T p 1 − λ i x i = 0 X i T p 2 − λ i x i = 0 X i T p 3 − λ i = 0

(30)

which in turn can be written in matrix form as

M v =

X i T 0 0 −x i

0 X i T 0 −y i 0 0 X i T −1

p 1 p 2 p 3 λ i

=

0 0 0

. (31)

Notice that the zeros in the matrix are actually blocks of zeros since vertexes X i are homogeneous three dimensional coordinates and thereby four by one vectors, the resulting matrix is three by thirteen. By concatenating at least six matrices from unique projections the resulting system will have a valid solution. Noise in the measurements along with the possibility that the system is overdetermined if additional points can be projected prohibits the existence of an exact solution. A homogeneous least square solution with a scale constraint ||v|| 2 = 1 is sought for.

||v|| min

2

=1 ||M v|| 2 (32)

This optimization estimates an additional camera projection matrix to the current cali- bration.

Bundle adjustment refines the current calibration by the solution of a non-linear op- timization problem. Estimates of camera calibration matrices for the currently included views are improved upon by this action. The optimization problem is

min a

j

,b

i

n

X

i=1 m

X

j=1

v ij · d(Q(a j , b i ), x ij ) 2 (33) where n is the number of reconstructed vertexes, m the number of views, v ij a visibility Dirac function, d(·, ·) a Euclidean distance function, Q(a j , b i ) the predicted projection of point i on image j from the camera matrix and x ij the actual image coordinate. Vectors a j and b i hold parameters for camera j and point i respectively. The terms that add up to the cost function d(Q(a j , b i ), x ij ) are the so called re projection errors which optimally should be small. The optimization problem is large with m · dim(a j ) + n · dim(b i ) parameters it grows fast for additional views and vertexes. A solution is found by the Levenberg-Marquardt algorithm [41, 42].

For a function f : R m → R n given an initial estimate p o ∈ R m and a measurement vector x ∈ R n , Levenberg-Marquardt finds p + which minimizes

 T ,  = x − f (p). (34)

An optimal minimum is found by the Gauss-Newton method which calculates incremental update steps δ p by solving the normal equations

J T J δ p = J  T (35)

(28)

where J is the Jacobian at the current point p and J T J is the corresponding Hessian of

|||| 2 .

Sparse bundle adjustment [43] exploits the sparsity of the Jacobian matrices to improve the computational efficiency, documentation and open source implementation is available online.[44]

3.3.3 Upgrade calibration

To complete the calibration, firstly, a final refinement is computed using sparse bundle adjustment. At this stage a set of camera calibration matrices, optimally one for every view, are estimated in the projective geometry. Secondly, the auto-calibration equations are to be solved for an upgrade of the geometric context from projective to metric. An estimate to the absolute dual conic ω 1 is found by solving

P i ω 1 P i T ∼ ω i = K i K i T ∀i. (36) The unknown ω 1 is parameterized using a four by four symmetric matrix of ten unique elements

ω 1 =

q 1 q 2 q 3 q 4 q 2 q 5 q 6 q 7 q 3 q 6 q 8 q 9 q 4 q 7 q 9 q 10

. (37)

For a general camera matrix

K =

k 1 k 2 k 3

k 2 k 4 k 5 k 3 k 5 1

→ ω i =

 r

k 1 − k 3 2(k

2

k −k

3

k

5

)

2

4

−k

25

k

2

−k

3

k

5

k

4

−k

52

k 3 0 q k 4 − k 2 5 k 5

0 0 1

(38)

on which constraints are imposed to estimate the dual absolute conic. Let p i denote the i : th row of a projection matrix and ω a dual image of the absolute conic, then,

ω 12 = 0 → p 1 ω 1 p T 2 = 0 (39) ω 13 = 0 → p 1 ω 1 p T 3 = 0 (40) ω 23 = 0 → p 2 ω 1 p 3 3 = 0 (41) ω 11 = ω 22 → p 1 ω 1 p T 1 − p 2 ω 1 p T 2 = 0. (42) Using this a linear equation system Aω i = ¯ 0 can be stated for every view. A is a four by ten matrix containing the obtained combinations of elements in P , with at least three views a solution for ω 1 can be found by singular value decomposition of A. The rectify- ing homography that will upgrade the geometry can then be determined and applied to camera projection matrices and vertexes.

In order to find a robust rectifying homography a RANSAC approach is used in the

upgrade step. For every RANSAC iteration three sample views are selected from the

sequence to determine a temporary homography. Inliers are counted by rectifying the

full sequence and decomposing the camera projection matrices, recall that P = [K · R|T ].

(29)

The property of interest here is the focal point estimation in K which can be compared to (W/2, H/2), the true focal point, where W and H are width and height of the image in the coordinate reference system respectively.

Intrinsic parameters K and rotation R can be obtained from P by QR-decomposition.

For any square matrix of full rank it is possible to calculate a decomposition into the product of an orthogonal and an upper triangular matrix. Using Givens rotations [45]

the method is, compute Q such that

Q T M = γ −σ

σ γ

! T

m 11 m 12 m 21 m 22

!

= r 11 r 12 0 r 22

!

= R. (43)

For computational examples and further analysis refer to [46]. A threshold for the Eu- clidean distance between the estimated focal point and the true image coordinate is used to discriminate between inliers and outliers to the homography.

3.4 Triangulation

Since the fundamental matrix is computed by means of a RANSAC estimation a corre- sponding point pair will not exactly satisfy x 0T F x = 0. An equivalent consequence of the estimation is that projecting rays from the image centers through the corresponding points will not intersect when triangulating for depth calculation. The equation system

x = P X

x 0 = P 0 X (44)

will not have an exact solution. An exact triangulation is shown in Figure 14. In order

Figure 14: The ideal case for triangulation with intersection of the projecting rays. For a triangulation in practice the rays will not intersect, additional computing is necessary.

Figure from [47].

to remedy the issue of non-intersecting rays a triangulation method in analogy with the

direct linear transform is proposed. The ideal triangulation equation (44) can be used to

form a linear equation system AX = 0, eliminating the homogeneous scale factor yields

References

Related documents

Backmapping is an inverse mapping from the accumulator space to the edge data and can allow for shape analysis of the image by removal of the edge points which contributed to

As a brief summary of this part, the Matlab code process the images taken by the robot and trains with them a neural network that performs object detection to the pictures

Our program ran successfully and achieved satisfactory results. We summarized our results in Table 1-3. We also show the performances of our algorithms with different

A: Pattern adapted according to Frost’s method ...113 B: From order to complete garment ...114 C: Evaluation of test garments...115 D: Test person’s valuation of final garments,

In this paper, we use a well-known physical partial differ- ential equations (PDE) model [1–5] to obtain an observation model [6, 7] that contributes to the analysis and synthesis

“Det är dålig uppfostran” är ett examensarbete skrivet av Jenny Spik och Alexander Villafuerte. Studien undersöker utifrån ett föräldraperspektiv hur föräldrarnas

Linköping Studies in Science and Technology, Dissertation No.. 1862, 2017 Department of

Road condition inventory pointed out 9712 images (one image view can include many distress occurrences) with different distresses on the pavement surface (Figure 3.2).. Every damage