Automatic Volume Estimation Using Structure-from-Motion Fused with a Cellphone's Inertial Sensors

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2016

Automatic Volume

Estimation using

Structure-from-Motion

fused with a Cellphone’s

Inertial Sensors

(2)

Master of Science Thesis in Electrical Engineering

Automatic Volume Estimation using Structure-from-Motion fused with a Cellphone’s Inertial Sensors

Marcus Fallqvist LiTH-ISY-EX--17/5107--SE

Supervisor: Tommaso Piccini, tommaso.piccini@liu.se

isy_{, Linköpings universitet}

Hannes Ovrén, hannes.ovren@liu.se

Christer Andersson, christer.andersson@escenda.com

Escenda Engineering AB

Examiner: Klas Nordberg, klas.nordberg@liu.se

Department of Electrical Engineering Linköping University

SE-581 83 Linköping, Sweden

(3)

Sammanfattning

I rapporten framgår hur volymen av storskaliga objekt, nämligen grus-och sten-högar, kan bestämmas i utomhusmiljö med hjälp av en mobiltelefons kamera samt interna sensorer som gyroskop och accelerometer. Projektet är beställt av Escenda Engineering med motivering att ersätta mer komplexa och resurskrä-vande system med ett enkelt handhållet instrument. Implementationen använ-der bland annat de vanligt förekommande datorseendemetoanvän-derna Kanade-Lucas-Tommasi-punktspårning, Struktur-från-rörelse och 3D-karvning tillsammans med

enklare sensorfusion. I rapporten framgår att volymestimering är möjligt men noggrannheten begränsas av sensorkvalitet och en bias.

(4)

(5)

Abstract

The thesis work evaluates a method to estimate the volume of stone and gravel piles using only a cellphone to collect video and sensor data from the gyroscopes and accelerometers. The project is commissioned by Escenda Engineering with the motivation to replace more complex and resource demanding systems with a cheaper and easy to use handheld device. The implementation features popular computer vision methods such as KLT-tracking, Structure-from-Motion, Space Carving together with some Sensor Fusion. The results imply that it is possible to estimate volumes up to a certain accuracy which is limited by the sensor quality and with a bias.

(6)

(7)

Acknowledgments

I would like to thank my examiner Klas Nordberg and my two supervisors at ISY, Tommaso Piccini and Hannes Ovrén. You all helped me getting past several prob-lems and always gave me constructive feedback. I would also like to acknowledge Christer Andersson at Escenda.

Linköping, April 2017 Marcus Fallqvist

(8)

(9)

Notation

Symbols

Notation Meaning

RN The set of real numbers in N dimensions

M A matrix

~

y A vector in 3D ¯

y C-normalised homogeneous 2D point

Abbreviations

Abbreviation Meaning

cs _{Coordinate system} sfm Structure-from-Motion imu Inertial measurement unit

(12)

(13)

1

Introduction

1.1 Motivation

Volume estimation of large scale objects such as stone and gravel piles at e.g. building sites and quarries is today done by complex and time-consuming meth-ods. Not to mention the man hours and the expensive costs of the equipments used, such as laser camera systems and drones. In this thesis another approach is suggested, where the idea is to have an easy to access and relatively cheap instrument to determine volume of such objects as described above. The replace-ment equipreplace-ment is a single modern handheld device, a tablet or a smart phone. The user only needs to gather data of the object of interest, without any prior knowledge to the technique and algorithms running in the background, whilst still being presented with the volume estimation in the end.

1.2 Purpose

The objective of this thesis is to develop a system which can accomplish the vol-ume estimation automatically by using a modern handheld device and then eval-uate at what precision the volume estimation is made. The volume estimation is made using 3D reconstruction and inertial measurement unit (IMU) data, ex-plained in section 2.2. These result are then compared with ground truth (GT)-data. To calculate the volume of stone and gravel piles, 3D representations must be generated. In this thesis, Structure-from-Motion (SfM) representations of the scenes are generated, explained in section 2.6. The unknown scaling factor in these are determined with sensor fusion. This is done simultaneously with the video acquisition, i.e. the IMU recordings are logged whilst shooting the video of the object in the scene. The general workflow of the ideal system can be seen in figure 1.1.

(14)

4 1 Introduction

1.3 Problem statements

The thesis answer the following questions:

• How can the volume of stone and gravel piles be estimated in an outdoor environment using a mobile phone and a computer?

• How can the method be evaluated?

• Is it possible to make the system usable in practice?

The answers to the formulated questions are presented in chapter 5, based on the results in chapter 4. The system is implemented in Windows with the meth-ods described in chapter 3 and uses data recorded with a smart phone (Samsung S6). The results are compared to the true volume measured on site. The client Escenda Engineering AB determines that the estimations are viable for practical use if the error is < 10% of the real volume.

1.4 Delimitations

The system needs only to handle large scale objects which are present in static outdoor scenes. The used tracking method require that the object of interest has a lot of texture, which the stone and gravel piles have. Instead of streaming data to a server or workstation, the data is passed manually and all computations are done offline and currently not communicated back to the smart phone due to the time frame of the thesis work. Only one smart phone will be used to gather data.

(15)

1.4 Delimitations 5

(16)

(17)

2

Theory

The goal to estimate volume of stone and gravel piles using smartphone data, such as the video stream, can be achieved with different computer vision meth-ods. In this chapter some of the most common 3D reconstruction and volume estimation methods are presented. In these methods, a 3D point cloud represen-tation is generated by using the video stream data.

With a 3D point cloud the volume can be calculated but only in the units of local camera coordinate system. To find the relation to the real world coordinate system (WCS) which has units in meter, the data needs to be scaled correctly. The idea explored in this thesis work, is to use some of the smartphone’s inertial mea-surement unit (IMU) sensors as additional input. In particular the accelerometer and gyroscope data are used, in the following referred to as the IMU data. The IMU data is collected in the sensor body frame, yet another CS, from now on re-ferred to as ICS. To correctly have this data represent the real world movement it is transformed to the real world CS (WCS), first with an initial rotation estimated from the first samples, this transformation is denoted RI,W_t=0. This initial transfor-mation to WCS is estimated using the gravity impact on the accelerometer y-axis which is pointing up, i.e. along the negative direction of gravity, as shown in fig-ure 2.1, where the WCS is defined from the first image taken. As the IMU data and video are recorded and while the sensor unit moves around the pile, a time dependent back to the original coordinate system is estimated. The other two reference frames, ICS and camera coordinate system (CCS) are defined as shown in figure 2.2.

2.1 Related Work

Similar studies have shown that it is possible to estimate volumes, for predeter-mined types of objects. In particular smaller objects.

(18)

8 2 Theory

Figure 2.1:How the WCS is defined in the thesis

(19)

2.2 Sensor Data and Sensor Fusion 9

2.1.1 Recognition and volume estimation of food intake using a

mobile device

It has been shown that volume estimation of food plates can be made[19]. Here the user needs to identify the type of food and a reference pattern is used in order to compute the unknown scale factor present in the 3D reconstruction. A dense point cloud together with Delaunay triangulation enables volume to be calculated for each voxel.

2.1.2 Estimating volume and mass of citrus fruits by image

processing technique

A study has been made in an attempt to facilitate the packaging process of cit-rus fruits. The goal was to remove the need of manual weighting of all gath-ered fruits before packaging. The method uses image processing methods to es-timate the volume of each fruit. This is done by estimating the cross-sectional areas and the volume of each elliptical slice segment. These segments are gener-ated from two cameras, mounted perpendicular to each other, capturing images simultaneously[16]. Using only two cameras with exact mounting is not within the scope of this thesis and therefore this method cannot be used.

2.1.3 Shape from focus

Another approach is to utilise "shape from focus" in order to determine the dis-tance to the target object. This method is based on using a microscope with a camera taking still images. Such a setup has a shallow depth of field and relies on that the internal camera parameters change when the focus is shifted around on the object and to the surrounding scene [14]. This approach is however deemed to not be stable enough since the cellphone camera is designed to work the oppo-site way as a microscope: it has a wide angle lens, small sensors and no moving parts.

2.2 Sensor Data and Sensor Fusion

The obtained IMU data goes through some processing in order to estimate rota-tions of the device, these steps are described below.

2.2.1 Gyroscope

The gyroscope measures the rotation ~ω(t) along three axes of the device in rad/s

as a function of time. The rotation measured are the yaw, pitch and roll shown in figure 2.2 in ICS. This response has a bias ~bgyro, which is a constant error for

a particular recording. But the bias changes slowly over the lifespan of a sensor, its variation is increased with sensor usage and particularly with time[17]. With

(20)

10 2 Theory

this in consideration the response that compensates for the bias along each axis is: ~ ωadjx(t) = ~ωx(t) − ~bgyrox, ~ ωadjy(t) = ~ωy(t) − ~bgyroy, ~ ωadjz(t) = ~ωz(t) − ~bgyroz, (2.1)

where ~bgyrois the bias.

To find the time dependent rotation from ICS to the WCS the response ~ωadj(t)

formed from the components above is used. For a particular time interval be-tween a sample gathered at t and the previous sample at t − 1, quaternions are formed for each sample by means of integration of ~ωadj(t). This is done

itera-tively for each sample, basically the quaternions are formed as described in this paper, p.69[23]: ~q(t) = ~q(t − 1) + ∆t/2             −_q₁_{(t − 1)} −_q₂_{(t − 1)} −_q₃_{(t − 1)} q0(t − 1) q2(t − 1) −q3(t − 1) q0(t − 1) −q1(t − 1) q3(t − 1) q0(t − 1) q1(t − 1) −q2(t − 1)             ωadj(t), (2.2)

where ∆t is the time between the samples and the initial quaternion is q(0) = h

1 0 0 0iT. The quaternions are then normalized such that ||q(t)|| = 1 to prop-erly represent an normalised axis and rotation angle. Then the quaternions are converted to rotation matrices in order to be applied to vectors in R3. Such a rotation matrix RI(t) rotates the ICS for time instance t back to the initial ICS, at

t = 0 and is constructed as[11]:

RI(t) =          q₀2+ q₁2−_q2 2−q23 2 · q1· q2−2 · q0· q3 2 · q1· q3+ 2 · q0· q2 2 · q1· q2+ 2 · q0· q3 q02−q21+ q22−q23 2 · q2· q3−2 · q0· q1 2 · q1· q3−2 · q0· q2 2 · q2· q3+ 2 · q0· q1 q02−q12−q22+ q23          , (2.3) where ~q(t) = [q0q1q2q3].

2.2.2 Accelerometer

Similarly, the accelerometer data is affected by a bias which is introduced at the start of each recording:

~aadj(t) = ~a(t) − ~bacc, (2.4)

where ~a(t) is the actual reading given in m/s2from the sensor and ~baccis the bias

of unknown magnitude and direction for the accelerometer.

To express these accelerometer readings in WCS the dependency between RI,W_t=0, RI_{(t) and the gravitation is as follows:}

(21)

2.3 Feature Point Tracker 11

where ~baccis the three element bias vector and ~g is the gravitation. These

trans-formed readings are then integrated twice and used to find the displacement and ultimately the unknown CCS scale which is explained more in detail in section 3.5. The accelerometer readings are also used to find the initial IMU to world rotation RI,W_t=0. This is done by using the first samples of accelerometer data read-ings which are logged when the user is standing still, with the camera towards the object of interest. Here the gravity ~g will induce readings on all axes of the

accelerometer, and the mean of these readings ~ameanis used to find a rotation that

satisfies:

~

g = RI,W_t=0~amean, (2.6)

where ~g is defined in WCS as [0, 9.82, 0].

There are more than one possible rotations that satisfy the equation (2.6). This rotation ambiguity is not considered to be a problem since the volume estimation is not dependent on the choice of this rotation. So any rotation that satisfy equa-tion (2.6) is used. This assumpequa-tion is only valid when applying RI,W_t=0 to ~ameanbut

applying the same rotation to any other vector will generate different results.

2.2.3 Bias Estimation

There exist different methods to compensate for the sensor biases. The most com-mon approach is to estimate the bias of the sensors over some seconds of time. This can only be done when the sensor is completely stationary.

Another method to estimate gyroscope and accelerometer bias based on GPS has proven to be successful when used in a running car. In the scenes used in this thesis the distance to the object of interest is in the scale of 10m. Since these scenes also are outdoors this method might be applicable, given that the GPS module would be of higher quality than the one present in the smartphone or that an external sensor is used[15].

2.3 Feature Point Tracker

The goal of the tracker is to produce a set of image feature points yk for a subset

of images k = 1, ..., l. The set of points are then used as input to a 3D recon-struction algorithm. The target family of objects, stone and gravel-piles, have a high repeatability. This means, for any given object at any time the texture and consequently the video images will be similar. With this in mind a local tracker operating at small pixel areas with a high frame-rate of 60 instead of the normal 30 frames per second (fps) is sufficient. Such a high fps means that there is a smaller displacement in each image. The tracker uses a track-re-track process for each pair of images by applying tracking in both directions. If a point is not cor-rectly tracked both from the first image to the second and vice versa it is removed in order to ensure robust correct tracking, explained more in detail in section 3.2. The tracking method used is based on a corner detection method

(22)

goodFea-12 2 Theory

turesToTrack [21]. The tracker also stores a unique ID for each point track which is used in a later step.

2.4 Pinhole Camera Model

The most common model in computer vision is the pinhole camera which is also used in this thesis. It models the camera aperture as a point and with no lenses that focus the light. This simplified model has a CCS that does not use a discrete mapping of 3D to 2D point as the smartphone camera, i.e. where a world point is mapped to a discrete numerated pixel. The pinhole camera model does not model any lens distorsions either.

In its whole the pinhole camera model is described as:

C ∼ K(R | ~t), (2.7)

where C is the camera matrix, K is an upper triangular 3x3 matrix describing the intrinsic parameters, R a 3x3 rotation matrix and ~t a 3x1 translation vector. The ∼_{means that the assumption is valid but without a known scale.}

The projection of 3D points to 2D image points in homogeneous coordinates is then: ˆ yk ∼Cnorm~xk = (R|~t) ~x₁k ! = R~xk+~t, k = 1, ..., n, (2.8) where ˆyk = ¯y₁k !

and Cnormis the C-normalized camera, defined as:

Cnorm∼(R|~t), (2.9)

where Cnormis the camera pose with the dependency of the internal K parameters

removed, (R|~t) represent a rigid transformation in C-normalized CCS between the previous camera pose to the current one[11]. Essentially this describes the cameras position and orientation in the local C-normalized CCS where the 3D reconstruction will be made. This means that the reconstruction is defined in Euclidean space where the camera poses can be related to each other.

Finally the projection operator P(Ck, ~xj) defined below in equation (2.10):

P(C, ~xj) = 1 wj uj vj ! , where          uj vj wj          = C~xj, (2.10)

2.4.1 Intrinsic Calibration

Each smartphone camera has unique properties which must be modelled and estimated in order to produce robust results. The intrinsic parameter matrix K can be estimated offline by the method explained in [11]. An implementation of this estimation can be found in Matlab toolbox Camera Calibration Toolbox[1].

(23)

2.4 Pinhole Camera Model 13

It uses a set of images of a chessboard of known size, captured with a fixed focal length of the cellphone, i.e. fixed focus on the camera in order to form constraints on K. In the beginning of each scene recording the camera focus is set and fixed to simulate the same conditions as when the camera was calibrated. This is done using Google’s camera API in the smartphone by disabling the autofocus and focusing roughly on the chessboard.

2.4.2 Epipolar Geometry

The geometry regarding relations between two camera views depicting the same static scene can be described using epipolar geometry. The goal is to relate how a 3D point is projected into two camera views. Let’s consider the most simple solution to the problem where image points, y1in view 1 with camera matrix C1,

and y2depicted in C2are known. These points are corresponding points if they

both are projections of the same 3D point x.

The projection line Ly1then contains all possible projections of x into C1. The

epipolar point e12 is defined as the projection of second camera center into C1.

The epipolar line l is then the line between this epipole and the projection of L1,

more in detail described in chapter 9 of [11]. Two image points, y1and y2need

to fulfill what is called the epipolar constraint to be corresponding points: 0 = y1· l1= y1Tl1 = y1TFy2, (2.11)

where F is called the fundamental matrix which describes the epipolar constraint. In the uncalibrated case, i.e. the two camera matrices are not known, F is es-timated from y0 and y1, a set of corresponding image points that have been

tracked from the first to the second image[22]. A simple error cost function that minimises the geometric error over the points and epipolar lines can be defined as[11]:  = n X j=1 dpl(y0j, l0j)2+ dpl(y1j, l1j)2, (2.12)

where n denotes the number of corresponding points in the views, dpl the point

to epipolar line distance, y0j and y1j are points in the first and second view and

l0j and l1jare defined below:

l0∼Fy1 l1∼FTy0 (2.13)

i.e. epipolar lines. The minimisation is done by a non-linear optimisation method-ology and is consequently minimising the point to epipolar line distance[9]. The algorithm, its implementation and solution details can be found in OpenMVG’s documentation[18].

One problem remains when F is found, there exist an infinitely set of cam-eras which are consistent with F. To determine the relative camera positions the problem must be solved in accordance to calibrated cameras and this is where the essential matrix E is introduced, further explained in section 2.6.1.

(24)

14 2 Theory

2.5 Minimisation Methods

A common tool used in computer vision for non-linear optimisation is the so called Random Sample Consensus (RANSAC) method[5].

2.5.1 RANSAC

RANSAC is effectively used in estimation problems where outliers occur, which is the case for example when corresponding points needs to be found. A RANSAC scheme is run for a finite amount of time in order to find a subset, which with the highest possible probability only contains inliers, and then a model is estimated from this subset. This subset is denoted consensus set C, with a corresponding estimated model M. The RANSAC scheme, is an iterative process and it is run for predetermined number of iterations or until C is sufficiently large. A general RANSAC scheme is roughly described below:

Starting from the full data set, a smaller subset T is randomly chosen in each iteration. From T , M is estimated and even if this subset would contain only inliers, M would only fit most, but not all, of the remaining inliers in the data set, since there is also an inaccuracy in the inliers.

For each RANSAC iteration, Cestis formed, consisting of all points in T that

fit the estimated model M. To determine if a point fits the model a cost error function is defined. For example, if the estimated model is a line, is the geometric distance between a point y in the full data set to the line currently estimated from T . If for this y is below some threshold t the point would be considered as a inlier and consequently added to C for this iteration of T [9].

Initially C is empty, but it is replaced whenever the current consensus set Cest,

is larger than C. The estimated M is then used until a larger Cestis found or until

the scheme is finished. As an optional but common step a last RANSAC itera-tion can then be run on Cestto improve M even further, excluding any potential

outlier.

2.5.2 Improving RANSAC

The OpenMVG system uses a unique approach in their minimisation method solving, called a contrario RANSAC (AC-RANSAC). The main idea of this ap-proach is to find a model which best fits the data, adapted automatically to any noise. AC-RANSAC looks for a C that includes a controlled Number of False Alarms (NFA), a model that is generated by chance, further explained in their article[13]. NFA is a function of M and the hypothesized number of inliers in the data set. Using NFA, a model M is considered valid if:

N FA(M) = min

k=Nsample+1...n

N FA(M, k) ≤ . (2.14)

The only parameter for AC-RANSAC is consequently , which is usually set to 1. The AC model estimation finds the arg minMN FA(M) among all models M, from

(25)

2.6 Structure-from-Motion 15

NFA is minimised, instead of maximising the inlier count, and this is the task of the AC-RANSAC. A model is eventually returned if it provides at least 2 ∗ Nsample

inliers.

The AC-RANSAC replaces ordinary RANSAC by its shear improvements. It removes the need of setting globally-fixed threshold for each parameter, which must be done when running the RANSAC algorithm. Instead the thresholds are adapted to the input data, automatically by the nature of AC-RANSAC. For ex-ample, it is used to estimate E with a adaptive threshold t for the inliers. How this specific AC-RANSAC scheme is used, is explained below in section 2.6 and in the method chapter 3.

2.6 Structure-from-Motion

In 3D reconstruction made by Structure-from-Motion (SfM) a representation of a scene is determined up to an unknown scaling factor. The final results from SfM are camera poses, denoted C, and 3D points, denoted ~x, which represent

the structure found in the scene in the video. The coordinates of C and ~x is

C-normalized local CCS, i.e. estimated in normalised camera coordinate system with the dependency of K removed. In local CCS the units are defined in some unknown unit and to estimate the real volume ~x must be related to the WCS so

the coordinates represent real units, e.g. metres.

The camera pose matrices relate to each other with three matrices, explained in section 2.4, equation (2.7). The 3D points ~x are generated from yk using

tri-angulation. Other algorithms used in SfM and the basic steps for a general SfM pipeline are briefly described below.

2.6.1 Two-view reconstruction

The first step of 3D reconstruction is to find camera poses, which relate the initial pair of two images, with that information the first 3D points can be triangulated.

A first solution can be found by solving an epipolar geometry problem relat-ing two images of the same scene. Usrelat-ing F the relative position between two cameras can be found in calibrated camera coordinates. The essential matrix E is introduced and is computed as[11]:

E= KTFK. (2.15)

It basically is the fundamental matrix defined in C-normalized coordinates. With that the infinite solution space is narrowed down to a set of 4 camera poses. By using a corresponding point pair a 3D point ~x is triangulated for a CCS centered

in one of the cameras C1. If the last coordinate of ~x is above 0 the 3D point

is in front of C1. To determine if this is the case for both cameras, ~x must be

transformed to C2CCS. This is done by the relation ~x0 = R~x + ~t. In one case the

last coordinate of ~x0 _{will as well be above 0, which means that the 3D point is}

in front of both cameras and consequently this E is used. A common method for two-view triangulation is the mid-point method, described in section 14.1.1 in

(26)

16 2 Theory

[11]. The exact methodologies of a contrario E estimation, triangulation and pose estimation are described in the OpenMVG documentation[12]. With the camera centered CS in Euclidean space the translation vector has a determined direction but not length. Here ~t is set to have unit length and consequently scales the reconstructed scene. With this assumption, the relation between WCS and CCS can be found, by approximating the distance between each camera pose with IMU data. This unknown scaling factor from CCS to WCS is from here on denoted as

sC2W and how it is generated is described in section 2.8.1.

2.6.2 Perspective-n-point - adding a view

After the first two poses are found, using the methodology explained in section 2.6.1, each new added view needs to be handled. For example with the next view some n known C-normalised image points ¯yj, j = 1,...,n can be tracked to the

previous image.

Since the corresponding 3D points are known for these new image points, the camera pose can be estimated for this new view. For example, for each new added view, C can be found by solving the PnP problem.

Since the data may contain noise the geometric minimisation of the PnP esti-mation problem is formulated for one camera pose as below:

P N P ,Geo= n X j=1 dpp( ˆyj, ˆy 0 j)2, where yˆ 0 j = P(C, ~xj), (2.16)

where P(C, ~xj) is explained above in equation (2.10). This is then minimised

over R∈SO(3) and ~t ∈ R3. RANSAC, i.e. the algorithm that handles data with outliers described in section 2.5, is used to find a robust P3P solution. The PnP solution will then be minimised over the inlier set, i.e. only the points which are reprojected within some threshold[11].

2.6.3 Bundle Adjustment

Bundle adjustment (BA) is a process used when a new pose have been added. At these occasions new 3D points are added using a RANSAC triangulation scheme for the given poses. The overall solution of these 3D points and camera poses is not ideal, i.e. the re-projection error for all 3D points to all camera poses is not zero. By using the BA method all parameters are refined and bad points are flushed out. The goal is to minimise the reprojection error for all ~xj. This is done

by having the BA algorithm change the parameters (Rk, tk) and ~xj. By varying the

parameters of the poses and the positions of the 3D points xjBA tries to minimise

the reprojection error:

BA,l= l X k=1 n X j=1 Wkjdpp(ykj, P(Ck, ~xj))2, (2.17)

(27)

2.7 Volumetric Representation - Space Carving 17

where l is the current number of camera poses, n is the current number of 3D points, Ckis the camera matrix for pose k and dppis the point-to-point distance

between ykj and the re-projected point Ck~xj. Essentially the overall projection

error is minimised, i.e. the distance between the observed 2D coordinate and the re-projection of the known 3D point, for all views it is seen. Wkj is a visibility

function which tells if a 3D point xjis visible in pose k. This optimisation is done

using non linear optimisation methodology, e.g. nonlinear least square.

2.7 Volumetric Representation - Space Carving

Some different volume estimation methods have been presented in section 2.1. The one used in the thesis is called Space Carving. By using the set of camera matrices produced for all views and refined by BA, the volume can be calculated in the CCS up to an unknown scale.

The Space Carving method produces a set of voxels, the volume of which is easily calculated since it consist of predetermined sized voxels. This voxel block is a representation of the reconstructed scene, the volume of which are determined by the methods described in[6].

The first step is to find suitable limits in each dimension for the voxel block in order to initiate it and also determine the resolution of it. The boundary di-mensions are initially limited by a relation to the actual position of the camera centers. In the current implementation the initial boundary condition is set to 75% of each dimension found by the minimum and maximum positions of the camera centers.

The boundaries are then also limited by the camera frustums. The camera frustum is the space which is depicted by a camera and all of the camera frustums are used to limit and offsetting the voxel block, height wise. The direction of the camera with index k is found as follows:

¯

nk = ~Y / ¯Y , (2.18)

where ~Y = Rk,r3and Rk,r3is the third row of the rotation matrix of camera k, ¯Y

is the norm of ~Y , and ¯nkis consequently the viewing direction for camera k.

With these boundaries known, finally a voxel block is initiated, where the number of voxels is set to a predetermined number.

2.7.1 Object Silhouette and Projection

The next step is to project all voxels in the block into each image to determine if they belong to the object or not. The images used here are binary masks, rep-resenting the object and its silhouette. A challenge here is to find a suitable seg-mentation method for the object of interest. In this thesis work a simple color segmentation is made, the implementation of which is described in the chapter 3.

(28)

18 2 Theory

This binary mask is used to determine whether a voxel belongs to the object or not. This is done by projecting each voxel in the scene into the image. A single projection outside the binary mask means that it is removed from the voxel grid. The remaining voxels are then counted and used as the estimate of the volume, but still in CCS.

2.8 Fusion of Camera and IMU Sensors & SfM

Improvements

To determine sC2W, transforming from local C-normalized CCS to real world CS

described in section 2.6.1, the sensor data from accelerometer, section 2.2.2 and gyroscope, section 2.2.1 are used. To improve precision of the data the physical distance between the sensors can be estimated for example.

2.8.1 Estimating IMU to Camera Transformation

Sensor fusion can be used to improve translation and rotation estimates by find-ing the rigid transformation between ICS and CCS. This is a tedious and time-consuming process and normally requires additional equipment but would in the end improve the performance of the scale estimate.

In an attempt to improve robot navigation systems it has been shown that sensor bias can accurately be obtained, as well as the metric scale and the trans-formation between IMU and camera. This is all done autonomously by using the visual information combined with sensor data from the gyroscope and accelerom-eter alone[10]. This have not been done in the thesis, and the local CCS is instead approximated to be the same as the ICS. This is not an issue since the sensors are located in a rigid frame and within a couple of centimeters from each other.

2.8.2 Global Positioning System Fused With SfM

With a stable high-performing GPS, the SfM can be modulated so that the cost function punishes camera centers drifting too far away from corresponding geo-tags. This would be similar to an algorithm that does robust scene reconstruction where noisy GPS data is used for camera center initialisation[3]. The GPS module of the used smartphone does not have such a precision GPS and this method was therefore not used.

2.8.3 Scaling Factor Between CCS and WCS

The unknown camera-to-world scale factor sC2W is most commonly deduced by

introducing a reference object of known length in the scene. Here, this approach is only used to validate the performance of the system. Instead sC2W is estimated

by finding the relation between the real world translations tW_{T +1,T} and the corre-sponding camera translations t_{T +1,T}C from time instances T to T + 1 for T < 100.

(29)

2.8 Fusion of Camera and IMU Sensors & SfM Improvements 19

To find a given translation t_{T +1,T}W in WCS between the keyframes at time in-stance T and T + 1, the positions p(T + 1) and p(T ) for these two time samples must be found. In order to correctly integrate the acceleration to position the ini-tial velocity is assumed to be zero. With this assumption in mind p(T ) is found by a double integration of the acceleration as:

p(T ) = T Z 0 T Z 0 ~aworld(t) dtdt, (2.19)

where ~aworld(t) is defined in equation (2.5). The translation vectors are then

found as follows:

tW_{T ,T +1}= p(T + 1) − p(T ). (2.20) The corresponding vector tC_{T +1,T} in CCS is simply found as the vector between the two camera centres representing T and T +1. In this implementation the scale

sC2W is simply calculated as the mean norm of their fraction as:

sC2W = 1 100 100 X k=1 t W k−1,k t C k−1,k . (2.21)

With known sC2W the volume in the 3D scene can be converted into the real

volume estimate. The result of the Space Carving explained above in section 2.7 is a voxel cloud representing the object in CCS. By counting the number of voxels and using known resolution of said voxels the real world volume can finally be calculated as:

V = NtotVvoxels3C2W, (2.22)

where Ntot is the total number of voxels occupied after the Space Carving. The

volume per voxel given in CCS is denoted Vvoxel and sC2W is calculated by the

relation in equation (2.21).

Scaling factor determined by reference object

In the scenes a 1m ruler, marked with distinct red tape at each end, is placed. This ruler’s end points are reconstructed in the SfM process. By identifying these points, they can be used to determine a reference scale and evaluate the perfor-mance of the scale calculation. This reference scale is used as the ground truth and from now on denoted as sGT_C2W. It is simply found by using the corresponding 3D points and the known length of the object:

sGT_C2W = l X W f irst−XendW . (2.23)

Here, l is the real world length of the ruler specified in meters, X_{f irst}W is the first 3D point of the ruler, and X_endW the end 3D point of the ruler.

(30)

20 2 Theory

2.8.4 SfM Using a Hierarchical Cluster Tree

One problem of incremental SfM, explained in section 2.6 is scalability. Instead of incrementally adding one pose at a time, global SfM uses a hierarchical clus-ter tree with the leaves starting at each image in the video sequence. From here parallel processes can be started, with the purpose of local reconstruction among neighbouring poses. This also improves the error distribution over the entire data set and the method is less sensitive to the initialisation and drift. This method is in particular useful when the system should run in real time. The idea is to use a subset of the input images and feature points for each reconstructed node. These subsets subsumes the entire solution and thereby reduces the problem computational complexity significantly compared to the traditional incremental pipeline[4]. There was not enough time to use this method in the thesis work and it was also not considered to be necessary, given the delimitation’s found in section 1.4.

(31)

3

Method

To solve the problem introduced in chapter 1, namely automatic volume estima-tion, a system has been developed. Using the theory explained previously, this chapter describes the implementation and solving of the various problems de-rived from the main goal of volume estimation.

The system consists of different subsystems, which for example generate im-ages and then communicate through text files. The software components are located in two separate hardware platforms. The first one being the cellphone, which gathers a video stream and sensor data. The recordings are started when the user is standing still, with the camera facing the object of interest. While recording the object, the user simultaneously walks around the object in a circle. The second hardware platform used is a PC running separate programs. One tracker, generating images and feature points to a Linux Virtual Machine run-ning 3D reconstruction, and finally a Matlab program runrun-ning Space Carving and volume estimation. A detailed overview of the systems modules can be seen in figure 3.1.

Using the obtained data from the scene, the corresponding feature point tracks are generated, described in section 3.2. These are then used in the Open Source Multiple View Geometry Library (OpenMVG) incremental SfM-pipeline[18], which is described in section 3.3. In this step 3D points are generated by means of triangulation for the chosen views. By solving the PnP problem, each view is added incrementally. The corresponding camera poses are then refined by per-forming BA. Here both 3D points and camera parameters for all given views are tweaked to minimise a global error. Finally the output of OpenMVG, i.e. the camera poses and 3D points, are used in Space Carving in order to calculate the volume of the object depicted in the scene, in section 3.4. The unknown scale in that scene is then found by means of sensor fusion in section 3.5.

(32)

22 3 Method

(33)

3.1 Sensors and Data Acquisition 23

3.1 Sensors and Data Acquisition

To record the video and IMU-data simultaneously, two applications are used. First the IMU-data log is started before the video recording, which is recorded at 1080p@60fps. The user then fixes the focus on the object in order to mitigate any changes of the focal length induced by the autofocus from Google Camera. The result of this, is a more robust internal camera matrix estimation.

The IMU-data logging is started from an initial position and is continuously logged while the user is walking around the object of interest. It is then stopped when the user has reached the start position (approximately). The IMU data sets are collected with a cellphone (Samsung S6) at a sample rate of 100 Hz.

The application used are the standard Google Camera[8] and the Android ver-sion of Sensor Fuver-sion[7], developed by Fredrik Gustafsson and Gustaf Hendeby at Linköping University.

Consequently there is an offset in capture time between the video frames and the sensor data which also has to be synchronised due to different sample rates. This is easily corrected by using the start of the video recording tcamera, the start

time of IMU-readings tI MU and the sample rates fI MU and fcamera. Simply put,

there is an offset from the given frame timestamp to the corresponding IMU data. Also this data needs to be transformed from ICS to WCS, using the methodology explained in section 2.2.

To be noted here is that inaccuracy of fI MU and fcamera are not accounted

for, neither any lost frames nor any rolling shutter artifacts. Such artifacts are induced as a consequence of different pixel readout times when the camera is in motion. For example a long straight object registered over the entire camera chip would appear more distorted and skew the faster the camera is moving. Implementing e.g. rolling shutter correction might improve the tracking but it was not within the scope of this thesis. Also the timing of the timestamps is not considered, i.e if the time stamp refers to the start, middle or end of an exposure and how this affects induced artifacts. Bias estimation of IMU-data is also not implemented, but ideally it should be estimated for each recording.

3.2 Tracking in Video Sequence

The tracker used is operating on downscaled 1080p images from a 60 fps video sequence and produces corresponding feature points. The implemented tracker uses the corner detector GoodFeaturesToTrack from OpenCV 2.4.12 [2] to gener-ate 1100 image points of corners. A higher number means a more robust tracking but also results in larger memory usage and slower total run time. The found fea-tures are tracked frame-to-frame using calcOpticalFlowPyrLK, enabling tracking on multiple scale levels. The matching points are then re-tracked to ensure cor-rect matching. The remaining points are also evaluated by a geometric constraint, by finding F with a threshold of 1 pixel.

The designated goal is to produce image coordinates ykfor a subset of images

(34)

24 3 Method

frames. Running on all frames would not be possible with the PC setup and would not produce a significantly better result. Mainly because the tracked fea-tures are found with a high confidence between each keyframe and consequently estimating camera poses on each frame would yield the same 3D model.

A keyframe is generated when one of two conditions are fulfilled. Case one is when the median displacement between the current tracked points and the last keyframe feature points yk−1 have reached a certain threshold, default at 5px.

The second situation happens when the current number of tracked points from the last keyframe have dropped below a certain threshold, default 30% of the number of initially generated points.

3.3 OpenMVG SfM module

Rather than implementing the methods described in sections 2.6.1 to 2.6.3, ready-made SfM modules can be used. In this implementation the system uses the OpenMVG SfM-pipeline which differs somewhat from the theory described in section 2.6. The RANSAC usage in two-view reconstruction and PnP is replaced with AC-RANSAC, explained in section 2.5.2. The inputs to the SfM-module are the tracked points produced by the tracker with the corresponding images, described above in section 3.2.

With known correspondences and image pairs the system first solves the epipo-lar two-view problem of finding E to extract pose estimations. The AC-RANSAC scheme runs for a maximum of 4096 iterations in order to find a suitable model, i.e. the model which has the minimal amount of NFA. The number of iterations increase the chance of finding a M consisting of only inliers. But at a certain point the increased chance of finding a better model is negligible compared to the in-creased computation time required. The system OpenMVG uses the a contrario principle on the correspondence problem but also on the pose estimation. This again applies to AC-RANSAC, where the threshold for an inlier, is computed for each view. This threshold is furthermore used as a confidence and in outlier re-jection of new possible triangulated tracks. Any triangulated point which yields a larger reprojection error than this threshold is discarded[13].

The estimated camera poses and triangulated points are then parsed from C++ environment to text files. These are then further used in the system pipeline, in a Matlab environment. More exactly, the SfM data is used to generate and calculate the volume, described in the next section 3.4.

3.4 Volumetric Representation - Space Carving

The Space Carving according to the theory described in 2.7 is implemented in Matlab. The first step is to find suitable limits in each direction for the voxel block in order to initiate it. These boundaries are simply found by a relation to the camera centres and their camera frustums, or viewing directions. With these boundaries known, a 3 million voxel block is initiated, where the number of voxels determines the resolution and consequently the computational load.

(35)

3.5 Determining the Real Volume 25

3.4.1 Object Silhouette and Projection

The silhouette is then found as a binary mask for every view. Since the target family of objects are in the grey color spectrum, this is done by a simple color segmentation. A pixel is deemed to represent the object and set to 1 in this mask if it has RGB values between 30 and 120 in each channel, where the maximum value is 255.

At this stage the mask often includes pixels outside of the object. In order to remove such small remaining connected components, those below 15% of the image area are removed. The resulting binary mask is then dilated to fill any holes that might be present before it is finally used in the projection.

3.4.2 Removing Voxels Below the Groundplane

When the object has been formed, the ground plane is removed. In this imple-mentation this is done by finding the ground plane manually from the existing 3D points, which are produced in the OpenMVG SfM incremental pipeline. First two vectors are chosen manually in the 3D point cloud which best corresponds to the ground plane. The cross product of these is then the normal to the ground plane and the equation of the plane can easily be found. The normalised Hessian normal is then used to determine whether a point in the voxel block is above or below the ground plane. Any point below the ground plane is removed.

3.5 Determining the Real Volume

After the 3D points have been generated and filtered a representative 3D model of the object is left. The estimated volume of this model given in CCS must then be related to WCS. The volume in CCS is found after projections for all views have been made and the ground plane has been removed. The volume of the object is simply the number of occupied voxels that are left multiplied by the volume per voxels. With known span of the initial voxel block on each axis and known number of initial voxels, the volume of each voxel is:

Vvol=

1

Ninit

||_x_axis|| ||_y_axis|| ||_z_axis||_, _(3.1) where Vvolis the volume of each voxel given in CCS, Ninit the initial number of

voxels and ||xaxis||is the length of the x.

The next step is to move from the CCS to the WCS. This is done by using the scale factor described in section 2.8.3. Matlab was used with the data generated by OpenMVG and the IMU-data. In attempt to suppress the impact of bias and noise of ~aworld(t) the scaling factor sC2W is only calculated for the first 100 poses.

The final estimated volume is consequently an estimate in real world units. The estimated volumes of the GT-data sets, together with results from each differ-ent module are presdiffer-ented in the next chapter.

(36)

(37)

4

Evaluation and Results

Using the method described in the previous chapter, the system results are gener-ated. In this chapter the results are presented and evalugener-ated.

With the goal of having a simple experiment setup, the user only have to collect data and then run it through the system. The system inputs are the sensor data and a video sequence and the final output to the user is the volume of the object. The result presented in this chapter mainly consists of two sub results, the IMU processing, i.e scale estimation, and the computer vision module with the final volume estimation.

The chapter starts with the IMU and video data collection and then shows the subsystem results. The program is written in C++ and Matlab and executed on a PC running Windows 10 and a Linux Virtual Machine in order to run OpenMVG.

Each module has been tested separately and the data and corresponding re-sults are presented below.

4.1 Data Sets

The full system is only evaluated on two stone and gravel piles with known ground truth volume. These sets of GT-data have been acquired with the com-pany which has the major responsibility of transport from and to quarries around Linköping. Two different types of stone and gravel were recorded and each pile was recorded 2 times. The piles were then lifted on to a truck and weighted with an accuracy of ±50kg, shown as volume in table 4.1 below. The density of the stone piles was given by GDL transport and has an inaccuracy of below 5%. The real volume is then calculated directly from the weight and density and presented as Volume in the same table.

Images from the recordings can be seen below. In figure 4.1 the first pile can be seen, recorded while moving clockwise. Each image is roughly 1 second or

(38)

28 4 Evaluation and Results

Pile-Scene Weight Density GT Volume Frames Keyframes

1-1 38900 ± 50kg 1700 ± 85kg/m3 22.9 ± 1.1m3 2545 1182 1-2 38900 ± 50kg 1700 ± 85kg/m3 22.9 ± 1.1m3 2366 1133 2-1 20100 ± 50kg 1700 ± 85kg/m3 11.8 ± 0.6m3 1879 817 2-2 20100 ± 50kg 1700 ± 85kg/m3 11.8 ± 0.6m3 2037 923 Table 4.1:First column denotes the pile and which scene, second column is the weight in kilo, third density in kg/m3, fourth GT volume (weight times density), fifth is the number of frames in the entire video stream of that scene and sixth is the number of keyf rames.

60 frames, apart. In figure 4.2 pile 2 is shown, also recorded clockwise. Both scenes have also been recorded counterclockwise. Keyframes are the number of frames used for SfM whilst all frames are used in the tracking module, how these keyframes are selected is described in section 3.2.

4.2 Sensor Fusion - Scale Estimation

The first major result is the estimated scale which is compared to the reference scale from the ruler, described in section 2.8.3. Below is the parts which are used to compute the scale.

4.2.1 IMU Readings and Integration

The theory described in section 2.2 and the implemented method regarding IMU have been tested by simple recordings, i.e. by moving the cellphone in a rectan-gular shape and walking back and forth. These results are not shown, instead the results on the GT scenes can be seen below. In figure 4.3 the acceleration data is shown for scene 1-2.

To make this data useful the methods described in section 3.1 is used to trans-form it to WCS and then remove the gravity. The corresponding result can be seen in the upper right corner of figure 4.3. To retrieve positions, the accelerom-eter data is double integrated into position estimation of the cellphone using the methods described in section 3.5.

The intermediate integration, i.e. velocity, is shown in the lower left corner and the final estimation of 2D positions in the x-z plane given in meters is shown in the lower right corner of figure 4.3. Similarly the same responses from scene 2-1 can be seen below in figure 4.4 and for scene 2-2 in figure 4.5. In the ideal case these positions should be a circular shape, or at the least an elliptic shape, since the user is ending the recordings at the initial position.

4.2.2 Result - Scale Factor

The reference ruler of 1m is used to determine sGT_C2W for each scene. Since the user is at different lengths from it, and the 3D model is different, sGT

(39)

4.2 Sensor Fusion - Scale Estimation 29

Figure 4.1:Scene 1 collage with images from corresponding video sequence

Scene Keyframes sGT_C2W sC2W

1-1 719 2.35m 1.17m

1-2 1133 1.35m 1.10m

2-1 817 1.41m 3.98m

2-2 923 2.25m 1.27m

Table 4.2:Results for scale estimation for each scene. First column denotes the scene, second column the number of Keyframes used in SfM 3D model generation, third the GT scale and fourth the estimated scale.

each scene. The result of this scale for the different scenes is presented in table 4.2. By using equation (2.21), the estimated scale from IMU-data is then presented as

sC2W and the quality of these are directly dependent on the on the IMU data.

Scale estimation when using all poses

The estimated sC2W, described above, is estimated using only the first 100 poses.

In an attempt to show the influence of using all poses for each corresponding scene these results have been generated and are shown below. To estimate the scale with this parameter change leads to a large impact of the bias and noise present in the IMU, see table 4.3.

(40)

Figure 4.2:Scene 2 collage with images from corresponding video sequence

Scene sC2W

1-1 60.26m

1-2 2.67m

2-1 48.27m 2-2 17.52m

Table 4.3:Results with all poses used for metric scale calculation. First col-umn denotes the scene and second colcol-umn the estimated scale.

4.3 Computer Vision - Volume Estimation

Here, the results from the second major subsystem are shown. The corresponding modules which produce the 3D model and volume estimation are also evaluated.

4.3.1 Tracking module

The tracking module has been tested on several piles besides the GT-data. Such tests have been performed by recording an object of interest and running the video sequence through the tracker and then evaluating the length of the point tracks. The results of this module are putative corresponding point tracks and a set of keyframes.

A typical tracking sequence is shown in figure 4.6, generated from scene 1. The current position of each tracked point (red dots) and its tracked path (blue

(41)

4.3 Computer Vision - Volume Estimation 31

Figure 4.3: Accelerometer data from scene 1-2, x-axis in blue, y-axis in brown and z-axis in yellow. Upper left: raw acceleration data. Upper right: WCS acceleration. Lower left: velocity. Lower right: position in 2D, sC2W is

estimated from green dot to black dot.

tracks) are shown. These tracks describes how the point has been displaced over time. The length of these vectors are used to determine the median displacement criterion described in section 3.2. If it fulfils the criterion new points are gener-ated and added to the tracker.

4.3.2 Structure-from-Motion

The keyframes and corresponding point tracks, explained above in section 4.3.1, are used in OpenMVG’s SfM pipeline. The output of this pipeline are 3D point cloud and estimations of camera poses. These camera poses are visualised for the GT-data sets later in this section. The module result, 3D points and camera poses, has also been evaluated and verified by using the results explained above both on test and GT-data.

The estimated camera poses are then used in the Space Carving module to generate a solid volume. The typical SfM results generated can be seen below, with one image per scene. In figure 4.7 the result from data set 1-2 is shown from the side. In these images the 3D points are shown in white and the poses in green. In figure 4.8 the same scene is shown but in a top-view perspective. Similarly the result for pile 2-2 is shown in figure 4.9 from the side and in figure 4.10 in a top-view perspective.

(42)

4.3.3 Space Carving

The output from the SfM module in form of camera poses, 3D points and images are then used to form a solid object. This is done by using the Space Carving method explained in section 3.4.

Initial Voxel Block

First the scene is initiated with a voxel block inside the camera frustums, see figure 4.11. In these images only a tenth of all the cameras are visualised in blue, with their viewing direction shown as well. This module has only been used on the GT-data, generating a total of four results. Using the silhouette for each camera pose, such as in figure 4.12, the voxel block is carved image by image.

Ground Plane Segmented From Voxel Block

Below is the result when the ground plane is found, see figure 4.13 and removed from the voxel block, see figure 4.14.

Final Voxel Block

The object volume is represented and calculated from the remaining voxels left after the Space Carving. Two typical results are shown in figures 4.15 and 4.16.

(43)

4.3.4 Result - Volume Estimation

Finally, the volume estimation, based on the modules described above, can be generated. The relation between the estimated volume CCS volume and the real world volume is the sC2W presented above in section 4.2.2. In table 4.4 the

re-sults from the system on the GT-data are shown. The camera-to-world scale is labelled as sC2W. Furthermore the volumes in CCS after applying the Space

Carv-ing methodology explained in section 3.4 are listed for each scene. Those values are calculated from the space carved scenes, e.g. from scene 1-2 visualised in figure 4.15. The pose data, shown in e.g. figure 4.9, where green dots are the estimated camera positions, are used for the Space Carving projections. The ori-entation of the cameras can more easily be seen in e.g. figure 4.15. Finally this result is combined with the scaling factor and used to determine the WCS volume for each scene in cubic meters, labelled as V .

GT Volume Estimation

To evaluate the effect of sC2W on the volume estimation separately, a GT scale

estimation has been made. The GT scale, sGT_C2W, is then used instead of sC2W to

(44)

Figure 4.6: Scene 2-2 tracked points motion vectors. Green area is zoomed and showed in right side image. In this small area, the red dots have been tracked in a similar fashion. The result of which are shown by the blue motion vectors, indicating each tracked point movement between tracked frames.

Figure 4.7:Scene 1-2 point cloud, side view. Green points are the estimated camera centers, white points 3D points and each red curve represent an Cartesian coordinate CCS axis.

(45)

Figure 4.8:Scene 1-2 point cloud, top view. Green points are the estimated camera centers, white points 3D points and each red curve represent an Cartesian coordinate CCS axis.

Figure 4.9:Scene 2-2 point cloud, side view. Green points are the estimated camera centers, white points 3D points and each red curve represent an Cartesian coordinate CCS axis.

(46)

Figure 4.10: Scene 2-2 point cloud, top view. Green points are the esti-mated camera centers, white points 3D points and each red curve represent an Cartesian coordinate CCS axis

Figure 4.11:Scene 1-2 Space Carving with initiated voxel block in green and cameras in blue.

(47)

Figure 4.13: Scene 1-2 scene with 3D points and ground plane estimated. The Ground plane is the straight line below the green pile, and the blue tracks around the pile are the camera centers.

(48)

Figure 4.15:Scene 1-2 space carved with all views. Green is the space carve result and blue the cameras.

Figure 4.16:Scene 2-1 space carved with all views. Green is the space carve result and blue the cameras.

(49)

Scene s3_C2W K f Ntot Vvoxel GT Volume V

1-1 1.60m 719 1071515 6.75 · 10−6 22.88m3 11.58m3 1-2 1.33m 1133 713564 3.05 · 10−5 22.88m3 29.11m3 2-1 63.04m 817 149411 1.38 · 10−5 11.82m3 130.05m3 2-2 2.05m 923 1118972 6.74 · 10−6 11.82m3 15.40m3

Table 4.4: The numeric results for each scene. First column is the scene, sC2W is the estimated camera-to-world scale, K f is the number of

K eyf rames used, Ntot the number of voxels left after Space Carving, Vvoxel

the volume of each voxel, GT Volume (weight times the density) and V the system volume estimation

Scene sGT ,3_C2W GT Volume VGT

1-1 12.98m 22.88m3 93.83m3 1-2 2.46m 22.88m3 53.82m3 2-1 2.80m 11.82m3 5.79m3 2-2 11.39m 11.82m3 85.66m3

Table 4.5:The numeric results for each scene using sGT_C2W. First column is the scene, s_C2WGT ,3 is the estimated GT camera-to-world scale cubed, GT Volume (weight times the density) and VGT _{is the system volume estimation using}

(50)

(51)

5

Discussion

In this chapter the thesis work result and implementation is discussed.

5.1 Result

The quality of the final result, the estimated volume, varies vastly for each scene. This can be explained by two flaws which both affect the outcome directly; esti-mation of sC2W and the Space Carving.

This is based on the fact that the the results from the tracker is sufficient for the goal of acquiring well-estimated poses. Consider e.g. figure 4.6, where the blue motion vectors indicates each tracked points movement between frames. By examining a small area, it shows that the points have been tracked in a simi-lar fashion. Furthermore the 3D models which contain camera poses and point clouds, have been reconstructed very well. They are similar to the real structure, see e.g. figure 4.7 to 4.10 and the camera poses corresponds to the user movement around the the objects of interest.

To draw any exact conclusions of the overall system performance, a larger amount of GT-data should be added. Preferably the GT-data should be located without similar objects in the background, but finding such areas with isolated GT volumes have not been found. In an attempt to generate such GT-data, gravel piles in proximity to construction sites were filmed, but the exact volume of these were not known to the personnel on site. Also the surrounding area had similar texture, e.g. trees, construction material etc so this approach was not pursued. Currently the four scenes have a large variation in the final results which makes difficult to make any statement about the system performance.

Automatic Volume Estimation Using Structure-from-Motion Fused with a Cellphone&apos;s Inertial Sensors

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2016