Structure from Forward Motion

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Structure from Forward Motion

Examensarbete utfört i datorseende vid Tekniska högskolan i Linköping

av

Fredrik Svensson

LiTH-ISY-EX--10/4364--SE

Linköping 2010

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Structure from Forward Motion

Examensarbete utfört i datorseende

vid Tekniska högskolan i Linköping

av

Fredrik Svensson

Handledare: Ognjan Hedberg

Autoliv Electronics AB

Examinator: Klas Nordberg

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution

Division, Department Division of Computer Vision Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2010-10-04 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-60136

ISBN

—

ISRN

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title 3D-struktur från framåtrörelse_{Structure from Forward Motion}

Författare

Author

Fredrik Svensson

Sammanfattning

Abstract

This master thesis investigates the difficulties of constructing a depth map using one low resolution grayscale camera mounted in the front of a car. The goal is to produce a depth map in real-time to assist other algorithms in the safety system of a car. This has been shown to be difficult using the evaluated combination of camera position and choice of algorithms.

The main problem is to estimate an accurate optical flow. Another problem is to handle moving objects. The conclusion is that the implementations, mainly tri-angulation of corresponding points tracked using a Lucas Kanade tracker, provide information of too poor quality to be useful for the safety system of a car.

Nyckelord

(6)

(7)

Abstract

This master thesis investigates the difficulties of constructing a depth map using one low resolution grayscale camera mounted in the front of a car. The goal is to produce a depth map in real-time to assist other algorithms in the safety system of a car. This has been shown to be difficult using the evaluated combination of camera position and choice of algorithms.

The main problem is to estimate an accurate optical flow. Another problem is to handle moving objects. The conclusion is that the implementations, mainly tri-angulation of corresponding points tracked using a Lucas Kanade tracker, provide information of too poor quality to be useful for the safety system of a car.

Sammanfattning

I detta examensarbete undersöks svårigheterna kring att skapa en djupbild från att endast använda en lågupplöst gråskalekamera monterad framtill i en bil. Må-let är att producera en djupbild i realtid som kan nyttjas i andra delar av bilens säkerhetssystem. Detta har visat sig vara svårt att lösa med den undersökta kom-binationen av kameraplacering och val av algoritmer.

Det huvudsakliga problemet är att räkna ut ett noggrant optiskt flöde. Andra problem härrör från objekt som rör på sig. Slutsatsen är att implementationerna, mestadels triangulering av korresponderande punktpar som följts med hjälp av en Lucas Kanade-följare, ger resultat av för dålig kvalitet för att vara till nytta för bilens säkerhetssystem.

(8)

(9)

Acknowledgments

I would like to thank my supervisor Ognjan Hedberg for his support during my work and Fredrik Tjärnström for coming with the problem formulation ideas. My thanks also go to Johan Hedborg for his valuable advice and Thomas Schön for showing interest in my work. Last but not least I would like to thank my examiner Klas Nordberg for his feedback on the report.

(10)

(11)

Introduction

This chapter presents the problem that is investigated in this thesis. The back-ground and purpose of the work is described below along with the limitations of this work. In the end of this section the report structure is presented.

1.1 Background

Autoliv Inc. is the world’s largest supplier of car airbags and seatbelts [6]. These and other safety systems use electronics developed by Autoliv Electronics AB [2]. Among other products Autoliv Electronics AB in Linköping develops camera based products, for example Night Vision which in night time conditions detects and tracks pedestrians, see Figure 1.1. Other products use two ordinary grayscale cameras as a stereo camera to get information about the 3D structure of the sur-rounding environment, i.e. the distance to pedestrians, cars and traffic signs etc. Instead of using a stereo camera, two consecutive image frames from a single cam-era can in principle be analyzed to retrieve approximately the same information.

Figure 1.1. The night vision system with pedestrian detection in action [2].

(14)

2 Introduction

To determine the distance to detected objects, today’s vision based systems use object recognition or stereo cameras. A system that uses object recognition in combination with some assumed object size can only deal with the object types it is designed for. On the other hand, a stereo camera does not depend on that kind of training but is more expensive to produce. It would therefore be valuable if the distance to any kind of object could be determined by only using a single camera together with the limited computational resources available in a car, i.e. about one digital signal processor (DSP) and one field-programmable gate array (FPGA).

There exist other techniques, such as radars, that can measure the distance to objects on the road. On the other hand, cameras are more often placed in the cars of today and therefore it is interesting to be able to produce the same information using fewer sensors.

1.2 Objective

The objective of this master thesis is to evaluate if it is possible to create an algorithm that produces 3D information, in the form of a depth map, from an image sequence provided by a single camera mounted in the front of a car. This problem is often called “structure from motion”. The proposed algorithms should meet the demands of running in the real-time system with limited computational resources available in a car and produce measurements of such a high quality that they are useful for the safety systems in a car.

The test image sequences provided by Autoliv are recorded using a stereo camera but the algorithm is fed with images only from one of the cameras. As Autoliv provides a disparity map from the stereo camera, it is convenient, for example when it comes to evaluation, to let the resulting algorithm produce a depth map.

The algorithm is developed and evaluated using MATLAB.

1.3 Problem Constraints and Limitations

There is a lot of research going on in the area of structure from motion, which is an indication of that there is no single obvious choice of solution. Therefore the work will focus on evaluating a couple of distinct approaches. The limited time resource also restricts the search for optimal parameters for each algorithm.

Constraints on this problem are that depth information should be produced online from images taken by a camera mounted in the front of a car. The objects that are of interest are object less than 40 meters from the car and the camera has a resolution of 740x430 pixels where the field of view is approximately 40 degrees. The objects near the point of infinity move less than a pixel between the frames. The small sideway motion makes the region where this phenomenon occurs quite large. On the contrary, objects close to the car, such as a bridge when passing under, may move more than 10 % of the image width between two frames. This means that the algorithm has to be able to deal with both small and large motion.

(15)

1.4 Report Structure 3

Another important observation is that there are moving objects in the scene. The depth estimate of these non-stationary objects will be wrong if they are assumed to be stationary. The focus of this thesis work is not to get the algorithm to work in an environment crowded with moving objects. Different weather and light conditions also create difficulties for the system which is neither dealt with in this thesis.

The computational resource consists of automotive classified digital signal pro-cessors (DSPs) and field-programmable gate arrays (FPGAs). The FPGA allows accelerating and parallelizing simple binary operations while the number of float-ing point operations must be minimized as they are relatively slow and use a larger amount of logical units on the FPGA.

1.4 Report Structure

The rest of this thesis is organized as follows: Chapter 2 describes some theory needed to understand the evaluated methods. The general approaches to solve the problem are described in Chapter 3. In Chapter 4 the implementations and results from these are found. At last, Chapter 5 summarizes the thesis work and proposes future work to be done in the field.

(16)

(17)

Chapter 2

Theory

This chapter introduces some parts of the theory needed to understand the algo-rithms used to solve the structure from motion problem. Readers that are familiar with the described theory may continue to read Chapter 4. A discussion about other methods that may be appropriate for further work can be found in Sec-tion 5.2.

2.1 Good Features to Track

The correspondence problem refers to the problem of finding corresponding points in two or more images of the same 3D scene, taken from different points of view. One way to solve this problem is to find interesting points in the images, describe them with some descriptor and match the descriptors of each image with each other. Such a descriptor may be just an image patch, i.e. the pixel values in the neighborhood of the interesting point.

2.1.1 The Aperture Problem

There may be ambiguity when matching descriptors due to the aperture problem. This occurs when an N-dimensional signal can be represented with M < N vari-ables and therefore is of intrinsic dimension M, often referred as i0D, i1D, i2D and so on. Then the ambiguity can be found in N − M dimensions. The goal is to track motions in all directions and therefore only suitable points, called fea-ture points, should be considered when finding point correspondences to avoid the aperture problem. Good feature points are typically those containing a locally i2D structure, e.g. corners.

An example is the function f (x, y) = sin(x + y) which is a 2D signal but just i1D because in the orthogonal basis s = x + y, t = x − y we get f(s, t) = sin(s). For a graphical example, see Figure 2.1.

(18)

6 Theory

Figure 2.1.The aperture problem shown for an i1D signal of dimension 2. The parallel lines are viewed through a window which makes it impossible to determine the vertical motion component.

2.1.2 The Harris Operator

The Harris operator [8] is probably the most famous corner detector used in image processing. The first step is to estimate the image gradients Ix and Iy. This can

efficiently be done using for example Sobel filters [17]. Then, for each pixel, the structure tensor T is calculated as

T(x) =P uw(u)∇I(x + u)∇I(x + u) T ₌P uw(u) IxIx IxIy IyIx IyIy , (2.1) ∇I(x) = Ix Iy T

=h _∂x∂I ∂I_∂y iT (2.2) where w is a weighting function, usually a circular Gaussian. At last the Harris operator is calculated as

H(x) = det T (x) − k tr2T(x). (2.3) The constant k has in most literature been assigned a value in the range 0.04 − 0.15 which, empirically, has been found to work well [23]. Feature points are then indicated by positive values being local maxima, see Figure 2.2.

2.1.3 The Census Operator

The census operator [24] calculates a descriptor of relationship between the central and surrounding pixel values of an image patch. This can be used to find point correspondences. Let the central pixel value be P . Then the resulting signature vector will be a binary string of ξ(P, P0_{) where P}0_{are the surrounding pixel values.} Not only the closest 8 pixels may be used, but also 16, in a star formation, or 24 neighbor pixels may be used as P0_{. An extension to the census operator adds a} third relation describing “similar” pixels [19] to make it more robust. The signature vector will then change to a ternary string, see Figure 2.3. Note that the census operator is invariant to illumination changes.

(19)

2.1 Good Features to Track 7

Figure 2.2. Feature points extracted with the Harris operator.

ξ(P, P0 ) = ( 0, if P0 < P 1, if P0 ≥ P .

Intensity values Census values Signature vector 124 74 32 124 64 18 157 116 84 → 1 1 0 1 x 0 1 1 1 → 110 10 111

(a) The original census operator.

ξ(P, P0 ) =    0, if P0 < P −  1, if |P0 − P | ≤  2, if P0 > P + 

Intensity values Census values Signature vector 124 74 32 124 64 18 157 116 84 → 2 1 0 2 x 0 2 2 2 → 210 20 222

(b) The census operator with an extension for similar values.

(20)

8 Theory

Each signature vector corresponds to a flat region, line, edge, corner or other structure. Therefore only certain signature vectors should be considered when creating hypotheses to corresponding points. The corresponding point hypotheses are created by matching all pixels of a certain signature vector in one image to the pixels containing the same signature vector in the other image. This creates a huge amount of hypotheses which can be filtered over time by only allowing combinations which do not change in direction or speed over three consecutive frames [19].

The Census operator itself is well suited for implementation in hardware. The lack of complex arithmetic calculations in the correspondence generation gives the algorithm potential of implementation in hardware [19].

2.2 Optical Flow

A motion field is the motion of the projection of the 3D points in the scene. Using only image data, the motion field cannot be reconstructed. Instead, an optical flow, the apparent motion, can be estimated [20]. A sparse optical flow can be determined by finding corresponding points.

The equation that most optical flow algorithms rely on is the brightness con-sistency equation,

I(x, y, t) = I(x + dx, y + dy, t + dt), (2.4) and by assuming that the movement is small, the first order Taylor series can be developed to get

I(x, y, t) = I(x + dx, y + dy, t + dt) ≈ I(x, y, t) +∂I ∂xdx +

∂I ∂ydy +

∂I

∂tdt ⇔(2.5)

0 = _∂x∂Idx +∂I_∂ydy +∂I_∂tdt ⇔ (2.6) 0 = _∂x∂Idx_dt +∂I_∂ydy_dt +∂I_∂tdt_dt ⇔ (2.7)

IxVx+ IyVy= −It. (2.8)

Note that there is only one equation but two unknowns, Vxand Vy, which can

be seen as the aperture problem described in Section 2.1.1. The equation also tells us that in a flat region Vxand Vycannot be determined because Ix= Iy= 0. Both

these problems are solved by introducing additional constraints in the different methods. Still, many methods cannot handle changes in illumination.

2.2.1 Lucas Kanade Tracking

The basic idea for the Lucas Kanade method (KLT) of calculating optical flow relies on the Newton-Raphson method [12]. Assume that the 1D signal I has moved the distance −d to the signal J(x) = I(x + d) in the next frame as in Figure 2.4. A symmetric first order of Taylor series development gives

J(x −d 2) = I(x +d2) ⇔ J(x) − d 2dJdx = I(x) +d2dIdx ⇔ d = J(x)−I(x)1 2 dJ dx+ 1 2 dI dx . (2.9)

(21)

2.2 Optical Flow 9

J(x) I(x)

J(x) − I(x)

d

x

Figure 2.4. Two signals to be matched.

This equation is solved iteratively with d0= 0 and

dk+1 = dk+ J1k(x)−I(x) 2 dJ_k dx + 1 2 dI dx , Jk(x) = J(x − dk). (2.10) In the 2D case the goal is to find d such as J(x) = I(x + d), where x = [x, y] and d = [dx, dy]. A symmetric dissimilarity measure between two windows can

then be defined as = Z Z W J x+d 2 − I x−d₂ 2 w(x)dx. (2.11)

The distance d can be found by solving the equation Zd = e [4] where

Z = RR

Wg(x)gT(x)w(x)dx (2x2 matrix),

e = RR

W(I(x) − J(x)) g(x)w(x)dx (2x1 vector) and

g(x) = h ∂x∂ I+J 2 ∂ ∂y I+J 2 iT (2x1 vector). (2.12)

Still, the KLT tracker needs a smooth signal which means that it cannot handle significant motions in general. If a large portion of the image moves approximately equally, a scale pyramid can be used to find this motion. Note that large motions of small objects cannot be found in this way as shown in Figure 2.5.

2.2.2 The Horn-Schunck Method

To improve the optical flow calculation, one common constraint is to let the optical flow vary on edges but not in flat regions to get a globally consistent result. This is done in Horn and Schunck’s algorithm by minimizing

= Z Z

(22)

10 Theory

(a) The first frame.

(b) The second frame.

(c) The corresponding points found using the KLT tracker in a scale pyramid. Note that the points were not tracked correctly in the marked area to the right.

Figure 2.5. One example of the KLT tracker using a scale pyramid. Small objects moving a large distance are not tracked correctly.

(23)

2.2 Optical Flow 11

The error can be iteratively minimized to a local minimum which may not be the global optimum. The smoothness term with the factor α will punish rapid flow changes. The method creates a dense optical flow but the smoothness term will produce a linear interpolation of the motion along flat regions according to the motion at the edges of the region. This interpolation may not be correct, e.g. for a region without constant depth.

2.2.3 Dynamic Programming Methods

An iterative search for the minimal error in the Horn-Schunck method, see Sec-tion 2.2.2, may not converge to the global minimum, especially not for rapid moSec-tion changes in the image. A better solution may be found by searching through all possible integer motions for all pixels which is computationally too heavy. Consid-ering the same problem for a 1D-signal with the error function (x, v), i.e. the error for the sample x with the motion v, a full search can efficiently be implemented [7]. The idea is to accumulate the error using

S(x, v) = (x, v) + minu(S(x − 1, u) + λ(v, u)) (2.14)

where λ(u, v) is a cost function for rapid motion changes between two neighbor pixels. This idea is applied in separate horizontal and vertical scans to find a local minimum close to the global minimum.

The difference compared to a full search of all possible combinations of mo-tions at all pixels, is that this method performs separate horizontal and vertical searches. This means that a large horizontal and vertical motion of a tiny object may be detected, but large diagonal motions may be missed. The negative part of comparing all possible motions is that a block match must be done for all possible motions in each individual pixel. Therefore it is suggested to just consider mo-tions that match the result of a sparse corresponding point method [18]. The idea is then to compute a limited set of motion hypotheses which are the only valid choices for the dynamic programming scans.

2.2.4 Maximally Stable Extremal Regions Tracking

An extremal region is a connected component of pixels which are all brighter or darker than all the pixels on the region’s boundary. Such a region Qtis found by

thresholding an image at an intensity value t. A maximally stable extremal region (MSER) is an extremal region with a certain intensity threshold where the rate of growth has its minimum [13, 14], i.e. the Qt0 where

t0= arg max t q(t) = arg max t area (Qt±1) area (Qt) .

The found MSERs may vary a lot between consecutive frames, see Figure 2.6. Therefore, when tracking MSERs it may be better to search for all extremal re-gions, not just the maximally stable [5]. The distance measure between corre-sponding MSERs can be chosen as the Euclidian distance between feature vectors

(24)

12 Theory

consisting of the mean gray value, region size, center of mass, and stability value q. If the rotation and scale can be neglected, which is close to the application of this thesis, the width and height of the bounding box and the second order moments of the regions can also be used in the feature vector.

The computing speed for MSERs can be implemented in linear time over the number of pixels [15]. The algorithm involves only integer operations, which has been confirmed by inspecting the source code of the VLfeat library [21]. This makes it suitable for real-time implementations in hardware.

2.3 Epipolar Geometry

The geometry of two cameras is known as epipolar geometry. The notations are shown in Figure 2.7 where P is the point in the 3D space and piis the homogenous

normalized image coordinates of the projection of Pi onto the image plane of the

camera with camera center Oi. This means that pi is described in the camera

centered orthonormal coordinate system where the image plane is orthogonal to the third axis and located at the distance 1 from Oi. The epipolar point ei is

the projection of the opposite camera center onto the image plane. The relation between the coordinates of P in the coordinate systems with origin Oi can be

described with a translation vector, T = O2− O1, and a rotation matrix, R, with

P2= R(P1− T).

2.3.1 The Essential Matrix

Given that a 3D world coordinate projected in one image lies on epipolar line l1, the projection of the same coordinate in the other image can be found along the corresponding epipolar line l2. The essential matrix, E, describes this relation in homogeneous normalized image coordinates pi, defined in 2.3, as

pT2Ep1= 0. (2.15)

One observation is that E = RS where

S =   0 −Tz Ty Tz 0 −Tx −Ty Tx 0  [20]. (2.16)

Further, R and T can be determined from E through a singular value decom-position to E = U DVT _{[10]. The rotation is then given by R = U W V}T _or

R= U WT_VT _where W =   0 −1 0 1 0 0 0 0 1   (2.17)

and T = u3 or T = −u3 with u3 as the last column of U . Note that T has an unknown scale as it lie in the null space of E, i.e. ET = 0. The correct scale can be retrieved from the vehicle velocity information available on the CAN-bus in the car.

(25)

2.3 Epipolar Geometry 13

(a) Maximal stable extremal regions. (b) All extremal regions.

Figure 2.6. Four consecutive frames where ellipses represent the extremal regions. In (a) only the maximal stable extremal regions are shown and in (b) all extremal regions are shown. The box shows an interesting region where one of the maximal stable extremal regions cannot be tracked.

(26)

14 Theory b b _b b _b b bP O1 p1 e1 e2 O2 p2 Epipolar lines P1 P2 l1 l2

Figure 2.7. The epipolar geometry.

In total, there are four hypotheses of motion. The correct configuration can be determined by triangulating the points and checking for which configuration all points lie in front of both cameras. Only one of the hypotheses has all points in front of the camera.

The Rotation Matrix

One way of specifying angles in a rotation is to use the yaw, pitch and roll angles, see Figure 2.8. Then the rotation matrix

R= RyawRpitchRroll (2.18)

where Ryaw(α) =   cos α 0 sin α 0 1 0 − sin α 0 cos α   Rpitch(β) =   1 0 0 0 cos β − sin β 0 sin β cos β   Rroll(γ) =   cos γ − sin γ 0 sin γ cos γ 0 0 0 1  . (2.19)

By inserting (2.19) into (2.18) the rotation matrix can be written as

R= 



sin α sin β sin γ + cos α cos γ sin α sin β cos γ − cos α sin γ sin α cos β cos β sin γ cos β cos γ − sin β sin α sin β sin γ − sin α cos γ cos α sin β cos γ + sin α sin γ cos α cos β





(27)

2.3 Epipolar Geometry 15 z x y (a) Yaw. y z x (b) Pitch. y x z (c) Roll.

Figure 2.8. One way of representing a 3D rotation is to use the yaw, pitch and roll angles in the specified order.

When inspecting (2.20) it turns out that the angles are reconstructable through tan α = r13 r33 − sin β = r23 tan γ = r21 r22 . (2.21)

2.3.2 The Normalized Eight-Point Algorithm

The essential matrices can be computed from five or more point correspondences in the frames. The five degrees of freedom relate to the three rotation angles and the two degrees of freedom for a normalized translation vector. A simpler equation can be derived if eight point correspondences are used instead. The eight-point algorithm uses equation (2.15) rewritten as

p1xp2xe11 + p1yp2xe12 + p2xe13 +

p1xp2ye21 + p1yp2ye22 + p2ye23 +

p1xe31 + p1ye32 + e33 = 0.

(2.22)

The equation system can be constructed and solved using 8 not degenerated point pairs. This is because E has 9 elements but has an unknown scale [20].

The calculation has numerical problems because the range of the pixel coor-dinates is from zero to a thousand while the third homogeneous coordinate is usually one. The solution to this problem is to translate the coordinate system to the center of mass and scale to get an average distance of √2 to the origin [9].

2.3.3 The Random Sample Consensus Algorithm

The set of corresponding point pairs contains errors. Some pixels have been tracked completely wrong and some have errors in their subpixel position esti-mation. Therefore, it is essential to not use erroneous points when running the eight-point algorithm described in Section 2.3.2. One solution to this problem is the Random Sample Consensus Algorithm (RANSAC), see Algorithm 1 [10]. Note that RANSAC is a non-deterministic algorithm but will produce better results the more iterations are allowed.

(28)

16 Theory Algorithm 1The Random Sample Consensus Algorithm (RANSAC)

Require: A lot of putative point correspondences.

best_E ← null

best_inliers ← 0

loop

Select a random sample of 8 correspondences.

current_{_E ← The essential matrix computed as in Section 2.3.2.}

num_{_inliers ← The number of correspondences consistent with current_E.} if num_inliers > best_inliers then

best_E ← current_E

best_inliers ← num_inliers

end if end loop

Equation (2.15) can be used to check whether a corresponding pixel pair is potentially correct or not. The tracked pixel coordinates always have noise, which means that the constraints in (2.15) usually do not hold. The solution is to deter-mine the distance from the tracked pixel to the epipolar line and sum the distances in the two images for a corresponding pair. If the distance is too large, the point pair is regarded as erroneous and is called an outlier.

The relation between a point and a line that intersects can be written as

¯lT

p= 0 (2.23)

where ¯l is the dual homogeneous coordinates of the line. By identifying the factors of (2.15) and (2.23) it follows that the epipolar line ¯l1= ETp2. Let ˆl denote the dual homogeneous coordinates of the line l that is normalized in the first and second element. Then, according to [10], it turns out that the distance

d(l1, p1)2=ˆlT1p1 2 = p T 2Ep12 (ET_p₂₎2 1+ (ETp2) 2 2 . (2.24)

2.3.4 Triangulation

If the translation vector, rotation matrix and camera matrix is known for a cor-responding point pair, the 3D point P can be triangulated. There is always an estimation error in the pixel positions, which means that the rays from the optical center through the pixel will not intersect.

One way of estimating the 3D point is the mid-point method [20]. The method chooses the point on the middle of the shortest line segment between the two rays as shown in Figure 2.9.

If the translation T and rotation matrix R are known, the end points of the segments in the camera centered coordinate system of the first camera can be written as ap1 and T + bRT_p_{2. The direction of the line segment is p1}_{× R}T_p_2,

which is orthogonal to both rays. By solving the equation system

(29)

2.4 Error Analysis 17 b b b _b b b b P O1 p1 O2 p2 l2 _l₁

Figure 2.9. Triangulation with the midpoint method.

for a, b and c, the 3D point P can be computed as

P1= ap1+c

2 p1× R

T_p_{2 .} _(2.26)

Remember that the larger the angle between the triangulated rays is, the better depth estimate is produced [10]. This means that the triangulation of objects in front of the cameras is very uncertain in the forward motion case, in contrast to a normal stereo camera, see Figure 2.10.

2.4 Error Analysis

The depth estimation has some errors due to the error propagation in each calcu-lation step. The images from the camera have some noise which introduces errors in the optical flow estimation.

Let two images be taken from camera positions where one is at the distance t in front of the other with no rotation change, see Figure 2.11. This means that they share the optical axis. Let an object be located at a distance x from the optical axis and the distance z1= z + t and z2= z from the cameras, which means that it is projected to the pixel distance ¯xi = f_zxi from the image center. Define

f = _tanr/2₍α

2)

where α is the field of view of the camera and r is the resolution in pixels. Assuming that the motion is of the object between the images is d = ¯x2−¯x1, it follows that ¯ xizi= f x ⇒ (2.27) ¯ x1z1= ¯x2z2⇒ (2.28) z+ t = z1= z1d d = z1x¯2−z1x¯1 d = z1x¯2−z2x¯2 d = t¯ x2 d = txf zd. (2.29)

(30)

18 Theory b b b O₁ p₁ O2 p₂ P b b b O₁ p₁ O2 p2 P

Figure 2.10. Triangulation with error in the estimate of the corresponding points. In contrast to a stereo camera, the angle between the triangulated rays is very small in the forward motion case which increases the error in the depth estimate.

b b ¯ x₂ ¯ x₁ _t z x α α

Figure 2.11. A typical situation when tracking a stationary object during forward motion.

(31)

2.4 Error Analysis 19

The camera used in this master thesis has a field of view of approximately 40 degrees and resolution r = 700 pixels. Let for example the vehicle speed be 45 km/h, which gives t = 0.5 m at the frame rate 30 Hz, and watch an object at

z+ t = 30 m located 1 m from the optical axis which means that the object will pass by close to the side of the car. Then d ≈ 0.53 pixels. If the estimate of d is in the range ˜d_{= d ± 0.2 pixels, ˜z will be estimated, in this example, in the range}

25 m to 38 m. As a consequence, either the optical flow has to be very precise or the estimates have to be filtered to reduce noise.

All these calculations depend on the assumption that all objects are stationary. If an object is moving the estimated motion can be thought of as the motion of corresponding stationary object together with a large error in the motion estima-tion. This will of course lead to errors in the depth estimaestima-tion. One solution may be to find the relative motion between the camera and the moving object but this is difficult, especially for small objects.

(32)

(33)

Chapter 3

Approaches

A common approach to reconstruct the 3D environment from 2D images is to triangulate points that have been tracked between image frames. Some other algorithms are based on optimization of a certain error measure. Two different approaches are described below. Both make use of point correspondences and must therefore also deal with erroneous correspondences, so called outliers.

3.1 Point Correspondence and Triangulation

be-tween Two Frames

The straight forward solution can be outlined with the following computation steps [10, 20]:

• Find corresponding points between two frames. • Estimate the essential matrix.

• Compute the ego-motion, i.e. the rotation and translation.

• Triangulate the point in space that projects to the corresponding points.

3.1.1 Corresponding Points

It would be ideal if a dense optical flow is estimated. Then every pixel in one frame has its corresponding position in the next frame which in turn will produce a dense depth map. The problem is that non-textured regions are difficult to track between frames and therefore the accuracy is lowered.

Instead of using a dense optical flow, a sparse set of point correspondences can be used. In this thesis the Harris detector, see Section 2.1.2, is used to find good points to track. These points are then tracked accurately, as they were chosen to be good to track, to the second frame. The problem with this technique is that the motion of the rest of the area is unknown and therefore no depth estimate will

(34)

22 Approaches

Figure 3.1. Pixel motion in the close-to-forward motion case. In the middle region, pixels move only a short distance.

be available. Without a dense depth map it may be difficult for other algorithms to detect generic obstacles in front of the car.

A method in between the sparse and dense tracking is the MSER tracking where image regions are tracked individually. There are still problems with geometrical changes, such as scale change, of the regions that can introduce errors.

3.1.2 Estimating the Essential Matrix

Most of the tracked image points in the center of the image will just move a short distance due to the forward motion, see Figure 3.1. Therefore the corresponding point estimates need subpixel accuracy to get a good estimation of the essential matrix.

All corresponding points may not be correctly matched and independently moving objects may also exist in the scene. Both situations will produce points that will spoil the estimate of the essential matrix. This can be handled by, for example, using a RANSAC loop, see Section 2.3.3.

3.1.3 Estimating the Rotation and Translation

As described in Section 2.3.1 the translation can be determined only up to scale. Therefore the vehicle speed may be incorporated in this step.

3.1.4 Triangulation

The triangulation is straight forward using the formulas introduced in Section 2.3.4. Problems occur when the point is far away or at a position in the image that does not move much. Then the distance measure can differ a lot compared to the ground truth due to errors in the estimates of corresponding points. A solution may be to track image points over several frames or filtering the distance estimates over time.

(35)

3.2 Bundle Adjustment 23

3.2 Bundle Adjustment

Instead of triangulating points from two frames, the Bundle Adjustment method optimizes the world coordinates of each image point over all frames [10]. The input to this algorithm is corresponding points that are tracked through multiple frames. This gives a large equation system, but all points are not tracked through all frames. Therefore, by grouping the equations in a clever way, it is possible to run the algorithm in a second when optimizing over tenths of frames with a thousand corresponding points on a normal desktop computer [11].

This method has not been evaluated in this master thesis. The focus has been on triangulating point correspondences found between two frames.

(36)

(37)

Chapter 4

Implemenations and Results

This chapter describes the implementations that have been done to evaluate dif-ferent algorithms. The evaluation has been done on the approach of using point correspondences to triangulate between two frames, see Section 3.1. The first step in that approach is to find a dense map of good point correspondences which has been proven to be difficult. Therefore much effort has been put into that particular task.

The implementations have run on image sequences with the resolution of 740x430 pixels where the field of view is approximately 40 degrees. The cho-sen image sequences vary from having no moving objects up to several moving cars.

4.1 Dense Optical Flow with KLT

The equations in Section 2.2.1 state an iterative process where weighted integrals, which are approximated with sums, should be calculated. This includes floating point operations when moving the images with subpixel precision in the beginning of each iteration. The number of iterations and the size of the integration areas are important parameters to determine the accuracy and computational cost. A comparison between two groups of parameters has been done in the following sections.

4.1.1 Using Parameters for Fast Computation

An implementation of the KLT tracker has been done in MATLAB. The param-eters have been chosen to easily make it possible for a real-time implementation in hardware, i.e. by using sliding windows for summing and together with a min-imized number of multiplications. The implementation uses

• a scale pyramid in five levels,

• averaging boxes as low pass filter kernels, 25

(38)

26 Implemenations and Results

• Sobel filters as gradient filter kernels and • rectangular summing windows for the integrals.

Some parts of the image, such as flat regions, are difficult to follow. Therefore, it is important to know if the tracking was successful or not. The first frame I is compared to an image constructed by warping the second frame J according to the motion into J0

(x) = J (x − d(x)), where d is the optical flow, where J0 should be equal to I. As a confidence measure, the sum of absolute difference (SAD) in a small area around the each pixel has been used. If the SAD is low in the whole area, the estimated motion is potentially correct. In some cases, e.g. in a flat region, the SAD will still be low for motions d(x) + δ that differ from the estimation. Hence, high certainty is indicated when

X

r∈R

|J (x + r − d(x + r)) − I(x + r)| (4.1) is low, for a small region R, and

X

r∈R

|J (x + r − d(x + r)) − I(x + r + δ)| (4.2) is high for all small δ 6= 0 [7].

The result when running the algorithm on two frames is shown in Figure 4.1. It is possible to distinguish objects close to the camera in the optical flow, but the motion is not regular. This problem can also be observed as clutter in the middle of the image. Due to the small movements in those parts of the image in the forward motion case, it is very difficult to seperate correct estimates, with some error, from incorrect.

4.1.2 Using Parameters for Accurate Computation

To compare the simple parameters with more accurate parameters, the same se-quence was run using the KLT tracker in the OpenCV library [16]. This leads to the results in Figure 4.2 which contains the same problems as described in 4.1.1. The main difference is that OpenCV claims it is able to detect the motion for a lot more pixels, including flat regions. It is also a lot better at finding the horizontal motion but there are still large errors in the vertical motion.

4.1.3 Comparing the Two Parameter Sets

The two methods have also been compared to the Middlebury ground truth data [3]. An evaluation using the Rubber Whale data set shows that 56 % of the pixels have errors less than 0.1 pixels when using the OpenCV KLT tracker compared to 20 % for the simple implementation, see Figure 4.3. For the errors less than 0.3 pixels, the numbers are 78 % and 67 % respectively. This shows that an optical flow estimated using simple box filters may not be accurate enough if the errors are required to be small.

(39)

4.1 Dense Optical Flow with KLT 27

(a) The two frames.

(b) Results from a simple implementation of a KLT tracker. The optical flow is shown in horizontal and vertical direction respectively. Blue means motion to the right/down, green means no motion, red means motion to the left/up and black indicates too low confidence.

(40)

(a) The two frames.

(b) Results from the OpenCV KLT tracker implementation. The op-tical flow is shown in horizontal and verop-tical direction respectively. Blue means motion to the right/down, green means no motion, red means motion to the left/up and black indicates too low confidence.

(41)

4.1 Dense Optical Flow with KLT 29

(a) The two frames.

(b) Optical flow error using the OpenCV KLT tracker implementation to the left and the own simple KLT tracker to the right. Blue means low error and red means large error. Pixels with too low confidence are black.

0 0.5 1 1.5 2 2.5 3 0 5 10 15 20 25 30

Optical Flow Error

Pixel Error Percent 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 12 14 16

Optical Flow Error

Pixel Error

Percent

(c) A histogram over the number of pixels with certain error, the OpenCV implementa-tion to the left and the own implementaimplementa-tion to the right. Pixels with too low confidence are not counted.

Figure 4.3. Two implementations of the KLT tracker evaluated using an image sequence with ground truth.

(42)

4.2 Dense Optical Flow with Dynamic

Program-ming

The Horn-Schunck method, described in Section 2.2.2, performed quite bad in the Middlebury evaluation [3]. Therefore effort has been put into an implementation of the dynamic programming method for dense optical flow that is described in Section 2.2.3.

The result of running between two frames is shown in Figure 4.4. It is clear that the optical flow has floated horizontally and vertically through flat regions, such as the road and the sky. This effect emerges from the separate horizontal and vertical scans, the criteria of minimizing changes in the optical flow and the fact that any flow in flat regions will give the same error.

The computation time and memory use complexity is O(h × w × R) where h and w is the size of the image and R is the search region for motion [7]. To catch large motions R needs to be quite big. For example ±50 pixels horizontally and ±10 pixels vertically gives R = 2121 which indicates thousands of operations per pixel have to be done. The algorithm is easy to parallelize because the horizontal scan is done independently over each line. It still needs a lot of memory to store the correlation results for all motions at all pixels.

4.3 Egomotion Estimation

The eight point algorithm has been implemented to estimate the egomotion from a set of corresponding points. To remove erroneous estimates, called outliers, the eight point algorithm has been placed inside a RANSAC loop. To find a good estimate of the egomotion, which is needed for the triangulation, the RANSAC loop has to run a lot of iterations.

The egomotion estimation is quite unstable as can be seen in Figure 4.5(a). The rotation matrix has been decomposed into the yaw, pitch and roll angles. Ap-parently the roll and translation are unstable. Even if only point correspondences that has been tracked at least five frames are used, the estimated parameters are still unstable as shown in Figure 4.5(b).

The unstable values are causing some frames to get totally wrong egomotion estimations. To avoid this, the best estimates from the RANSAC loop can be filtered using a filter that includes outlier rejection, e.g. using the median. This kind of filtering has not been implemented in this master thesis.

4.4 Sparse Optical Flow with KLT

The OpenCV library contains an implementation of the KLT tracker for tracking sparse points [16]. By only using the best features according to the Harris response, most of the tracked points should have good precision.

Using two consecutive frames, the motion of the tracked points is too small to robustly be able to distinguish between points on moving and stationary objects

(43)

4.4 Sparse Optical Flow with KLT 31

(a) The two frames.

(b) An example of a dense optical flow using a dynamic programming method. The optical flow is shown in horizontal and vertical direction respectively. Blue means motion to the right/down, green means no motion, red means motion to the left/up and black indicates too low confidence.

(44)

(a) The egomotion has been estimated between two consecutive frames.

(b) The egomotion has been estimated between points that have been tracked through five consecutive frames.

Figure 4.5. One example of running the eight point algorithm in 2000 iterations of a RANSAC loop. The estimated translation and rotation parameters are shown for the best 100 samples, i.e. the samples with most inliers. Even if the points are tracked more frames, the estimate robustness does not increase. Note that the translation has been normalized as the scale cannot be determined according to Section 2.3.2.

(45)

4.4 Sparse Optical Flow with KLT 33

Figure 4.6. It is difficult to separate the red outliers from the blue inliers when esti-mating the egomotion. Both inliers and outliers can be found on both stationary and moving objects. The rate of inliers is 60 % in the background, 30 % at the cars, 50 % at the bicycle and 0 % at the pedestrians that are crossing the road.

Intensity values Census values Signature vector 111 133 249 126 60 253 124 74 32 203 96 124 64 18 99 201 157 116 84 86 47 97 204 6 50 → 2 2 1 2 1 0 2 2 x 0 2 2 2 2 0 2 0 → 221 210 2202 222 020

Figure 4.7. The sixteen neighbors used in the implemented Census operator.

when estimating the egomotion. In Figure 4.6 it is clear that some inliers, regarding the stationary environment, have been counted as outliers and the inliers on the moving targets will provide false depth estimates. The figure also shows that objects moving along the epipolar lines have quite high inlier rate but objects moving across have high outlier rate.

4.4.1 Initial Correspondences using the Census Operator

As shown in Figure 2.5, the KLT tracker cannot handle too large motion compared to the size of the objects as those objects disappear when low-pass filtering the scale pyramid. Therefore, the KLT tracker can be given an initial guess of the optical flow from some other algorithm.

An implementation of the ternary Census operator over 16 neighbor pixels, see Figure 4.7, produced a good amount of corresponding points to use for tri-angulation, see Figure 4.8. To be able to estimate the depth, the corresponding point estimates have to be refined with subpixel accuracy. The refinement can be done using the KLT tracker but without the need of any scale pyramid. Using the Census operator may be an option to speed up the KLT tracker.

(46)

(a) The three frames.

(b) Corresponding points hypothesis found using the Census trans-form.

(c) Hypothesis remaining after temporal filtering.

Figure 4.8. Finding corresponding points using the Census transform is able to find large motions of small objects. Compare with the same image sequence and the KLT tracker in Figure 2.5.

(47)

4.5 Depth Estimation 35

4.5 Depth Estimation

The point correspondences, tracked trough five frames, and the egomotion estimate have been used to triangulate the distance to the tracked points. The evaluation has been done on one of the two image sequences provided by a stereo camera. Consequently, the depth estimates from the stereo camera can be used as reference. The result is shown in Figure 4.9. As the depth estimates varies a lot, the variance of the triangulation result from the latest three frames is used to validate the stability when creating the depth map in Figure 4.9(b).

In Figure 4.9(c) the OpenCV implementation of the KLT tracker has been used to find corresponding points. The traffic sign to the left can be seen as a sharp yellow line and the cars on the right are also clearly visible. The problems arise with the man in the middle which has depth estimates ranging from 5 to 30 meters. This can be compared to Figure 4.10(a) where the depth estimates, calculated using the own KLT implementation, have an even larger variance, both concerning the man in the middle and the objects around. When getting closer to the man, the depth estimates also gets concentrated, see Figure 4.10(b). In this case it is too late to provide a warning to the driver. By watching the result from several similar sequences it turns out that the points group at the distance 7 meters when the object is 0.5 meters beside the line of motion, 12 meters when 1 meter beside and 15 to 20 meters when the object is 3 meters beside.

When running a similar sequence but using a man that moves, the RANSAC loop efficiently removes the tracks on the man as outliers. There are too few corresponding points on the man to estimate the motion of him robustly using a RANSAC loop and the eight point algorithm. Thus it will not be possible to triangulate any distance at all according to Section 2.4.

(48)

(a) Corresponding points that have been tracked in five frames. Blue are counted as inliers on stationary objects, red are outliers. The two gray lines show the two rows used as reference in (c).

(b) The depth estimates. Blue means close to the camera and red means far away. Only estimates with low variance over three consec-utive frames are considered as valid.

(c) The depth estimates seen from above. The axes are in meters which differs from (a) and (b). The gray points correspond to the depth estimates from the stereo image pair at the level of the two gray lines in (a) respectively. Tiny gray dots can be found in the middle of the image, above and below the larger gray dots. These tiny dots correspond to the depth that would have been estimated if the optical flow differs ±0.3 pixels from the correct

value. The colored points are the triangulated estimates where blue points corresponds to low objects near the road, red means objects above the car and green is just in between.

Figure 4.9. The result of triangulating a sparse optical flow. The traffic sign to the left can be seen as a sharp yellow line and the cars on the right are also clearly visible. The problem arise with the man in the middle which has depth estimates ranging from 5 to 30 meters.

(49)

4.5 Depth Estimation 37

(a) The triangulation result of the same sequence as in Figure 4.9(c) but using the own implementation of the KLT tracker instead of the implementation provided by the OpenCV library. It is clear that the estimates vary a lot more.

(b) The triangulation result when getting closer to the man. The optical flow is estimated using the OpenCV KLT tracker. The points are grouped close to each other at the position of the man.

Figure 4.10. The result of triangulating a sparse optical flow. The depth estimates seen from above. The gray points correspond to the depth estimates from the stereo image pair at the level of the two gray lines in (a) respectively. Tiny gray dots can be found in the middle of the image, above and below the larger gray dots. These tiny dots correspond to the depth that would have been estimated if the optical flow differs ±₀

.3 pixels from the correct value. The colored points are the triangulated estimates

where blue points corresponds to low objects near the road, red means objects above the car and green is just in between.

(50)

4.6 MSER Tracking

The benefit of tracking regions is that when the depth estimate of one point in the region has been found, the rest of the region may be approximated with the same depth.

The regions are found using the implementation of MSER in the VLfeat C library [21]. All regions in one frame are compared to the regions in the other frame. The sum of squared differences in the first and second order of moment was used as a similarity measure. The regions are tracked in several frames before triangulation of the center of mass.

The result from this method is shown in Figure 4.12. It is clear that the depth estimates of the objects are not stable, which is a result of a too small baseline compared to the accuracy of the tracked center of gravity points when triangulating. It may still be possible to see that some traffic signs are at the correct distance but moving objects and objects in the middle of the scene have erroneous distance estimates.

Moreover, the MSER regions found when driving on a highway may be too few to be able to estimate the egomotion using the center of gravity alone, see Fig-ure 4.11. This means that the MSER tracking has to be combined with some other method to get corresponding points in a large variety of driving environments.

Figure 4.11. Just a few maximal stable extremal regions are found when driving on a highway.

(51)

4.6 MSER Tracking 39

(a) The last frame in the sequence.

(b) The tracked regions.

(c) Tracked regions that conform to the essential matrix. The essen-tial matrix was estimated using a RANSAC loop.

(d) The depth estimate of the tracked regions. Dark blue is just in front of the car and dark red means a distance of 60 m.

Figure 4.12. Triangulation of points that have been tracked using MSER. The traffic signs on both sides of the road are at the same distance but objects in the middle of the scene have erroneous distance estimates. Also the vespa to the left has been estimated closer than it really is.

(52)

(53)

Chapter 5

Summary

This chapter describes the conclusions of this master theses work. The conclusions are followed by a discussion of areas where future work can be done.

5.1 Conclusions

In Chapter 4 it has been shown that creating a depth map in real-time using one camera in front of the car is difficult. The measurements are too unstable to be useful for the safety system of the car which can be seen in Figure 4.10. The difficulties arise from the camera motion towards the epipoles which gives a problematic geometry as described in Section 2.4. Objects that are moving in the scene are even harder to handle correctly, especially those moving along the epipolar lines which cannot be distinguished from the stationary background just using the optical flow, see Section 4.6.

One great problem is the optical flow estimation which contains several error sources. In the case of dense optical flow estimation the aperture problem has a great impact. The road and the sky are all flat regions which give no good estimate at all. If any optical flow would be estimated in this area, it would be erroneous and often estimated to zero due to fixed pattern noise in the images. Traffic lights and poles, tree trunks and road lanes are examples of objects that are i1D, therefore hard to follow, and may also be too tiny to easily follow between frames robustly.

5.2 Future Work

The optical flow may be estimated using several frames to improve the accuracy. On the other hand, too many frames cannot be used because the data must be fresh in a causal system. One possibility may be to use 3D-tensors, i.e. structure tensors that also spans over time, in the KLT tracker to incorporate time but the computational effort may also increase drastically.

(54)

42 Summary

The second problem with the optical flow estimation is all the flat regions. Instead of trying to estimate the optical flow in these regions, the accuratly trian-gulated points outside these regions may be used to interpolate the depth in the flat areas.

Another great problem is moving objects. One solution may be to model the road as a plane and investigate if the triangulation results in an object that lies on the road, is below the road or is above the road [22].

The fact that the surrounding environment is almost known, cars usually drive forward on roads, could possibly be used in optical flow algorithms. The conser-vation of momentum of physical objects enforce that the motion remains about the same in consecutive frames. This information could be incorporated in the optical flow algorithms. The difficulty that may arise is that objects close to the camera will introduce large steps in the optical flow and if the edges are not han-dled correctly, the previously estimated optical flow may be smeared out into the new frames. An example of this is when turning on a road and then continuing forward which will first introduce lateral motion that will continue to exist in the following frames. Another idea to produce a dense optical flow is to use a sparse flow as reference points to guide a dense optical flow estimator [18].

Even if only a sparse optical flow has been used, there have still been problems of estimating the distance to the objects in the middle of the scene. Instead of finding the ultimate optical flow algorithm, effort can be put into filtering the depth estimates from the triangulation. The simple filter of removing estimates of large variance made Figure 4.9(b) quite regular. The more advanced Kalman filter produces both an estimate and a variance measure to each filtered point. The drawback is the computational power needed when tracking thousands of points. This technique could also be used to keep track of occluded objects.

In this master thesis the straight forward eight point algorithm has been used to estimate the essential matrix. But the essential matrix has only five degrees of freedom and therefore the essential matrix can be determined by using just five corresponding pairs [10]. This lets the RANSAC loop converge faster. To get even faster convergence, other methods using a modified RANSAC, such as PROSAC [1], can be used.

When estimating the essential matrix of the egomotion, the RANSAC loop suf-fers from problems of getting some of the corresponding points on moving objects counted as stationary background. One great dilemma is when driving in a traffic queue where the preceding car occupies a large portion of the image. Then the RANSAC loop will find the motion between our and the preceding car instead of between our car and the stationary background.

Another approach to the problem is to use a global bundle adjustment method. Moving objects and erroneous point correspondences must still be handled, maybe by introducing further degrees of freedom in the equations. The drawback is as usual that the computational demands could be too tough for a real-time applica-tion where the model must be updated several times per second.

At last, it should not be forgotten that a depth map can be estimated accuratly if other sensor configurations are used. One example is to use a stereo camera. Another solution is to use a 3D laser scanner.

(55)

Bibliography

[1] Matching with PROSAC - progressive sample consensus, volume 1, 2005. [2] Autoliv Inc. Autoliv Inc. - What we do - Electronics, April

2010. URL: http://www.autoliv.com/wps/wcm/connect/autoliv/Home/ What+We+Do/Electronics.

[3] Simon Baker, Stefan Roth, Daniel Scharstein, Michael J. Black, J.P. Lewis, and Richard Szeliski. A Database and Evaluation Methodology for Optical Flow. Computer Vision, IEEE International Conference on, 0:1–8, 2007. URL: http://vision.middlebury.edu/flow/.

[4] Stan Birchfield. Derivation of Kanade-Lucas-Tomasi Tracking Equation, 1997. [5] Michael Donoser and Horst Bischof. Efficient Maximally Stable Extremal Region (MSER) Tracking. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 1:553–560, 2006.

[6] FundingUniverse. Autoliv Inc., April 2010. URL: http://www.fundinguniverse.com/company-histories/

Autoliv-Inc-Company-History.html.

[7] Minglun Gong and Yee-Hong Yang. Estimate Large Motions Using the Reliability-Based Motion Estimation Algorithm. International Journal of Computer Vision, 68(3):319–330, 2006.

[8] Chris Harris and Mike Stephens. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, pages 147–151, 1988. [9] Richard I. Hartley. In Defense of the Eight-Point Algorithm. IEEE

Transac-tions on Pattern Analysis and Machine Intelligence, 19(6):580–593, 1997. [10] Richard I. Hartley and Andrew Zisserman. Multiple View Geometry in

Com-puter Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.

[11] Manolis I. A. Lourakis and Antonis A. Argyros. SBA: A Software Package for Generic Sparse Bundle Adjustment. ACM Trans. Math. Software, 36(1):1–30, 2009.

(56)

44 Bibliography

[12] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In IJCAI’81: Proceedings of the 7th international joint conference on Artificial intelligence, pages 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc.

[13] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. Distinguished regions for wide-baseline stereo. Technical report, Center for Machine Per-ception, K333 FEE Czech Technical University, Prague, Czech Republic, Nov 2001. Research Report CTU-CMP-2001-33.

[14] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. In Paul L. Rosin and A. David Marshall, editors, BMVC. British Machine Vision Association, 2002.

[15] David Nistér and Henrik Stewénius. Linear Time Maximally Stable Extremal Regions. In ECCV ’08: Proceedings of the 10th European Conference on Computer Vision, pages 183–196, Berlin, Heidelberg, 2008. Springer-Verlag. [16] OpenCV. OpenCV (Open Source Computer Vision), May 2010. URL: http:

//opencv.willowgarage.com/.

[17] John C. Russ. The Image Processing Handbook, Fifth Edition (Image Pro-cessing Handbook). CRC Press, Inc., Boca Raton, FL, USA, 2006.

[18] Timothy M. A. Smith, David W. Redmill, C. Nishan Canagarajah, and David R. Bull. Dense optical flow from multiple sparse candidate flows using two pass dynamic programming. In 5th International Conference on Visual Information Engineering, 2008. VIE 2008., pages 203–208, 29 2008-aug. 1 2008.

[19] Fridtjof Stein. Efficient Computation of Optical Flow Using the Census Trans-form. In DAGM-Symposium, pages 79–86, 2004.

[20] Emanuele Trucco and Alessandro Verri. Introductory Techniques for 3-D Computer Vision. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1998. [21] Andrea Vedaldi and Brian Fulkerson. VLFeat: An Open and Portable Library

of Computer Vision Algorithms. http://www.vlfeat.org/, 2008.

[22] Andreas Wedel, Annemarie Meissner, Clemens Rabe, Uwe Franke, and Daniel Cremers. Detection and Segmentation of Independently Moving Objects from Dense Scene Flow. In EMMCVPR ’09: Proceedings of the 7th International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 14–27, Berlin, Heidelberg, 2009. Springer-Verlag.

[23] Wikipedia. Corner detection — Wikipedia, the free encyclopedia, May 2010. URL: http://en.wikipedia.org/w/index.php?title=Corner_ detection&oldid=360258982.

(57)

Bibliography 45

[24] Ramin Zabih and John Woodfill. Non-parametric Local Transforms for Com-puting Visual Correspondence. In ECCV ’94: Proceedings of the Third Eu-ropean Conference-Volume II on Computer Vision, pages 151–158, London, UK, 1994. Springer-Verlag.

(58)

Structure from Forward Motion

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Structure from Forward Motion

Structure from Forward Motion

Examensarbete utfört i datorseende

vid Tekniska högskolan i Linköping

av

Abstract

Sammanfattning

Acknowledgments

Contents

Chapter 1

Introduction

1.1

Background

1.2

Objective

1.3

Problem Constraints and Limitations

1.4

Report Structure

Chapter 2

Theory

2.1

Good Features to Track

2.1.1

The Aperture Problem

2.1.2

The Harris Operator

2.1.3

The Census Operator

2.2

Optical Flow

2.2.1

Lucas Kanade Tracking

2.2.2

The Horn-Schunck Method

2.2.3

Dynamic Programming Methods

2.2.4

Maximally Stable Extremal Regions Tracking

2.3

Epipolar Geometry

2.3.1

The Essential Matrix

2.3.2

The Normalized Eight-Point Algorithm

2.3.3

The Random Sample Consensus Algorithm

2.3.4

Triangulation

2.4

Error Analysis

Chapter 3

Approaches

3.1

Point Correspondence and Triangulation

be-tween Two Frames

3.1.1

Corresponding Points

3.1.2

Estimating the Essential Matrix

3.1.3

Estimating the Rotation and Translation

3.1.4

Triangulation

3.2

Bundle Adjustment

Chapter 4

Implemenations and Results

4.1

Dense Optical Flow with KLT

4.1.1

Using Parameters for Fast Computation

4.1.2

Using Parameters for Accurate Computation

4.1.3

Comparing the Two Parameter Sets