JohanHedborg PoseEstimationandStructureAnalysisofImageSequences

(1)

Link¨oping Studies in Science and Technology

Thesis No. 1418

Pose Estimation and Structure Analysis

of Image Sequences

Johan Hedborg

Department of Electrical Engineering

Linköpings universitet, SE-581 83 Linköping, Sweden Linköping October 2009

(2)

Thesis No. 1418 Author

Johan Hedborg

Department of Electrical Engineering Link¨oping University

SE-581 83 Link¨oping, Sweden Copyright c_{2009 Johan Hedborg} Hedborg, Johan

Pose Estimation and Structure Analysis of Image Sequences ISBN 978-91-7393-516-6

ISSN 0280-7971 Typeset with LATEX

(3)

iii

Abstract

Autonomous navigation for ground vehicles has many challenges. Autonomous systems must be able to self-localise, avoid obstacles and determine navigable surfaces. This thesis studies several aspects of autonomous navigation with a particular emphasis on vision, motivated by it being a primary component for navigation in many high-level biological organisms.

The key problem of self-localisation or pose estimation can be solved through analysis of the changes in appearance of rigid objects observed from different view points. We therefore describe a system for structure and motion estimation for real-time navigation and obstacle avoidance. With the explicit assumption of a calibrated camera, we have studied several schemes for increasing accuracy and speed of the estimation.

The basis of most structure and motion pose estimation algorithms is a good point tracker. However point tracking is computationally expensive and can oc-cupy a large portion of the CPU resources. In this thesis we show how a point tracker can be implemented efficiently on the graphics processor, which results in faster tracking of points and the CPU being available to carry out additional processing tasks.

In addition we propose a novel view interpolation approach, that can be used effectively for pose estimation given previously seen views. In this way, a vehicle will be able to estimate its location by interpolating previously seen data.

Navigation and obstacle avoidance may be carried out efficiently using struc-ture and motion, but only whitin a limited range from the camera. In order to increase this effective range, additional information needs to be incorporated, more specifically the location of objects in the image. For this, we propose a real-time object recognition method, which uses P-channel matching, which may be used for improving navigation accuracy at distances where structure estimation is unreliable.

(4)

(5)

v Acknowledgments

I would like to thank...

First and foremost my supervisor, professor Michael Felsberg for all invaluable discussions, great support through this work and for having confidence in me. My second supervisor and Per-Erik Forss´en for great inspiration and valuable insights.

My fellow colleges for nice company and giving discussions; Erik Jonsson, Johan Sunneg˚ardh, Fredrik Larsson, Vasileios Zografos, Bj¨orn Johansson, Johan Wik-lund, Klas Nordberg, G¨osta Granlund.

This work has been supported by European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 215078, DIPLECS. It has also been supported by EC Grant IST-2003-004176 COSPAL.

Family, especially my mother for the motivation through life.

(6)

(7)

Introduction

1.1 Motivation

To walk from A to B is an elementary human capability, and we do it without much effort. This is true for most creatures and even the smallest ones can effortlessly navigate in varying terrains full of obstacles. This has been a crucial survival skill and is a result of millions of years of evolution. However when examining this ability more thoroughly it discloses very complex underlying mechanisms.

Trying to get a machine or a computer to do the same task has proven to be quite difficult. There have been some successful attempts in recent years, where cars have navigated in deserts and in city environments (2005 DARPA grand challenge [5] and 2007 DARPA Urban challenge[1] ). The sensors that are used in these cases are quite different from the one humans and mammals use. While humans mainly use vision and balance to navigate, these systems use information from sensors measuring distances with laser and very accurate GPS systems. These sensors are large, expensive, and consume a lot of energy.

Using vision for machine navigation has been extensively studied in recent years. However, due to the complexity of the processing involved, it is often less successful than methods based on other sensors. An example of this is the 2007 DARPA urban challenge where all vision sensors where turned off in the final run.

1.2 This thesis

The aim of this thesis is to propose methods for solving the positioning and navi-gation problem using vision sensors. We are going to look deeper into two different approaches, each of them useful in different contexts or as complementary tech-niques for navigation.

The view based approaches to navigation apply the following problem formu-lation: given a set of previously seen images of a scene with known positions and

(10)

Figure 1.1: Two robot cars from the 2007 DARPA Urban challenge.

orientations, the task is to find the position and orientation of a new view of the same scene. View based approaches are also used for object recognition, but instead of labeling the images or views with the pose, they are given an object category and the task is to find the category for an object in a new image or view. The second approach we look into is the model based approach. In the case where the observer has no prior knowledge of what she is looking at, it can be useful to try to estimate the geometric shape of the observed object or scene. This shape estimation is based on the apparent variation of an object when viewed from different positions, called parallax distortion. Many mammals use this principle for depth perception, where they have two eyes that are positioned apart to get stereo vision. The separation of the eyes or cameras can also be done in the time domain, e.g. some owls move their head from left to right and back again to enforce a stronger parallax distortion or equally to widen their baseline in their stereo vision.

We will also touch upon the subject of object detection, which can and has successfully been used in navigation or map building systems [3].

1.3 Platform description

1.3.1 RC car

Methods proposed in this thesis are to be implemented on a light weight cheap robotic platform. The recent development of the relatively large market of radio controlled (RC) model devices has produced a variety of interesting alternatives. These products are not that uncommon in the research field of autonomous nav-igation and researchers have modified and used model helicopters, airplanes and cars. Our research aims at ground vehicle navigation and a four-wheel driven RC-car called TRAXXAS E-MAXX suits this purpose nicely. Due to the conditions and speeds these cars are driven at, they have a very robust construction and can

(11)

1.3 Platform description 3

Figure 1.2: Upper: The original RC-car (Picture copyright of Traxxas), lower: After modification

be controlled with high accuracy.

Our RC-car has been modified for carrying additional hardware. The sus-pension system has been replaced, the car bodywork has been replaced with a aluminum platform for carrying the hardware. The gear ratio has been changed, reducing the top speed of 45 km/h to less than 20 km/h. This also gives extra control and torque at lower speeds. The car has been modified to carry up to 7 Kg. The modified car platform is shown in figure 1.2.

The interface for controlling the car are standardized and can easily be replaced with control modules interfaced to a computer. The car can be equipped with a fairly large laptop containing a powerful graphics card for demanding calculations.

(12)

Figure 1.3: Example of rolling shutter. Image from Wikipedia

1.3.2 Sensors

To be able to navigate with the car and to avoid obstacles it is equipped with a set of relatively cheap low powered sensors: a camera, an inertial measurement unit (IMU) and for longer drift free navigation it can be equipped with a small GPS unit.

The robot’s most importation sensor for navigation will be the vision sensor (camera). From the camera views the system estimates structural properties of surfaces and distinguishes between drivable terrain and non drivable terrain. These estimates have a limited range and the camera has a limited frame rate, giving the platform an upper limit to the speed it can travel. If route planning is desired at higher speed, one has to use other than geometrical information. This is where the view based approaches are more appropriate. The vision sensor combined with a view based method has been successfully used to recognize previously visited areas [9].

Advances in consumer cameras are swift and cameras are getting high reso-lution and low noise. One interesting thing that is happening to the consumer cameras is that they are switching from CCD sensors to CMOS sensors. The CMOS sensor exposures the image row by row, meaning that each row has differ-ent start time for its exposure. This is called a rolling shutter and gives rise to a kind of ”jello” effect 1.3. The model based approaches are all based on the as-sumption of an image being taken at one time instance and thus making a normal consumer camera unsuitable. There are however a wide variety of industrial CCD cameras with a global shutter.

An inertial measurement unit (IMU) is a set of sensors working together to estimate its orientation and velocity. The sensors include accelerometers, micro-mechanical gyros and electronic compasses. The gravity direction is continuously measured by the accelerometers, however the cardinal directions are estimated by

(13)

1.4 Paper overview 5 motion gradients and are thus subject to drift. To compensate for this some of the IMUs are equipped with an electronic compass that acts as a on-line calibration tool.

A normal GPS gives a very coarse position estimate compared to the others sensors mentioned, but is on the other hand not subject to drift in the measure-ments.

These sensors complement each other in a nice way, especially the camera and the IMU, much like mammals have combined their eyes with a balance organ in the inner ear (the vestibular system). Rapid motion of a camera causes motion blur and large displacement per frame, making the position estimation difficult. When this happens the IMU delivers very useful sensor data. The IMU can be used to estimate the position by integrating twice over the acceleration values, but because of the accumulation of errors the drift is quite severe. Here the camera gives a lot better estimates.

Even if the camera is a lot better than the IMU concerning position drift, it still suffers from some drift. If necessary a GPS can be used to compensate for this drift.

Unlike RADAR, LIDAR and other active sensors, the discussed sensors do not send out signals for sensing, instead they use what is already available. This has advantages, they are on average lower in power consumption, smaller in size and will not interfere with other sensors.

1.4 Paper overview

A short summary of each paper is given. A relevance motivation and the contri-bution of the author is also given.

1.4.1 Paper A: Real-time view-based pose recognition and

interpolation for tracking

M. Felsberg and J. Hedborg. Real-time view-based pose recognition and inter-polation for tracking initialization. Journal of Real-Time Image Processing, 2(2– 3):103–116, 2007

Summary: In this paper we propose a new approach to real-time view-based pose recognition and interpolation. Pose recognition is particularly useful for identifying camera views in databases, video sequences, video streams, and live recordings. All of these applications require a fast pose recognition process, in many cases video real-time. It should further be possible to extend the database with new material, i.e., to update the recognition system on-line.

The method that we propose is based on P-channels, a special kind of infor-mation representation which combines advantages of histograms and local linear

(14)

models. Our approach is motivated by its similarity to information representation in biological systems but its main advantage is its robustness against common dis-tortions such as clutter and occlusion. The recognition algorithm consists of three steps:

1. low-level image features for color and local orientation are extracted in each point of the image

2. these features are encoded into P-channels by combining similar features within local image regions

3. the query P-channels are compared to a set of prototype

P-channels in a database using a least-squares approach. The algorithm is applied in two scene registration experiments with fish-eye camera data, one for pose interpolation from synthetic images and one for finding the nearest view in a set of real images. The method compares favorable to SIFT-based methods, in particular concerning interpolation.

The method can be used for initializing pose-tracking systems, either when starting the tracking or when the tracking has failed and the system needs to re-initialize.Due to its real-time performance, the method can also be embedded directly into the tracking system, allowing a sensor fusion unit choosing dynami-cally between the frame-by-frame tracking and the pose recognition.

Relevance and contribution: View interpolation approach that can be used effectively for pose-estimation given previously seen views. In this way, a vehicle will be able to estimate its location by interpolating previously seen data. The author contributed with realization of the real-time capability’s and part of theo-retical background. This work has been supported by EC Grants IST-2003-004176 COSPAL and IST-2002-002013 MATRIS.

1.4.2 Paper B: Real-Time Visual Recognition of Objects

and Scenes Using P-Channel Matching

M. Felsberg and J. Hedborg. Real-time visual recognition of objects and scenes using p-channel matching. In Proc. 15th Scandinavian Conference on Image Anal-ysis, volume 4522 of LNCS, pages 908–917, 2007

Summary: In this paper we propose a new approach to real-time view-based object recognition and scene registration. Object recognition is an important sub-task in many applications, as e.g., robotics, retrieval, and surveillance. Scene registration is particularly useful for identifying camera views in databases or video sequences. All of these applications require a fast recognition process and the possibility to extend the database with new material, i.e., to update the recognition system on-line.

(15)

1.4 Paper overview 7 The method that we propose is based on P-channels, a special kind of infor-mation representation which combines advantages of histograms and local linear models. Our approach is motivated by its similarity to information representation in biological systems but its main advantage is its robustness against common distortions as clutter and occlusion. The recognition algorithm extracts a number of basic, intensity invariant image features, encodes them into P-channels, and compares the query P-channels to a set of prototype P-channels in a database. The algorithm is applied in a cross-validation experiment on the COIL database, resulting in nearly ideal ROC curves. Furthermore, results from scene registration with a fish-eye camera are presented.

Relevance and contribution: Navigation and obstacle avoidance may be car-ried out efficiently from structure and motion, but only at a limited range from the camera. In order to increase this effective range, additional information needs to be incorporated, more specifically the location of objects in the image. For this, we propose a real-time object recognition method, which uses P-channel matching, which may be used for improving navigation accuracy at distances where struc-ture estimation is unreliable. The author contributed to the real-time capability of the system, and to some theoretical parts. This work has been supported by EC Grants IST-2003-004176 COSPAL and IST-2002-002013 MATRIS.

1.4.3 Paper C: Fast and Accurate Structure and Motion

Johan Hedborg, Per-Erik Forss´en, and Michael Felsberg. Fast and accurate struc-ture and motion estimation. In ISVC’09, December 2009

Summary: This paper describes a system for structure-and-motion estimation for real-time navigation and obstacle avoidance. We demonstrate a technique to increase the efficiency of the 5-point solution to the relative pose problem. This is achieved by a novel sampling scheme, where we add a distance constraint on the sampled points inside the RANSAC loop, before calculating the 5-point solution. Our setup uses the KLT tracker to establish point correspondences across time in live video. We also demonstrate how an early outlier rejection in the tracker improves performance in scenes with plenty of occlusions. This outlier rejection scheme is well suited to implementation on graphics hardware. We evaluate the proposed algorithms using real camera sequences with fine-tuned bundle adjusted data as ground truth. To strengthen our results we also evaluate using sequences generated by a state-of-the-art rendering software. On average we are able to reduce the number of RANSAC iterations by half and thereby double the speed. Relevance and contribution: The key problem of self-localisation or pose-estimation may be solved through analysis of the changes in appearance of rigid objects observed from different view points. We therefore describe a system for

(16)

structure and motion estimation for real-time navigation and obstacle avoidance. With the explicit assumption of a calibrated camera, we have studied several schemes for increasing accuracy and speed of the estimation. The author con-tributed the majority of this work. This work has been supported by Euro-pean Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 215078, DIPLECS.

1.4.4 Paper D: Real time camera ego-motion compensation

and lens undistortion on GPU

Johan Hedborg and Bjorn Johansson. Real time camera ego-motion compensation and lens undistortion in gpu. Technical report, Linkoping, Sweden, April 2006 Summary: This paper describes a GPU implementation for simultaneous camera ego-motion compensation and lens undistortion. The main idea is to transform the image under an ego-motion constraint so that tracked points in the image, that are assumed to come from the ego-motion, maps as close as possible to their average position in time. The lens undistortion is computed simultaneously. We compare the performance with and without compensation using two measures; mean time difference and mean statistical background subtraction.

Relevance and contribution:

Navigation systems based on vision are limited by the computational ability of the system, by having certain algorithms running on separate hardware is impor-tant. Here we look at lens distortion and tracking on the hardware. The author contributed to theory and implementation. This work has been supported by Swedish Road Administration (SRA) within the IVSS project

1.4.5 Paper E: KLT tracking implementation in the GPU

Johan Hedborg, Johan Skoglund, and Michael Felsberg. KLT tracking implemen-tation on the GPU. In Proceedings SSBA 2007, Linkoping, Sweden, Mars 2007 Summary: The GPU is the main processing unit on a graphics card. A mod-ern GPU typically provides more than ten times the computational power of an ordinary PC processor. This is a result of the high demands for speed and image quality in computer games.

This paper investigates the possibility of exploiting this computational power for tracking points in image sequences. Tracking points is used in many computer vision tasks, such as tracking moving objects, structure from motion, face tracking etc. The algorithm was successfully implemented on the GPU and a large speed up was achieved.

(17)

1.4 Paper overview 9 Relevance and contribution: For some navigation applications its important to track point through image sequences. This can be computationally heavy and occupy a large portion of the computation time for the main processor. In this paper we show how one could use additional hardware to do this faster and to free the main processor to do other tasks. The author contributed the majority of this work.

This work has been supported by EC Grant IST-2003-004176 COSPAL.

1.4.6 Paper F: Synthetic Ground Truth for Feature Trackers

Johan Hedborg and Per-Erik Forss´en. Synthetic ground truth for feature trackers. In Proceedings SSBA 2008, pages 59–62, Lund, 2008

Summary: Good data sets for evaluation of computer vision algorithms are im-portant for the continued progress of the field. There exist good evaluation sets for many applications, but there are others for which good evaluation sets are harder to come by. One such example is feature tracking, where there is an obvi-ous difficulty in the collection of data. Good evaluation data is important both for comparisons of different algorithms, and to detect weaknesses in a specific method. All image data is a result of light interacting with its environment. These interactions are so well modelled in rendering software that sometimes not even the sharpest human eye can tell the difference between reality and simulation. In this paper we thus propose to use a high quality rendering system to create evaluation data for sparse point correspondence trackers.

Relevance and contribution: A good evaluation data should have exact Its hard to evaluate a navigation algorithm because of the absence of exact ground-truth. In the paper we argue that a modern program for generation of 3D graphics can offers this. The author contributed the majority of this work.

(18)

(19)

Chapter 2

View based pose estimation

Here we will define pose estimation as the problem of determining the relative position between an object and an observer, the observer in this context being a camera or a platform with a mounted camera on it. The object can be a smaller object visible in the camera, or it can be the entire environment or scene in which case the problem is more towards positioning the camera rather than the object. To restrict to the second case, it is common to use the term ego-motion.

Ego-motion estimation can be done in various ways, in this chapter we will focus on view based pose estimation. Other terms for this is appearance based methods. View based is characterized by having built up some high dimensional space of previously seen views and when a new view is acquired the task is to localize the new view in this space. It is important to have a good representation of the view or object in order to have a nicely partitioned space when trying to position the new observation. The representation should not be too discriminative because then it is hard to find matching data nor should it be too general because then everything will melt together making it hard to distinguish between matches and mismatches.

The view based pose estimation is mostly combined with machine learning methods, mostly with supervised learning. In supervised learning a set of training data and a corresponding set of desired outputs is used to train the system. The output can be in the form of a view or an object category label. The task of the trained system is to estimate or predict the output for any new valid sample that is fed into the system. This means it has to generalize from its learning set to a previously unseen evaluation set in a way that it produces sensible output in all cases.

(20)

2.1 Local histograms

Classification and matching can be based on a statistical representation or dis-tributions of image properties, such as color, edges (local orientation), corners or frequencies. Statistical representations can be used to generalize and find similar-ities between images in a more robust and efficient way than looking at the raw pixels. An additional advantage is that they can often be handled and stored more efficiently than the raw image, while still maintaining relevant information of the original image.

A general and efficient method for non parametric density representation is to divide the data in intervals. This representation of a distribution is known as an histogram in statistics and computer vision. The intervals are usually evenly spread and are called bins.

Histograms give us information about what can be found in the image but tell us nothing about where in the image it can be located. A vague spatial knowledge is valuable for certain types of applications, pose estimation is one example. Spatial information can be added by computing several local histograms of parts of the image. The size of the histogram representation grows linearly with the number of local regions, but usually within the range of efficient handling.

To make a generalized description of a region is an important property of the local histograms and has made them quite common in some fields of computer vision. Regions that are viewed from positions far apart are generally subject to geometrical distortions or occlusions, making them hard to match. In these situations the local histogram based SIFT descriptor [17] has been very successful. The matching capabilities of SIFT have been used for pose estimation with large view angles (wide base line stereo) and object recognition among other things.

Humans are difficult to recognize, especially when they are viewed at low reso-lution. Here, the local histogram based HOG (histogram of orientated gradients) descriptor is, at the current time, state of the art. (mention something about Fourier descriptors?)

2.2 P-Channels

In the case of a multidimensional representations where we use several properties, e.g. gradients, color or intensity, the size of the data grows exponentially with the bin count per dimension. Resolution has to be weighted against the increased storage complexity. Instead of increasing the bin count for extra resolution, it is possible to add an offset to each bin. By adding information in this way, we can reduce the growth rate from exponential to linear. This representation is called the P-Channel representation.

The performance penalties for creating a P-Channel representation instead of a histogram representation is not high. We only need two extra operations compared

(21)

2.3 Usage 13 to the local histogram scheme. Two simple operations are added to the local histogram scheme. Check the distance to the center of the bin and accumulate the value to the offset value for the bin.

2.3 Usage

In [6] we successfully use the P-Channels for pose estimation. This can be done if we have a set of views with known poses. Here the representation is used to interpolate between the nearest views.

Here we show the strength of the P-Channels by having a multi dimensional representation running in real-time for a large set of views.

The P-Channel representation can also be used in object recognition, as shown in [7]. Objects are matched by calculation the distance between their respective P-Channel representation. Several distance measures have been examined, more information on this subject can be found in [15].

(22)

(23)

Chapter 3

Geometry based pose

estimation

We have shown that it is possible to estimate the pose for a set of previously seen images with known camera positions. In this chapter we will look into methods for solving the problem without this prior knowledge. Instead we use a constraint which exists for images showing a static scene/rigid object. The methods/distance metrics used in the previous chapter is also based on geometry according to the definition. For clarification reasons: when we use the term geometry-based pose estimation, we actually mean pose estimation based on Euclidean or projective geometry.

3.1 Model assumptions

To make use of object or scene geometry when finding the relative pose between these and the camera, we need to know how the geometry is transformed when it is depicted/projected to the image plane.

The most commonly used camera model is the pinhole camera model. The camera is modeled as a box with an infinitesimal hole in it, the image is projected onto one side of the box by light traveling through the pinhole onto the opposite side, see figure 3.1.

A practical deviation from this model is the size of the pinhole. In reality a small hole does not allow for enough light to reach the sensor within a reasonable exposure time. Instead, a set of lenses focuses light through a point (focal point) producing a projection or image in similar manner as the pinhole camera would have done. There are however some differences between the two approaches. In particular there can be some radially symmetric distortions in the projected image. These arise from the shape of the lenses and is called radial distortion. The size and position of the sensor can also vary, influencing the projection of the image.

(24)

Figure 3.1: Pinhole camera model

Some of these deviations can be estimated and compensated for, this process is called camera calibration.

There is a lot of research aiming at automatic camera calibration. This is very important in the case where the images are from unknown cameras. However, for our navigation task, the camera’s internal structure is fixed. We have the ability to do an offline calibration in a controlled environment and, to the best of our knowledge, this is the preferable choice in these cases.

3.2 Tracker

The term tracker has a wide range of meanings in the computer vision community. In this thesis we will define it as a points tracker. More precisely: given a point in one image, tracking is the task of finding its corresponding point in a second image. With corresponding point we refer to the image point which is a projection of the same 3D point in the observed world. It should however be pointed out that when locating the point we do it with textural information from its surrounding region, demanding some regional properties which are critical for the success.

One of the most commonly used point tracker is the Kanade-Lucas-Tomasi tracker (KLT tracker) [18]. It is an iterative gradient search trying to minimize the difference over a square window surrounding the point. The gradient search

(25)

3.3 Pose estimation 17 in the KLT tracker can be formulated to search for a variety of parameters. In its simplest form it is a pure translation (2D) tracker, but it can also be configured to handle full affine (6D) transformations of the rectangular window around the point. The pure translation tracker will be more stable where there are small changes in camera motion. However when camera motion between tracked frames becomes too big, the changes in rotation, scale and projective distortion will have a significant impact on accuracy.

Illumination changes do also degrade the tracker performance. To avoid the degradation in tracking result, it is common to update the patch by replacing the original one with a patch that has been seen later in the sequence. Exploiting this scheme to the maximum, one would replace the patch every frame with the patch from the previous frame. This will however give rise to accumulation of tracking error (drift). This trade off between drift and tracking precision depends upon things like sensor noise, observed structure and camera trajectory. This has to be taken under consideration. In [11] we replace the patch every frame due to the very dynamic image produced be a camera moving forward, in [12] however the patch is replaced every 50 frames because the non-translation changes are much smaller.

3.3 Pose estimation

In the case where we have a calibrated camera there is a simple dependency be-tween the camera pose, the observed geometry (3D points) and the projected points (2D points). Consider two cameras viewing a 3D object, we have found a set of corresponding points in the two views. The pose change between the two views consists of a rotation R and a translation t. The 3D points on the object are pw

i = [xi, yi, zi]T and their projections in the two cameras are p1i = [u1i, vi1, 1]T

and p2

i = [u2i, vi2, 1]T. The upper index defines which coordinate system the values

are in, w stands for world and 1,2 are camera numbers. The depths for point i from the two cameras are d1

i, d2i. Their relations are

p1

i = 1/d1ipwi , pi2= 1/d2iRT(pwi − t). (3.1)

The only known variables in the equations are p1

i and p2i. Fortunately, the

number of parameters in R and t are limited and independent of the number of points we choose to observe. The depths d1

i, d2i depends on R, t and zi. This

means that for every new point we add, we get 4 known and 3 unknown variables, and eventually (for most cases) the system will have a unique solution. However, this holds only if all 3D points are stationary with respect to each other.

If the change in camera position is relatively small the projective distortions will be small, making the system ill conditioned. In these cases it is better to discard geometrical information and to use a simpler model [12].

(26)

p

w w 2 2 1 1 2 1 2 1 2 1

Figure 3.2: An object viewed from two cameras

To solve the pose problem, we will use an exact method; the minimal case method. This implies that the system will not be overdetermined and a minimal amount of information is used to solve it. There is a good reason for this which is further discussed in section 3.4.1.

3.3.1 Five point solution

The minimal number of point projections needed to solve the relative camera motion between two cameras is five, which we will now show. The pose difference between the two cameras is again expressed as a translation t and a rotation R see figure 3.3. The two camera centers are c1 and c2. Let an arbitrary 3D point

in front of the cameras be denoted as pw

i. Let the vectors between this point and

the two camera centers be (cw

1 − pwi ) and (cw2 − pwi ). Because all vectors lies in the

same plane the cross product between t and (cw

2 − pwi ) (the normal of the plane)

is perpendicular to (cw 1 − pwi ):

(cw

1 − pwi ) · (t × (cw2 − pwi)) = 0. (3.2)

By using the fact that p1

i is parallel to (cw1 − pwi) and Rp2i is parallel to (cw2 − pwi)

its possible to rewrite 3.2 as p1

(27)

3.3 Pose estimation 19

t

R

c

p

w w w 1 2 i

Figure 3.3: An object viewed from two cameras Introducing the antisymmetric matrix T

T =   t0z −t0z −ttyx −ty tx 0   (3.4)

for the cross product yields t × a = T a. Plugging 3.4 into 3.3 gives

pT_TRp0 _{= 0,} _(3.5)

where TR is called the essential matrix, usually denoted E. The rotation matrix R has 3 degrees of freedom (DOF). The translation can only be recovered up to an unknown scale factor. Setting the length of the translation vector to one gives 2 DOF for T, resulting in 5 DOF for E in total. With one point correspondence p to p0 _{we get the equation:}

p1T_Ep2_{= u}1_u2_e 11+ v1u2e12+ u2e13+ + u1_v2_e 11+ v1v2e22+ v2e23+ (3.6) + u1_e 31+ v1e32+ e33= 0.

(28)

The 3D structure is also uniquely determined by the five points, from the essential matrix E we get the rotation and the translation we obtain the 5 3D coordinates by triangulation, i.e. the intersection of p1

i and Rp2i + t. Because

the projected points and the two camera centers all lie in the same plane, the 3D points can be determined exactly.

Equation 3.6 constitutes a non-linear problem and there exist a number of solutions, iterative [14] and exact [16, 19]. [19] is the state-of-the-art method in terms of both speed and accuracy and is the one we use in [11].

The geometrical pose problem can also be solved for the non-calibrated case. In this case, the minimal number of points needed is 7 and the solution is more straightforward. However, it has a number of drawbacks. it is less constrained and it has been shown that for some camera motions it gives worse estimates [19]. Planar structure has a degeneracy for the uncalibrated case, this degeneracy does not exist for the calibrated case [19].

If the depth is discarded in (3.1) as it is done in [12], the equation for the pose parameters becomes linear. This means that they can be solved very easily for the exact case or for the overdetermined case without having to rely on a constraint on the solution for stability.

3.4 Robustness

3.4.1 RANSAC

To track a set of points through a sequence of images is a tricky task and it is not uncommon that a significant portion of the points are subject to high levels of noise or totally miss their correct positions. These points, called outliers, can have a severe impact on the solution of the pose problem and it is important to detect and discard them. There are different techniques to deal with outliers. In [12] a preliminary solution is calculated and this is used to detect the points that deviated most from the solution. This scheme can be run for several iterations. A more commonly used method is RANdom SAmple Consensus (RANSAC) [8]. Outliers are here detected by first constructing a hypothesis based on a small number of samples randomly selected amongst all data. The smaller the better because it lowers the probability of having an outlier in the hypothesis set. The hypothesis is then tested against all other data. This is run for a number of iterations and the hypothesis that fits most of the data is picked as the best one.

3.4.2 Improvements of RANSAC

There are many schemes to make RANSAC more efficient either by lowering the amount of iterations [21] or by refining a new best hypothesis [2]. In [11] we present an efficient and simple addition to RANSAC that lowers the number of RANSAC iterations. It is based on two assumptions, (i) 3D points that are far from each

(29)

3.4 Robustness 21 other are in general subject to more parallax and thus better for estimating the essential matrix. (ii) If two points lie far from each other in image space they are also likely to be far from each other in 3D space. The idea is simply to make sure that the randomly selected points do not lie too close to each other. In [11] we show that, on average this simple scheme reduces the iteration count by half and increases the speed by almost a factor two.

3.4.3 Tracking outlier detection

Tracking may fail if previously visible areas in the scene are occluded. This mainly happens when there are large depth differences near a tracked point. This can cause large errors in the tracking result. Commonly used interest-point detectors will not distinguish a nice high contrast textured planar region from a couple of branches on a tree having the same contrast to the background. But in terms of tracking precision it is a huge difference, where the former will give a much more reliable tracking result. This kind of outliers can be efficiently handled by having an earlier outlier detection in the tracker stage. After the tracker has tracked one point from the first image to the second it can be run a second time. This time it is initiated with the image data from the second image and at the point in the second image (the result from the first tracking). In the second run the tracker tries to find its way back, and if it fails it is likely that something has gone wrong near or on the point. This method has been evaluated in [11]. The paper shows a clear performance gain in the case where we have occlusion boundaries or other disturbances around the tracked points. One of the nice aspects of this technique is its simplicity. If the tracking implementation exists, it is only a matter of running it backwards, and checking the distance between the initial points and the track-retracked points to determine if they are outliers or inliers. This doubles the computational time for tracking, however the tracking is very well suited for an GPU implementation as showed in [13].

(30)

(31)

Chapter 4

The GPU

Here we will briefly discuss some properties of the the Graphics Processors Units (GPUs). This hardware has previously been dedicated for computer graphics, but has lately gone through a change in both hardware and software enabling it to do general computation. In computer graphics every pixel of the screen is handled individually and the generation of an image can easily be parallelized. This has affected the development direction of the GPUs towards becoming massively parallel in their architecture and also making them suitable for many computer vision algorithms.

This chapter has a different layout compared to other GPGPU texts. Initially we will focus on the differences between a central processing unit (CPU) and a Graphics Processing Unit 1_{. Understanding the differences between the two}

hardware architectures is the key to understand which algorithm is suitable or how to change an algorithm to fit the architecture. In the cases where the algorithms are well suited the speed increase can be 10-100-fold compared to a modern quad-core CPU.

4.1 GPU hardware overview

Modern graphic cards have 240 processor cores and it could be expected that 240 concurrent running threads would be enough to fully utilize the GPU. Unfortu-nately this is not the case. Instead we actually need thread counts in order of thousands to reach the full potential of the GPU. To understand why so many threads are needed its essential to know the architectural trade-offs between size and functionality in the design of a GPU. A modern GPU consists of

approxi-1_{The name Graphics Processing Unit is misleading here, because in the context of computer} vision it has little or nothing to do with graphics and should be seen more as massive parallel general purpose processor. We will still call it GPU because of its history and the fact that it is more or less standard to use the name.

(32)

mately 1.4 billion transistors, twice the transistor count of a modern CPU2_{. Given}

the numbers of processors in each case, a CPU core (including caches) occupies approximately 30 times more transistors that a GPU core.

The cache system on a CPU is complex and built from a hierarchy of caches, from large slow to small fast caches. Data is transferred from memory to the larger caches and then goes upwards in the hierarchy to the smallest faster cache ending up in a CPU core. The CPU caches can occupy up to half of the transistor count. To be able to fit all the computational units onto a GPU, major modifications are made to the GPU caches. Some memory access use small special purpose caches, but the majority of memory accesses are completely uncached (when using the GPU for general purpose calculations). The drawback of this, is memory latency meaning that the GPU, after having issued a memory fetch instruction, has to wait a certain time before getting the result every time it accesses data. In contrast, CPU can efficiently reuse data if it exists in any of its caches.

The GPU instruction pipeline is in general deeper than the CPU pipeline and the implication of this is that the instruction latency is longer. A simple example: a GPU can execute one instruction every clock cycle, but it take 12 cycles before it gets the answer and in the case where an instruction depends on the previous one it has to wait 12 cycles.

On a related topic, if the CPU finds dependency between instructions, it can reorder the instructions and execute a non-dependent instruction while waiting for the answer. This mechanism is called out-of-order execution and has been a part of the CPU for several generations. However, the GPU core is a simpler in-order execution unit and does not have this ability.

Another way of reducing the transistor count is to share parts between cores. In the GPU the fetch/dispatch of an instruction is shared among a set of cores making it a Single Instruction Multiple Data (SIMD) unit. A SIMD unit must execute the same instruction for a set of cores meaning that the same program must be run for all threads.

These are some of the trade-offs made to be able to shrink the GPU core size and they all results in latency; access to memory without a cache gives latency, deep pipelines gives latency, having in-order execution gives latency. We can hide all this latency by adding threads, and at each occasion where some thread has to wait we simply switch to another thread that is ready to execute. This is exactly what the GPU does to hide all latencies and it is the explanation why we need so many more threads than cores when programing a GPU.

4.2 APIs

The GPU can be programmed through two types of Application Programmer Interfaces (APIs), either a graphics API or an API for general purpose computing

(33)

4.3 Usage in computer vision 25 on the GPU. CUDA and OpenCL are examples of APIs enabling the GPU to easily be used for computing more general things than computer graphics [4]. These APIs are C-like programming languages that enable the programmer to write highly parallel programs for the GPU. A modern GPU has hardware parts that are dedicated to general purpose computing only. These parts allow for inter-thread communication, scattered memory writes and other functionality, and are only accessible through APIs like CUDA or OpenCL.

The graphics APIs are for computer generated 3D images and they often require a layer of code to be written in order to use the GPU for more general tasks. However, for computer vision there are some useful hardware functionality that only be reached from the graphics APIs. Functionalities such as sampling pixels from scale pyramids and in scale space with anisotropic kernels exist in the graphics APIs but not in the CUDA or OpenCL APIs.

4.3 Usage in computer vision

There are many areas of computer vision where the GPU is well suited. In [13] we show how to efficiently implement a KTL point tracker [18]. We demonstrate the importance of having many threads by comparing it with an implementation that uses less threads [20]. We also show the speedup that can be achieved when the GPU is used for correction of lens distortion. The requirement of many threads makes the GPU suitable for some tasks and less suitable for others. In particular, [11] we argue that it makes sense to have an outlier detection that is based on a second run of the tracker on the GPU, because the rest of the system tasks are not well suited for a GPU implementation.

(34)

(35)

Bibliography

[1] Special issue on the darpa urban challenge autonomous vehicle competition. 8(4):703, December 2007.

[2] Ondˇrej Chum, Jiˇr´ı Matas, and Josef Kittler. Locally optimized ransac. In J. van Leeuwen G. Goos, J. Hartmanis, editor, DAGM 2003: Proceedings of the 25th DAGM Symposium, number 2781 in LNCS, pages 236–243, Heidel-berger Platz 3, 14197, Berlin, Germany, September 2003. Springer-Verlag. [3] N. Cornells, B. Leibe, K. Cornells, and L. Van Gool. 3d city modeling

us-ing cognitive loops. In Proc. Third International Symposium on 3D Data Processing, Visualization, and Transmission, pages 9–16, June 14–16, 2006. [4] NVIDIA Corporation. Web site: www.nvidia.com/object/cuda home.html. [5] Carl D. Crane. The 2005 darpa grand challenge. In Proc. International

Symposium on Computational Intelligence in Robotics and Automation CIRA 2007, page nil4, June 20–23, 2007.

[6] M. Felsberg and J. Hedborg. Real-time view-based pose recognition and inter-polation for tracking initialization. Journal of Real-Time Image Processing, 2(2–3):103–116, 2007.

[7] M. Felsberg and J. Hedborg. Real-time visual recognition of objects and scenes using p-channel matching. In Proc. 15th Scandinavian Conference on Image Analysis, volume 4522 of LNCS, pages 908–917, 2007.

[8] M.A. Fischler and R.C. Bolles. Random sample consensus: a paradigm for model fitting, with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.

[9] Raia Hadsell, Pierre Sermanet, Jan Ben, Ayse Erkan, Jeff Han, Urs Muller, and Yann Lecun. Online learning for offroad robots: Spatial label propaga-tion to learn long-range traversability. In Robotics: Science and Systems III. Robotics: Science and Systems, June 2007.

[10] Johan Hedborg and Per-Erik Forss´en. Synthetic ground truth for feature trackers. In Proceedings SSBA 2008, pages 59–62, Lund, 2008.

(36)

[11] Johan Hedborg, Per-Erik Forss´en, and Michael Felsberg. Fast and accurate structure and motion estimation. In ISVC’09, December 2009.

[12] Johan Hedborg and Bjorn Johansson. Real time camera ego-motion compen-sation and lens undistortion in gpu. Technical report, Linkoping, Sweden, April 2006.

[13] Johan Hedborg, Johan Skoglund, and Michael Felsberg. KLT tracking imple-mentation on the GPU. In Proceedings SSBA 2007, Linkoping, Sweden, Mars 2007.

[14] Uwe Helmke, Knut H¨uper, Pei Yean Lee, and John Moore. Essential matrix estimation using gauss-newton iterations on a manifold. Int. J. Comput. Vision, 74(2):117–136, 2007.

[15] Erik Jonsson. Channel-Coded Feature Maps for Computer Vision and Ma-chine Learning. PhD thesis, Link¨oping University, Sweden, SE-581 83 Link¨oping, Sweden, February 2008. Dissertation No. 1160, ISBN 978-91-7393-988-1.

[16] H.D. Li and R.I. Hartley. Five-point motion estimation made easy. pages I: 630–633, 2006.

[17] D. G. Lowe. Object recognition from local scale-invariant features. In Proc. Seventh IEEE International Conference on Computer Vision The, volume 2, pages 1150–1157, September 20–27, 1999.

[18] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In IJCAI’81, pages 674–679, April 1981. [19] David Nist´er. An efficient solution to the five-point relative pose problem.

IEEE TPAMI, 6(26):756–770, June 2004.

[20] Sudipta N. Sinha, Jan michael Frahm, Marc Pollefeys, and Yakup Genc. Gpu-based video feature tracking and matching. Technical report, In Workshop on Edge Computing Using New Commodity Architectures, 2006.

[21] Tom´aˇs Werner and Tom´aˇs Pajdla. Oriented matching constraints. In British Machine Vision Conference 2001, pages 441–450, 2001.

JohanHedborg PoseEstimationandStructureAnalysisofImageSequences

Link¨oping Studies in Science and Technology

Thesis No. 1418

Pose Estimation and Structure Analysis

of Image Sequences

Johan Hedborg

Abstract

Contents

Chapter 1

Introduction

1.1

Motivation

1.2

This thesis

1.3

Platform description

1.3.1

RC car

1.3.2

Sensors

1.4

Paper overview

1.4.1

Paper A: Real-time view-based pose recognition and

interpolation for tracking

1.4.2

Paper B: Real-Time Visual Recognition of Objects

and Scenes Using P-Channel Matching

1.4.3

Paper C: Fast and Accurate Structure and Motion

1.4.4

Paper D: Real time camera ego-motion compensation

and lens undistortion on GPU

1.4.5

Paper E: KLT tracking implementation in the GPU

1.4.6

Paper F: Synthetic Ground Truth for Feature Trackers

Chapter 2

View based pose estimation

2.1

Local histograms

2.2

P-Channels

2.3

Usage

Chapter 3

Geometry based pose

estimation

3.1

Model assumptions

3.2

Tracker

3.3

Pose estimation

p

p

p

p

p

p

3.3.1

Five point solution

t

R

c

c

p

3.4

Robustness

3.4.1

RANSAC

3.4.2

Improvements of RANSAC

3.4.3

Tracking outlier detection

Chapter 4

The GPU

4.1

GPU hardware overview

4.2