Pose Recognition for Tracker Initialization Using 3D Models

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Pose Recognition for Tracker Initialization Using

3D Models

Examensarbete utfört i Bildbehandling vid Tekniska högskolan i Linköping

av Martin Berg LiTH-ISY-EX--07/4076--SE

Linköping 2008

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Pose Recognition for Tracker Initialization Using

3D Models

Examensarbete utfört i Bildbehandling

vid Tekniska högskolan i Linköping

av

Martin Berg LiTH-ISY-EX--07/4076--SE

Handledare: Michael Felsberg

isy, Linköping University

Alain Pagani

Fraunhofer IGD

Examinator: Michael Felsberg

isy, Linköping University

(4)

(5)

Avdelning, Institution

Division, Department

Computer Vision Laboratory Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2008-002-006 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.cvl.isy.liu.se/

ISBN

—

ISRN

LiTH-ISY-EX--07/4076--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title

Pose-estimering för initialisering av kameratracker från 3D-modell Pose Recognition for Tracker Initialization Using 3D Models

Författare

Author

Martin Berg

Sammanfattning

Abstract

In this thesis it is examined whether the pose of an object can be determined by a system trained with a synthetic 3D model of said object. A number of variations of methods using P-channel representation are examined. Reference images are rendered from the 3D model, features, such as gradient orientation and color infor-mation are extracted and encoded into P-channels. The P-channel representation is then used to estimate an overlapping channel representation, using B1-spline

functions, to estimate a density function for the feature set. Experiments were conducted with this representation as well as the raw P-channel representation in conjunction with a number of distance measures and estimation methods.

It is shown that, with correct preprocessing and choice of parameters, the pose can be detected with some accuracy and, if not in real-time, fast enough to be useful in a tracker initialization scenario. It is also concluded that the success rate of the estimation depends heavily on the nature of the object.

Nyckelord

(6)

(7)

Abstract

In this thesis it is examined whether the pose of an object can be determined by a system trained with a synthetic 3D model of said object. A number of variations of methods using P-channel representation are examined. Reference images are rendered from the 3D model, features, such as gradient orientation and color infor-mation are extracted and encoded into P-channels. The P-channel representation is then used to estimate an overlapping channel representation, using B1-spline

functions, to estimate a density function for the feature set. Experiments were conducted with this representation as well as the raw P-channel representation in conjunction with a number of distance measures and estimation methods.

It is shown that, with correct preprocessing and choice of parameters, the pose can be detected with some accuracy and, if not in real-time, fast enough to be useful in a tracker initialization scenario. It is also concluded that the success rate of the estimation depends heavily on the nature of the object.

(8)

(9)

Acknowledgments

There are a number of people who helped making the work with this thesis possible, and others who helped to make it pleasurable.

I would like to take this opportunity to thank my supervisor Alain Pagani (Fraunhofer IGD) for valuable discussions about the subject of tracker initializa-tion and science in general. I would also like to thank Michal Felsberg (CVL), for explaining details about P-channels and for guiding my train of thought, and Johan Hedborg (CVL), for tips about implementation details.

Lastly, I’d like to thank my Mitbewohner at Gross-Gerauer Weg 38, all the people in department A4 at Fraunhofer IGD and the members of Capoeira Brasil Darmstadt, for making me feel at home during my stay.

(10)

(11)

Introduction

1.1 Background

The camera tracker (fig. 1.1) has been developed within the scope of the MATRIS project (Markerless Tracking Technology for Augmented Reality Applications, see [4] for details). Its purpose is to keep track of the position, orientation and focal length of a video camera as compared to a static setting such as a TV-studio. To perform this task the tracker uses information from inertial sensors together with computer vision algorithms, much like humans use the sensors of the inner ear (essentially inertial sensors) together with visual perception to determine position and orientation in the environment.

Figure 1.1. The camera tracker developed in the MATRIS project.

The vision algorithm works by identifying a number of reference points (“land-marks”) and tracking their movement in the field of view of the camera. This means that the vision algorithm, like the inertial sensors, provide information only about the movement of the camera in relation to previous frames. On starting the tracking process, the position and orientation of the camera must thus be known beforehand, and the tracker is initialized with this information. Initialization can

(14)

2 Introduction

also be necessary when the camera for some reason loses track of it’s reference points.

Currently, initialization requires the camera to be put in a known position and orientation. This is of course far from ideal, which is why a vision algorithm that can determine the position and orientation of the camera with minimal prior knowledge about the environment is desired.

1.2 Problem

The problem is thus to estimate the position and orientation of the camera in relation to a fixed scene. This problem is largely equivalent to estimating the pose and position of an object in a scene, in relation to a fixed spectator or camera. As the latter formulation is easier to test, that is the one we focus on in this thesis. To further simplify the problem, only the orientation (or pose) will be considered, i.e. only three of the original six degrees of freedom will be estimated. In section 6.2, however, some suggestions are made on how to extend the algorithm to deal with six degrees of freedom.

This pose estimation is to be done solely based on a textured CAD model of the object in question. The model is simple, including no further surface shading information (such as glossiness) and no information about lighting conditions is included. The method can be trained off-line, with no real-time requirement. When trained, the method should work on a single-frame basis, i.e. no tracking is involved.

1.3 About This Thesis

This thesis is meant to combine research from Fraunhofer IGD and CVL, Linköping University, to see if the P-channel representation [10, 12, 8, 9], developed by Michael Felsberg, can be used for pose recognition and subsequently tracker initial-ization based on CAD-models. The thesis is written as part of a Master of Science education in Applied Physics and Electrical Engineering at Linköping University, Sweden.

1.4 Related Work

There are numerous publications dealing with pose recognition and tracker ini-tialization, applying various methods and theories. In [9] and [10], Felsberg and Hedborg use the P-channel representation in pose recognition tasks, however, there are significant differences to the problem formulation of this thesis. In both papers, the training set is taken from the same set of images as the test set, making the estimation step more about interpolation than matching. The real-time require-ment is stressed, and on-line “training” is employed. Learning from a 3D model introduces a new set of complications while off-line learning offers a new set of

(15)

1.5 Document Overview 3

possibilities. Therefore, this thesis can be viewed as a development in a slightly different direction, based on the mentioned papers.

Another approach that has similar traits to the P-channel representation is Histograms of Oriented Gradients, or HOGs, see [1, 6]. This representation is being used in ongoing work with tracker initialization at Fraunhofer IGD, though results are yet inconclusive.

1.5 Document Overview

This thesis is divided into six chapters. The first chapter is an introduction to the thesis, with a brief description of the background. The second chapter deals with the theory of the methods and algorithms used in the thesis. The third chapter contains a description of how this theory has been implemented to solve the pose estimation problem. The fourth chapter describes how the implementation has been tested, and the fifth chapter discusses the results acquired. The sixth and last chapter contains a conclusion and suggestions on how to continue the work started in this thesis.

(16)

(17)

Chapter 2

Theory

2.1 Method

The basic steps of the approach to pose recognition used in this thesis are: • Finding a representation or descriptor which disregards as much as

pos-sible of the information contained in the image that is not useful to the pose recognition task, while keeping as much as possible of the information that is.

• On these representations, a dissimilarity measure must be applicable. The measure of how similar one representation is to another should correspond to how similar the pose is of the objects in the underlying images

• When this is done, an estimation algorithm, which, based on training data, can determine the pose of an incoming query, is needed.

The algorithm will work in two phases:

• In the off-line training phase, images will be rendered from the 3D model and encoded into the chosen representation. The representation vectors will then be fed, together with their corresponding pose, to the estimation algorithm. • In the on-line phase, the estimation algorithm will use this data, or portions of it, together with the dissimilarity measure, to determine the pose of an incoming query consisting of a camera image encoded into the representation.

2.2 Representation

This chapter briefly explains the theory of channel representations in general and P-channels in particular. The Channel representation is a complex subject with applications far beyond the scope of this thesis. To learn more about channels in general, see [13], for P-channels, see [10, 12, 8, 9].

(18)

6 Theory

2.2.1 Channel Representation

Assume a scalar feature f ∈ [0, 1], for example the intensity of a pixel in a gray-scale image. The traditional way to represent this would be simply using the scalar value f . A channel representation is constructed by placing kernel centers on specific values of the feature interval, and then, using a kernel function, encoding the feature as a vector valued function of the distances to the channel centers. The output of each kernel function is called a channel response. The kernel function should be symmetric and decreasing with the distance from the center, looking something like fig. 2.1.

Figure 2.1. Kernel function

A well known example of channel representation is the RGB representation of color, which in turn is related to how the eye registers color. The color of light depends on the distribution of the wavelength of the light. The human eye does not register the wavelength itself, but rather its closeness to reference values of red, green and blue. The wavelength distribution is thus encoded into channels with kernel centers at red, green and blue.

In the following example we will use the cos2-kernel. It has the right shape, and has also the advantage of limited support. The kernel function is defined as

Bj(f ) =

(

cos2(ωd (f, fj)) if ωd (f, fj) ≤π₂

0 otherwise (2.1)

where d is a distance function, e.g. the L1_{-norm: d (f, f}

j) = |f − fj|, and fj, j =

0, 1, 2 . . . , J are the kernel centers. Sometimes, periodic features are of interest, such as the orientation of the gradient of the image, or the hue color value. In this case, the distance measure must be modified to take this into account. E.g. for a feature f with periodicity K,

d (f, fj) = min ( mod (|f − fj|, K) , K − mod (|f − fj|, K)) (2.2)

The channel representation of the scalar feature f is the vector B (f ) = (B1(f ) , B2(f ) , B3(f ) , ..., BJ(f ))

T

(2.3) For a numeric example, we encode the feature value f = 0.3 with f1= 0, f2=

(19)

2.2 Representation 7

0 f . 5 1

. 4 5 1 . 0

. 1 3

Figure 2.2. Channel encoding of the value f = 0.3

B (0.3) = (0.13, 0.45, 0)T (2.4) The decoding of the cos2channel representation back into scalar representation is described in Forssén’s PhD-thesis [13]. It is evident that using many channels to encode a single value while using a kernel function with limited support, many of the channels will be zero, making the channel representation sparse. It can also be noted that the representation B = (0, 0, 0, . . . , 0)T does not represent the scalar value f = 0 but rather “no information”.

It is possible to encode several scalar values in one single channel vector, simply by superimposing the channel values from the different features:

B (f ) =X i B (fi) = X i (B1(fi) , B2(fi) , B2(fi) , . . . )T (2.5)

If the feature values are sufficiently far apart, it is still possible to perform an unambiguous decoding, by assuming that the feature value lies in a specific interval and performing a local decoding for several intervals [13]. When a large and dense sample set, such as the intensity values of an image, is encoded into a single channel vector, the channel representation can no longer be unambiguously decoded. However, in this case the channel representation can be used to estimate the distribution of the sample set, using theory related to kernel density estimation [7]. Using channels to estimate density can provide a “smaller” representation (using fewer coefficients and thus less storage memory) while more accurately estimating the position and size of the modes (peaks) of the distribution, compared to normal histograms [16]. Location and size of the modes in a feature distribution can then be used for matching or recognition.

It’s important to note that, though the above examples consider the one-dimensional case, the channel representation is by no means limited to 1D dis-tributions. In a K-D case, the channel centers are simply distributed on a grid in the multidimensional space. The feature consists of a feature vector f rather than a scalar, and the channel vector B will have the dimensionality K + 1. Often the kernel function is separable, making it possible to calculate the multi-dimensional

(20)

8 Theory

channel representation BK as the outer product (tensor product) of the 1D channel

representations B1k in each direction:

BK = K O k=1 B₁k (2.6)

2.2.2 P-Channels

The problem with channel representation as described in 2.2.1 is the overlapping of the channel functions. The outer product calculation in (2.6) required for multi-dimensional channel representations scales with aK _{if a is the overlap [10], making}

the computational complexity unmanageable for larger K. The alternative, non overlapping channels (essentially ordinary histograms), has the disadvantage of losing the sub-bin information, i.e. the sample’s position within the bin. The P-channel representation [8, 12, 9, 10] aims to solve this problem by storing both the sample’s presence in the bin and its position within the bin. Based on the ideas of channel representation, P-channels can be seen as using a vector valued, non-overlapping kernel function:

Bj(f ) = Π (λ (f − fj)) λ (f − fj) 1 (2.7) where Π (x) = ( 1 if|x| ≤ 0.5 0 otherwise

and λ is a scale factor. Approaching P-channels from histograms, the P-channel representation is simply a histogram augmented with an offset component.

Assume that f is scaled to fit [0, J ], where J is the number of channels, the kernel centers j are at integer positions and [f ] denotes the closest integer value to f . Then the P-channel representation of a set of scalar features fi, i = 1, 2, 3 . . .

is the 2 × J -matrix P = (p1, p2, . . . , pJ), where

pj= X i δ ([fi] − j) fi− j 1 (2.8) and δ (.) is the Kroenecker delta function. Encoding the value f = 0.6 into P-channels we get

P (0.6) =0 −0.2 0

0 1 0

(2.9) See fig. 2.3 and compare to fig. 2.2.

The extension to a multi-dimensional feature set fi, i = 1, 2, 3 . . . is:

pj= X i δ ([fi] − j) fi− j 1 (2.10)

(21)

2.2 Representation 9 1 0 . 5 - 0 . 5 1 2 2 f 1 0 0

Figure 2.3. P-channel encoding of the value f = 0.6

where j is a multi-index, i.e., a vector of indices (j1, j2, . . . , jK), denoting the

channel centers in the scaled K-D space, and [f ] = [f1_{], [f}2_{], . . . , [f}K_{]. The}

P-matrix now has the dimensions (K + 1) × J1× J2× . . . where Jk is the number of

channels in dimension k.

Because the P-channels use a non-overlapping kernel function, the performance disadvantage of normal channel representations is avoided and the computation is manageable even for higher values of K.

2.2.3 B

1

-Spline Channels

A family of kernel functions known as B-splines is frequently used in function approximation applications [25]. The functions are constructed by starting with the rectangular function Π (x) and successively convoluting with this rectangular function, achieving a successively smoother kernel function.

B0(x) = Π (x) (2.11)

Bi(x) = B0∗ Bi−1(x) (2.12)

B-splines can very well be used as kernel function in any channel representation. However, a significant advantage of the B1-spline function is that an overlapping

channel representation using B1-splines can be approximated from a P-channel

(22)

10 Theory

overlapping channel representation, useful for density estimation and matching, while avoiding the performance pitfalls described in 2.2.1.

In the 1D case, two neighboring P-channels, P1= (o1, h1) T

, P2= (o2, h2) T

are combined to one B-spline channel:

B1=

h1+ h2

2 + o1− o2 (2.13)

Fig. 2.4 shows the 1D P-channel basis functions and the combination into a single B1-spline basis function. The general idea of box-filtering the histogram

components, while differentiating the linear components of the P-channel, are applicable in higher dimension as well. In a multi-dimensional case, the histogram component is convoluted with the (separable) multidimensional box-filter. Each linear component is convoluted with a differentiating filter along the component dimension, and box filters in all other dimensions. In the resulting representation, the number of channels in non-periodic dimension have shrunk by one, and the kernel centers are shifted by 0.5. In each periodic dimension, the first P-channel value is replicated after the last before calculating the B-splines and the number of channels is thus unchanged.

- 1 1 - 1 1 - 1 0 1 0 0 0 . 5 - 0 . 5 1 1

Figure 2.4. P-channels to B1-splines transform. The basis functions for two P-channels

(left) are combined into one B1-spline (right) through (2.13).

The B1-spline channel representation achieved by the described transform is an

approximation to an overlapping channel representation calculated with the outer product in (2.6), with equality if the underlying distribution of the data is locally independent. This virtually never holds, but the approximation is good enough for many cases, and the significantly smaller computational complexity makes this representation useful in applications were normal overlapping channels would be infeasible. From now on, all references to B-splines will refer to this B1-spline

(23)

2.2.4 Feature Extraction

The selection of the features, so far described with an anonymous f or f is a decision every bit as crucial as the representation. As channel representations in general, P-channels and B-spline approximations are equally capable of repre-senting any type of bounded features. The number of features is limited only by performance requirements, which means a wide variety of features can be consid-ered.

When matching rendered 3D models to actual camera images, it is useful to think about which features are similar in the pair of images that the system is designed to find similar. Intensity for example, is an exceptionally poor choice, because of its dependence on lighting conditions (it is unlikely that the lighting conditions in the rendering system will be similar to those in the real world testing environment). The features considered in this thesis are largely based on the earlier experiences with P-channel for matching [9, 10].

Orientation

The orientation of the image is the direction in which the image varies in a given point. The orientation can be estimated coarsely by convolution with gradient filters in x- and y-direction, or more intricately with the use of quadrature filters [17] or Riesz-transform/Monogenic phase [11]. The former method is very quick and easy to implement, while the latter two methods put significantly more burden on both hardware and developer.

The orientation has the advantage of being largely invariant to lighting and other environmental conditions. Lines and edges, causing peaks in the orientation distribution, will be similar between model and actual object, even if the model doesn’t look exactly like the actual object.

Gradient Magnitude

The magnitude of the gradient, or largest eigenvalue of structure tensor in case of the more advanced orientation estimation methods, tells how much the gray level in the image varies in the direction identified in the orientation estimation. Having large values for lines and edges, it should be quite similar between model and camera image. However, blurriness in the image due to unfocused camera or other environmental factors, as well as to some extent lighting conditions, has some effect on the feature.

Color

While intensity may be a poor choice, and RGB-values for similar reasons are not suitable, color information may still be useful. It’s relatively cheap to extract hue,

saturation and value (HSV ) from an RGB image, making it possible to discard

the most lighting-dependent value component and focus on hue and saturation. Saturation, denoting the “fullness” of the color has proven useful when com-paring camera images to other camera images. 3D models, however, tend to be

(24)

12 Theory

more colorful than their real world counterparts. Additionally lighting, camera color response and subtle color errors in the computer model all cause unwanted discrepancies that make saturation less useful in matching computer-generated images to camera images.

The hue, on the other hand, depends little on lighting and environmental con-ditions. Theoretically, it should be a lot easier to get the right hue than the right saturation when constructing a computer model. The human factor introduced by the modeler makes the hue slightly less promising when matching generated images to camera compared to camera-to-camera, but altogether the hue might still be useful.

Position

The x- and y-coordinate of the pixel with above mentioned features can be included as features in themselves. This makes the channel representation local, which makes it more robust to e.g. occlusion.

Whereas the position features are regarded as critical, the other features can be mixed and matched to find the combination of features that yield the best result in the pose recognition process overall.

2.3 Matching

The point of extracting features and encoding it into a representation is to enable the system to decide if one image is similar to another, in a sense that is useful for the application. In the case of this thesis, the problem is to determine if one image contains the object with a similar pose. A dissimilarity measure is thus needed:

D (p, q) : X × X → R+ (2.14)

where X is the representation space and R+ denotes the set of positive real num-bers. The value of D should be small if the images represented by p and q contains an object with similar pose. Ideally, the value of D should increase monotonically with the dissimilarity in target space (pose space). A dissimilarity measure is a

distance if it is symmetric and fulfills the triangle inequality. Some estimation

applications may require a distance, but most will work with any dissimilarity measure.

The target of an image is the pose of the central object, described Euler angles using the pitch-roll-yaw convention [26]. The details of this representation is out-side the scope of this thesis, what is important is that the pose can be described by three scalar parameters, α, β and γ (see fig. 2.5). The target space T can thus be viewed as a three-dimensional space:

(25)

2.3 Matching 13

T = {α, β, γ} (2.15)

α ∈ [0, 2π] (2.16)

β ∈ [0, π] (2.17)

γ ∈ [0, 2π] (2.18)

In this space the L1- or L2-norm, taking into account the periodic dimensions as in (2.2), can be used as a simplified distance measure.

It should be mentioned that 3-D rotational geometry is a vast and complicated area, and that more elegant approaches for this representation, e.g. using quater-nion algebra, could have been considered. However, it was decided that having an accurate, unambiguous representation in the target space was not necessary for the work in this thesis.

Figure 2.5. Target representation to describe the pose of an object

2.3.1 Bin-by-Bin Dissimilarity Measures

Given a representation with the characteristics of a discrete distribution or his-togram, where data is sorted into bins, there is a choice of comparing the dis-tributions bin by bin or taking neighboring or even all bins into account in the comparison. The difference is illustrated in fig. 2.6.

The B1-spline representation from 2.2.3 is similar to a simple histogram, with

regularly placed bins. P-channel representation, on the other hand, is a signature, i.e. a binned representation with irregularly placed bins, the bin position refined by the offset component. The decision of bin-by-bin or cross-bin measure is thus of interest for both representations.

Minkowski-Form Distance

A very common and simple distance measure is the Minkowski-Form Distance DLr(p, q) ≡ X i (|pi− qi| r )1/r (2.19)

(26)

14 Theory

Figure 2.6. The advantage of cross-bin dissimilarity measures: a bin-to-bin dissimilar-ity measure (left) will only compare bins with the same index, finding the upper and lower histograms very different. A cross-bin dissimilarity measure (right) will take the neighboring bins into account and match the perceptual similarity better.

where r = 1 (Manhattan distance) and r = 2 (Euclidean distance) are common choices of r.

Kullback-Leibler Divergence

The Kullback-Leibler divergence (information divergence, information gain or rel-ative entropy), defined as:

DKLD(p, q) ≡ X i pilog pi qi (2.20) is a measure for comparing probability distributions [18]. Here, p and q are re-garded as samples from two probability distributions P and Q. The KLD measures how much more surprising a sample from P is if it is thought to come from Q, as compared to if the sample is known to come from P , in an expectation sense. It can also be seen as a measure of “how inefficient on average it would be to code one representation using the other as the code-book” [5]. The KLD is theoretically suitable for the B-spline representation from section 2.2.3, as that representation is designed to provide a good approximation of the probability distribution. Because of this it was used by Felsberg and Hedborg when matching B-spline representa-tions [10].

χ2-statistics

A χ2_{-test can be used as a dissimilarity measure, defined as:}

Dχ2(p, q) ≡ X i (pi− mi)2 mi (2.21) where mi= pi+ qi 2

This function measures “how unlikely it is that one distribution was drawn from the population represented by the other” [21], and is, like KLD, particularly suitable for the B-spline representation.

(27)

2.3 Matching 15

2.3.2 Cross-Bin Dissimilarity Measures

It has been argued that bin-by-bin measures do not always match perceptual similarity well [21]; one such case is shown in fig. 2.6. Bin-by-bin measures are also sensitive to the bin size: too coarse binning will not be able to represent details in feature space, whereas too fine binning will place similar features in different bins and the features will not be matched. To deal with this problem, there is a number of cross-bin measures, that take other bins into account.

Quadratic Form Distance

Suggested by Niblack et al. [15], the quadratic form distance (QFD) is defined as: DA(p, q) ≡

q

(p − q)TA (p − q) (2.22)

where p and q are column vectors and A = [aij] is a similarity matrix, where

aij denote the similarity between bins i and j. The recommendation in [15] is

to use either aij = 1 − dij/dmax or aij = exp

−σ (dij/dmax) 2

, where dij is a

ground distance representing the distance between the bins. Using histograms or

B-spline representation, the ground distance can be the L1_{- or L}2_{-norm between}

the bin centers. Using the P-channel representation, it is suitable to use the linear offset component together with the bin center coordinates to refine the bin center locations, and then use the L2-norm to determine the ground distance. In this case p and q will only contain the histogram part of the P-channel representation. The QFD uses all the information stored in the P-channel representation in a sensible way. However, as all cross-bin measures, it is a lot more computationally expensive than the bin-by-bin alternatives.

Earth Mover’s Distance

The Earth Mover’s Distance, or EMD, is an advanced distance measure proposed by Rubner, Tomasi and Guibas [21]. It’s designed to work on signatures, i.e. binned representations with arbitrary bin locations. The name comes from the idea that a distribution is like a pile of dirt, and the dissimilarity should be measured as the minimal amount of work required to move the dirt in one distribution to form the other. In the algorithm, this is translated to an optimization problem of finding the flow that minimizes the overall cost of moving the dirt, subjected to a number of restrictions. The algorithm, like the QFD, takes a ground distance matrix as input, together with two distributions.

Like the QFD, EMD uses the information from the P-channel representation in a reassuring way. However, the optimization problem makes the measure far more computationally expensive even than the quadratic form distance.

Diffusion Distance

Inspired by the field of thermodynamics, the diffusion distance, proposed by Ling [19] treats the difference between two distributions as an isolated temperature

(28)

16 Theory

field T (x, t). The similarity is then based on how fast the temperature diffuses, i.e. spreads evenly in the field. In practice this means subtracting one distribution from the other and successively low-pass filter the result using a gaussian kernel:

d (x) = px− qx (2.23) T0(x) = T (x, 0) = d (x) (2.24) T (x, t) = T0(x) ∗ Φ (x, t) (2.25) where Φ (x, t) = √1 2πte −x2 2t2 (2.26)

The distance is then defined as:

Ddiff(p, q; ¯t) ≡ ¯ t Z 0 k (|T (x, t) |) dt (2.27)

where ¯t is a positive constant upper bound for the integration and k (.) is a norm that measures how T (x, t) differs from zero. A simple and tried choice of k (.) is the L1_-norm: k (|T (x, t) |) = ∞ Z −∞ |T (x, t) | dx (2.28)

The measure thus becomes a comparison of distributions on different scale, every value of t corresponding to a scale. In the implementation of the diffusion distance, x and t are discrete variables and all integrals are replaced by sums.

The diffusion distance does not use a ground distance and can thus not ben-efit from the linear offset components in the P-channel representation the way quadratic form distance or EMD does. Still, it does include cross-bin information, and is significantly less computationally intensive than other cross-bin alternatives (though still more expensive than the bin-by-bin measures).

Projection By Pseudoinverse

Stretching the boundaries of what can be regarded as a dissimilarity measure, it is possible to perform matching by collecting a number of prototype sample set and calculating the pseudoinverse. The method is touched upon by Felsberg and Hedborg [9], section 3.2.

Given the matrix P = (p1, p2, . . . , pN), where pi is the representation of a

synthetic image i, in the form of a column vector. The pseudoinverse of P is defined as

P†≡ PT_P−1

(29)

2.4 Estimation 17

yielding

P†P = I (2.30)

where I is the identity matrix. This means that

si= P†pi (2.31)

will provide the column vector si_{= s}i

1, si2, . . . , siN T , where si_j = ( 1 ifj = i 0 otherwise (2.32)

The idea is then to perform the projection

sq= P†q (2.33)

where q is the representation of a camera image, in the form of a column vector. The index of the largest element in sq _{should then be the same as the index of the}

most similar representation vector in the prototype set P.

As the pseudoinverse is calculated off-line, the computational complexity of this measure is very low. The method can take advantage of the of the linear offset component in the P-channel representation, although not as explicitly as EMD or QFD. Information is decidedly shared across bins, making this a cross-bin measure. The measure is very different from the previously discussed measure with regards to form and application. The fact that the result depends on all vectors in the prototype set P, not only the one closest to the query q, may prove a disadvantage.

2.4 Estimation

The goal of the estimation step is to find a function or mapping

y : X → T (2.34)

from the representation space X to the target space T. Typically, y makes use of some static information in the form of parameters, weights or reference data, usually calculated or collected during an off-line training phase. Calling this in-formation W, y can be written

t = y (q, W) (2.35)

where q ∈ X. In this chapter some methods for estimation are presented with a brief explanation and remarks about their respective advantages and disadvan-tages.

(30)

18 Theory

2.4.1 Nearest Neighbor

The estimation method that first springs to mind when a good dissimilarity mea-sure has been found is to do a nearest-neighbor search. A large set of synthetic example images with known pose is gathered, encoded into the desired represen-tation and stored in memory. An incoming query image is then encoded and compared to all prototype vectors stored in memory, to find the best match and thus the most probable pose.

The advantages of this method is that it is simple to implement and, if the dissimilarity measure performs well, a nearest neighbor result should also perform well.

The disadvantages are notably performance and memory consumption. In the case of an exhaustive search, both memory consumption and execution time scales linearly with the number of prototype samples, and to cover the entire target space, a large set of prototype samples is needed.

One solution to the execution time problem is to perform a successively re-fined search, i.e., in the first loop, compare only to a set of prototype samples corresponding to a regular grid in the target space. In the next run, compare to prototypes corresponding to target vectors close to the nearest neighbor of the first search. This can be done in two or more levels, vastly reducing the execution time. However, the method has a risk of getting stuck in local minima, posing tougher requirements on the dissimilarity measure to be monotonically increasing around the best match.

Another solution is to use a diminished prototype set and use some sort of in-terpolation strategy to increase the resolution of the estimate. This would decrease memory consumption and, depending on the complexity of the interpolation, likely also shorten the execution time. This strategy also poses high demands on the uniformity of the dissimilarity measure.

2.4.2 Linear Least-Squares

A least square estimation is based on the assumption that the target vector t can be derived from the channel representation or density estimate d by linear projection, i.e.:

t = Wd (2.36)

where t and d are column vectors in target space and representation space re-spectively. During the off-line training phase, the matrix W can be derived using the prototype set P = (p1, p2, . . . pN) and the corresponding known target values

T = (t1, t2, . . . tN). Minimizing the error norm

= kWP − Tk2 (2.37)

we get the well known solution

(31)

2.4 Estimation 19

where P† ₌ _PT_P−1

PT _{denotes the pseudoinverse of P. The acquired}

W-matrix can the be used in the on-line estimation step to derive the target values directly from the query vector.

This method is very inexpensive with regard to computation complexity and memory usage. The size of the M-matrix depend only on the dimensionality of the representation and target space, not the number of training samples. However, the assumption in (2.36) is a rather bold one, and poses large requirements on the representation to linearly separate unsimilar feature sets. Nevertheless, in [9], Felsberg and Hedborg successfully applies this method on P-channels for a pose estimation application.

While not using any distance measure explicitly, the method is linked to the projection by pseudoinverse in section 2.3.2, since

ti= Tsi (2.39)

2.4.3 Linear Regression with Radial Basis Function Models

As the assumption of linear separability in section 2.4.2 may be too strict, we can extend the linear least squares to a so called “linear-in-the-parameter” model, using non linear radial basis functions (RBF). (2.36) then becomes

t =X

i

wiφi(d) (2.40)

where wi is a weight vector, and Φi is a radial basis function, a function that

depends on the distance between a center somewhere in representation space and the input d. Various functions can be used, but normally something looking like the kernel function in fig. 2.1 is suitable. The centers are typically chosen from the training set so that their corresponding targets form an equally spaced grid in target space. (2.40) can be rewritten as

t = Wφ (2.41)

where W = (w1, w2, . . . ) and φ = (φ1(d) , φ2(d) , . . . ). The optimal weights can

then be acquired with a least-squares approximation using the training set P, cf. (2.38):

W = TΦ† (2.42)

where T = (t1, t2, . . . ) and Φ = (φ (p1) , φ (p2) , . . . ).

This method has several advantages over the linear least-squares. Firstly, it does not require linear separability in the representation space. Secondly, any distance measure can theoretically be used. The type of basis function can be freely chosen to fit the problem, and the number of basis functions can also be tuned.

This also constitutes the main difficulty with this method. Choosing the num-ber of basis function, their form and width, is a difficult balance between over-fitting to the training set, thus having worse generalization ability, and not having

(32)

20 Theory

fine enough discrimination possibilities. Moreover, the method is slightly more expensive than linear least-squares, since the φ-vector must be calculated for each incoming query.

If d is the unencoded feature set, using RBF has similarities to using linear least squares on a channel representation, with channel functions corresponding to the radial basis functions. Usually, however, the radial basis functions are positioned based on the corresponding position in target space, whereas channel functions are normally reagularly placed in the feature space. This makes it possible to use fewer basis functions in the RBF case, which in turn makes it possible to use smoother functions, e.g. Gaussian kernels, without damaging performance. This makes it motivated to use RBF regression, even in conjunction with kernel representations. To reduce the risk of over-fitting, regularization can be used. Based on the assumption that data is generated from smooth rather than noisy functions, a penalty can be introduced to favor such functions. In a linear model framework, complex functions usually have larger weights. We penalize large weights by rewrit-ing the error function

ˆ

= kWΦ − Tk2+ λkW k2 (2.43)

Minimizing this error norm yields the penalized least squares (PLS) estimate for W:

WP LS= T ΦTΦ + λI −1

ΦT (2.44)

where I is the identity matrix and λ is a hyperparameter balancing how well the resulting function fits the samples in P and how smooth it is. An example of an overfitted function approximation, together with a regularized approximation of the same function, is illustrated in fig. 2.7.

There are ways to methodically calculate the optimal value for λ, using a

validation set with known target values. This is unfortunately not very useful in

this thesis, since it will mean testing the system on synthetic images, whereas the objective here is to have the system working on camera images, for which we have no known pose. Thus λ will have to be manually tuned.

2.4.4 Relevance Vector Machine

The relevance vector machine (RVM) (see e.g. [3] chapter 7.2, [24]) can be seen as an extension to the RBF model in 2.4.3. A simplified description of how RVM works could be that instead of choosing the radial basis function centers by picking vectors from the prototype set with their corresponding targets placed equidistant in target space, RVM uses bayesian inference and marginalization to find the most “relevant” prototype vectors, the relevance vectors.

In the following brief explanation of RVM, a scalar target value will be assumed. The generalization of RVM to multivariate targets was done by Thayananthan et al., see [23].

Rather than optimizing the parameters linearly and trying to find an explicit function, RVM works by trying to find the probability distribution of the target

(33)

2.4 Estimation 21 0 2 4 6 8 10 −1.5 −1 −0.5 0 0.5 1 1.5 Desired function Samples with noise Overfitted Linear RBF Penalized Least Squares

Figure 2.7. 1D Penalized Least Squares example. In this example, a sine wave is

recreated from a sample set with added noise. The smoother PLS-estimate is visibly closer to the estimated function.

(pose) given the input data. The idea is based on assumption that the target data is the noisy result of an underlying “true” model, t = y (d, w) + where is gaussian noise. This leads to the following form

p (t|d, w, β) = N t|y (d, w) , β−1 (2.45) where β is the inverse noise variance and y (d, w) = P

iwiφi(d). Given the

prototype set P with the known target t, we assume that the prototype vectors pnin P are statistically independent and get

p (t|P, w, β) =Y

n

p (tn|pn, w, β) (2.46)

Assuming a zero mean Gaussian prior distribution for w and introducing the hyperparameter α, where p (wi|αi) = N 0, α−1i , the posterior for w can be

com-puted with Bayes’ rule. The values of α and β are determined through type-2 maximum likelihood (see [24]). The result is that w can be marginalized out of (2.45), resulting in

p (t|q, P, α?, β?) = N t|µT_wφ (q) , σ2(q) (2.47) where µwis the mean of w and σ2(.) is a known function that depends on α? and

β?_.

A key advantage of RVM is that, in finding the hyperparameter vector α?_,

(34)

22 Theory

correspond to zero valued weights, making the estimation, though initially appear-ing to use all prototype vectors in P, very sparse. The prototype vectors that have non-zero weights are the relevance vectors. This sparseness is illustrated in fig. 2.8, where a function with scalar input and multi-variate target is approximated using MVRVM. −10 −5 0 5 10 0 0.5 1 1.5 2 2.5 target functions RVM predictor RVs

Figure 2.8. Function approximation with multi-variate relevance vector machines. Note the sparseness of the relevance vectors (marked by green rings).

The disadvantages of RVM include slightly more time required for learning than linear-in-model RBF regression. It also shares the disadvantage of having to choose the form and radius of the basis function.

2.5 Summary

In this chapter we described the tree major steps in a pose recognition algorithm: representation/descriptor, matching and estimation method. We presented a brief background on channel representation, which led to the central representation of this thesis, the P-channels. The transform from non-overlapping P-channels to an approximation of overlapping B1-splines was also explained.

In the matching section, a number of dissimilarity measures were presented and explained. Bin-by-bin measures, comparing only bins or channels with the same index include Minkowski-form distance (L1_{- or L}2_{-norm), Kullback-Leibler}

divergence (KLD) and χ2_{-distance. The cross-bin measures, taking neighboring}

or all bins into account, include quadratic form distance (QFD), earth mover’s distance (EMD), diffusion distance and projection by pseudoinverse.

(35)

2.5 Summary 23

In the estimation step we started with different approaches to nearest-neighbor searching. We then described different types of regression, starting with the simple linear least-squares estimation, which was then extended into an approach using radial basis functions. Lastly, a more advanced regression approach, the relevance vector machine (RVM), was presented.

(36)

(37)

Chapter 3

Implementation

In this chapter, we describe how the algorithms and methods from chapter 2 were implemented. We start with describing the tools used and then work through the system one abstraction layer at a time, via feature extraction, representation, matching and estimation/regression.

3.1 Tools

Two main tools were used to implement, test and evaluate combinations of the methods described in chapter 2:

• MATLAB is a numerical computing environment and high-level scripting language developed by The MathWorks, that should be familiar to most students and professionals in science and technology. The environment en-ables quick implementation and testing of mathematic algorithms, though execution speed is generally lower than a well optimized corresponding im-plementation in a native programming language such as C++.

• VisionLib is a C++ code framework and a platform for research and devel-opment in computer vision developed at Fraunhofer IGD. It provides a mod-ularized architecture, where vision algorithms can be plugged in as actions. Several actions can be stacked upon another in an action pipe, successively transforming and analyzing an input image. Moreover, many basic tasks, such as video capturing, 3D rendering, disk I/O and image format conver-sion are already developed and tested, which enables the developer to focus on the algorithms currently tested.

In the work included in this thesis, MATLAB was used to implement all algo-rithms and test their viability. The methods that showed promising results were then implemented in VisionLib and tested in a wider scope.

(38)

26 Implementation

3.2 Feature Extraction

3.2.1 Gradient Orientation and Magnitude

Simple gradient estimation was done by convolution with [−1 0 1] kernels in x-and y-direction, estimating the x- x-and y-component of the gradient. The re-sults of the convolution were then combined to get direction and magnitude. For the orientation we used double-angle representation (see [14], chapter 1), i.e. θ = arg(fx+ ify)2

, stripping the orientation vector of its sign. For advanced orientation estimation, an implementation using quadrature filters was created in MATLAB. The details are outside the scope of this thesis; the implementation was based on the paper by Knutsson and Andersson [17]. Also in this case, the double-angle representation was extracted from the acquired structure tensor.

3.2.2 Color Information

The conversion between an RGB-image and HSV (hue, saturation, value) exists as a built in function rgb2hsv() in MATLAB. A base for an implementation in VisionLib was also readily available.

3.3 Representation

3.3.1 P-Channel Encoding

A P-channel encoding function was created in MATLAB, starting from Algo-rithm 1 in [9] and adding optimizations for MATLAB’s matrix-oriented archi-tecture. The prototype of the MATLAB function is as follows:

P = pchannel(features,J,mask,weight)

The features argument is a 3D matrix of features to be encoded. The ar-gument J is a vector containing the number of channels in each dimension. This parameter has not been extensively evaluated in earlier P-channel papers ([9, 10]), where an ad hoc choice of eight channels for spatial dimensions and four channels for other features was chosen. In this thesis, this setting was used as a base, but several other configurations were tested. The weight parameter is a matrix of the same size as the features argument, and enables weighting of the features. This was primarily used to weight the orientation feature with the magnitude of the gradient, similar to histogram of oriented gradients [6], but experiments with weighting the hue with the saturation were also conducted. When weighting more than one feature (i.e. orientation with gradient magnitude and hue with satura-tion) the addition to the linear offset components are scaled with the respective weights and the contribution to the histogram component is scaled with the prod-uct of both weights for the considered pixel. The mask is a matrix of ones and zeros, denoting which image pixels should influence the representation. This prop-erty was used to threshold the orientation to disregard pixels with small gradient magnitude. This is especially useful when using synthetic images with a uniform

(39)

black background. Without using threshold or weight, the background pixels will have zero orientation, which will match badly with the noisy background of cam-era images. If these features are thresholded, however, the distance measure can be modified with this in mind.

3.3.2 B-Spline Representation

Based on section 3.2 in [10] a transformation from P-channel representation to the B-splines representation described in 2.2.3 was implemented in MATLAB. The prototype of this function looks as follows:

B = pc2bsc2(P,J,Kp,noiseLevel)

The P and J arguments are known from the P-channel encoding function. The Kp is simply the number of features that are periodic, and thus require repeating the first P-channel bin after the last as mentioned in section 2.2.3. It is assumed that the periodic features are placed first in the feature matrix, i.e. the first Kp features are assumed to be periodic.

Estimated Noise Level

The noiseLevel argument to the B-spline transformation function merits a more detailed explanation. From the image, a feature vector will be extracted. Some of the values in this feature vector will come from actual features, while some will come from random noise. In a statistical sense, this can be viewed as drawing samples from two distributions, where the first is the distribution of the features (Bf(f )), and the other is a uniform noise distribution (Π (f )). The distribution of

the combined set of drawn samples will be the sum of the two distributions B (f ) = (1 − λ) Bf(f ) + λΠf(f ) (3.1)

normalized so that

Z

F

B (f )df = 1 (3.2)

where F is the entire feature space. Using B-spline representation to estimate the distribution, we get an approximation of B, whereas we are really interested in Bf. To get Bf from our estimation ˆB we can guess λ and write

ˆ Bf(f ) = max  0, ˆB (f ) −λ ˆ B   (3.3) where ˆ B

denotes the total number of channels/bins. The estimation is again normalized, so that

X

f

ˆ

Bf(f ) = 1 (3.4)

With a good guess of λ, this provides a more precise distribution estimate and should be more useful for matching.

(40)

28 Implementation

3.4 Matching

All bin-by-bin distance measures mentioned in 2.3.1 consist of simple arithmetic operations and sums, making implementing them all in MATLAB a trivial task. The same is true for the QFD in section 2.3.2 and the projection by pseudoinverse in section 2.3.2. The EMD from section 2.3.2, on the other hand, is a quite ad-vanced algorithm. Luckily there is C-code available [20] as well as a MATLAB mex interface [2], based on the same C-code. The diffusion distance in section 2.3.2 lands somewhere in between with regards to complexity. A MATLAB implemen-tation was made based on the paper by Ling [19]. The important parameters here are the size (in samples) of the Gauss kernel, and the upper limit ¯t from (2.27) which are both a balance between speed of execution and accuracy of the approximations.

3.5 Estimation

With the distance measures properly implemented in MATLAB, implementing a nearest neighbor search is trivial task, as is extending it to the successively refined search described in section 2.4.1. In MATLAB, interpolation is readily available through the interp() function, for using in the sparse interpolated nearest neigh-bor approach.

The linear least-squares approach relies on simple matrix arithmetic, which is well supported by MATLAB. The parameter to be decided is how many of the samples in the training set that are to be used in optimizing the weights, as too many will cause overfitting and too few will not optimize the weights enough.

The radial basis functions approach is similar to the linear least-squares, with the additional element of the distance matrix calculations which lead to the Φ matrix.

The implementation of the relevance vector machine is way outside the scope of this thesis. Luckily, there is an implementation ready available for the multi-variate RVM in both MATLAB code and C++ [22]. Unfortunately the MATLAB imple-mentation is unbearably slow for higher dimension input vectors, instead a stan-dalone C++ application developed by Alain Pagani (Fraunhofer IGD), based on the aforementioned code, was used. Data was communicated between MATLAB and the RVM application by the means of text files.

(41)

Chapter 4

Testing

Having implemented the methods described in chapter 2, the following chapter will describe how the implementations were tested to find which combination of the methods that had the most merit in a pose recognition application. We will begin by describing the testing of the feature extraction, examining visually if the features correspond well between model and camera images. Then the tests to find which combination of representation parameters (input parameters to P-channels and B-splines) and choice of dissimilarity measure provide the best capability to find the correct best match will be described. Lastly, a short description is included of how the estimation methods from section 2.4 (linear least-squares, radial basis functions and relevance vector machine) were tested. The results of the testing methods described in this chapter are discussed in chapter 5.

4.1 Test Data

To test the different steps of the application, two objects were used: a textured cardboard box, and a toy car of the model Mini Cooper. The objects, which will from here on be referred to as the “box” and the “mini” respectively, are shown together with their 3D models in fig. 4.1.

From these objects, a number of image sets containing 60 snapshots covering a 360◦ _{rotation around the vertical axis were created. Limiting the tests to one}

degree of freedom enables smaller test sets, simplifies illustration of results, and provides a “lower bound” testing: what does not work with one DOF, will most likely not work with three DOF.

4.1.1 Disturbance

To test the robustness of the algorithm test sets with introduced disturbances were created. Two sorts of disturbances were tested: partial occlusion and translation. Occlusion was achieved by placing a post-it note on one of the corners of the box (see fig. 4.2). Translation was examined by simply moving the center of rotation of the 3D model in the training/prototype set.

(42)

30 Testing

Figure 4.1. Test data: synthetic model (left) and camera image (right) of “mini” (top) and “box” (bottom)

(43)

4.2 The “Background Problem” 31

4.2 The “Background Problem”

When learning from model images and testing on camera images, one of the main difficulties is that the backgrounds of the camera images will differ from the model images. We examined three ways to deal with this problem:

• Leave the background black in the model images and make sure the algorithm does not make use of this “disinformation”.

• Add a background to the model images consisting of random photographies, and design the algorithm to average between different backgrounds.

• Add a background of Gaussian or uniformly distributed noise.

After some rudimentary tests, it was judged that the first option limits the choice of algorithm too much, and that the third option with random noise works at least as well as the random background picture idea. As the noise background approach is much simpler, that is the approach that was used.

4.3 Feature Extraction

On the first level of abstraction, it is interesting to see if the features extracted from model and camera images are visually similar for corresponding poses. The easiest, and likely most useful, way to do this is to visualize the extracted features in a suitable way and compare the results visually. When this is done, the testing of the merits of the features in the pose estimation application is included in the testing of the dissimilarity measures.

We found that the compression by the camera caused artifacts in the image, which had a big impact on the extracted features. This was solved by adding a little bit of noise to both model and camera images, and low-pass filtering the images before feature extraction. The problem and solution is visible in fig. 4.3.

4.4 Testing Dissimilarity Measures

Running 60 synthetic images and 60 camera images with corresponding pose through the feature extraction and encoding we get a matrix of 60 prototype vectors p1, p2, . . . , p60 and 60 query vectors q1, q2, . . . , q60. By choosing a

dis-similarity measure, we can create a disdis-similarity matrix D with the elements dij = D (pi, qj) where D (.) is the chosen dissimilarity measure. Visualizing the

distance matrix for different choices of features, representation parameters and dissimilarity measures, we can get a rough idea of what setting provide the best discrimination capabilities. By using test sets with the disturbances described in section 4.1.1, we can also see how settings and choices affect the robustness of the application.

(44)

32 Testing 50 100 150 200 250 300 50 100 150 200 50 100 150 200 250 300 50 100 150 200 50 100 150 200 250 300 50 100 150 200 50 100 150 200 250 300 50 100 150 200

Figure 4.3. The compression artifacts in the camera image (upper right), e.g. on

the roof of the car, causes unwanted dissimilarity in the extracted gradient orientation between the model (left) and camera (right) images. With added noise and low-pass filtering, the problem is solved (bottom pair).

(45)

4.4 Testing Dissimilarity Measures 33

4.4.1 EMD and QFD

It was quickly judged that the Earth Mover’s Distance (section 2.3.2) and the Quadratic Form Distance (section 2.3.2), however theoretically appealing, both due to their devastating computational complexity have to be excluded from the set of eligible distance measures. One single EMD-calculation takes in the orders of 0.1 seconds when using eight spatial channels in each dimension, and four channels in the feature dimension. The QFD is somewhat faster, but they both share the problem that the execution time grows quickly with finer binning, as the calculation of the distance matrix is quadratic in the total number of channels. As at least hundreds or thousands of distance calculations has to be performed in the estimation step for a full three DOF application, it is clear that 0.1 seconds is far beyond what can be deemed useful.

4.4.2 Projection by Pseudoinverse

Though the projection by pseudoinverse showed some merit in isolated cases, it lacked robustness when conditions were less favorable. This is illustrated in fig. 4.4: though the ideal case shows good results, the performance declines very quickly with the disturbances from section 4.1.1.

No disturbance 20 40 60 10 20 30 40 50 60 Occlusion 20 40 60 10 20 30 40 50 60 Translation 20 40 60 10 20 30 40 50 60

Figure 4.4. Projection by Pseudoinverse for the box model. The images are visualiza-tions of the D-matrix from section 4.4, high similarity is showed by higher intensity, and the best match is marked by a red dot. In the leftmost image, the best match is almost always close to the right one. With partial occlusion, however, the situation gets worse, and with translation, the result is largely useless.

4.4.3 Remaining Dissimilarity Measures

The remaining dissimilarity measures were tested for a wide variety of parameter settings and feature selections on both models, with or without occlusion and translation. For every setting, a diagram of the type seen in fig. 4.5 was created,

(46)

34 Testing

to find the best combination of parameters, features and dissimilarity measure. It was found that using a bin-by-bin dissimilarity measure on the P-channel vector, as compared to only the histogram component, yields near identical results. It is also hard to theoretically motivate using a bin-by-bin measure on the P-channel representation, as it will count the linear offset components the same as distance components. Because of this, we applied the bin-by-bin measures on histograms and B-spline representations only.

Execution Time

The computational complexity of the Minkowski-form distances, the KLD and the χ2_{-distance are approximately the same, all containing only a few simple}

op-erations for each bin. The differences in execution time depend mostly on how optimized the algorithm is. The diffusion distance, being a much more advanced distance measure and using cross-bin information, is significantly slower. In ta-ble 4.4.3 the execution times for the MATLAB implementations of the distance measures are listed. These do not necessarily correspond directly to an optimized implementation in C++, but they provide a rough indication of the complexity of the algorithms. For example, the MATLAB implementation of KLD, as can be seen in the table, is an order of magnitude faster than Minkowski-form and χ2-distance. However, this is assumed to be due to the way MATLAB handles loops and sums rather than to the actual computational complexity of the distance measures (which ought to be similar for the three).

Measure Time (µs)

Minkowski-form Distance 190

KLD 17.6

χ2_-distance ₁₁₉

Diffusion Distance 9444

Table 4.1. The execution times of the MATLAB implementations of the used distance measures.

4.5 Testing Estimation Methods

All estimation methods from section 2.4 were using the same test sets as when testing the dissimilarity measures. The parameters found to be optimal in the distance measures tests were used in the estimation tests as well, since the perfor-mance of all estimation methods rely more or less directly on the distance measure used (or in the case of linear least-squares, the Projection by Pseudoinverse mea-sure). In the tests, the target value was represented either by the scalar rotation angle, φ, or the vector (cos (φ) sin (φ))T. The latter target representation has the advantage of being continuous, whereas the former may have problems around φ = 0 and φ = 2π. The estimated rotation angle for each image was then plotted

(47)

4.5 Testing Estimation Methods 35 L1−norm, Histograms 20 40 60 20 40 60 L1−norm, B−splines 20 40 60 20 40 60 L2−norm, Histograms 20 40 60 20 40 60 L2−norm, B−splines 20 40 60 20 40 60 KLD, Histograms 20 40 60 20 40 60 KLD, B−splines 20 40 60 20 40 60 χ2_{Distance, Histograms} 20 40 60 20 40 60 χ2_{Distance, B−splines} 20 40 60 20 40 60

Diffusion Distance, Histograms

20 40 60 20

40 60

Diffusion Distance, B−splines

20 40 60 20

40 60

(48)

36 Testing

against the angle measured in the scene during the creation of the image, and an absolute error could be plotted as the difference between the two.

(49)

Chapter 5

Results and Evaluation

In this chapter the results from the tests described in chapter 4 will be presented and discussed. We will show how the choice of features and encoding parameters of the P-channel and B-spline representations affected the performance of the dif-ferent dissimilarity measures. We will then discuss which dissimilarity measures succeded best in finding the correct best match. The results of the four estimation algorithms used will be presented and commented, in relation to the expectations and speculations in previous chapters. As a last point, a real-time video imple-mentation will be presented, and the resulting performance will be judged.

In this chapter, visualizations of the performance of distance measures, in the form of a D-matrix, are frequent. The D-matrix, as described in section 4.4 is a matrix with the elements dij = D (pi, qj), where pi is the representation of a

model image, qjis the query vector from a camera image and D (., .) is the chosen

dissimilarity measure. In the visualizations, high similarity is denoted by bright pixels, and the best match is marked by a red dot.

5.1 Features and Parameters

5.1.1 Feature Selection

By visualizing the distance matrix D for different choice of features, it was shown that the gradient orientation was the feature that gave the best results overall. The hue also yielded useful results, especially on the box model (see fig. 5.1). However, we found that the hue was more sensitive to disturbance, such as translation or occlusion. When trying combinations of hue and orientation, we achieved the best results with using only the orientation.

Advanced Orientation Estimation

Some experiments were conducted with a more sophisticated orientation estima-tion method, using quadrature filters [17]. However, using a more advanced ori-entation estimate did not show a consistent positive effect on the performance of

Pose Recognition for Tracker Initialization Using 3D Models

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Pose Recognition for Tracker Initialization Using

3D Models

Pose Recognition for Tracker Initialization Using

3D Models

Examensarbete utfört i Bildbehandling

vid Tekniska högskolan i Linköping

av

Abstract

Acknowledgments

Contents

Chapter 1

Introduction

1.1

Background

1.2

Problem

1.3

About This Thesis

1.4

Related Work

1.5

Document Overview

Chapter 2

Theory

2.1

Method

2.2

Representation

2.2.1

Channel Representation

2.2.2

P-Channels

2.2.3

B

-Spline Channels

2.2.4

Feature Extraction

2.3

Matching

2.3.1

Bin-by-Bin Dissimilarity Measures

2.3.2

Cross-Bin Dissimilarity Measures

2.4

Estimation

2.4.1

Nearest Neighbor

2.4.2

Linear Least-Squares

2.4.3

Linear Regression with Radial Basis Function Models

2.4.4

Relevance Vector Machine

2.5

Summary

Chapter 3

Implementation

3.1

Tools

3.2

Feature Extraction

3.2.1

Gradient Orientation and Magnitude

3.2.2

Color Information

3.3

Representation

3.3.1

P-Channel Encoding

3.3.2

B-Spline Representation

3.4

Matching

3.5

Estimation

Chapter 4