Structure and Motion Estimation using Sparse Point and Line Correspondences in Multiple Affine Views

(1)

Sparse Point and Line Correspondences

in Multiple AÆne Views

Lars Bretzner and Tony Lindeberg

ComputationalVisionand Active Perception Laboratory (CVAP)

Departmentof Numerical Analysisand Computing Science

KTH (Royal Institute of Technology)

S-100 44 Stockholm, Sweden.

http://www.nada.kth.se/~tony

Email: f bretzner, tonyg@nada.kth.se

Technical report ISRN KTH/NA/P{99/13{SE

Abstract

Thispaperaddressestheproblemofcomputingthree-dimensionalstructureand

motion from an unknown rigid congurationof points and lines viewed by an

aÆne projectionmodel. An algebraicstructure, analogousto the trilinearten-

sor forthreeperspectivecameras,isdened forcongurationsof threecentered

aÆne cameras. This centered aÆnetrifocal tensor contains 12non-zero coeÆ-

cientsandinvolveslinearrelationsbetweenpointcorrespondences andtrilinear

relationsbetweenlinecorrespondences. ItisshownhowtheaÆnetrifocaltensor

relatestotheperspectivetrilineartensor,andhowthree-dimensionalmotioncan

becomputedfrom this tensorin astraightforwardmanner. A factorizationap-

proachisdevelopedto handlepointfeaturesandlinefeatures simultaneouslyin

image sequences, anddegeneratefeature congurationsareanalysed. This the-

ory isappliedtoaspecic probleminhuman-computerinteractionofcapturing

three-dimensional rotationsfrom gestures of a human hand. This application

to quantitative gesture analyses illustrates the usefulness of the aÆne trifocal

tensorinasituationwheresuÆcientinformationisnotavailabletocomputethe

perspectivetrilinear tensor, while thegeometry requires pointcorrespondences

aswellasline correspondencesoveratleastthree views.

An earlier version of this manuscriptwas presented inH. Burkhardt and B. Neumann(eds.)

Proc. 5th European Conference onComputer Vision,(Freiburg,Germany),vol. 1406 of Springer-

VerlagLectureNotesinComputerScience,pp. 141{157,June1998. ThesupportfromtheSwedish

ResearchCouncilforEngineeringSciences,TFR,andtheSwedishNationalBoardforIndustrialand

TechnicalDevelopment,NUTEK,isgratefullyacknowledged.

(2)

1 Introduction 1

2 Geometric problem and extraction of image features 2

3 The trifocal tensor for three centered aÆne cameras 3

3.1 Perspectivecamera and three views. . . . . . . . . . . . . . . . . . . . 4

3.2 AÆne cameraand three views. . . . . . . . . . . . . . . . . . . . . . . 5

4 The centered aÆne camera and its relations to perspective 6 5 Orientation from the centered aÆne trifocal tensor 8 6 Joint factorization of point and line correspondences 10 6.1 Structureestimationfrompointand linecorrespondences . . . . . . . 12

6.2 Resolvingtheambiguityinthe rotationestimates . . . . . . . . . . . . 13

6.3 Relative weightingof point andlineconstraints . . . . . . . . . . . . . 13

6.3.1 Computing thecentered aÆnetrifocaltensor . . . . . . . . . . 14

6.3.2 Findingscalefactorsof linespriorto factorization . . . . . . . 14

6.3.3 Simultaneous factorization ofpointsandlines . . . . . . . . . . 15

7 Degenerate situations 16 7.1 Degenerate three-dimensionalshapes . . . . . . . . . . . . . . . . . . . 16

7.2 Degenerate three-dimensionalmotions . . . . . . . . . . . . . . . . . . 17

8 Experiments 18 8.1 Experimentson synthetictest data . . . . . . . . . . . . . . . . . . . . 18

8.1.1 Errormeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

8.1.2 In uence offeaturelocalizationerrors . . . . . . . . . . . . . . 19

8.1.3 In uence ofnumberof imagefeatures . . . . . . . . . . . . . . 20

8.1.4 In uence ofperspective eects . . . . . . . . . . . . . . . . . . 20

8.1.5 In uence oftemporal samplingdensity. . . . . . . . . . . . . . 21

8.2 Dependencyon object shape . . . . . . . . . . . . . . . . . . . . . . . 22

8.3 Conclusionsfrom thesyntheticexperiments . . . . . . . . . . . . . . . 25

8.4 Experimentson real imagedata. . . . . . . . . . . . . . . . . . . . . . 25

9 Summary and discussion 26 A Appendix 27 A.1 Algebraicconstraintson theaÆne trifocaltensor . . . . . . . . . . . . 27

A.2 Experimentalinvestigationof a minimalcase . . . . . . . . . . . . . . 28

(3)

Theproblemofderivingstructuralinformationandmotioncuesfromimagesequences

arises as an important subproblem in several computer vision tasks. In thispaper,

we are concerned with the computation of three-dimensional structure and motion

frompoint and linecorrespondencesextracted froma rigidthree-dimensionalobject

ofunknownshape,usingtheaÆnecamera model.

Early works addressing this problem domain based on point correspondences

from perspective and orthographic projection have been presented by Ullman [1 ],

Maybank [2 ], Huang and Lee [3 ], Huang and Netravali [4] and others. With the

introduction of the aÆne camera model (Koenderink and van Doorn [5 ], Mundy

and Zisserman [6 ]) a large number of approaches have been developed, including

(Shapiro[7 ], Beardsley etal. [8 ], McLauchlan et al. [9], Torr [10 ])to mention just a

few,see also (Faugeras [11 ]). Line correspondenceshave been studiedby(Spetsakis

andAloimonos [12 ], Weng etal[13]), and factorization methodsforpointsandlines

constituteaparticularlyinterestingdevelopment(Tomasiand Kanade[14 ],Poelman

andKanade[15 ],QuanandKanade[16 ], Sturmand Triggs[17 ]). Thesedirectionsof

research haverecentlybeencombinedwiththe ideasbehindthefundamentalmatrix

(Longuet-Higgins[18 ],Faugeras[19 ],Xu andZhang[20 ])andhave leadto thetrilin-

eartensor(Shashua [21 ],Hartley[22 ], Heyden[23])asa uniedmodelforpoint and

line correspondences for three cameras, with interesting applications (Beardsley et

al.[24 ])aswellasadeeperunderstandingoftherelationsbetweenpointfeaturesand

linefeaturesovermultipleviews(Faugerasand Mourrain [25 ],Heydenet al. [26 ]).

The subject of this paper is to build upon the abovementioned works, and to

develop a framework for handling point and line features simultaneously for three

or more aÆne views. Initially, we shall focus on image triplets and show how an

aÆne trifocal tensor can be dened for three centered aÆne cameras. This tensor

hasasimilaralgebraicstructureasthetrilineartensorforthreeperspectivecameras.

Compared to the trilinear tensor, however, it has the advantage that it contains

a smaller number of coeÆcients, which implies that fewer feature correspondences

are required to determine this tensor. Motion estimation from this tensor is more

straightforwardthan fortheperspective trilineartensor. Moreover, the resultsfrom

aÆnemotionestimationcan beexpectedtobemorerobustthanperspectiveanalysis

in situations when the perspective eects are small. The handle image features in

morethanthreeimages,weshallalsodevelopafactorizationapproach,whichinvolves

simultaneous handlingofpointand linefeaturesin multipleimageframes.

This theory will then be applied to the problem of computing changes in three-

dimensionalorientationfroma sparseset of pointand linecorrespondences. Speci-

cally,itwillbedemonstratedhowaman-machineinterfacefor3-Dinteractioncan be

designedbased on thetheory presented. The idea isto track pointand linefeatures

corresponding to the nger tips and the orientation of the ngers, and to compute

three-dimensionalrotations (andtranslations) assuming rigidityof thehand. These

motion estimates can then be used for controlling the motion of other computer-

controlledequipment (Lindeberg and Bretzner [27 ]). Notably,we thereby eliminate

theneedforother externalcontrolequipmentthantheoperator'sown hand.

(4)

A mainrationale to this work originates from the following question: If we have a

sparse set of image features that have been tracked over a relatively long period of

time, to what extent can such extended feature trajectories be used for computing

the three-dimensional structure and motion of a rigid object? Moreover, we are

interestedinexploringwhether itispossibletomake useofimagefeaturesthathave

been extracted from natural objects. Most works on three-dimensional structure

andmotionestimationhave beenperformedunderdierentconditions,byexploiting

densesets of imagefeatures, whichhave beencomputed fromman-made objects.

Figure 1 shows one specic application, which we will focus on. The idea is to

capture three-dimensional motions as mediated by the gestures of a human hand,

and to use measurements of 3-D rotational information computed in this way for

controllingothercomputerizedequipment,see[27 ]foramore generaldescriptionand

Cipolla et al. [28 ], Freeman and Weissman [29 ], Maggioni and Kammerer [30 ] for

related works. In contrast to previous approaches for human{computer interaction

thatarebasedondetailedgeometrichandmodels(suchasKuchandHuang[31 ],Lee

and Kunii [32], Heap and Hogg [33], Yasumuro et al. [34 ]), we shall here explore a

model based on qualitative features only. This model involves three to ve ngers,

andforeach nger thepositionofthe ngertipand theorientationof thenger are

measuredintheimagedomain. Successfultrackingoftheseimagefeaturesovertime

leadsto asetofpointcorrespondencesandlinecorrespondences. Thetaskisthento

computechangesinthe3-D orientationofsucha conguration,whichisassumed to

be rigid.

Given only a a small number of image features, neither the trajectories of the

point features or the line features per se are suÆcient to compute the motion in-

formation we are interested in. For example, when a user holds his hand with the

ngersspreading out, we have experienced that the positions of the nger tips will

oftenbeinapproximatelythesameplane,leadingtoill-conditionedmotionestimates

if computed from point features only. Therefore, the ability to combine point fea-

turesand linefeatures isof high importance. Moreover, dueto thesmallnumberof

image features, the informationis not suÆcient to compute the trilineartensor for

perspective projection (see the next section). For thisreason, we shalluse an aÆne

projectionmodel,andthe aÆnetrifocal tensorwillbe a keytool.

The trajectories ofimagefeatures usedasinputareextractedusingaframework

for feature tracking with automatic scale selection reported in (Bretzner and Lin-

deberg [35 , 36 ]). Blob features corresponding to the nger tipsare computed from

points(x;y; t)inscale-space (Koenderink[37 ],Lindeberg[38 ])at whichthesquared

normalizedLaplacian

(r 2

norm L)

2

=t 2

(L

xx +L

yy )

2

(1)

assumes maxima with respect to scale and space simultaneously (Lindeberg [39 ]).

Such points are referred to as scale-space maxima of the normalized Laplacian. In

a similar way, ridge features are detected from scale-space maxima of a normalized

measureof ridgestrength

AL 2

norm

= t 4

(L 2

pp L

2

qq )

2

=t 4

(L

xx L

yy )

2

+4L 2

xy

2

; (2)

(5)

pp qq

parameter = 0:875 (Lindeberg [40 ]). At each ridge feature, a windowed second

moment matrix

= Z Z

(;)2R 2

L 2

x L

y

L

x L

y L

2

y

g(;; s)dd (3)

iscomputedusingaGaussianwindowfunctiong(;; s)centered at thespatialmax-

imum of AL

norm

and with theintegration scale stuned bythe detection scaleof

the scale-space maximum of AL

norm

. The eigenvector of corresponding to the

largest eigenvalue givestheorientationof thenger.

Figure1:Resultsofmulti-scaletrackingofpointandlinefeaturescorrespondingtothenger

tipsand the ngersof ahumanhand. (left)grey-levelimage showingthe rst framein an

imagesequence,(middle)imagefeaturesextractedbycombiningthedetectionofscale-space

maximaof blob and ridgefeatures [39, 40] with a qualitative hand model in the form of a

multi-scalefeaturehierarchy[41],(right)feature trajectoriesobtainedbymulti-scalefeature

tracking[35].

Figure 1(c) shows an exampleof image trajectoriesobtained inthisway. An at-

tractivepropertyofthisfeaturetrackingschemeisthatthescaleselectionmechanism

adapts the scale levels to the local image structure. This gives the ability to track

imagefeaturesoverlargesizevariations,whichisparticularlyimportantfortheridge

tracker. Providedthatthecontrastto thebackgroundissuÆcient,thisscheme gives

feature trajectories over large numbers of frames, using a conceptually very simple

interframe matchingmechanism.

3 The trifocal tensor for three centered aÆne cameras

To capture motion information from the projections of an unknown conguration

of points and lines in 3-D, it is necessary to have at least three independent views.

A canonical model for describing the geometric relationships between point corre-

spondencesandlinecorrespondencesoverthree perspectiveviewsis providedbythe

trilineartensor (Shashua [21 , 42 ], Hartley [22 ], Heyden et al. [26 ]). For aÆne cam-

eras,acompactmodelofpointcorrespondencesovermultipleframescanbeobtained

by factorizing a matrixwith image measurements to theproduct of two matrices of

rank 3,one representing motion,and the other one representing shape(Tomasi and

Kanade[14 ],UllmanandBasri[43 ]). Frameworksforcapturing linecorrespondences

overmultipleaÆneviewshavebeenpresentedbyQuanandKanade[16 ]andforpoint

featuresunderperspectiveprojectionbySturmandTriggs[17 ].

(6)

simultaneous modellingof point and linecorrespondencesover three views with the

aÆne projectionmodel. It willbe shown how an algebraicstructure closely related

tothetrilineartensorcan bedenedforthree centered aÆnecameras. Thiscentered

aÆnetrifocal tensor involveslinearrelationsbetweenthepointfeaturesandtrilinear

relationshipsbetween thelinefeatures.

3.1 Perspective camera and three views

Considera point P =(x;y;1;) T

whichis projectedbythree camera matricesM =

[I;0], M 0

=[A;u 0

]and M 00

=[B;u 00

]to theimagepoints p,p 0

and p 00

:

p= 0

@ x

y

1 1

A

= 0

@

1 0 0 0

0 1 0 0

0 0 1 0 1

A 0

B

@ x

y

1

C

A

; (4)

p 0

= 0

@ x

0

y 0

1 1

A

= 0

@ a

1

1 a

1

2 a

1

3 u

01

a 2

1 a

2

2 a

2

3 u

0 2

a 3

1 a

3

2 a

3

3 u

0 3

1

A 0

B

@ x

y

1

C

A

= 0

B

@ a

1 T

p+u 0

1

a 2

T

p+u 0

2

a 3

T

p+u 0

3 1

C

A

; (5)

p 00

= 0

@ x

00

y 00

1 1

A

= 0

@ b

1

1 b

1

2 b

1

3 u

00 1

b 2

1 b

2

2 b

2

3 u

00 2

b 3

1 b

3

2 b

3

3 u

00 3

1

A 0

B

@ x

y

1

C

A

= 0

B

@ b

1 T

p+u 00

1

b 2

T

p+u 00

2

b 3

T

p+u 00

3 1

C

A

: (6)

FollowingFaugerasand Mourrain [25 ] andShashua [42 ],introducethefollowingtwo

matrices

r

j

=

1 0 x

0

0 1 y

0

; s

k

=

1 0 x

00

0 1 y

00

: (7)

Then, in terms of tensor notation (where i;j;k 2 [1;3], ; 2 [1;2] and we follow

the Einstein summation convention that a double occurrence of an index implies

summation over that index) the relations between the image coordinates and the

camerageometrycan be written

r

j u

0 j

+r

j a

j

i p

i

=0; s

k u

00 k

+s

k b

k

i p

i

=0: (8)

Byintroducingthetrifocaltensor(Shashua [21 ],Hartley [22 ])

T jk

i

=a j

i u

0 0 k

b k

i u

0 j

; (9)

therelationsbetweenthe pointcorrespondenceslead to thetrifocalconstraint

r

j s

k T

jk

i

=0: (10)

Writtenoutexplicitly,thisexpressioncorrespondstothefollowingfour(independent)

relationsbetweentheprojections p,p 0

and p 00

of P (Shashua[42 ]):

x 00

T 13

i p

i

x 0 0

x 0

T 33

i p

i

+x 0

T 31

i p

i

T 11

i p

i

=0;

y 00

T 13

i p

i

y 00

x 0

T 33

i p

i

+x 0

T 32

i p

i

T 12

i p

i

=0;

x 00

T 23

i p

i

x 00

y 0

T 33

i p

i

+y 0

T 31

i p

i

T 21

i p

i

=0;

y 00

T 23

i p

i

y 00

y 0

T 33

i p

i

+y 0

T 32

i p

i

T 22

i p

i

=0:

(11)

(7)

Given three corresponding lines,l T

p =0, l 0

p 0

= 0 and l 00

p 00

= 0, each image line

denes a plane throughthe center of projection, given by L T

P =0, L 0

T

P = 0 and

L 00

T

P =0,where

L T

=l T

M =(l

1

;l

2

;l

3 0);

L 0

T

=l 0

T

M 0

=(l 0

j a

j

1

;l 0

j a

j

2

;l 0

j a

j

3

;l 0

j u

0 j

);

L 00

T

=l 00

T

M 0 0

=(l 0 0

k b

k

1

;l 00

k b

k

2

;l 00

k b

k

3

;l 00

k u

00 k

):

(12)

Since l, l 0

and l 00

are assumed to be projections of the same three-dimensionalline,

theintersection ofthe planesL,L 0

and L 00

mustdegenerate to a lineand

rank 0

B

@ l

1 l

0

j a

j

1 l

00

k b

k

1

l

2 l

0

j a

j

2 l

00

k b

k

2

l

3 l

0

j a

j

3 l

00

k b

k

3

0 l 0

j u

0 j

l 00

k u

00 k

1

C

A

=2: (13)

All33 minorsmustbezero, and removalof thethreerst lines respectively,leads

to thefollowingtrilinearrelationships,outof which twoare independent:

(l

2 T

jk

3 l

3 T

jk

2 )l

0

j l

00

k

=0;

(l

1 T

jk

3 l

3 T

jk

1 )l

0

j l

00

k

=0;

(l

1 T

jk

2 l

2 T

jk

1 )l

0

j l

00

k

=0:

(14)

These expressions provide a compact characterization of the trilinear line relations

rstintroducedbySpetsakisand Aloimonos[12 ].

In summary, each point correspondence gives four equations, and each line cor-

respondence two. Hence, K pointsand L lines are(generically) suÆcient to express

a linear algorithm forcomputing the trilineartensor (upto scale) if 4K+2L 26

(Shashua [21 ], Hartley[22]).

3.2 AÆne camera and three views

Consider next a point Q =(x;y;;1) T

which is projected to the image points q, q 0

andq 00

bythree aÆne cameramatrices M,M 0

and M 00

,respectively:

q = 0

@ x

y

1 1

A

=MQ= 0

@

1 0 0 0

0 1 0 0

0 0 0 1 1

A 0

B

@ x

y

1 1

C

A

(15)

q 0

= 0

@ x

0

y 0

1 1

A

=M 0

Q= 0

@ c

1

1 c

1

2 c

1

3 v

0 1

c 2

1 c

2

2 c

2

3 v

0 2

0 0 0 1

1

A 0

B

@ x

y

1 1

C

A

(16)

q 00

= 0

@ x

00

y 00

1 1

A

=M 0 0

Q= 0

@ d

1

1 d

1

2 d

1

3 v

0 0 1

d 2

1 d

2

2 d

2

3 v

0 0 2

0 0 0 1

1

A 0

B

@ x

y

1 1

C

A

(17)