Qualitative multi-scale feature hierarchies for object tracking

(1)

for Object Tracking

Lars Bretzner and Tony Lindeberg

ComputationalVisionand Active Perception Laboratory (CVAP)

Departmentof Numerical Analysisand Computing Science

KTH (Royal Institute of Technology)

S-100 44 Stockholm, Sweden.

Email: f bretzner, tonyg@nada.kth.se

Technical report ISRN KTH/NA/P/{9909{SE

Abstract

This papershows howthe performanceof feature trackerscanbe improved by

building a hierarchical view-based object representation consisting of qualita-

tiverelationsbetweenimagestructures at dierentscales. The ideaisto track

all image features individually, and to use the qualitative feature relations for

avoiding mismatches, resolving ambiguous matchesand for introducing feature

hypotheses whenever image features are lost. Compared to more traditional

workonview-based objecttracking,thismethodologyhastheabilitytohandle

semi-rigid objectsandpartialocclusions. Comparedto trackersbasedonthree-

dimensional objectmodels,thisapproachismuchsimplerandofamoregeneric

nature. Ahands-onexampleispresentedshowinghowanintegratedapplication

systemcanbeconstructedfrom conceptuallyverysimpleoperations.

ThesupportfromtheSwedishResearchCouncilforEngineeringSciences,TFR,andtheSwedish

NationalBoardforIndustrialandTechnicalDevelopment,NUTEK,isgratefullyacknowledged. Ac-

ceptedfor publicationin Journal ofVisual Communicationand ImageRepresentation. An earlier

versionofthismanuscriptwaspresentedinM.Nielsen,P.Johansen,O.OlsenandJ.Weickert(eds),

Proc.SecondInternationalConferenceonScale-SpaceTheoriesinComputerVision,(Corfu,Greece),

September1999. Springer-VerlagLectureNotesinComputerScience,vol1682,pp. 117{128.

(2)

Tomaintainastablerepresentationofadynamicworld,itisnecessarytorelateimage

datafromdierent timemoments. Whenanalysingimagesequencesframebyframe,

asiscommonly doneincomputer visionapplications, itisthereforeusefultoinclude

anexplicittrackingmechanismsinto thevisionsystem.

Whenconstructingsuchatrackingmechanism,thereisalargefreedomindesign,

concerning how much a priori informationshould be included into and be used by

the tracker. If the goal is to track a single object of known shape, then it may be

naturalto builda three-dimensional object model, and to relate computed viewsof

thisinternal model to the image data that occur. An alternative approach is store

a large number of actual views in a database, and subsequentlymatch these to the

imagesequence.

Dependingonwhat typeofobjectrepresentationwechoose,wecanexpectdier-

ent trade-os between thecomplexity of constructing theobjectrepresentation and

thecomplexity inmatching theobject representationto imagedata.

1

In particular,

dierent design strategies will implydierent amounts of additional work when the

databaseisextended withnew objects.

The subject ofthisarticle isto advocate theuseof qualitative multi-scaleobject

modelsinthiscontext, asopposedto moredetailedmodels. The ideaisto represent

only dominant image features of the object, and relations between those that are

reasonably stable under view variations. In this way, a new object model can be

constructedwith onlyminoradditional work,and it willbedemonstrated that such

a weaker approach to object representationis powerful enough to give a signicant

improvement intherobustness offeature trackers.

Amainrationalefortheproposedapproachisthatifwetrackindividualfeatures

over long time periods in scenes with changing conditions (e.g., object pose and

illumination), the likelihood that features will be mismatched or lost will increase

with time. Major aims of the proposed hierarchical representation are to handle

such problems, and also to assist in the initialization stage of the feature tracker.

Whenafeatureislost,therelationsofthequalitativefeaturehierarchymodelwillbe

usedfordeningsearchregions inthewhich thelostfeaturecan bedetected. When

mismatches occur, relational constraints in the feature hierarchy will be helpful for

detecting andrejecting outliers.

Theusefulnessofsuchahierarchicalobjectrepresentationforfeaturetrackingwill

bedemonstratedbyexperimentsonreal-worldimagesequences. Specically,itwillbe

shownhowan integratednon-trivial applicationto human-computerinteractioncan

beconstructedinastraightforwardandconceptuallyverysimpleway,bycombination

witha setof elementaryscale-space operations.

The presentationis organized as follows: Section 2 presents the general motiva-

tionsbehindtheproposedapproach,withanoverviewofrelated works. Insection 3,

werst brie yreviewthemulti-scaleframework weusefordetecting imagefeatures,

anddescribehowhierarchicalandqualitativefeaturerelationscanbedenedbetween

these multi-scales image features. Section 4 outlines how such a view-based object

1

With theterm\complexity", wehere referto boththe computationalcomplexity inmatching

algorithmsandthedegreeofstructuralcomplexitythatisrequiredwhendesigningthesoftware.

(3)

tal results for two sample applications to hand gesture analysis and face tracking,

respectively. Finally,section 5 concludes witha summaryand discussionconcerning

otherpossibleapplicationsand generalizationsofthe proposedideas.

2 Choice of Image Representation for Feature Tracking

The framework we consideris one in which image features are detected at multiple

scales. Eachfeatureisassociatedwitharegioninspaceaswellasarangeofscales,and

relations between features at dierent scales imposehierarchical links across scales.

Specically, we assume that the image features are detected with a mechanism for

automaticscale selection (Lindeberg 1998b). In earlier work (Bretzner & Lindeberg

1998a), we have demonstrated how such a scale selection mechanism is essential to

obtaina robust behaviour of the featuretracker ifthe image features undergo large

sizevariations intheimagedomain.

Therationaleforusingahierarchicalmulti-scaleimagerepresentationforfeature

trackingoriginatesfromthewell-knownfactthatreal-worldobjectsconsistofdierent

typesofstructuresatdierentscales. Aninternalobjectrepresentationshouldre ect

thisfact. One aspect of this, which we shall make particular use of, is that certain

hierarchical relationsover scales tendto remainreasonably stable when theviewing

conditionsare varied. Thus, even if some features arelost duringtracking(e.g. due

to occlusions, illumination variations, or spurious errors by the feature detector or

thefeature matchingalgorithm), itis ratherlikelythat a suÆcientnumberof image

featureswillremaintosupportthetrackingoftheotherfeatures. Thereby,thefeature

trackerwillhavehigherrobustness 2

withrespecttoocclusions,viewingvariationsand

spuriouserrorsinthe lower-levelmodules. Aswe shallsee, the qualitative nature of

these feature relations willalso make itpossible to handlesemi-rigidobjects within

thesame framework.

In this way, the approach we will propose is closely related to the notionof ob-

jectrepresentation. Comparedtothemoretraditionalproblemofobjectrecognition,

however,therequirementsaredierent,sincetheprimarygoalistomaintainastable

imagerepresentation over time, and we do not need to supportindexingand recog-

nition functionalities into large databases. For these reasons, a qualitative image

representation can be suÆcient inmanycases, and oer a higher exibilitybybeing

more genericthandetailedobject models.

Related works. The topic of this paper touches on both the subjects of feature

trackingandobjectrepresentation. Theliteratureontrackingislargeandimpossible

to reviewhere. Hence,we focuson themostcloselyrelated works.

Imagerepresentations involvinglinkingacross scaleshave beenpresentedbysev-

eral authors. (Crowley & Parker 1984, Crowley & Sanderson 1987) detected peaks

and ridges in a pyramid representation. In retrospect, a main reason why stability

problemswereencountered isthat thepyramidsinvolveda rathercoarsesamplingin

2

According tothe terminologyproposedby(Toyama &Hager1999),theautomaticscale selec-

tionmechanismisessentialforthepre-failurerobustnessof thefeaturetracker,while theproposed

qualitativemulti-scalefeaturehierarchyimprovesthepost-failurerobustness.

(4)

paths inscale-space, and thisidea was madeoperational formedicalimage segmen-

tation by (Lifshitz & Pizer 1990) and (Vincken et al. 1997). (Lindeberg 1993) con-

structed a scale-space primal sketch, in which a morphological support region was

associated with each extremum point and paths of critical points over scales were

computeddelimitedbybifurcations. (Olsen1997) applieda similarapproachto wa-

tershed minimain the gradient magnitude. (GriÆn et al. 1992) developed a closely

relatedapproachbasedonmaximumgradientpaths,however,atasinglescale. Inthe

scale-space primalsketch, scaleselection wasperformed,bymaximizingmeasuresof

blobstrength over scales,and signicance was measuredbythe volumes that image

structures occupyin scale-space, involving thestability overscales asa major com-

ponent. A generalization ofthisscaleselection ideato more general classesof image

structureswaspresentedin(Lindeberg1994,Lindeberg1998b, Lindeberg1998a), by

detectingscale-spacemaxima,i.e. pointsinscale-space atwhichnormalizeddieren-

tialmeasures of feature strength assume local maxima withrespect to scale. (Pizer

etal. 1994) and his co-workers (Gauch & Pizer 1993) have proposedclosely related

descriptors, focusing on multi-scale ridge representations for medical image analy-

sis. Psychophysical results by (Burbeck & Pizer1995) support the belief that such

hierarchicalmulti-scalerepresentationsare relevantforobjectrepresentation.

Withrespecttotheproblemofobjectrecognition,(Shokoufandehetal.1998)de-

tectextremainawavelet transforminawaycloselyrelatedtothedetectionofscale-

space maxima, and dene a graph structure from these image features. This graph

structure is then matched to corresponding descriptors for other objects, based on

topologicalandgeometricsimilarity. Earliergraph-likeobjectrepresentationsinclude

theclassicalmodel-basedapproachby(Lowe 1985),usedinconjunctionwithpercep-

tual grouping, as well as the distributed aspect hierarchy proposed by (Dickinson

etal.1992). Inrelationtothelargenumberof worksonmodelbasedtracking, there

aresimilaraims betweenourapproach and the followingworks: (Koller etal. 1993)

usedcarmodelsto supportthetrackingof vehiclesinlongsequenceswithocclusions

and illumination variations. (Smith & Brady 1995) dened clusters of coherently

moving corner features as to support the tracking of cars in a qualitative manner.

(Black & Jepson 1998b) constructed a view-based object representation using an

eigenimage approach to compactly represent and support the tracking of an object

seen from a large number of dierent views. The recently developed condensation

algorithm (Isard & Blake 1998, Black & Jepson 1998a) is of particular interest, by

explicitly constructing statistical distributions to capture relations between image

features. Concerning the specic application to qualitative hand tracking that will

beaddressedinthispaper,more detailedhandmodelshavebeenpresentedby(Kuch

& Huang1995, Heap & Hogg 1996, Yasumuro et al. 1999). Related graph-like rep-

resentations forhand trackingand face tracking have beenpresented by(Triesch &

von derMalsburg1996, Mauerer&von derMalsburg 1996).

3 Image Features and Qualitative Feature Relations

Weareinterestedinrepresentingobjectswhichcangiverisetoarichvarietyofimage

features of dierent types and at dierent scales. Generically, these image features

(5)

(iii) two-dimensional (blobs), and we assume that each image feature is associated

witha regioninspace aswellasarange of scales.

3.1 Computation of Image Features

When computing a hierarchical view-based object representation, one may at rst

desire to compute a detailed representation of the multi-scale image structure, as

donebythe scale-space primalsketch or some of theclosely related representations

reviewed in section 2. Since we are interested in processing temporal image data,

however, and the construction of such a representation from image data requires a

rather large amount of computations, we shall here follow a computationally more

eÆcient approach.

We focusonimagefeaturesexpressedinterms ofscale-space maxima,i.e. points

inscale-space at which dierential geometric entities assume local maxima with re-

spectto space andscale (Lindeberg1998b). Formally,such pointsaredenedby

( r(D

norm

L(x; s))=0) ^ ( @

s (D

norm

L(x; s))=0) (1)

where L(; s) denotes the scale-space representation of the image f constructed by

convolutionwithaGaussiankernelg(;s)withscaleparameter(variance)sandD

norm

isa dierentialinvariantnormalized bythereplacementof all spatialderivatives@

x

i

by -normalizedderivatives@

i

=s =2

@

x

i :

Two examplesofsuchdierentialdescriptors,whichweshallmakeparticular use

ofhere,include thenormalized Laplacian(with =1) forblob detection

r 2

norm

L=s(L

xx +L

yy

) (2)

andthesquaredierence betweentheeigenvaluesL

pp andL

qq

oftheHessian matrix

(with =3=4) forridgedetection

AL

norm

=s 2

jL

pp L

qq j

2

=s 2

((L

xx L

yy )

2

+4L 2

xy

) (3)

see(Lindeberg1998a)foramoregeneraldescription. Acomputationallyveryattrac-

tivepropertyofthisconstructionisthatthescale-spacemaximacanbecomputedby

architecturallyvery simpleandcomputationallyhighlyeÆcientoperationsinvolving:

(i) scale-space smoothing, (ii) pointwise computation of dierential invariants, and

(iii)detection of local maximaofscalar entitiesinscale-space.

Furthermore,tosimplifythegeometricanalysisofimagefeatures,weshallreduce

the spatial representation of image descriptors to ellipses, by evaluating a second

moment matrix

= Z

2R 2

L 2

x L

y

L

x L

y L

2

y

g(;s

int

)d (4)

at integration scale s

int

proportionalto the detection scale of thescale-space maxi-

mum(equation(1)). Thereby,eachimagefeaturewillwerepresentedbyapoint(x; s)

inscale-spaceand acovariancematrixdescribingtheshape,graphicallyillustrated

byan ellipse. Forone-dimensional features, thecorrespondingellipseswillbehighly

(6)

scriptors of the second moment matrices will be rather circular. Attributes derived

from thecovariance matrixinclude its anisotropyderived from the ratio

max

=

min

between its eigenvalues, and its orientation dened as the orientation of its main

eigenvector.

Figure 4 shows an example of such image descriptors computed from a grey-

level image, after ranking on a signicance measure dened as the magnitude of

the response of the dierentialoperator at the scale-space maximum. A trivialbut

nevertheless very useful eect of this ranking is that it substantially reduces the

number of image features for further processing, thus improving the computational

eÆciency. In a more detailed representation of the multi-scale deep structure of a

real-world image, itwilloften be the casethat a largenumberof theimagefeatures

andtheir hierarchical relationscorrespondto imagestructures thatwillberegarded

asinsignicantbylaterprocessing stages.

3.2 Qualitative Feature Relations

Betweentheabovementionedfeatures,varioustypesofrelationscanbedenedinthe

imageplane. Here,weconsiderthefollowingtypesof qualitative relations:

Spatial coincidence (inclusion): We saythataregionAat positionx

A

and scale

s

A

is in spatial coincidence relation to a region B at position x

B

and at a

(coarser) scales

B

>s

A if

(x

A x

B )

T

1

B (x

A x

B )2[D

1

;D

2

] (5)

where D

1

and D

2

are distance thresholds and

B

is a covariance matrix asso-

ciated withregionB. By usingaMahalanobis distancemeasure, we introduce

a directional preference which is highly useful for expressing spatial relations

between elongated image features. While the special case D

1

=0 corresponds

to an inclusion relation, there are also cases where one may want to explicitly

represent distantfeatures, usingD

1

>0

Stability of scale relations: Fortwoimagefeaturesat timest

k andt

k

0,weassume

thattheratiobetweentheirscalevaluesshouldbeapproximatelythesame. This

is motivatedby thephysicalrequirement ofscaleinvariance underzooming

s

A (t

k )

s

B (t

k )

s

A (t

k 0

)

s

B (t

k 0

)

: (6)

To accept smallvariations dueto changes inview direction and spuriousvari-

ations from the scale selection mechanism of the feature tracker, we measure

relative distances in the scale direction and implement the \" operation by

q q 0

()jlog q

q 0

j<logT,where T >1 isa thresholdinthescaledirection.

Directional relation (bearing): Forafeature A relatedtoa one-dimensionalfea-

ture B, the angle is measured between the main eigenvector of

B

and the

vector x

A x

B

from thecenter x

B

of B to thecenter x

A

ofA (see Figure1) .

(7)

x

x B

A α

Figure 1:The direction relation (bearing) between two features A and B is the angle

betweenthemaineigenvectorof

B

(illustrated bytheellipse)andthevectorx

A x

B .

Trivially,theserelationsareinvarianttotranslationsandrotationsintheimageplane.

The scale invariance of these relations follows from corresponding scale invariance

properties of image descriptors computed from scale-space maxima | if the sizeof

animagestructureisscaledbyafactorcintheimagedomain,thenthecorresponding

scalelevels aretransformedbya factor c 2

.

3.3 Qualitative Multi-Scale Feature Hierarchy

Letus now considera specic examplewith images of a hand. From our knowledge

thata handconsists ofve ngers,weconstruct a modelconsisting of: (i)thepalm,

(ii)theve ngers, (iii)a nger tipforeach nger,(see gure2).

Eachngerisinaspatialcoincidencerelationtothepalm,aswellasadirectional

relation. Moreover,eachngertipisinaspatialrelationshiptoitsnger,andsatises

a directional relation to this feature. In a similar manner, each nger is in a scale

stability relation with respect to the palm, and each ngertip is in a corresponding

scalestabilityrelationrelative to its nger.

Such a representation will be referred to as a qualitative multi-scale feature hi-

erarchy. Figure 3 shows the relations this representation is built from, using UML

notation (Fowler & Scott 1997). An attractive property of this view-based object

representation is that it only focuses on qualitative object features. There is no

assumptionof rigidity,onlythat thequalitativeshapeis preserved.

The idea behind this construction is of course that the palm and the ngertips

should give rise to blob responses (equation (2)) and that the ngers give rise to

ridgeresponses (equation(3)). Figure 4shows an exampleofhowthismodel can be

initializedand matchedto imagedatawith associated imagedescriptors.

To exclude responses from thebackground, we have here requiredthat all image

featuresshouldcorrespond to bright blobsorbright ridges. Alternatively,one could

denespatial inclusionrelations with respectto other segmentation cues relative to

thebackground, e.g. chromaticityordepth.

Here,wehave constructedthegraphwith featurerelations manually,usingqual-

itativeknowledgeabouttheshapeoftheobjectanditsprimitives. Inamoregeneral

setting,however, one can also considerthe learningof stable featurerelations inan

actualsetting,basedonarichersetofimagefeaturesaswellasarichervocabularyof

(8)

x y s

Figure2:A qualitativemulti-scalefeaturehierarchyconstructedforahandmodel.

top−hand:Relation handconstraint:Constraint

hand:Objfeature

hand−finger:Relation fingerconstraint:Constraint

finger[1]:Objfeature finger[2]:Objfeature

finger−tip[1]:Relation finger−tip[2]:Relation

tipconstraint:Constraint

tip[1]:Objfeature tip[2]:Objfeature

...

Figure3:Instancediagram forthefeaturehierarchyofahand (gure2).

20 strongestblobsandridges Initializedhandmodel Allhandfeaturescaptured

Figure4: Illustrationof theinitialization stage of theobjecttracker. Once the coarse-scale

feature is found (here the palm of the hand), the qualitative feature hierarchy guides the

top-downsearchfortheremainingfeaturesoftherepresentation. (The leftimage showsthe

20mostsignicantblobresponses(inred)and ridgeresponses(in blue).)