for Object Tracking
Lars Bretzner and Tony Lindeberg
ComputationalVisionand Active Perception Laboratory (CVAP)
Departmentof Numerical Analysisand Computing Science
KTH (Royal Institute of Technology)
S-100 44 Stockholm, Sweden.
Email: f bretzner, tonyg@nada.kth.se
Technical report ISRN KTH/NA/P/{9909{SE
Abstract
This papershows howthe performanceof feature trackerscanbe improved by
building a hierarchical view-based object representation consisting of qualita-
tiverelationsbetweenimagestructures at dierentscales. The ideaisto track
all image features individually, and to use the qualitative feature relations for
avoiding mismatches, resolving ambiguous matchesand for introducing feature
hypotheses whenever image features are lost. Compared to more traditional
workonview-based objecttracking,thismethodologyhastheabilitytohandle
semi-rigid objectsandpartialocclusions. Comparedto trackersbasedonthree-
dimensional objectmodels,thisapproachismuchsimplerandofamoregeneric
nature. Ahands-onexampleispresentedshowinghowanintegratedapplication
systemcanbeconstructedfrom conceptuallyverysimpleoperations.
ThesupportfromtheSwedishResearchCouncilforEngineeringSciences,TFR,andtheSwedish
NationalBoardforIndustrialandTechnicalDevelopment,NUTEK,isgratefullyacknowledged. Ac-
ceptedfor publicationin Journal ofVisual Communicationand ImageRepresentation. An earlier
versionofthismanuscriptwaspresentedinM.Nielsen,P.Johansen,O.OlsenandJ.Weickert(eds),
Proc.SecondInternationalConferenceonScale-SpaceTheoriesinComputerVision,(Corfu,Greece),
September1999. Springer-VerlagLectureNotesinComputerScience,vol1682,pp. 117{128.
Tomaintainastablerepresentationofadynamicworld,itisnecessarytorelateimage
datafromdierent timemoments. Whenanalysingimagesequencesframebyframe,
asiscommonly doneincomputer visionapplications, itisthereforeusefultoinclude
anexplicittrackingmechanismsinto thevisionsystem.
Whenconstructingsuchatrackingmechanism,thereisalargefreedomindesign,
concerning how much a priori informationshould be included into and be used by
the tracker. If the goal is to track a single object of known shape, then it may be
naturalto builda three-dimensional object model, and to relate computed viewsof
thisinternal model to the image data that occur. An alternative approach is store
a large number of actual views in a database, and subsequentlymatch these to the
imagesequence.
Dependingonwhat typeofobjectrepresentationwechoose,wecanexpectdier-
ent trade-os between thecomplexity of constructing theobjectrepresentation and
thecomplexity inmatching theobject representationto imagedata.
1
In particular,
dierent design strategies will implydierent amounts of additional work when the
databaseisextended withnew objects.
The subject ofthisarticle isto advocate theuseof qualitative multi-scaleobject
modelsinthiscontext, asopposedto moredetailedmodels. The ideaisto represent
only dominant image features of the object, and relations between those that are
reasonably stable under view variations. In this way, a new object model can be
constructedwith onlyminoradditional work,and it willbedemonstrated that such
a weaker approach to object representationis powerful enough to give a signicant
improvement intherobustness offeature trackers.
Amainrationalefortheproposedapproachisthatifwetrackindividualfeatures
over long time periods in scenes with changing conditions (e.g., object pose and
illumination), the likelihood that features will be mismatched or lost will increase
with time. Major aims of the proposed hierarchical representation are to handle
such problems, and also to assist in the initialization stage of the feature tracker.
Whenafeatureislost,therelationsofthequalitativefeaturehierarchymodelwillbe
usedfordeningsearchregions inthewhich thelostfeaturecan bedetected. When
mismatches occur, relational constraints in the feature hierarchy will be helpful for
detecting andrejecting outliers.
Theusefulnessofsuchahierarchicalobjectrepresentationforfeaturetrackingwill
bedemonstratedbyexperimentsonreal-worldimagesequences. Specically,itwillbe
shownhowan integratednon-trivial applicationto human-computerinteractioncan
beconstructedinastraightforwardandconceptuallyverysimpleway,bycombination
witha setof elementaryscale-space operations.
The presentationis organized as follows: Section 2 presents the general motiva-
tionsbehindtheproposedapproach,withanoverviewofrelated works. Insection 3,
werst brie yreviewthemulti-scaleframework weusefordetecting imagefeatures,
anddescribehowhierarchicalandqualitativefeaturerelationscanbedenedbetween
these multi-scales image features. Section 4 outlines how such a view-based object
1
With theterm\complexity", wehere referto boththe computationalcomplexity inmatching
algorithmsandthedegreeofstructuralcomplexitythatisrequiredwhendesigningthesoftware.
tal results for two sample applications to hand gesture analysis and face tracking,
respectively. Finally,section 5 concludes witha summaryand discussionconcerning
otherpossibleapplicationsand generalizationsofthe proposedideas.
2 Choice of Image Representation for Feature Tracking
The framework we consideris one in which image features are detected at multiple
scales. Eachfeatureisassociatedwitharegioninspaceaswellasarangeofscales,and
relations between features at dierent scales imposehierarchical links across scales.
Specically, we assume that the image features are detected with a mechanism for
automaticscale selection (Lindeberg 1998b). In earlier work (Bretzner & Lindeberg
1998a), we have demonstrated how such a scale selection mechanism is essential to
obtaina robust behaviour of the featuretracker ifthe image features undergo large
sizevariations intheimagedomain.
Therationaleforusingahierarchicalmulti-scaleimagerepresentationforfeature
trackingoriginatesfromthewell-knownfactthatreal-worldobjectsconsistofdierent
typesofstructuresatdierentscales. Aninternalobjectrepresentationshouldre ect
thisfact. One aspect of this, which we shall make particular use of, is that certain
hierarchical relationsover scales tendto remainreasonably stable when theviewing
conditionsare varied. Thus, even if some features arelost duringtracking(e.g. due
to occlusions, illumination variations, or spurious errors by the feature detector or
thefeature matchingalgorithm), itis ratherlikelythat a suÆcientnumberof image
featureswillremaintosupportthetrackingoftheotherfeatures. Thereby,thefeature
trackerwillhavehigherrobustness 2
withrespecttoocclusions,viewingvariationsand
spuriouserrorsinthe lower-levelmodules. Aswe shallsee, the qualitative nature of
these feature relations willalso make itpossible to handlesemi-rigidobjects within
thesame framework.
In this way, the approach we will propose is closely related to the notionof ob-
jectrepresentation. Comparedtothemoretraditionalproblemofobjectrecognition,
however,therequirementsaredierent,sincetheprimarygoalistomaintainastable
imagerepresentation over time, and we do not need to supportindexingand recog-
nition functionalities into large databases. For these reasons, a qualitative image
representation can be suÆcient inmanycases, and oer a higher exibilitybybeing
more genericthandetailedobject models.
Related works. The topic of this paper touches on both the subjects of feature
trackingandobjectrepresentation. Theliteratureontrackingislargeandimpossible
to reviewhere. Hence,we focuson themostcloselyrelated works.
Imagerepresentations involvinglinkingacross scaleshave beenpresentedbysev-
eral authors. (Crowley & Parker 1984, Crowley & Sanderson 1987) detected peaks
and ridges in a pyramid representation. In retrospect, a main reason why stability
problemswereencountered isthat thepyramidsinvolveda rathercoarsesamplingin
2
According tothe terminologyproposedby(Toyama &Hager1999),theautomaticscale selec-
tionmechanismisessentialforthepre-failurerobustnessof thefeaturetracker,while theproposed
qualitativemulti-scalefeaturehierarchyimprovesthepost-failurerobustness.
paths inscale-space, and thisidea was madeoperational formedicalimage segmen-
tation by (Lifshitz & Pizer 1990) and (Vincken et al. 1997). (Lindeberg 1993) con-
structed a scale-space primal sketch, in which a morphological support region was
associated with each extremum point and paths of critical points over scales were
computeddelimitedbybifurcations. (Olsen1997) applieda similarapproachto wa-
tershed minimain the gradient magnitude. (GriÆn et al. 1992) developed a closely
relatedapproachbasedonmaximumgradientpaths,however,atasinglescale. Inthe
scale-space primalsketch, scaleselection wasperformed,bymaximizingmeasuresof
blobstrength over scales,and signicance was measuredbythe volumes that image
structures occupyin scale-space, involving thestability overscales asa major com-
ponent. A generalization ofthisscaleselection ideato more general classesof image
structureswaspresentedin(Lindeberg1994,Lindeberg1998b, Lindeberg1998a), by
detectingscale-spacemaxima,i.e. pointsinscale-space atwhichnormalizeddieren-
tialmeasures of feature strength assume local maxima withrespect to scale. (Pizer
etal. 1994) and his co-workers (Gauch & Pizer 1993) have proposedclosely related
descriptors, focusing on multi-scale ridge representations for medical image analy-
sis. Psychophysical results by (Burbeck & Pizer1995) support the belief that such
hierarchicalmulti-scalerepresentationsare relevantforobjectrepresentation.
Withrespecttotheproblemofobjectrecognition,(Shokoufandehetal.1998)de-
tectextremainawavelet transforminawaycloselyrelatedtothedetectionofscale-
space maxima, and dene a graph structure from these image features. This graph
structure is then matched to corresponding descriptors for other objects, based on
topologicalandgeometricsimilarity. Earliergraph-likeobjectrepresentationsinclude
theclassicalmodel-basedapproachby(Lowe 1985),usedinconjunctionwithpercep-
tual grouping, as well as the distributed aspect hierarchy proposed by (Dickinson
etal.1992). Inrelationtothelargenumberof worksonmodelbasedtracking, there
aresimilaraims betweenourapproach and the followingworks: (Koller etal. 1993)
usedcarmodelsto supportthetrackingof vehiclesinlongsequenceswithocclusions
and illumination variations. (Smith & Brady 1995) dened clusters of coherently
moving corner features as to support the tracking of cars in a qualitative manner.
(Black & Jepson 1998b) constructed a view-based object representation using an
eigenimage approach to compactly represent and support the tracking of an object
seen from a large number of dierent views. The recently developed condensation
algorithm (Isard & Blake 1998, Black & Jepson 1998a) is of particular interest, by
explicitly constructing statistical distributions to capture relations between image
features. Concerning the specic application to qualitative hand tracking that will
beaddressedinthispaper,more detailedhandmodelshavebeenpresentedby(Kuch
& Huang1995, Heap & Hogg 1996, Yasumuro et al. 1999). Related graph-like rep-
resentations forhand trackingand face tracking have beenpresented by(Triesch &
von derMalsburg1996, Mauerer&von derMalsburg 1996).
3 Image Features and Qualitative Feature Relations
Weareinterestedinrepresentingobjectswhichcangiverisetoarichvarietyofimage
features of dierent types and at dierent scales. Generically, these image features
(iii) two-dimensional (blobs), and we assume that each image feature is associated
witha regioninspace aswellasarange of scales.
3.1 Computation of Image Features
When computing a hierarchical view-based object representation, one may at rst
desire to compute a detailed representation of the multi-scale image structure, as
donebythe scale-space primalsketch or some of theclosely related representations
reviewed in section 2. Since we are interested in processing temporal image data,
however, and the construction of such a representation from image data requires a
rather large amount of computations, we shall here follow a computationally more
eÆcient approach.
We focusonimagefeaturesexpressedinterms ofscale-space maxima,i.e. points
inscale-space at which dierential geometric entities assume local maxima with re-
spectto space andscale (Lindeberg1998b). Formally,such pointsaredenedby
( r(D
norm
L(x; s))=0) ^ ( @
s (D
norm
L(x; s))=0) (1)
where L(; s) denotes the scale-space representation of the image f constructed by
convolutionwithaGaussiankernelg(;s)withscaleparameter(variance)sandD
norm
isa dierentialinvariantnormalized bythereplacementof all spatialderivatives@
x
i
by -normalizedderivatives@
i
=s =2
@
x
i :
Two examplesofsuchdierentialdescriptors,whichweshallmakeparticular use
ofhere,include thenormalized Laplacian(with =1) forblob detection
r 2
norm
L=s(L
xx +L
yy
) (2)
andthesquaredierence betweentheeigenvaluesL
pp andL
oftheHessian matrix
(with =3=4) forridgedetection
AL
norm
=s 2
jL
pp L
qq j
2
=s 2
((L
xx L
yy )
2
+4L 2
xy
) (3)
see(Lindeberg1998a)foramoregeneraldescription. Acomputationallyveryattrac-
tivepropertyofthisconstructionisthatthescale-spacemaximacanbecomputedby
architecturallyvery simpleandcomputationallyhighlyeÆcientoperationsinvolving:
(i) scale-space smoothing, (ii) pointwise computation of dierential invariants, and
(iii)detection of local maximaofscalar entitiesinscale-space.
Furthermore,tosimplifythegeometricanalysisofimagefeatures,weshallreduce
the spatial representation of image descriptors to ellipses, by evaluating a second
moment matrix
= Z
2R 2
L 2
x L
x L
y
L
x L
y L
2
y
g(;s
int
)d (4)
at integration scale s
int
proportionalto the detection scale of thescale-space maxi-
mum(equation(1)). Thereby,eachimagefeaturewillwerepresentedbyapoint(x; s)
inscale-spaceand acovariancematrixdescribingtheshape,graphicallyillustrated
byan ellipse. Forone-dimensional features, thecorrespondingellipseswillbehighly
scriptors of the second moment matrices will be rather circular. Attributes derived
from thecovariance matrixinclude its anisotropyderived from the ratio
max
=
min
between its eigenvalues, and its orientation dened as the orientation of its main
eigenvector.
Figure 4 shows an example of such image descriptors computed from a grey-
level image, after ranking on a signicance measure dened as the magnitude of
the response of the dierentialoperator at the scale-space maximum. A trivialbut
nevertheless very useful eect of this ranking is that it substantially reduces the
number of image features for further processing, thus improving the computational
eÆciency. In a more detailed representation of the multi-scale deep structure of a
real-world image, itwilloften be the casethat a largenumberof theimagefeatures
andtheir hierarchical relationscorrespondto imagestructures thatwillberegarded
asinsignicantbylaterprocessing stages.
3.2 Qualitative Feature Relations
Betweentheabovementionedfeatures,varioustypesofrelationscanbedenedinthe
imageplane. Here,weconsiderthefollowingtypesof qualitative relations:
Spatial coincidence (inclusion): We saythataregionAat positionx
A
and scale
s
A
is in spatial coincidence relation to a region B at position x
B
and at a
(coarser) scales
B
>s
A if
(x
A x
B )
T
1
B (x
A x
B )2[D
1
;D
2
] (5)
where D
1
and D
2
are distance thresholds and
B
is a covariance matrix asso-
ciated withregionB. By usingaMahalanobis distancemeasure, we introduce
a directional preference which is highly useful for expressing spatial relations
between elongated image features. While the special case D
1
=0 corresponds
to an inclusion relation, there are also cases where one may want to explicitly
represent distantfeatures, usingD
1
>0
Stability of scale relations: Fortwoimagefeaturesat timest
k andt
k
0,weassume
thattheratiobetweentheirscalevaluesshouldbeapproximatelythesame. This
is motivatedby thephysicalrequirement ofscaleinvariance underzooming
s
A (t
k )
s
B (t
k )
s
A (t
k 0
)
s
B (t
k 0
)
: (6)
To accept smallvariations dueto changes inview direction and spuriousvari-
ations from the scale selection mechanism of the feature tracker, we measure
relative distances in the scale direction and implement the \" operation by
q q 0
()jlog q
q 0
j<logT,where T >1 isa thresholdinthescaledirection.
Directional relation (bearing): Forafeature A relatedtoa one-dimensionalfea-
ture B, the angle is measured between the main eigenvector of
B
and the
vector x
A x
B
from thecenter x
B
of B to thecenter x
A
ofA (see Figure1) .
x
x B
A α
Figure 1:The direction relation (bearing) between two features A and B is the angle
betweenthemaineigenvectorof
B
(illustrated bytheellipse)andthevectorx
A x
B .
Trivially,theserelationsareinvarianttotranslationsandrotationsintheimageplane.
The scale invariance of these relations follows from corresponding scale invariance
properties of image descriptors computed from scale-space maxima | if the sizeof
animagestructureisscaledbyafactorcintheimagedomain,thenthecorresponding
scalelevels aretransformedbya factor c 2
.
3.3 Qualitative Multi-Scale Feature Hierarchy
Letus now considera specic examplewith images of a hand. From our knowledge
thata handconsists ofve ngers,weconstruct a modelconsisting of: (i)thepalm,
(ii)theve ngers, (iii)a nger tipforeach nger,(see gure2).
Eachngerisinaspatialcoincidencerelationtothepalm,aswellasadirectional
relation. Moreover,eachngertipisinaspatialrelationshiptoitsnger,andsatises
a directional relation to this feature. In a similar manner, each nger is in a scale
stability relation with respect to the palm, and each ngertip is in a corresponding
scalestabilityrelationrelative to its nger.
Such a representation will be referred to as a qualitative multi-scale feature hi-
erarchy. Figure 3 shows the relations this representation is built from, using UML
notation (Fowler & Scott 1997). An attractive property of this view-based object
representation is that it only focuses on qualitative object features. There is no
assumptionof rigidity,onlythat thequalitativeshapeis preserved.
The idea behind this construction is of course that the palm and the ngertips
should give rise to blob responses (equation (2)) and that the ngers give rise to
ridgeresponses (equation(3)). Figure 4shows an exampleofhowthismodel can be
initializedand matchedto imagedatawith associated imagedescriptors.
To exclude responses from thebackground, we have here requiredthat all image
featuresshouldcorrespond to bright blobsorbright ridges. Alternatively,one could
denespatial inclusionrelations with respectto other segmentation cues relative to
thebackground, e.g. chromaticityordepth.
Here,wehave constructedthegraphwith featurerelations manually,usingqual-
itativeknowledgeabouttheshapeoftheobjectanditsprimitives. Inamoregeneral
setting,however, one can also considerthe learningof stable featurerelations inan
actualsetting,basedonarichersetofimagefeaturesaswellasarichervocabularyof
x y s
Figure2:A qualitativemulti-scalefeaturehierarchyconstructedforahandmodel.
top−hand:Relation handconstraint:Constraint
hand:Objfeature
hand−finger:Relation fingerconstraint:Constraint
finger[1]:Objfeature finger[2]:Objfeature
finger−tip[1]:Relation finger−tip[2]:Relation
tipconstraint:Constraint
tip[1]:Objfeature tip[2]:Objfeature
...
...
...
Figure3:Instancediagram forthefeaturehierarchyofahand (gure2).
20 strongestblobsandridges Initializedhandmodel Allhandfeaturescaptured
Figure4: Illustrationof theinitialization stage of theobjecttracker. Once the coarse-scale
feature is found (here the palm of the hand), the qualitative feature hierarchy guides the
top-downsearchfortheremainingfeaturesoftherepresentation. (The leftimage showsthe
20mostsignicantblobresponses(inred)and ridgeresponses(in blue).)