• No results found

A Prototype System for Computer Vision Based Human Computer Interaction

N/A
N/A
Protected

Academic year: 2021

Share "A Prototype System for Computer Vision Based Human Computer Interaction"

Copied!
15
0
0

Loading.... (view fulltext now)

Full text

(1)

Based Human Computer Interaction



Lars Bretzner, 1

Ivan Laptev, 1

Tony Lindeberg, 1

Soren Lenman, 2

Yngve Sundblad 2

1

ComputationalVisionand Active Perception Laboratory (CVAP)

2

Center for User-Oriented IT-Design (CID)

Departmentof Numerical Analysisand Computing Science

KTH (Royal Institute of Technology)

S-100 44Stockholm, Sweden.

Technical report ISRN KTH/NA/P{01/09{SE

1 Introduction

Withthedevelopment of informationtechnology inoursociety, we can expect

that computer systems to a larger extent will be embedded into our environ-

ment. Theseenvironmentswillimposeneedsfornewtypesofhuman-computer-

interaction,with interfacesthat arenaturaland easy to use. Inparticular, the

abilitytointeractwithcomputerizedequipmentwithoutneedforspecialexter-

nalequipment isattractive.

Today, the keyboard, the mouse and the remote control are used as the

main interfaces for transferring information and commands to computerized

equipment. Insomeapplicationsinvolvingthree-dimensionalinformation,such

asvisualization,computer games and controlof robots, other interfaces based

ontrackballs,joysticksanddataglovesarebeingused. Inourdailylife,however,

wehumansuseourvisionandhearingasmainsourcesofinformationaboutour

environment. Therefore, one may ask to what extent it would be possible to

developcomputerizedequipmentabletocommunicatewithhumansinasimilar

way,byunderstandingvisualand auditive input.

Perceptualinterfaces based on speech have already started to nda num-

ber of commercial and technical applications. For examples, systems are now

available where speech commands can be use for dialing numbers in cellular



The support from the Swedish Research Council for Engineering Sciences,

TFR, and the Swedish National Board for Industrial and Technical Development,

NUTEK, is gratefully acknowledged. On-line video clip demos can be viewed from

http://www.nada.kth.se/cvap/g vmdi , and an on-line version of this manuscript can be

fetchedfromhttp://www.nada.kth.se/cvap/a bstr acts/ cvap 251. html.

(2)

ingpowerofcomputershasreachedapointwherereal-timeprocessingofvisual

informationis possiblewith commonworkstations.

The purpose of this article is to describe ongoing work in developing new

perceptualinterfaceswithemphasis oncommandsexpressed ashandgestures.

Examplesof applicationsofhandgesture analysisinclude:

 Controlof consumerelectronics

 Interactionwithvisualization systems

 Controlof mechanical systems

 Computer games

Potential advantages of usingvisualinputinthiscontext arethat visualinfor-

mation makes it possible to communicate with computerized equipment at a

distance, without need for physical contact with the equipment that is to be

controlled. Moreover,theusershouldbeableto controltheequipmentwithout

needforspecializedexternal devices,such asa remotecontrol.

2 Control by hand gestures

Figure 1 shows an illustrationof a type of scenario we are interested in. The

user is infront of a camera connectedto a computer. The camera follows the

movements of thehand, and performs actions dependingon the state and the

motion of the hand. Three basic types of hand gestures can be identi ed in

such asituation:

 Astatichandpostureimpliesthatthehandisheldina xedstateduringa

certainperiodoftime,duringwhichthesystemrecognizesthestategiven

a prede ned set of states. Examples of interpretations that are possible

Figure 1:Example of asimple situation where the user controls actions on ascreen

usinghand gestures. Inthisapplication,thepositionofthecursoriscontrolledbythe

(3)

between di erentmodesfora commandinvolving motion.

 Aquantitativehand motionmeansthatthetwo-dimensionalorthethree-

dimensional motion of the hand is measured, and the estimated motion

parameters (translationsandrotations)arebeingusedforcontrollingthe

motionofothercomputerizedequipment,suchasvisualizationparameters

for displaying a three-dimensional object, the volume of a TV or the

motion of robot.

 Aqualitativehand motionmeansthatthehandmovesaccordingtoapre-

de ned motion pattern(a trajectory inspace-time) and that themotion

patternisrecognizedfromaprede nedsetofmotionpatterns. Examples

ofinterpretationsincludeletters(thePalmPilotsignlanguage)orcontrol

of consumerelectronicsina similarmannerasforstatic handpostures.

3 A prototype scenario

To be ableto test computer-vision-basedhuman-computer-interactioninprac-

tice, we developed a prototype test bed system, where the user can control a

TVset and alampusingthefollowingtypes ofhandpostures:

 Three open ngers( gure 2(a)) toggle theTV on oro .

 Twoopen ngers( gure 2(b-c))changethe channeloftheTV.With the

index nger pointing to one side, the next TV channelis selected, while

the previouschannelis selected iftheindex nger pointsupwards.

 Five open ngers( gure 2(d)) toggle thelampon oro .

Figure 3 shows a few snapshots from a demonstration, where a user controls

equipment in theenvironment in thisway. In gures 3(a){(b) a user turnson

the lamp, in gures 3(c){(d) he turns on the TV set, and in gures 3(e){(f)

heswitches theTVset to anew channel. Allstepsinthisdemonstrationhave

Toggle TVon/o Next channel Previous channel Toggle lampon/o

Figure2:Handposturescontrollingaprototypescenario: (a)ahand withthreeopen

ngerstogglestheTVonoro ,(b)ahandwithtwoopen ngersandtheindex nger

pointingtoonesideselectsthenextTVchannel,(c)ahandwithtwoopen ngersand

theindex ngerpointing upwards selects theprevious TV channel, (d) a hand with

veopen ngerstoggles thelamp onoro .

(4)

(a) (b)

(c) (d)

(e) (f)

Figure3:Afewsnapshotsfromascenariowhereauserentersaroomandturnsonthe

lamp(a)-(b),turnsontheTVset(c)-(d)andswitchestoanewTVchannel(e)-(f).

(5)

systemdescribedinnext section.

4 A prototype system

To track and recognize hands in multiple states, we have developed a system

basedonacombinationofshapeand colourinformation. At anoverviewlevel,

thesystemconsistsof thefollowingfunctionalities(see gure 4):

Image capturing

Colour segmentation

Feature detection

Tracking and Pose recognition

Application control

ROI

Blobs and Ridges Colour image

Pose, Position, Scale and Orientation

Figure 4:Overview of the main components of the prototype system for detecting

and recognizing hand gestures, and using this information for controlling consumer

electronics.

Theimageinformationfromthecameraisgrabbedatframerate,thecolour

images are converted from RGB format to a new colour space that separates

the intensity and chromaticity components of the colour data. In the colour

images, colour feature detection is performed, which results in a set of image

featuresthatcanbematchedto amodel. Moreover, acomplementarycompar-

isonbetweenactualcolourandskincolourisperformedto identifyregionsthat

aremorelikelytocontainhands. Basedonthedetectedimagefeaturesand the

computedskincolour similarity,comparisonwith a setof object hypothesesis

performed using a statistical approach referred to as particle ltering or con-

densation. The most likelyhand posture is estimated, aswell asthe position,

sizeand orientationofthehand. This recognizedgesture informationisbound

to di erent actions relative to the environment, and these actions are carried

under the control of the gesture recognition system. In this way, the gesture

recognition system provides a medium by which the user can control di er-

ent typesof equipment in hisenvironment. AppendixAgives a more detailed

descriptionof thealgorithmsand computationalmodulesinthe system.

(6)

The problem of hand gesture analysis has received increased attention recent

years. Early work of using hand gestures for television control was presented

by (Freeman & Weissman 1995) usingnormalized correlation; see also (Kuch

& Huang 1995, Pavlovic et al. 1997, Maggioni & Kammerer 1998, Cipolla&

Pentland 1998) for related works. Some approaches consider elaborated 3-

D hand models (Regh & Kanade 1995), while others use colour markers to

simplifyfeature detection (Cipollaet al. 1993). Appearance-based models for

hand tracking and sign recognition were used by (Cui & Weng 1996), while

(Heap &Hogg 1998, MacCormick & Isard 2000) tracked silhouettes of hands.

Graph-like and feature-based hand models have been proposed by (Triesch &

vonderMalsburg1996)forsignrecognitionandin(Bretzner&Lindeberg1998)

fortrackingand estimating3-D rotationsof ahand.

Theuseofahierarchicalhandmodelcontinuesalongtheworksby(Crowley

&Sanderson1987)whoextractedpeaks fromaLaplacianpyramidofanimage

and linked them into a tree structure with respect to resolution, (Lindeberg

1993) who constructed scale-space primal sketch with an explicit encoding

of blob-like structures in scale space as well as the relations between these,

(Triesch&von derMalsburg1996)who usedelasticgraphsto representhands

in di erent postures with local jets of Gabor lters computed at each vertex,

(Lindeberg1998) who performed feature detection with automaticscale selec-

tionbydetectinglocalextremaofnormalizeddi erentialentitieswithrespectto

scale,(Shokoufandehetal.1999)whodetectedmaximainamulti-scalewavelet

transform, aswellas(Bretzner & Lindeberg1999), who computedmulti-scale

blobandridgefeaturesand de nedexplicitqualitative relationsbetweenthese

features. The useof chromaticityas a primarycuefor detecting skincoloured

regionswas rstproposedby(Fleck etal.1996).

Our implementation of particle ltering largely follows the traditional ap-

proachesforcondensationaspresentedby(Isard&Blake1996,Black&Jepson

1998, Sidenbladhet al. 2000, Deutscher et al. 2000) and others. Using thehi-

erarchical multi-scale structure of the hand models, however, we adapted the

layeredsamplingapproach(Sullivanetal.1999)andusedacoarse-to- nesearch

strategyto improvethecomputational eÆciency,here,bya factor oftwo.

The proposed approach is based on several of these works and is novel in

the respect that it combines a hierarchical object model with image features

at multiple scales and particle ltering for robust tracking and recognition.

For more details about the algorithmic aspects underlying the tracking and

recognitioncomponentsinthecurrent system,see (Laptev&Lindeberg2000).

6 The CVAP-CID collaboration

Theworkis carried outasa collaborationprojectbetweentheComputational

Vision and Active Perception Laboratory (CVAP) and the Center for User-

Oriented IT-Design at KTH, where CVAP provides expertise on computer vi-

sion,whileCIDprovides expertiseon human-computer-interaction.

(7)

tralimportancethatuserstudiesarebeingcarriedoutandthattheinteraction

istestedinprototypesystemsasearlyaspossible. Computervisionalgorithms

for gesture recognition will be developed by CVAP, and will be used in pro-

totype systems in scenarios de ned in collaboration with CID. User studies

forthese scenarios willthen be performed and bedeveloped by CID, to guide

furtherdevelopments.

References

Black, M. & Jepson,A. (1998), A probabilistic frameworkfor matchingtemporaltrajecto-

ries: Condensation-based recognition of gestures and expressions, in `Fifth European

ConferenceonComputerVision',Freiburg,Germany,pp.909{924.

Bretzner, L. & Lindeberg, T. (1998), Use your hand as a 3-D mouse or relative orienta-

tionfromextended sequencesof sparsepoint andline correspondences usingtheaÆne

trifocal tensor, in H. Burkhardt & B.Neumann, eds, `Fifth European Conference on

Computer Vision', Vol. 1406 of Lecture Notes in Computer Science, Springer Verlag,

Berlin,Freiburg,Germany,pp.141{157.

Bretzner, L. & Lindeberg, T. (1999), Qualitative multi-scale feature hierarchies for object

tracking,inO.F.O.M.Nielsen,P.Johansen&J.Weickert,eds,`Proc.2ndInternational

Conference onScale-Space Theories in Computer Vision', Vol. 1682, Springer Verlag,

Corfu,Greece,pp.117{128.

Cipolla, R., Okamoto, Y. & Kuno, Y. (1993), Robust structure frommotion usingmotion

parallax, in `Fourth International Conference on Computer Vision', Berlin, Germany,

pp.374{382.

Cipolla, R. & Pentland, A., eds (1998), Computer vision for human-computer interaction,

CambridgeUniversityPress,Cambridge,U.K.

Crowley, J. & Sanderson, A. (1987), `Multiple resolution representation and probabilistic

matchingof2-dgray-scaleshape',IEEETransactionsonPatternAnalysisandMachine

Intelligence9(1),113{121.

Cui, Y. & Weng, J. (1996), View-based hand segmentation and hand-sequence recognition

withcomplexbackgrounds, in`13thInternationalConference onPatternRecognition',

Vienna,Austria,pp.617{621.

Deutscher, J., Blake, A. & Reid, I. (2000), Articulated body motion capture by annealed

particle ltering,in`CVPR'2000', HiltonHead,SC,pp.II:126{133.

Fleck, M., Forsyth, D. & Bregler, C. (1996), Finding naked people, in `Fourth European

ConferenceonComputerVision',Cambridge,UK,pp.II:593{602.

Freeman,W.T.&Weissman,C.D.(1995),Televisioncontrolbyhandgestures,in`Proc.Int.

Conf.onFaceandGestureRecognition',Zurich,Switzerland.

Heap, T. & Hogg, D. (1998), Wormholes in shape space: Tracking through discontinuous

changes in shape, in `Sixth International Conference on Computer Vision', Bombay,

India,pp.344{349.

Isard,M.&Blake,A.(1996),Contourtrackingbystochasticpropagationofconditionalden-

sity,in`FourthEuropeanConferenceonComputerVision',Cambridge, UK,pp.I:343{

356.

Kuch,J.J.&Huang,T.S.(1995),Visionbasedhandmodellingandtrackingforvirtualtele-

conferencingandtelecollaboration,in`Proc.5thInternationalConferenceonComputer

Vision',Cambridge,MA,pp.666{671.

Laptev,I.&Lindeberg,T.(2000),Trackingofmulti-statehandmodelsusingparticle ltering

and a hierarchy of multi-scale image features, Technical Report ISRN KTH/NA/P--

00/12--SE,Dept.ofNumericalAnalysisandComputingScience,KTH,Stockholm,Swe-

den.

(8)

scale-space primal sketch: A method for focus-of-attention', International Journal of

ComputerVision11(3),283{318.

Lindeberg,T.(1998),`Featuredetectionwithautomaticscaleselection',InternationalJournal

ofComputer Vision30(2), 77{116.

MacCormick,J.&Isard,M.(2000),Partitionedsampling,articulatedobjects,andinterface-

quality hand tracking, in `Sixth European Conference on Computer Vision', Dublin,

Ireland,pp.II:3{19.

Maggioni,C.& Kammerer,B. (1998), Gesturecomputer-history,designand applications,in

R.Cipolla&A.Pentland,eds,`Computervisionforhuman-computerinteraction',Cam-

bridgeUniversityPress,Cambridge,U.K.,pp.23{52.

Pavlovic, V. I.,Sharma, R.& Huang, T. S.(1997), `Visualinterpretation of hand gestures

forhuman-computerinteraction: Areview',IEEETrans.PatternAnalysisandMachine

Intell.19(7),677{694.

Regh,J.M.&Kanade,T.(1995),Model-basedtrackingofself-occludingarticulatedobjects,

in`FifthInternationalConferenceonComputerVision',Cambridge,MA,pp.612{617.

Shokoufandeh, A., Marsic, I.& Dickinson, S. (1999), `View-based object recognition using

saliencymaps',ImageandVisionComputing17(5/6),445{460.

Sidenbladh,H.,Black,M. &Fleet, D.(2000),Stochastic trackingof3dhuman gures using

2dimagemotion,in`SixthEuropeanConferenceonComputerVision',Dublin,Ireland,

pp.II:702{718.

Sullivan, J.,Blake, A.,Isard, M. &MacCormick,J.(1999), Objectlocalization bybayesian

correlation, in `Seventh International Conference onComputerVision', Corfu, Greece,

pp.1068{1075.

Triesch, J. & von der Malsburg, C. (1996), Robust classi cation of hand postures against

complexbackground,in`Proc.Int.Conf.onFaceandGestureRecognition',Killington,

Vermont,pp.170{175.

A Computational modules in the prototype system

This appendix gives a more detailed description of the algorithms underlying

thedi erent computationalmodulesinthe prototype system forhandgesture

recognitionoutlinedinsection4. Incontrasttothemaintext,thispresentation

assumesknowledge aboutcomputer vision.

A.1 Shape cues

For each image, a set of blob and ridge features is detected. The idea is that

the palm of the hand gives rise to a blob at a coarse scale, each one of the

ngersgives rise to a ridge at a ner scale, and each nger tip givesrise to a

ne scale blob. Figure 5 shows an example of such image features computed

froman image.

A.1.1 Colour feature detection

Technically,thisfeaturedetection stepisbasedonthefollowingcomputational

steps. The inputcolour imageis transformed from theRGB colourspace to a

(9)

I =

R+G+B

3

(1)

u=R G (2)

v=G B (3)

Ascale-space representationis computedofeach colourchannelf

i

byconvolu-

tionwithGaussiankernelsg(; t)ofdi erentvariancet,C

i

(; t)=g(; t)f

i ()

andthefollowingnormalizeddi erentialexpressionsarecomputedandsummed

upoverthe channelsat each scale:

B

norm C=

X

C t

2

(@

xx C

i +@

yy C

i )

2

(4)

R

norm C=

X

C t

3=2

(@

xx C

i

@

yy C

i )

2

+4(@

xy C

i )

2

(5)

Then,scale-space maximaofthesenormalized di erentialentitiesaredetected,

i.e.,pointsatwhichB

norm

andR

norm

assumenormalizedmaximawithrespect

to space and scale. At each scale-space maximum (x; t) a second-moment

matrix

= X

i Z

2R 2



(@

x C

i )

2

(@

x

LCi)(@

y C

i )

(@

x C

i )(@

y C

i

) (@

y C

i )

2



g(;s

int

)d (6)

is computed at integration scales

int

proportional to the scale of the detected

imagefeatures. To allowforthecomputationaleÆciencyneededto reachreal-

timeperformance,allthecomputationsinthefeaturedetection stephavebeen

implementedwithina pyramidframework. Figure5 shows such features, illus-

tratedbyellipsescentered at xand withcovariance matrix=t

norm

,where



norm

==

min and 

min

isthesmallesteigenvalueof .

(a) (b)

(c)

Figure5:Theresultofcomputingblobfeaturesandridgefeaturesfromanimageofa

hand. (a) circlesand ellipsescorrespondingto thesigni cantbloband ridgefeatures

extractedfrom animage of ahand; (b)selected image features correspondingto the

palm, the ngers and the nger tips of a hand; (c) a mixture of Gaussian kernels

associated with blob and ridge features illustrating how the selected image features

capturetheessentialstructureofahand.

(10)

Asmentionedabove,animageofahandcanbeexpectedtogiverisetobloband

ridgefeaturescorrespondingtothe ngersofthehand. Theseimagestructures,

together with informationabout their relative orientation, positionand scale,

canbeusedforde ningasimplebutdiscriminativeview-basedmodelofahand.

Thus, we represent a handbya setof blob and ridgefeatures asillustrated in

gure6,and de nedi erent states, dependingonthe numberof open ngers.

To model translations, rotations and scaling transformations of the hand,

we de ne a parameter vector X = (x;y;s; ;l), which describes the global

position (x;y), the size s, and the orientation of the hand in the image,

together with its discrete state l = 1:::5. The vector X uniquely identi es

thehandcon gurationintheimageandestimationofXfrom imagesequences

correspondsto simultaneoushandtrackingand recognition.

α

x,y,s l=1 l=2

l=4

l=3

l=5

Figure6: Feature-basedhand models in di erent states. Thecircles and ellipsescor-

respondtoblob andridgefeatures. Whenaligningmodelsto images,thefeaturesare

translated,rotatedandscaledaccordingtotheparametervectorX.

A.3 Skin colour

When tracking human faces and hands in images, the use of skin colour has

beendemonstrated tobea powerfulcue. Inthiswork,weexplore similarityto

skincolourintwo ways:

 Forde ningcandidateregions (masks)forsearching forhands.

 Forcomputingaprobabilisticmeasure of anypixel beingskincoloured.

Histogram-based computation of skin coloured search regions. To

delimit regions in the image for searching for hands, an adaptive histogram

analysis of colour information is performed. For every image, a histogram is

computedforthechromatic(u;v)-componentsofthecolourspace. Inthis(u;v)-

space acoarse search regionhas beende ned, whereskin colouredregions are

likely to be. Within this region, blob detection is performed, and the blob

mostlikelyto correspond to skincolouris selected. The supportregionof this

blob incolour space is backprojected into the image domain, which results in

(11)

interestinterestcomputedinthisway,whichareusedasaguideforsubsequent

processing.

Figure7:Todelimittheregionsinspacewheretoperformrecognitionofhandgestures,

aninitialcomputationofregionsofinterestiscarriedout,basedonadaptivehistogram

analysis. Thisillustrationshowsthebehaviourofthehistogrambasedcolouranalysis

for a detailof a hand. In the system, however, the algorithm operates on overview

images. (a)originalimage,(b)histogramoverchromaticinformation,(c)backprojected

histogramblob givingahandmask,(d)resultsofblobdetectionin thehistogram.

Probabilistic prior on skin colour. For exploring colour information in

thiscontext, wecompute aprobabilisticcolourpriorinthe followingway:

 Hands were segmentedmanuallyfrom thebackgroundforapproximately

30 images, and two-dimensionalhistograms over the chromatic informa-

tion (u;v) wereaccumulatedforskin regionsand background.

 These histogramswere summedup and normalizedto unit mass.

 Given these trainingdata, the probability of any measured image point

with colourvalues(u;v) beingskincolourwasestimatedas

p

skin

(u;v)=

max (0;aH

skin

(u;v) H

bg (u;v))

P

u;v

max(0;aH

skin

(u;v) H

bg (u;v))

; (7)

References

Related documents

The models created in these experiments all performed poorly with only achieving 4-11% mAP on the test set. Earlier testing of these architectures shows that they have

The RA group and the control group showed significantly improved hand force (both flexion and extension force) and hand function after only 6 weeks of

Execution-driven simulators execute applications on a simulated processor. No traces are needed and the simulation can be conducted on one machine [13]. The instruc- tions

Due to their structure, parallel grip- pers can only perform simple in-hand manipulation motions, but the combination of many of these simple motions allows the robot to

The friction coefficients µ and ξ may typically not be known a priori for a new tool, and they also have to be estimated. This estimation is run in parallel to the execution of

In this paper we are going to discuss the problem of hand pose estimation in the context of an existing application, developed in [19] and more recently in [18] for close to

In the examples used, the environment textures of World of Warcraft are analogous, since the Blizzard artists have utilised hues between two primary colours in the colour wheel as

The quantitative evaluation does not really represent a realistic scenario since ground truth data is used to simulate the assumption that the loop closure relative pose is