Performance, Processing and Perception of Communicative Motion for Avatars and Agents

(1)

Performance, Processing and Perception of

Communicative Motion for Avatars and Agents

SIMON ALEXANDERSON

Doctoral Thesis

Stockholm, Sweden 2017

(2)

ISRN KTH/CSC/A-17/24-SE ISBN 978-91-7729-608-9

SE-100 44 Stockholm SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen Fredagen den 15 december 2017 klockan 14.00 i F3, Kungliga Tekniska Högskolan, Lindstedtsvä-gen 26, Stockholm.

(3)

iii

Abstract

Artificial agents and avatars are designed with a large variety of face and body configurations. Some of these (such as virtual characters in films) may be highly realistic and human-like, while others (such as social robots) have considerably more limited expressive means. In both cases, human motion serves as the model and inspiration for the non-verbal behavior displayed. This thesis focuses on increasing the expressive capacities of artificial agents and avatars using two main strategies: 1) improving the automatic capturing of the most communicative areas for human communication, namely the face and the fingers, and 2) increasing communication clarity by proposing novel ways of eliciting clear and readable non-verbal behavior.

The first part of the thesis covers automatic methods for capturing and processing motion data. In paper A, we propose a novel dual sensor method for capturing hands and fingers using optical motion capture in combination with low-cost instrumented gloves. The approach circumvents the main prob-lems with marker-based systems and glove-based systems, and it is demon-strated and evaluated on a key-word signing avatar. In paper B, we propose a robust method for automatic labeling of sparse, non-rigid motion capture marker sets, and we evaluate it on a variety of marker configurations for finger and facial capture. In paper C, we propose an automatic method for anno-tating hand gestures using Hierarchical Hidden Markov Models (HHMMs).

The second part of the thesis covers studies on creating and evaluating multimodal databases with clear and exaggerated motion. The main idea is that this type of motion is appropriate for agents under certain communicative situations (such as noisy environments) or for agents with reduced expressive degrees of freedom (such as humanoid robots). In paper D, we record mo-tion capture data for a virtual talking head with variable articulamo-tion style (normal-to-over articulated). In paper E, we use techniques from mime act-ing to generate clear non-verbal expressions custom tailored for three agent embodiments (face-and-body, face-only and body-only).

(4)

Sammanfattning

I dagens samhälle finns en uppsjö av olika artificiella människolika figu-rer. Vissa av dessa (som digitala rollfigurer i film) är väldigt realistiska till utseende och beteende. Andra (som sociala robotar) är starkt begränsade i sina kroppsliga uttryckmöjligheter. I båda fallen fungerar mänsklig rörelse som modell och inspiration till deras icke-verbala kommunikation. Denna av-handling behandlar två grundläggande frågor för att få maskiner att lära sig av och efterlikna mänsklig icke-verbal kommunikation: 1) Hur kan vi automa-tiskt spela in och processera data för de viktigaste kroppsdelarna för mänsklig kommunikation, dvs fingrar och ansikten? 2) Hur kan vi skapa ett rörelsebib-liotek med tydligt läsbar icke-verbal kommunikation? Användningsområdet för avhandlingen är för syntes av animerade rörelser för artificiella agenter och avatarer.

Del 1 av avhandlingen omfattar automatiska metoder för att fånga och processera mänsklig rörelse. I artikel A föreslår vi en ny metod för att fånga händer och fingrar med optisk motion capture-teknik i kombination med enkla data-handskar. Metoden demonstreras och utvärderas på en teckenspråksa-vatar. I artikel B föreslår vi en robust metod för automatisk identifiering av motion capture-markörer fästa på fingrar och ansikten. I artikel C föreslår vi en automatisk metod för att annotera gester utifrån motion capture data.

Del 2 av avhandlingen handlar om att skapa och utvärdera databaser med tydlig icke-verbal kommunikation. Idén är att denna typ av rörelse är lämplig för agenter i vissa kommunikativa situationer (såsom bullriga miljöer) eller för agenter med reducerade möjligheter till kommunikation (som humanoida robotar). I papper D utvärderar vi animationer för tydlig audiovisuell syntes. I papper E använder vi mim-skådespeleri för att generera tydliga rörelser för sociala robotar och stiliserade avatarer.

(5)

Acknowledgements

First and foremost, I would like to thank my supervisor Jonas Beskow. His unique intellectual brilliance and flexible guidance style ensured that the work on this thesis was both inspiring explorative and yet, never lost track of direction. His early-stage ideas and last-minute improvements warranted quality along the line, and his scripting wizardry saved uncountable hours of manual corpora preparation. I owe him special thanks for his openness to my many creative escapades outside of TMH (e.g. developing live animation theater shows or doing previs for film projects).

I thank my co-supervisor David House for support, ideas and discussions on gesture and speech over the years. His profound knowledge and consistent curiosity gave me invaluable insights on the nature of science.

I thank my dear friends and colleagues at TMH for creating an outstanding research environment. To the other professors: Joakim Gustafsson, Olov Engwall, Sten Ternström, Anders Friberg, Jens Edlund, Giampiero Salvi, Gabriel Skantze and Johan Boye. Special thanks to Jocke for advice and references, Sten for trust-ing me with a huge bag of money for motion capture equipment, Jens for readtrust-ing and commenting on the thesis and to Giampi for discussions on machine learning. Thanks to professors emeriti Björn Granström, Rolf Carlson, Johan Sundberg and Anders Askenfelt for your leadership when I started at TMH, and to Gunnar Fant (in memoriam) for pioneering the field. Thanks to my room-mates; Catha Oertel for friendship, discussions and travel companionship, Eva Szekely for fun and in-spiration and Zofia Malisz for discussions and encouragement. To my other fellow coworkers at TMH, current and past: Anders Elowsson, Andreas Selamtzis, Anna Hjalmarsson, Bajibabu Bollepalli, Bo Schenkman, Carina Lingonbacke, Christos Koniaris, Daniel Neiberg, Dimosthenis Kontogiorgos, Gaël Dubus, Gerhard Eckel, Glaucia Salomaõ, Jana Götze, Joe Mendelson, José Lopes, Kalin Stefanov, Kjell Elenius, Kjetil Hansen, Laura Enflo, Ludwig Elbaus, Martin Johansson, Mats Blomberg, Maurizio Goina, Mattias Bystedt, Mattias Heldner, Niklas Vanhainen, Patrik Jonell, Per Fallgren, Peter Nordqvist, Petur Helgason, Preben Wik, Raveesh Meena, Saeed Dabbaghchian, Samer Al Moubayed, Sofia Strömbergsson and Todd Shore. Special thanks to Dimos and Patrik for collaboration in the motion capture lab and to Kalin for long standing support and talks. Thanks also to the people at Furhat Robotics for bringing entrepreneurship and innovative spirit to the office space.

I owe a special thanks to Samer Al Moubayed for early cooperation and for v

(6)

introducing me to Disney Research, where I had the pleasure to do an internship supervised by Carol O’Sullivan and Gene Lee. Big thanks to both of you for making this a truly fun and transformative experience for me. I thank Carol for her great guidance and leadership, and for the continued support and collaboration through Trinity Collage Dublin. I also thank her for linking me to Michael Neff, and I thank Michael for his invaluable input and contributions to paper E in this thesis. I hope to get to work more with both of you in the future.

Many thanks to my other collaborators on projects and papers. To Britt Claes-son, Sandra Drebring and Morgan Fredriksson for work with the Tivoli project. To Roberto Bresin and Petter Ögren for our small visionary project on robot soni-fication and stylized motion. To Emma Frid for ideas and sound design. I look forward to continued team-work. Thanks to Iolanda Leite and Aravind Elanjimat-tathil Vijayan for translating motion capture data to the NAO robot, and to Kelly Karipidou, Hedvig Kjellström and Anders Friberg for cooperation on conducting gestures. A special thanks to Meg Zellers for co-authoring papers, contributions with analysis and annotation, and for being such a positive and bright person.

During my time at TMH I have had the pleasure to work in several projects together with the Stockholm Academy of Dramatic Arts. I especially thank mime lecturer Alejandro Bonnet for long-time collaborations in the arts and technologies. Your expertise in gesture and communication has provided a solid understanding of the importance of mime for robots and virtual characters, and you are a truly outstanding and inspiring equilibrist! Thanks to Mirko Lempert at the film and media department for collaborative projects exploring previs and virtual produc-tion. Our mix of technological knowhow and solid understanding of the processes of filmmaking has been genuinely enjoyable and fruitful. Thanks to Maria Hed-man Hvitfeldt and Anders BohHed-man for collaboration, insights and support. To Nils Claesson for collaboration on artistic aspects of motion capture and animation. To Olof Halldin for thoughts and references. I thank Stockholm University of the Arts for providing the funding and settings for these collaborations. My appreciations to Esther Ericsson, Åsa Andersson Broms and Martin Christensen at the Royal Institute of Arts for artistic vision and technical know-how.

Many thanks to Henrik Bäckbro and Filip Alexanderson for our raw creative bursts in the work with Cabaret Electrique1_{, which has been a source of inspiration}

and energy throughout the thesis work. To Henrik for his in-depth knowledge in mime and puppetry, and for being a notorious maker and tinkerer, and to Filip for his artistic vision, playful nature and for fearlessly pushing motion capture and animation to the stronghold of Swedish theater tradition. Thanks to Dramaten for support and confidence. Thanks to Leif Handberg at KTH R1 for curating the most unique and fascinating performance space in Stockholm (the old nuclear reactor hall 25 m below KTH). To the many artists and technicians contributing to these endeavors. I owe my gratitude to Iwan Peter Scheer and Colm Massey for

(7)

vii

introducing me to digital puppetry and for infusing ideas and perspectives following my path ever since.

To my family. I thank Maria Tengö for love, feedback and massive support throughout the years. Her open mind, bright intellect and long experience as re-searcher has provided further perspectives and advice from home. Her work with multidisciplinary research and traditional knowledge systems has been a valuable complement to my research at KTH. I thank Anna and Katja for the constant fun and pride of being your father, for frank advice on how to design virtual robots (so they don’t look creepy) and for participating in experiments. A special thanks to Lisa Tengö for always being there when conference season is high. My profound gratitude to my parents, Kalle and Lottie Alexanderson for fostering me with a solid confidence, lust for learning and cherishing of creativity and culture in all shapes and forms.

(8)

Abstract iii

Sammanfattning iv

Contents viii

Included publications and individual contributions 1

Additional publications 3

1 Introduction 5

1.1 Scope of the thesis . . . 7

2 Background 9 2.1 Avatars and agents . . . 9

2.2 Data-driven animation . . . 11

2.3 Gesture synthesis . . . 12

2.4 Sign language avatars . . . 15

I

Automatic processing of motion data

19

3 Automatic processing of motion data 21 3.1 Motion capture . . . 21

3.2 Automatic segmentation of gestures . . . 33

II Clear communication

37

4 Clear communication 39 4.1 Natural clear communication . . . 39

4.2 Stylized communication . . . 43 viii

(9)

CONTENTS ix

III Summary

49

5 Contributions 51

6 Conclusions and future work 55

6.1 Conclusions . . . 55 6.2 Future work . . . 56

Bibliography 57

(10)

(11)

Included publications and

individ-ual contributions

The studies included in this thesis have been collaborative efforts. Specifications of the authors’ contributions are given below.

Paper A

Alexanderson, S., & Beskow, J. (2015). Towards Fully Automated Motion Cap-ture of Signs–Development and Evaluation of a Key Word Signing Avatar. ACM Transactions on Accessible Computing (TACCESS), 7(2), 7.

The idea of using dual sensors came up in a discussion between SA and JB. The motion capture data collection was led by SA with help and input from JB. SA post-processed the data, implemented the skeleton solver and did the 3D-modelling, rigging and retargeting. The experiments were designed and implemented by SA with input from JB. SA conducted the experiments and analyzed the data with help from Margaret Zellers. Writing was mainly done by SA with contributions from JB.

Paper B

Alexanderson, S., O’Sullivan, C., & Beskow, J. (2017). Real-Time Labeling of Non-Rigid Motion Capture Marker Sets. Computers & Graphics. This is an extended version of our Best Paper awarded publication from Motion in Games 2016 [1]. The challenge of labeling sparse marker sets was first addressed in the Spontal project, were a rudimentary algorithm was developed by SA with input from JB. SA noticed the need for a specific method for labeling finger and facial marker sets, and proposed the spatial and temporal model finally used in the paper. SA planned and executed the experiments with input from JB and CS. Writing was done by SA with input from JB and CS.

(12)

Paper C

Alexanderson, S., House, D., & Beskow, J. (2016, November). Automatic annota-tion of gestural units in spontaneous face-to-face interacannota-tion. In Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction (pp. 15-19). ACM.

The idea of using HHMMs for automatic annotation of gesture units originated from SA. SA implemented the algorithm and performed the experiments. Annota-tion of training data was done by SA, DH and JB. Writing was done by SA with contributions from JB and DH.

Paper D

Alexanderson, S., & Beskow, J. (2014). Animated Lombard speech: motion cap-ture, facial animation and visual intelligibility of speech produced in adverse con-ditions. Computer Speech & Language, 28(2), 607-618.

The idea of using Lombard and whispered speech to elicit hyper articulated speech originated from JB. The data collection and processing was done in collaboration between SA and JB, where SA was in charge of the motion capture data, and JB the audio and video. The retargeting algorithm was implemented by SA building on top of previous work by JB. The experiments were executed by SA with input from JB. SA did the statistical analysis and the main part of the write-up with contributions by JB.

Paper E

Alexanderson, S., O’Sullivan, C., Neff, M., & Beskow, J. (2017). Mimebot - Inves-tigating the Expressibility of Non-Verbal Communication Across Agent Embodi-ments. ACM Transactions on Applied Perception (TAP), 14(4), 24.

The idea of using mime to elicit clear and stylized expression for artificial characters and agents came from SA. The corpus design was a collaborative effort by between SA, JB, CS and MN. SA wrote the manuscripts together with Henrik Bäckbro and Alejandro Bonnet at the Stockholm Academy of Dramatic Arts. The corpus collection was led by SA with help from JB and Dimosthenis Kontogiorgos and Patrik Jonell at TMH. The post-processing, 3D modelling, retargeting and rendering was done by SA. SA designed and executed the experiments with input from JB, CS and MN. Analysis was done by SA and CS, with input from JB and MN. Writing was led by SA, with significant contributions by CS, MN and JB.

(13)

Additional publications

In addition to the papers comprising this thesis, the author has contributed to the following publications.

House, D., Ambrazaitis, G., Alexanderson, S., Ewald, O., Kelterer, A. (2017). Temporal organization of eyebrow beats, head beats and syllables in multi-modal signaling of prominence. In International Conference on Multimulti-modal Communication: Developing New Theories and Methods. Osnabrück, Ger-many.

Karipidou, K., Ahnlund, J., Friberg, A., Alexandersson, S., & Kjellström, H. (2017). Computer Analysis of Sentiment Interpretation in Musical Conduct-ing. In IEEE Conference on Automatic Face and Gesture Recognition. Zellers, M., House, D., & Alexanderson, S. (2016). Prosody and hand gesture at turn boundaries in Swedish. In 8th Speech Prosody 2016, 31 May 2016 through 3 June 2016 (pp. 831-835). International Speech Communications Association.

Alexanderson, S., O’Sullivan, C., & Beskow, J. (2016, October). Robust online motion capture labeling of finger markers. In Proceedings of the 9th International Conference on Motion in Games (pp. 7-13). ACM.

Beskow, J., Alexanderson, S., Stefanov, K., Claesson, B., Derbring, S., Fredriks-son, M., ... & AxelsFredriks-son, E. (2014). Tivoli: Learning Signs Through Games and Interaction for Children with Communicative Disorders. In 6th Bien-nial Conference of the International Society for Augmentative and Alternative Communication ISAAC.

Alexanderson, S., House, D., & Beskow, J. (2013, August). Aspects of co-occurring syllables and head nods in spontaneous dialogue. In AVSP (pp. 169-172).

Alexanderson, S., House, D., & Beskow, J. (2013). Extracting and analysing co-speech head gestures from motion-capture data. Proceedings of Fonetik 2013, 1.

Alexanderson, S., House, D., & Beskow, J. (2013). Extracting and analyz-ing head movements accompanyanalyz-ing spontaneous dialogue. In Proceedanalyz-ings of Tilburg Gesture Research Meeting, Tilburg.

(14)

Edlund, J., Alexandersson, S., Beskow, J., Gustavsson, L., Heldner, M., Hjal-marsson, A., ... & Marklund, E. (2012). 3rd party observer gaze as a contin-uous measure of dialogue flow. In LREC-The eighth international conference on Language Resources and Evaluation. LREC.

Alexanderson, S., & Beskow, J. (2012). Can Anybody Read Me?-Motion Capture Recordings for an Adaptable Visual Speech Synthesizer. In The Listening Talker.

Al Moubayed, S., Alexandersson, S., Beskow, J., & Granström, B. (2011). A robotic head using projected animated faces. In Auditory-Visual Speech Processing 2011.

Beskow, J., Alexandersson, S., Al Moubayed, S., Edlund, J., & House, D. (2011). Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation. In Auditory-Visual Speech Processing 2011.

(15)

Introduction

The human body provides a rich medium for us to express ourselves. Combined with speech, body movements give us a multitude of ways to communicate effectively, with great variation and nuance. Without the presence of speech, body movements alone have the potential of taking on the full communicative powers of language.

Evolutionary, the perception of movement has played an important role for survival. As humans, we are especially tuned to recognizing and interpreting human motion. Experiments with point light displays have shown that we are able to recognize human motion from very sparse representations [2], and we are not only able to categorize it broadly, but also to distinguish finer details such as gender and emotions [3]. The imaginative nature of our minds also allows us to combine shape and motion in flexible ways, and by applying human-like movements to inanimate objects, we can empathize and build social bonds to even simple embodiments with low human resemblance. In fact, by looking at the long history of puppets, such simple embodiments have a played an important role as proxies for thoughts, stories and ideas.

In recent years, embodied agents have been of increasing use as machine inter-faces with a steady growing number of applications [4]. Virtual agents are frequently used in customer service applications, simulation, and computer games. Some ap-plications benefit from the fact that just the presence of an embodiment has positive effects on user experience [5], [6]. Others draw on non-verbal communication as an additional communicative channel and aim to make the interface more natural, ef-ficient and engaging [4]. Parallel to the development of virtual agents, there has been an increased interest for human-like interfaces in the robotics community. In the future, with increased flexibility and intelligence of robots, we are expected to see more scenarios where humans and robots interact and collaborate in everyday life [7]. In such scenarios, non-verbal behavior is important both to ensure efficient task performance and for safety reasons. By using gesture, head pose and gaze, the robot can make it easier for the human collaborator to predict and understand its actions and intentions.

Building computational models of human non-verbal behavior is a key to this development. While synthetic non-verbal behavior traditionally has been produced with manually coded rules and animations, more and more research in the area has shifted to employ data-driven methods learned from observed data from

(16)

human interaction. This has led to increased focus towards recording large-scale multimodal corpora. During the last few years, the development has been further fueled by the success of deep learning methods, which increases the demands for large data-collections.

Unfortunately, using data recorded from natural human interaction as the basis for data-driven synthesis of non-verbal behavior comes with some major pitfalls. First, the process of capturing human motion is a non-trivial task, and involves both a high amount of manual post-processing and an often crude approximation of the original movement. Especially the most informative body parts for non-verbal communication, i.e. the face and fingers, are extremely challenging to capture in detail. Secondly, the target embodiment usually lacks many of the communicative degrees of freedom of the full human body. Consequently, many of the subtle ways of human communication are lost in translation and the resulting movements may end up being incomprehensible or misunderstood.

Another fundamental problem is that people in natural interaction usually do not display the level of clarity that we may require of an embodied agent. In a human robot collaboration scenario, we arguably would want non-verbal communi-cation to be displayed as clear and readable as possible. Unfortunately, this is not how people behave in natural interaction. As humans, we tend not to communicate overly articulated. Instead, we adapt our ways of communication to the context and surrounding environment in order to be understood with minimal effort [8].

The perspective in this thesis is that, in order to improve on data-driven methods for synthesis of non-verbal behavior, we need to address the following problems and questions. First, we need to develop robust methods to capture the finer details of human motion, and these methods should be automated enough to enable the collection of large-scale corpora without extensive manual labor. Second, we need to study the perceptual effects of transferring human motion to different embodiments and investigate what gets lost in the transfer. Finally, we need to design strategies to compensate for the loss of information caused by embodiment.

Part I of the work presented in the thesis is devoted to the automatic captur-ing and processcaptur-ing of the finer details of non-verbal communication to enable the collection of large-scale databases for data-driven synthesis. Special focus lies on fingers and faces, as these parts play an important role in communication and are the most challenging to capture by traditional methods. They are also typically missing in existing motion capture corpora.

Part II investigates ways to compensate for limitations of target embodiments, by increasing the intelligibility of non-verbal behavior. Two approaches are used to accomplish this. First, we employ the ways people adapt their audiovisual speech production in order to facilitate clear communication across adverse conditions (such as noisy environments). Second, we draw from the traditions of mime acting to generate clear non-verbal communication. Common to both these approaches is the use of exaggeration and simplification as fundamental for clear communication. According to novel theories from learning theory, these principles are one of the key elements of learning by demonstration, and the ability to use pantomime, i.e.

(17)

1.1. SCOPE OF THE THESIS 7

to represent real world objects and actions with imaginary gestures, was a major advantage in human evolution [9]. Another recent theory by Clark [10] stresses the importance of performed depictions as a fundamental way of communication. Examples of depictions are iconic gestures, facial gestures, quotations of all kinds, full-scale demonstrations, and make-believe play. Clark recognizes mimes as true experts in this form of communication.

1.1 Scope of the thesis

The focus in the work lies in the challenges of capturing expressive motion from humans, transferring it to avatars and agents with different embodiments, and eval-uating the perceptual effects of the transfer. The challenges of artificial intelligence or modelling autonomous behaviors are not covered. In the following sections, we refer to avatars as virtual characters (rendered in real-time or offline) directly driven by humans, and agents to those exhibiting autonomous control. Although the methods developed and the insights gained from the studies are general to many fields and applications for both avatars and agents, the context of the work is for data-driven synthesis of non-verbal communication for embodied agents. During the progress of the work, different embodiments and applications have been ex-plored. For example, the work presented in paper A was for a key-word signing avatar, paper D for a virtual talking head, and paper E was intended for social robots. Together this covers a large range of the challenges encountered in multiple scenarios for artificial characters.

(18)

(19)

Background

2.1 Avatars and agents

With the rapid development of 3D technology and robotics, artificially created human-like characters have become ubiquitous in society. We see virtual characters playing lead roles in movies, we use them as alter egos in computer games and ask them for information online. At the same time, robots displaying human-like behaviors are expected to take an increasing role in society, for example by taking care of elderly [11],[12], helping children in their learning [13], and collaborating with humans in factories and other work places [14].

A challenge for designing motion for these types of artificial ’humans’ is to decide what ’human-like’ features to display and find a way to generate appropriate behaviors for an intended application. One important design-factor is the level of realism the character’s embodiment is capable of. While virtual characters in movies can display impressive realism with the aid of large amounts of animators and long rendering times, interactive characters requiring real-time rendering are more limited, and situated robots even more so. For example, while a human skeleton has about 200 joints [15], standard skeletons used for virtual agents have about 20 (50 if the fingers are articulated) [16], and the humanoid robot NAO only has 25 degrees of freedom (DoF) [17]. The added complexity of musculature and facial expression makes it a grand challenge to represent the fine details of human expression on these types of entities.

The main methods of creating animation data are by using procedural tech-niques, by manually authoring poses (key-frame animation) and by motion capture. Procedural techniques automatically create animation in real-time by algorithmi-cally constraining it to some given parameters. Procedural techniques include spec-ifying arm and hand movements by using inverse kinematics, constraining eye, head and neck rotations by setting a gaze target, and generating motion from physics simulations. In 3D animation, procedural techniques are an important part of the character rigging process and are typically used for spine twisting, finger flexion and arm rolls. For virtual agents, procedural techniques have been used to gener-ate animation of non-verbal behavior and sign language from descriptive markup languages such as BML [18], [19] and SiGML [20]. While procedural techniques give a high-level control of movements and may automate many tasks, it may be

(20)

challenging to create suitable parameterizations for more complex behaviors, and using key frame animation may be favoured for a more fine grained control.

Key frame animation is generated by manually posing characters at different time frames and interpolating motion between poses. This allows for high control and artistic freedom, and thus makes it possible to produce both realistic and stylized cartoon-like animation. However, the quality of the results are highly dependent on the skill of the animator, and acquiring this skill takes years of training and practice. A major obstacle for using key-frame animation for large-scale corpora is the shear amount of time it takes to produce. Even a well-trained professional animator cannot produce more than a few minutes of data per day.

The use of motion capture to record human subjects has many benefits compared to procedural and key-frame animation. It allows for rapid recording of data and the data is realistic by nature. Motion capture data may be used for analysis and annotation of human behaviour and for output realization of animated movements. Using data-driven methods and large data-bases of recorded subjects, the captured motion may not only be used for re-animation, but also as base for syntheses of novel motion.

However, while movements of the larger parts of the body such as running, jumping and kicking are fairly uncomplicated to record and process, motion for human communication requires more detailed capturing of the hands, fingers and face. Unfortunately, these parts of the body have proven to be especially hard to capture with standard methods, and usually require an extensive amount of manual post processing. In the worst cases, the data may not be usable at all, or the amount of post processing can take longer than key-framing the movements frame by frame. Another challenge is that the motion captured performances are typically ap-plied to agents or avatars with a different appearance than the original performer. For example, motion of the same actor is commonly applied to a multitude of dif-ferent characters in crowd simulations [21], [22], and these characters may be more or less human-like and have different levels of stylization. Questions arise whether the original performance should account for this transformation or not. The ques-tion is even more crucial when the communicative capabilities of the embodiment is severely limited such as for humanoid robots.

These questions are related to two dominant traditions within western acting, the ’naturalistic’ acting methods as taught by Stanislawski [23], Meisner [24] and Strasberg [25], and the ’physical’ acting methods as originating from the traditions of puppetry and mime [26], [27]. Naturalistic acting focuses on realism and authen-ticity, with the actor acting from ’within’ as a transmitter of the role character’s feelings and emotions. Physical acting relies less on the spoken word and takes the body as the main means of expression. The principles of exaggeration and sim-plification are used to make the expression clearly readable to the audience [28]. In the perspective of this thesis, we rely on ’naturalistic’ acting for embodiments with higher degrees of realism (such as realistic virtual avatars), and ’physical’ acting for stylized characters or limited embodiments (such as robots), requiring a more clearly readable behavior. This is in line with the tradition of animation,

(21)

2.2. DATA-DRIVEN ANIMATION 11

where exaggeration and simplification are common principles for creating readable and effective animation [29]. Further discussions on how artistic knowledge can be applied to virtual characters are given in [28].

2.2 Data-driven animation

The modeling and generation of non-verbal behaviors for artificial agents have come a long way since the first systems developed [4]. While the early systems used hand-coded rules and animations, a strong trend has been towards using data-driven approaches informed with examples of human behavior. Data-driven approaches have been used to train the rules of the rule-based system, to train probabilistic models of behaviors and to synthesize output realizations. This shift has also led to an increasing focus on designing scenarios and methods for recording human behavior. Due to the large variation of how people interact in different situations, it is obvious to limit the context of the corpora to a small set of behaviors of interest.

A distinction may be drawn between corpora mainly designed for analysis/modelling and corpora collected for realization/synthesis. While the former focus on modelling and recognizing generalized behavioral patterns, the latter is more concerned with the output synthesis for a specific target agent. As an analogue, corpora for speech recognition usually contain a large number of individuals under various conditions, while corpora for speech synthesis usually are modelled after a specific individual recorded in an extremely controlled manner. A discussion on requirements and considerations for designing multimodal corpora is given in [30]. Below we review some general methods for data-driven animation. We then follow with an overview of gesture synthesis and sign language avatars.

One way to synthesize novel motion from motion captured data is by using motion graphs [31], [32]. Motion graph based methods work by cutting the original data into small fragments and reassembling them in a novel order. Motion graphs can be applied to the whole body motion or the individual limbs separately [33].

Other methods synthesize animation using statistical models. Variants of Gaus-sian Process Latent Variable Models (GPLVM) [34] have been used to decompose motion data to a low-dimensional manifold which can then be used for output syn-thesis [35], [36]. More recently, there has been a burst of studies using deep learning to synthesize motion data. These studies include trajectory-driven synthesis of lo-comotion [37] as well as speech-driven head pose estimation [38], lip sync [39], [40], [41] and facial animation [42].

A fundamental requirement for data-driven synthesis is the availability of large data bases of representative example motion. For example, the method proposed in [37] is trained on about 14 hours of data (the CMU mocap database plus new record-ings), the synthesis of facial animation in [40] is trained on the IEMOCAP data base [43], containing 12 hours of emotional conversation, and [38] apply a database containing about 6 hours on emotional dyadic conversation. One can note the lack of large corpora with articulated finger motion, or with full performance capture

(22)

(face, gaze, fingers and body). Acquiring such detail is still a challenging and costly task requiring expert knowledge. One example of highly detailed motion capture data collections is the sign language corpus presented in [44]. The motion capture for this corpus was done in a professional motion capture studio (MocapLab1_{) and}

contains about twenty sentences.

In papers A and B, we propose two methods to address the challenges for record-ing large scale motion capture corpora. In paper E we use this knowledge to collect a highly structured corpus with about one hour of performance capture (body, fingers, facial expression and gaze).

2.3 Gesture synthesis

The ability for conversational agents to gesticulate with their hands has a high impact on the naturalness and efficiency of the interaction. In this section we review work on synthesis of gesticulation for conversational agents. A detailed overview of gesture and speech in interaction is given in [45].

Following Kendon [46], hand gestures can be placed along a continuum where their co-occurrence with speech is more and more optional.

• Gesticulation: Gestures co-occuring with speech.

• Speech-framed gesture: Gestures filling in a slot in speech, or completing a sentence. Such as in the phrase "He went [making a flying gesture] out the window".

• Emblems: Conventionalized culture-specific gestures that may replace words. Examples are the ’victory’ or ’thumbs up’ signs.

• Pantomime: Gestures produced without speech conveying a narrative or stag-ing imaginary scenes.

• Sign Language: Has lexical words and full grammars.

Moving along the continuum, gestures become increasingly language-like and may take over more of the communicative functions of speech. On one side of the continuum are co-speech gestures, which are unconsciously produced in conjunc-tion with speech. On the other side are sign languages, which are full languages with their own morphology, phonology and syntax [47], [48]. Kendon’s continuum is related to the work in this thesis as it indicates how much information needs to be communicated visually, and thus provides a guideline for the requirements of capturing the motion. While synthesis of gesticulation may tolerate some approx-imation, animating emblems and sign language requires higher fidelity motion as the information is solely encoded in the visual domain. Figure 2.1 shows how the papers correspond to the continuum.

(23)

2.3. GESTURE SYNTHESIS 13

Figure 2.1: Kendons continuum and related papers.

Several categorizations of gesture have been proposed. One of the most adopted is the one by McNeill [49]. Here, gestures are categorized into the following groups: • Iconics: Presents images for real objects and actions by depicting them with

shape, such as outlining the shape of a vase.

• Metaphorics: Symbolic representations of abstract concepts such as cupping the hands to represent an idea.

• Beats: Simple and fast movements of the hands used to emphasize spoken discourse.

• Deictics: Pointing gestures used for direct spatial reference (such as direct pointing towards a person or object) or abstract reference (such as referencing to concepts from a previous utterance).

(24)

The basic challenges that need to be addressed for gesticulation synthesis is to select which gesture to generate and then to realize the output animation. This has been referred to as the selection and animation problems [50].

An important consideration for gesture selection is what underlying representa-tion to use as the basis for synthesis. Several studies base synthesis on text input. For example, the BEAT system [51] used a markup-language for texts describing functional aspects (theme/rheme, emphasis, contrast, topic-shifts, turn-taking and grounding), and generate behaviors using a rules-based approach. Other work ap-ply statistical methods to generate individualized gestures from annotated corpora [52], [53], [54]. Individualized gesture has shown to outperform averaged gesture in terms of likeability, competence and human-likeness [54]. Text-based approaches have many attractive properties. They can capture the content and context of the current conversational state and be used in conjunction with text-to-speech synthesis to generate a more natural non-verbal behavior. However, using text alone as input limits the output gestures to only convey redundant information to that encoded in the text. An important property of non-verbal behavior is its complementary function to increase information bandwidth [55]. For example, iconic gestures are efficient for indicating shape and form, and deictic gestures are the preferred way to show focus of attention and to disambiguate spatial relations [56]. Studies have shown that gestures conveying complementary information to the speech channel increase the efficiency of communication [57].

Gesture and speech are commonly thought of as originating from a shared com-municative intent (cf. McNeills growth point theory [49]). Efforts in the SAIBA framework (Situation, Intention, Behavior, Animation) 2_{, have resulted in an}

ar-chitecture comprised of three stages: the intent planner, the behavior planner and the behavioral realizer [18]. Two XML-based representations have been developed to communicate between the modules. The Functional Markup Language (FML) is used to send messages from the intent planner to the behavioral planner, and the Behavioral Markup Language (BML) is used to send messages from the behavioral planner to the behavioral realizer. FML acts on a higher level and represents con-cepts such as person characteristics, communicative actions, content, mental state and social-relational goals [58]. BML specifies utterances and nonverbal behaviours such as facial expressions and gestures [18], [19].

Other studies use prosodic features as the base for output synthesis. In a first study by Levine et al. [59], a method based on Hidden Markov Models (HMM) is used to generate beat-like gestures from speech. In a second study [60], they instead propose using Conditional Random Fields (CRF) to avoid problems with overfitting. A similar approach was adopted by Chiu et al. [61], although using GPLVMs to generate more natural gesture transitions.

Important for gesture synthesis is the temporal characteristics of the hand move-ments. According to the classification of Kendon [46], gesticulation is temporally divided into units, phrases and phases. A gesture unit is defined as a movement

(25)

2.4. SIGN LANGUAGE AVATARS 15

path of the hands outgoing from one rest position and ending in another. Each gesture unit contains one or several phrases, which in turn are comprised of one or several phases. The gesture phrases each have a mandatory stroke phase and optional preparation, hold and retraction phases. The stroke is the main event of the gesture and is temporally aligned with concepts conveyed in speech. The preparation phase acts to position the hand for the stroke, and the hold phases are used for coordination of the stroke and the speech. The retraction phase ends the gesture unit by placing the hand at a rest position.

To generate natural gesticulation, not only individual gesture phrases need to be produced, but the gestures should also be composed into gesture units. In a study by Kipp et al. [62], a character was perceived as more friendly and trustworthy when gesturing with multiple phrase gesture units than when gesturing with singleton gestures.

Gesture synthesis methods commonly rely on manually annotated corpora. In paper C, we present an automatic method for gesture annotation using motion capture data as input.

2.4 Sign language avatars

During recent years, there has been a substantial research effort to develop assistive technology for the Deaf population. This group of people are especially at a disad-vantage when it comes to communicating with society, and to access information such as in education and social services [63], [64]. Loudspeaker announcements at public spaces rely heavily on voiced forms of spoken language and are thus inacces-sible for the non-hearing. The use of technological solutions may have great impact to access such information.

Several long-term research projects have been funded to develop sign language technology, such as the ViSiCAST [20], Dicta-Sign [65], eSign [66] and SignCom [67] projects. An early avatar system was developed by Wells et al. [68] to generate signed representations of TV subtitles. The system generated gloss-to-gloss anima-tion (corresponding to Sign Supported English) by concatenating and interpolating between signs. The Tessa system [69] generated British Sign Language phrases by replaying movements recorded with motion capture technology. The phrases were set in a post office context and the system could replace individual signs in the phrases, but not generate novel sentences. Efforts in the ViSiCAST and Dicta-Sign projects led to the development of the Signing Gesture Markup Language (SiGML) [20], which is a descriptive representation based around the HamNoSyS [70] pho-netic transcription system. To use SiGML to generate new signs, properties such hand pose, sign location and direction are specified in XML-format, and the system generates animations using procedural techniques. A benefit of this approach is that it generates phonetically correct signs and enables fine control over the movements. Also, as the format allows for specification of facial and body landmarks important for different signs, synthesis can be transferred across different avatars. A drawback

(26)

Figure 2.2: Renderings of the corpus collected in the Sign3D project [44] using optical motion capture and an eye-tracker. Images courtesy of MocapLab.

is that it requires a high level of linguistic knowledge to specify the signs and to add new signs to the vocabulary. Also, the options to generate non-manual (fa-cial) features are low, and the use of procedural techniques makes it challenging to achieve natural and fluid animation.

Other studies employ data-driven methods for sign-language synthesis based on motion capture data [71], [67], [72]. Benefits of data-driven methods are the increased naturalness of the generated movements and the possibilities to use the data for both analysis and output animation. For example, Lu & Huenerfauth [71] use motion captured signs to synthesize inflected verbs in American Sign Language. The location and direction of such verbs depend on where the signer has placed subjects and objects in sign-space. Gibet et al. [72] use an annotated corpus of high detailed motion capture data for synthesis of signs for a museum scenario. An in-depth discussion of the requirements and challenges of data-driven sign language synthesis is given in [72].

What technology to use to capture sign language has been a topic of discussion among research groups. Lu & Huenerfauth [73], for example, highlight the prob-lems with occlusions inherent in all optical motion capture systems, and employ a combination of non-optical sensors (CyberGloves, an eye tracker, and inertial, mag-netic and acoustic sensors for the body and head) in their corpus of American Sign Language. The corpora recorded in the SignCom [67] and the Sign3D [44] projects,

(27)

2.4. SIGN LANGUAGE AVATARS 17

however, used optical marker-based motion capture for better accuracy and higher frame rates. Renderings from the data show that highly accurate and expressive animations could be achieved (see Figure 2.2). A downside of using optical motion capture is, however, the high costs of capturing and processing the data, which may limit the size of the corpus.

Other researchers have focused on developing tools to facilitate the authoring and editing of 3D animated signs. As current animation software requires high ex-pertise, and thus is inaccessible to lay-people, more intuitive and natural authoring systems may help generating sign animation directly by signers. Heloir & Nunnari [74] propose a system to author sign language animations using the open-source Blender3_{software and the low-cost sensors Kinect and Leap Motion. The signs are}

produced in three consecutive steps. First, general hand, body and facial motions are performed in front of the Kinect and recorded in Blender. Then the resulting key-frames are trimmed and finally the signs are edited using the Leap motion de-vice as a natural user interface. Other editing methods include the work of [75], [76],[77].

Outside the Deaf community, there are groups of people who use signs com-plementary to speech in order to enhance communication. This group includes people with various disabilities such as developmental disorder, language disorder and autism. Within this group, Key Word Signing (KWS) is a well established method of augmented and alternative communication (AAC). The work in paper A (See Section 3.1), describes the development of an avatar designed to demonstrate key word signs to the players of a computer game for practicing KWS. One of the main challenges was to capture detailed finger movements in a cost-effective way requiring low amounts of manual post-processing. The potential for such methods may not only have impact on research on signing avatars, but also more generally for avatars and agents [78].

(28)

(29)

Part I

Automatic processing of motion

data

(30)

(31)

Automatic processing of motion data

In this chapter, we start with an in-depth review of optical motion capture tech-nology and follow by a brief discussion of glove-based systems (which provide an alternative method for capturing finger motion). The complementary natures of optical and glove-based systems led to the dual-sensor method proposed in paper A. We then continue with a review of methods for automating the cleanup process for optical motion capture, leading to the method for labelling finger and facial marker sets presented in paper B. Finally, we discuss automatic annotation of ges-tures, and introduce the approach for gesture unit annotation proposed in paper C.

3.1 Motion capture

One of the main objectives in this thesis is to provide motion capture data for data-driven synthesis of talking and gesturing avatars. To ensure efficient and com-prehensible communication, this requires high accuracy and resolution of the face and finger capture. A dominant method is optical marker-based motion capture. According to a recent state-of-the-art report [78], this is the technology providing the most accurate capturing of hands and fingers, and it is the technology pri-marily used in this thesis. For capturing facial expression, video-based approaches using head-mounted cameras [79], or methods employing RGBD-cameras [80], [81] are alternative methods. Recent developments in video-based landmark detection [82], [83] provides novel marker-less techniques for both face and finger tracking. Questions of intrusiveness, required frame rate and size of capture volume needs to be carefully considered before choosing technology. The work presented here was primarily developed for finger capturing, and later extended to facial markers sets.

Optical marker-based motion capture

Optical motion capture has become a dominant method of recording motion within many areas such as film, computer games and sports analyses. The technology provides accurate data at fast sampling rates, and the same system can be used to capture the motion of a wide range of structures, including objects, animals, human bodies, fingers and faces. By using passive (reflective) markers, all processing is

(32)

done externally, and the captured subject does not need to wear electrical equipment or wires. The systems are comprised of specialized infrared (IR) cameras along with computers and software for image analysis and processing. The cameras detect small markers placed on strategic locations on the captured subjects. These markers can either be active, or passive. Active markers emit infrared light detected by the cameras. Passive markers are coated with reflective material, and this requires the cameras to emit the light, which is reflected back and detected. In both cases, the cameras will use the detections to triangulate the 3D locations of the markers. One main benefit for active markers is the possibility to discriminate the different marker identities by modulating the emitted light pattern. The main disadvantage is the added wires and electrical equipment that need to be attached to the subject. For this reason, passive markers have been the main choice in many applications. Typically, a system has a large number of cameras placed around and aimed towards the capture volume, which is the region with sufficient camera overlap for marker reconstruction.

After reconstructing the 3D points into a point cloud, the system needs to determine which point is which and label each point with a marker id. This process is commonly referred to as marker labelling, and is non-trivial due to the fact that all markers look alike to the system, and the inference of marker id’s needs to rely on temporal and structural information only. For example, by placing three markers on a rigid structure forming a non-symmetrical triangle, the edges of the triangle can be used to infer the marker identities [84]. A typical scenario starts with the actor performing a range-of-motion (RoM), i.e. a movement series where all joints are exercised. This provides the system with marker inter-relations for labeling. All subsequent recordings with the same actor start by the actor entering a specified pose (usually T-Pose or A-Pose) used to initialize tracking and to provide a reference pose for matching the markers to a skeleton model. Some systems come with pre-trained models and do not require the RoM. However, this may come with the drawback of less flexibility to create custom marker sets.

For human subjects, the marker data is used to estimate the kinematic motion of a bio-mechanical model of the human skeleton. This includes a problem of estimating the skeleton parameters, such as bone lengths and joint centers, and a problem of estimating the pose of the skeleton. Automatic methods to estimate skeleton parameters have been proposed in several studies [85],[86]. The estimated joint angles can thereafter be used for further processing or animation.

Although optical motion capture can provide highly accurate results, it comes with a number of associated problems, which may end up in poor data quality and extensive costs in manual post-processing [87]. As with all vision-based tech-nologies, optical motion capture requires a clear line of sight and occlusions cause serious challenges. Increasing the amount of cameras viewing the scene from more directions may help limiting occlusions, but most capturing scenarios will inevitably contain blind spots regardless of the number of cameras (such as markers on the chests of two subjects hugging). Markers placed on fingers are especially problem-atic and often suffer from self-occlusions when the fingers are bent or the hands

(33)

3.1. MOTION CAPTURE 23

are facing towards the body or with palm-up [78]. Occlusions do not only cause problems with missing data, they also makes the labeling process more difficult as this reduces the available information for inference. Further challenges arise in situations when several markers come in close contact (such as clapping hands) or when multiple people interact [88].

Another challenge for optical systems is to capture highly detailed motion such as fingers and faces. Unlike the well-distributed movements of the larger parts of the human body, these movements have great mobility confined to small regions. To capture the details of such motion, a high number of small markers needs to be applied, and the cameras need to be placed closer together to provide enough separation of the marker detections in the camera views. For facial capture, head-mounted cameras can be used to increase the resolution, but capturing finger motion in large volumes is still a challenge attaining substantial research effort. Novel data-driven methods [89], [90], [91] have been shown to be capable of reconstructing high quality finger animation using reduced marker sets. This allows for larger markers and thus larger capture volumes. Unfortunately, the best placements for these markers are located on the outer parts of the hands, which leads to further problems with occlusion and labelling.

Generally, there is an inherent conflict regarding the number of markers to use. On one hand, attaching more markers to the subject is beneficial, as it provides more information for labeling and skeleton solving, and the redundancy of information may be used for automatic cleanup. However, adding markers increases the amount of gaps and noise caused by low marker separation, and additionally increases the amount of manual cleanup for the extra markers. Finding a good compromise for a given application has been a key effort. Hoyet et al. [92] investigate the perceptual effects of finger animation using different size marker sets. For more complex hand and finger motion, such as counting and signing, the study recommends a smaller marker set of 8 markers (6 for the fingers, 2 for the thumb) rather than a full marker set (20 markers) as the perceptual differences were small.

To summarize, optical motion capture provides highly accurate data, but may require a large amount of manual post-processing when capturing regions containing a high amount of detail in small areas such as fingers and faces. With additional challenges arising from occlusions, the amount of manual cleanup may even be prohibiting for applications requiring large amounts of data.

Instrumented Gloves

An alternative method to capture finger motion is provided by glove-based technolo-gies. Glove-based systems started to be developed in the 80s, and have since been used in a wide range of applications, including virtual reality [93], sign language [69], [73], gesture and robotics tele-operation [94]. Gloves have been developed us-ing different sensors with various levels of accuracy, and may be more or less suited for different applications (see [95] and [96] for in-depth reviews). Here, we will focus

(34)

Figure 3.1: Top Left: The hand has 27 bones: the 8 carpal bones of the wrist, the 5 metacarpal bone in the palm, the 5 proximal phalanges of the fingers and thumb, the 4 intermediate phalanges of the fingers and 5 distal phalanges of the fingers and thumb. Top Right: The articulations of the hand are mainly confined to the fingers, although some movements of the metacarpal phalanges allows for cupping of the palm. Bottom Left: A five-sensor 5DT glove from Fifth Dimension Technologies. Bottom Right: A CyberGlove II from CyberGlove Systems.

(35)

on gloves using bend sensors (usually called instrumented gloves), as they are the preferred option in hand animation research.

Current designs, such as the CyberGlove1 _{and the 5DT data gloves}2 _{are made}

of stretchable Lycra material and have sewn in bend sensors covering the artic-ulated joints (See Figure 3.1). The number of sensors vary from simple gloves having five long sensors covering all joints of each finger, to high end gloves having up to 22 sensors covering each joint individually. The five sensor gloves gives an average estimation of flexion-extension (bending) of the fingers but cannot regis-ter abduction-adduction (spreading). The high end gloves can generate a more accurate estimation of all joint angles.

Glove-based systems have many appealing properties, including the possibilities to use them in real-time, in large spaces and to avoid the problems of occlusions. Unfortunately, there are many drawbacks of the technology. One challenge is caused by sensors cross-coupling, where the movement of one joint causes readings in mul-tiple sensors. Typically, the bend sensors registering the abduction-adduction are influenced by the flexion of the fingers. To provide accurate results, the gloves need frequent re-calibrations and elaborate calibration protocols [97]. Another consider-ation is the high price-point, which may put them out of range for many research projects. Furthermore, as gloves only measure joint angles relative to the wrist, an-other system needs to be adopted to capture the global positions and orientations of the hand.

In our work with the Tivoli project, we observed that the strengths and weak-nesses of optical and glove-based motion capture are to a large extent complimen-tary. While optical systems provide high accuracy and global positioning but have problems to capture the outer parts of the fingers, glove-based systems provide gap-free data but show cross coupling problems at the inner joints. This observation led to the dual-sensor approach presented in paper A.

Dual sensor finger capture

In the Tivoli project [98], which is the framework of paper A, we developed a game for learning and training key word signs. The target group for the game is children with significant differences in communicative abilities, memory, attention span, and motor skills. The game was designed to give the children both a way to learn new signs demonstrated by a highly stylized avatar, and to practice them by using signs as an input to solve different puzzles. Two main technical challenges in the project were to develop the avatar presenting the signs and a system for recognizing signs inputted by the users. The work presented in paper A was dedicated for the former task. The paper presents the motion capture, animation and evaluation of the avatar used in the game. The main contribution lies in a novel dual-sensor approach to capture detailed finger motion, circumventing the limitations of optical

1_{http://www.cyberglovesystems.com} 2_{http://www.5dt.com/data-gloves/}

(36)

and glove-based motion capture. Due to limitations in the project budget, an important requirement was that the system is efficient and cost-effective. The method proposed in the paper uses two low-end systems, one optical motion capture system with 16 low resolution cameras, and two low-end five sensor gloves. The optical system was used to provide data for the full body, the face and the proximal parts of the fingers, and the gloves only provided data for the two distal joints on each finger. By combining the data sources, full hand pose estimation could be achieved.

The system works as follows:

1. The signer is equipped with a motion capture suit and two gloves. 2. Markers are placed on the body, face and all joints of the fingers. 3. The signer performs a RoM exercising the body, face and the fingers. 4. The RoM is used for two things:

a) To automatically fit skeleton models of the hands and fingers using the optical data. For this we implement the method by Miyata et al. [99]. b) To train linear regression models using the glove sensors and the proximal

joint angles as input and the joint angles at the distal joints as output. 5. During subsequent sessions, the markers on the distal part of the fingers

can be removed, and the corresponding joint angles can be predicted by the regression model.

The method provides many attractive features for applications requiring hand and finger animation:

• Contrary to the calibration procedures used for glove-based systems, the cali-bration and estimation of skeleton parameters are performed in an automated way using only a short RoM.

• Contrary to purely optical systems, the method allows for a few markers on the inner joints, which are relatively gap-free and easy to label.

• The system uses low-cost sensors and is thus accessible to projects with lower budgets.

A quantitative evaluation of the system was made by comparing joint angles from the dual-sensor method to those estimated using the full set of 21 markers on all joints. In this experiment, we used the motion from the RoM take and calculated the errors using 3 fold cross validation. The results showed that reasonable accuracy could be achieved for the fingers (average errors ranging from 0.04 to 8.8 degrees), but that the thumb was more of a challenge (average errors ranging from 0.4 to 20.9 degrees).

(37)

Figure 3.2: Left: 5DT Glove with 21 markers attached. Right: Linear regression model fitted to the training data.

To complement the quantitative assessment, we performed a perceptual eval-uation of the avatar using the full 21-marker animations and the dual-sensor an-imations as conditions. In a first experiment, we assessed the intelligibility and the clearness of the signs using a free-text sign identification task and an absolute rating of the clearness. In a second experiment, we assessed the perceived relative clearness of the two conditions. A total of 25 participants, self-reported to be flu-ent in sign language, completed the first experimflu-ent, and 21 of these completed the second. Despite the lower accuracy of the dual-sensor method, we did not find any significant differences in the perceptual evaluation in either of the experiments.

Although the experiments support the viability of the proposed method for capturing finger motion, as the expert users both could identify the signs in the stimulus and did not find the dual-sensor versions as less clear, the study comes with some limitations impacting the more general use for sign language avatars. One limitation is that the signs were performed at a slower rate than average in sign language. The reason for this was that they were intended to be displayed to young children with communicative impairments. This presumably had impact on the sign identification task, and signs produced faster may have been harder to identify. On the other hand, a slower sign rate may have facilitated the rating of the clearness of the hand shape and sign, and we did not find any significant degradation from our method. Another limitation is that our experiment was not specifically designed to assess the intelligibility of the more complex hand shapes used in sign language (such as letters ’m’ and ’n’ in American Sign Language). Indeed, the evaluation of sign language avatars is a complex task and has been the

(38)

Figure 3.3: Left: Baseline hand pose estimation using full (21 marker) marker set. Right: Reconstructed hand poses using the dual sensor method. Note the problems with estimating thumb joint angles (bottom right).

(39)

topic of several papers [100], [101], [73].

As discussed in the paper, several improvements may be accomplished on both the hardware side and within the algorithm. First, the gloves used in the paper have five long sensors covering all joints of the fingers, and our method only requires sensors covering the distal joints. Redesigning new gloves with separate sensors for the outer joints may improve the results. Second, later tests have shown that the prediction of the joint angles for the thumb improves considerably by using an extra marker further out (although this would require more data to label and clean). Finally, improvements may be accomplished using more advanced machine-learning approaches than the simple linear regression models used in the paper.

A further limitation is that the use of two different types of sensors introduces extra complexity to the setup. This includes the synchronization of the data from the optical system and the gloves, the added software implementation to handle the two data streams and the fact that the subject needs to wear both gloves and markers. Using an optical system as a single source of data would simplify the setup and make it practical for more general use for gesturing avatars and agents. To enable detailed large-scale data collections using optical motion capture only, we noted that the problem of labelling finger and facial markers needed to be solved. In the following sections, we give an overview of methods to automate the process of data-cleanup, and introduce the method presented in paper B.

Automatic data cleanup of optical motion capture data

The first step requiring manual intervention in a motion capture pipeline is data cleanup, which refers to labeling, gap-filling and filtering [87]. In a typical scenario, a motion capture specialist manually corrects erroneous labels, fills in gaps and filters the data to remove spikes and noise.

Several methods have been proposed to automate the different steps of the data cleanup process. A fundamental problem to solve is the labelling problem as correct marker labels are a prerequisite for all subsequent automatic methods. As noted before, the challenge for automatic labelling is mainly due to gaps in the data, which causes regions of missing data in the spatio-temporal relations. Without gaps, the labeling problem can be solved either in the temporal domain, using multiple target tracking algorithms [102], or in the spatial domain, using shape constraints [84]. Most previous methods work according to the following procedure: First, the system initializes the labels. This can be done either by manual assignment, or by automatically fitting the data by entering a predefined pose [103], [104]. Tracking is conducted by predicting marker candidates close to the labeled markers in the previous frame and optimizing the assignments based on spatial constraints and temporal smoothness. Spatial constraints may be supplied from other markers with rigid connections [105],[84], from an underlying skeleton model [103], [104] or from a bilinear spatiotemporal model [106]. If the system recognizes that markers are missing (due to occlusions), these are automatically reconstructed using the spatio-temporal model as a basis. After reconstruction, the

(40)

labeling can proceed as before. Some methods use a RoM to train their model [88], while others do not require pre-training [103], [104]. A problem with the above methods is the requirement of a dense enough marker set to allow for accurate reconstruction of missing markers, and that inaccurate reconstruction will impede subsequent tracking. Unfortunately, sparse, non-rigid marker sets do not comply to this request. In our work with the Tivoli project, we especially noted the problems of labeling finger markers, and set out to develop an algorithm especially tailored to this problem. The efforts resulted in a multiple-hypothesis method presented in [1]. This paper was well received and invited for a journal extension, which is the work presented in paper B. In addition to labeling finger markers, the extended version covers simultaneous labeling of multiple marker sets (finger and face data), data-driven reconstruction of full marker sets from sparse data, and additional results for labelling data in a full performance capture.

After the labels are corrected, the remaining gaps need to be filled. Small gaps may be filled by interpolation or by copying motion from neighboring markers [87]. Larger gaps are commonly filled with manually added key frames. Automatic meth-ods for gap-filling have been proposed using data-driven methmeth-ods. For example, the method developed by Baumann et al. [107] uses a nearest neighbour search and optimize the gap-filled trajectories for smoothness. They demonstrate their method using the HDM05 database [108] as basis for reconstruction. Peng et al. [109] use non-negative matrix factorization and demonstrate their method on the HDM05 and CMU data set. A problem with all data-driven methods for gap-filling is the premise that similar motion exists in the prior data-base.

Robust real-time labeling of non-rigid marker sets

The work in paper B addresses the labeling problem for motion capture of non-rigid marker sets such as those applied to fingers and faces. As noted above, the method was first developed for finger capture only, but the generalization to facial marker sets enables the same algorithm to be used for full performance capture.

We based our algorithm on the following observations:

1. To be able to capture fingers in large volumes with minimal manual post processing, large (ca 5-10 mm) markers and sparse marker sets need to be used.

2. Sparse marker sets for finger capture do not provide enough spatial infor-mation to disambiguate the pose of an underlying skeleton model, nor to reconstruct occluded markers.

3. Frequent occlusions cause the temporal domain not to be a reliable source of information.

In addition to these observations, we added the following desireabilia: • The method should work in an online and offline mode.