• No results found

Using a robot head with a 3D face mask as a communication medium for telepresence

N/A
N/A
Protected

Academic year: 2022

Share "Using a robot head with a 3D face mask as a communication medium for telepresence"

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

Using a robot head with a 3D face

mask as a communication medium for telepresence

MAGNUS GUDMANDSEN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Degree project, second level

Using a robot head with a 3D face mask as a communication medium for telepresence

Användande av ett robothuvud med en 3D-ansiktsmask som kommunikationsmedium för telenärvaro

Magnus Gudmandsen maggud@kth.se

Master’s Thesis Computer Science

KTH 14 juni 2015

Supervisor: Gabriel Skantze Examiner: Anders Lansner

In collaboration with Furhat Robotics AB Supervisor: Preben Wik

(3)

This thesis investigates the viability of a new communication medium for telep- resence, namely a robotic head with a 3D face mask. In order to investigate this, a program was developed for an existing social robot, enabling the robot to be used as a device reflecting the facial movements of the operator. A study is performed with the operator located in front of a computer with a web cam- era, connected to speak through the robot to two interlocutors located in a room with the robot. This setup is then compared to a regular video call.

The results from this study show that the robot improves the presence of the operator in the room, along with providing enhanced simulated gaze direction and eye contact with the interlocutors. It is concluded that using a robotic head with a 3D face mask is a viable option for a communication medium for telepresence.

Sammanfattning

Denna uppsats utforskar gångbarheten för ett nytt kommunikationsmedium för telenärvaro, nämligen ett robothuvud med en 3D-ansiktsmask. För att under- söka detta har ett program utvecklats för en existerande social robot, som möj- liggör att roboten kan återspegla operatörens ansiktsrörelser. En studie utförs med en operatör placerad framför en dator med en webbkamera, där opera- tören kopplas genom roboten för att prata med två samtalspartner som är i samma rum som roboten. Detta arrangemang jämförs sedan med ett vanligt videosamtal. Resultaten från studie visar att roboten förbättrar operatörens närvaro i rummet, och dessutom ger förbättrad simulerad blickriktning och ögonkontakt med samtalspartnerna. Slutligen fastställs att ett robothuvud med en 3D-ansiktsmask är ett gångbart kommunikationsmedium för telenärvaro.

(4)

Contents

1 Introduction 1

1.1 Scientific Question . . . 2

2 Background 3 2.1 Established areas of telepresence . . . 3

2.2 The Mona Lisa Effect . . . 4

2.3 Future areas of telepresence . . . 4

2.3.1 Telerobotics . . . 5

2.3.2 Holograms . . . 5

2.3.3 Physical telepresence . . . 6

2.3.4 Social Tele-operated robots . . . 7

2.4 The Furhat robot . . . 8

2.5 Uncanny valley . . . 9

2.6 Summary . . . 10

3 Implementation 12 3.1 System overview . . . 12

3.2 The Furhat toolbox . . . 13

3.2.1 IrisTK . . . 13

3.2.2 The animated face . . . 13

3.2.3 Synface . . . 14

3.3 Clmtrackr library . . . 15

3.4 System details . . . 16

3.4.1 Client side . . . 16

3.4.2 Server side . . . 17

3.4.3 Data flow . . . 17

3.4.4 Resampling and buffering . . . 19

4 Evaluation 20 4.1 The Desert Survival Task . . . 20

4.2 Areas of evaluation . . . 22

5 Results 23 5.1 Interlocutors . . . 23

(5)

6 Discussion 27 6.1 Interpreting the results . . . 27 6.2 Effects of the uncanny valley . . . 28 6.3 Future Work . . . 28

7 Conclusion 30

Bibliography 32

A Facial Parameters 34

B Questionnaires 36

(6)

Chapter 1

Introduction

By looking at the history of communication, it can be seen that it has been improved greatly in the last years. Early on, communicating was simply done by saying words to each other. From there, it was developed to writing letters, making it possible to communicate over a large distance with the help of a courier. However, this method of communicating was hardly fast enough. Eventually we could reach new levels of communicating, making it possible to communicate in real-time by using telephones.

With technology advancing, this was not enough. Simple communication by audio only might be enough for some situations, but hardly for all. After the breakthrough of television, technologies were developed for real-time communication with both au- dio and video, enabling full-scale video calls as we know it today. With all these improvements of communication historically, it is hard to see that improvements of communication media would stop here. The question that remains, however, is what the next medium is going to be.

Telephone and television are simply different media of communication for telep- resence. With presence, this thesis refers to the definition as seen in the Oxford dictionary: The state or fact of existing, occurring or being present1, meaning it refers to how one’s surroundings are perceived, as you can be present not only in your physical environment [1], but also in a remote location - telepresence. Another definition of presence is the sense of being in an environment generated by natural means [2]. Steuer defines telepresence as the sense of being in an environment gen- erated by mediated means [2].

An example of telepresence is the previously mentioned video call. Here, appearance of a physical presence is enhanced through a video and audio stream, allowing you to both hear and see the callee, despite your physical distance being large. This illusion of another person being present in the room has been created by displaying the video on a screen and streaming the audio from the speakers, or simpler put, by a medium of communication: a video call.

1http://www.oxforddictionaries.com/definition/english/presence - March 15, 2015.

(7)

One commonly occurring problem with the video call, however, is the Mona Lisa effect (described in detail in Section 2.2). Due to the face of the person being pro- jected in a 2D screen, it is impossible to determine the gaze direction of the caller.

According to Al Moubayed et al. [3], using a 3D surface solves this issue. Rather than looking at the previous and established media of communication, this report will focus on future possibilities to improve telepresence. The ultimate example would be holograms, which brings great possibilities, providing a 3D visualization of a person in a room. This technology does, however, mainly exist in the future, as explained in Section 2.3.2. This report will focus on more applicable variants, namely using a robotic head with a 3D face as a medium of communication, which should remove the Mona Lisa effect, allowing for a better interaction.

Before looking into the background and going through the different media of commu- nication for telepresence in detail, a clearer specification of what we will be looking for is needed. This is provided by the scientific question in Section 1.1.

1.1 Scientific Question

Is a robotic head with a 3D face mask a viable option for a future medium of com- munication for telepresence?

This question will be evaluated by investigating how well such a medium performs in a telepresence scenario, compared to a video call and actual presence, along with observing advantages and disadvantages.

(8)

Chapter 2

Background

This chapter will discuss previous work and areas related to the work that will be done in this thesis.

It starts with a brief explanation of existing areas of telepresence, such as tele- phone, television and video calls (2.1), followed by clarifying the Mona Lisa ef- fect (2.2), and then advances to future areas or not fully explored areas such as telerobotics (2.3.1), holograms (2.3.2), physical telepresence (2.3.3) and social tele- operated robots (2.3.4). Further on, it will describe the Furhat robot (2.4) that will be used in the implementation, together with presenting the uncanny valley (2.5).

2.1 Established areas of telepresence

The most common areas of telepresence today can be generalized to four main areas:

Radio, Telephone, Television, and Video calls. Due to the common knowledge of these areas, they will only be briefly presented as an introduction to future areas.

Radio - Single-modality (audio only) in one direction.

The very simplest form of communication medium for telepresence. By hearing someone talking on a radio, the illusion of that person being present in the room is created. It is, however, not a very strong illusion (due to audio only being sent in one direction), meaning that it is not a very good medium for telepresence.

Telephone - Single-modality (audio only) in two directions.

Using telephones allows audio to be sent over a large physical distance, in both directions, simulating verbal communication. For telepresence, using telephones as a communication medium is a great improvement over the radio, as it actually allows for proper two-way communication.

Television - Dual-modality (audio and video) in one direction.

A huge stepping stone from the radio, the television (TV) allows both video and sound to be sent, providing possibilities to both see and hear in a distant location.

(9)

It is, however, just like the radio, only in one direction, greatly limiting its use for telepresence.

Video call - Dual-modality (audio and video) in multiple directions.

With a small modification to a television, adding a camera at both ends instead of a camera only at one end, we get to a closely related, but all the more relevant field:

the video call. This is a medium that, together with audio calls (as used by the telephone), provides the majority of telepresence applications today. Using video calls, we can simulate a real conversation, where you can see facial expressions, such as mouth movements, and body language, such as hand movements. All of this is coupled with audio.

Video calls leaves us at dual-modality in multiple directions. However, the dis- play technology is still only two-dimensions, missing the third dimension: depth.

This introduces the problem of the Mona Lisa effect, explained further in Section 2.2.

2.2 The Mona Lisa Effect

Named after the famous painting by Leonardo da Vinci, this is an effect that is said to make it impossible to establish where Mona Lisa’s gaze is directed at [3]. It doesn’t matter where you are standing in the room; it still appears that Mona Lisa is looking directly at you. The same effect has been observed with images and videos on flat-surfaced monitors, and is a recurring problem with telepresence, especially apparent in video conferences due to the multiple recipients.

In 2012, an article presented by Al Moubayed, et al. [3], showed that the Mona Lisa effect is not only applicable on gaze, but on anything with a direction, such as a finger pointing towards the viewers perspective. The article presents an extensive investigation of the perceived gaze direction of an animated face projected on a 2D surface compared to a 3D surface (a head shaped as a human head with a face).

The test showed that the Mona Lisa effect appears on the 2D surface, and that the test subjects did not agree on where the gaze of the animated face was directed.

On the 3D surface, however, there were no traces of the Mona Lisa effect. The test subjects agreed largely on where the gaze was directed at, despite the gaze only being directed at one of the five test subject at a time. It was thereby concluded that using a 3D projecting surface completely eliminates the Mona Lisa effect.

2.3 Future areas of telepresence

This section contains areas that are not fully explored yet. The technical possibilities may exist, but it is probably a long time before this will be used in its full potential.

Telepresence areas of the future gives us a new dimension: the physical dimension.

(10)

2.3. FUTURE AREAS OF TELEPRESENCE

2.3.1 Telerobotics

With technology advancing, it is possible to use more advanced robots, with es- pecially interesting robots inhabiting the field of telerobotic surgery. Using a teler- obotic setup for surgery allows expert surgeons to perform surgeries on the other side of the world. It also allows novice surgeons to have a mentor across the world join in on the surgery and assist with comments, and even step in at critical moments if necessary. Two such setups are the telerobots da Vinci and Zeus [4], depicted in Figure 2.1. These rigs allows the surgeon to sit in an ergonomically superior position compared to the similar laparoscopic surgery (more commonly referred to as keyhole surgery), and gives a better control of the camera and robot arms, another one of the problems to laparoscopic surgery, as identified by Ballantyne [4]. The telerobots also give a 3D representation of the area that the surgeon can work with, instead of the 2D representation that laparoscopic surgery provides. Ballantyne argues that this is a way to reduce the disadvantages of laparoscopic surgery of today, and a solid advancement in the area of telerobotic surgery.

Figure 2.1: The telerobots da Vinci (left) and Zeus (right). With kind permission from Springer Science and Business Media [4].

Satava and Jones [5] describes more advantages for the telerobotic surgery, such as giving extreme precision by enhancing hand movements to a 100:1 ratio, permitting precisions of 10 µm. Furthermore, the system can filter out hand or finger tremors.

Together this gives a precision that is nowhere near beatable by the human hand (which has a maximum precision of approximately 200 µm, along with a hand tremor of 8-14 Hz). One important matter, however, is the latency. While telerobotics allows surgery over a large physical distance, it is vital that the delay introduced by the distance remains small.

2.3.2 Holograms

Another area that most people probably can relate to as futuristic, is holograms. As early as 1977 it was displayed in the famous Star Wars movie, and today, this field is showing some promises of a future of interacting with holograms. For example, a research paper published by Blanche et al. in 2010 [6] showed proof-of-concept

(11)

techniques for near-real-time 3D holographic display and telepresence. Figure 2.2 depicts an example of such an holographic display.

Figure 2.2: Hologram of an airplane from three different angles. Reprinted by permission from Macmillan Publishers Ltd: Nature [6], copyright 2010.

They also describe that they were able to create a telepresence area for recording an image in 3D, which could then be visualized by their hologram in 3D. They managed to get an update rate of 0.5 Hz, something that indicates that telepresence by holograms is still far from a viable real-time environment.

2.3.3 Physical telepresence

The topic physical telepresence generally focuses on touch-based telepresence. One implementation type is done with an area of small cubes that you can push, much like nail boards where you can put your hand on one side and it will be projected by the nails on other side. The only difference here is that the two sides of the board are not physically connected, but rather connected digitally with two boards.

When you change one board, the other board is changed the same way, or inverted, depending on the implementation.

One perfect example of such an implementation [7] is depicted in Figure 2.3. This implementation is something that can be related to Section 2.3.4, where we look at social tele-operated robots. They use a screen and an area to reflect the operators movements, instead of using robot head and hands.

(12)

2.3. FUTURE AREAS OF TELEPRESENCE

This image has been removed due to copyright restric- tions.

Missing figure

Figure 2.3: The remote controller is hovering his hands over a motion capture area. His movements are then rendered on the remote area, allowing him to move the ball around with his hands.

2.3.4 Social Tele-operated robots

Entering the field of social robotics gives us a whole new definition of physical.

We now have access to a physical robot, in some cases even a humanoid android [8], which is a robot that is supposed to give the appearance of a human. Using this robot for a telepresence setup can give many interesting scenarios, even though some may find it an uncanny experience (discussed in more depth in Section 2.5).

An example of just how human-like such an android can be is depicted in Figure 2.4, where Ishiguro poses with his android replicate.

Figure 2.4: A man (left) and his an- droid replicate, Geminoid-HI-1 (right).

Copyright c IEEE. All rights reserved.

Reprinted, with permission, from [8]

This image has been removed due to copyright restrictions.

Missing figure

Figure 2.5: The setup of the LiveMask [9]

implementation, presented in 2012, includes a 3D mask, a mirror to mirror the projector rays to the back of the mask and a servo motor to rotate and tilt the head.

A combination between social tele-operated robots, and telerobotics could be a big step forward for telepresence, allowing for not just a visual and audible presence in the room, but also a physical, where you can actually move objects in the room with your arms, as is currently done in telerobotic surgery (Section 2.3.1).

(13)

A simpler type of robot is a robotic head with a 3D mask. One such implementation goes by the name of LiveMask [9], depicted in Figure 2.5. The implementation uses a back-projected 3D mask, and is designed to remove the common problem of the Mona Lisa effect, which is explained in further detail in Section 2.2. The telepres- ence aspect of the implementation uses facetracking to project the face properly on the mask, and by doing a study on interpreted gaze locations, the study concluded that the face mask solved the Mona Lisa effect, as opposed to 2D displays.

2.4 The Furhat robot

The robot that will be used to perform the evaluation is called Furhat [10]. The robot consists of a robotic head with a facial 3D mask and a projector, similar to the LiveMask setup depicted in Figure 2.5. It was developed at the Royal Institute of Technology in Stockholm, and is now part of the company Furhat Robotics.

Figure 2.6: The furhat robot consists of several steps. First, there is an animated face.

Then there is a 3D facial mask acting as a projection surface. The animated face is then projected on the mask, and the projector is covered by a so called "furhat", resulting in the rightmost picture.

The Furhat robot, presented in Figure 2.6 looks very similar to the LiveMask setup.

It does, however, have differences. The main usage of the Furhat robot is to act as a social robot, which is why it has an animated face (instead of trying to replicate the operators face in three dimensions as done in the LiveMask setup). For this thesis, it means that we can reuse the animated face, and implement a telepresence environment around it. This allows for a much simpler approach that will not put requirements on the client, as presented in Section 3. Hopefully, using an animated face will also allow for a lesser risk of falling into the uncanny valley (explained further in Section 2.5).

When being used as a social robot, Furhat uses two different setups for tracking the speaker [11, 12]. One of the implementations works by having a Microsoft Kinect working as its eyes, providing information of the area in front of the robot.

The Kinect provides facial recognition and also positions the speaker in 3D-space.

It can also record audio, although in presentations there are so much noise that two external close range microphones with infrared markers are usually used. Another implementation for face tracking uses the SHORE software, developed by Frauen- hofer [13]. The video and audio are then used to connect faces to voices, and it can distinguish who is talking. With all this information, it is possible to create a

(14)

2.5. UNCANNY VALLEY

multiparty conversation environment, meaning that more than one person can have an intelligent conversation with the robot at the same time.

2.5 Uncanny valley

In 1970, Mori presented an essay regarding the subject Uncanny Valley [14]. Mori proposed that the uncanny valley can be discribed with a graph plotting familiarity versus human likeness, as depicted in Figure 2.7.

Figure 2.7: The uncanny valley as presented by Mori in 1970. 2012 IEEE. Reprinted,c with permission, from [14].

Mori argued that as the human likeness of things increased, the familiarity would also increase, up to a point where it came too similar to a living human, such as the ultimate similarity, an actual corpse, marking the lowest point of the valley. In the valley we would also find a prosthetic hand. As the human likeness increased the familiarity would once again increase and get us out of the uncanny valley. A healthy person marks the top of the curve.

Mori also pointed out that movements would amplify the familiarity, giving a greater familiarity with humanoid robots, and a deeper valley with a moving corpse, a zom- bie.

In 2006, MacDorman and Ishiguro presented a paper where they tested the theory in an experiment [15]. With the results from 45 indonesian participants, they repli- cated the uncanny valley graph, as seen in Figure 2.8, showing that it does indeed seem to exist, something that needs to be considered when designing a telepresence setup.

(15)

Figure 2.8: Results from an experiment done in 2006 by MacDorman and Ishiguro.

[15] The experiment was done by morphing a picture between a robot and a human, and asking participants of the experiment to rate the eeriness of the picture. Reprinted with kind permission from John Benjamins Publishing Company, Amsterdam/Philadelphia.

[www.benjamins.com].

The uncanny valley will be relevant when working on the implementation, and can also act as a motivation for choosing the Furhat robot with an animated face instead of a LiveMask type robot with a real face. Due to projecting on a 3D-mask, the real face is likely to be somewhat distorted, resulting in a not quite perfect resemblance to the real face, but close enough to potentially cause an eerie feeling. The animated face, however, is easily detected as an avatar, and not an actual face of a person, so any distortion can be accepted as a flaw in the animation.

2.6 Summary

After researching the field of telepresence, several things can be observed.

1. The area of telepresence has come far. So far that previously futuristic ideas such as holograms are beginning to appear in the field.

2. Telerobotics is starting to get well developed, along with social robots, as seen with examples such as the Ishiguro android (Section 2.3.4). This introduces possibilities for usages of the same technology in telepresence environments.

3. Turning a social robot into a telepresence device is highly plausible. It can be done, as seen in the LiveMask project (Section 2.3.4). The question still remains, however: How well can it be done? Can this replace current mediums of communication?

a) LiveMask is a great example of a a new telepresence environment, remov- ing the Mona Lisa-effect apparent on 2D screens used in regular video

(16)

2.6. SUMMARY

calls. Some problems still remain, however; it requires an advanced setup client-side.

b) Projecting a real face in real-time on a 3D surface is very close to a real face. In fact, it could be close enough to fall down the uncanny valley.

The face should resemble the operator in such a way that any distortion might introduce an eerie feeling.

4. Furhat is a similar platform as LiveMask. It does, however, use an animated face instead of a real face on its 3D projection, making it less demanding on client resources. The client simply needs a laptop. It also makes it more obvious that the projected face is an avatar of the operator’s face, and is thus less likely to be caught in the uncanny valley.

The above points lead to the conclusion that the Furhat robot is a potential viable communication medium. It has the same capabilities as the LiveMask system, but with the simpleness that comes with controlling an animated face.

(17)

Implementation

In order to test the viability of the Furhat robot acting as a medium of communi- cation for telepresence, a software was implemented that enables the social robot to act as a replacement for the screen on one side of a video call. Note that it will not replace both sides, since that would not only remove the possibilities of more than one participant per Furhat robot, but would also make the control of the robot very unnatural. The idea is presented in Section 3.1, with details presented in Section 3.4. Already available software is presented in Sections 3.2 and 3.3.

3.1 System overview

The implementation method follows the flow presented in Figure 3.1.

Figure 3.1: The flow of audio and video data in the implementation. The operator captures video and audio with a web camera and a microphone, along with mouse clicks, and sends facial parameters along with audio to the Furhat robot (top part). In return, the Furhat robot sends video and audio data captured by the Kinect camera connected to the robot (bottom part).

(18)

3.2. THE FURHAT TOOLBOX

The flow outlines how audio and video is sent through the system. The client and server computers communicate through websockets when sending audio, and facial parameters, and via HTTP requests when sending video and commands. Before going further into the details of the implementation (Section 3.4), there are a few dependencies that needs to be presented. This is done in Section 3.2 (already avail- able software for the Furhat robot), and Section 3.3 (the face tracking API used at the client side to retrieve the facial parameters).

3.2 The Furhat toolbox

The Furhat robot has been developed for several years, providing an API for devel- opers. This API is used within the implementation to reuse earlier implementations of the social robot Furhat, and make them available for the telepresence solution pre- sented in this chapter. The software presented for developers to control the Furhat robot is called IrisTK, and is explained further in Section 3.2.1. The IrisTK API incorporates an animated face (Section 3.2.2), that can be used as a stand-alone avatar or as a display on the Furhat robot.

3.2.1 IrisTK

This is the software currently available for controlling the Furhat robot. The two major resources available is control of the animated face (explained further in Section 3.2.2) and a Jetty1 server. More resources include, but are not limited to, retrieving audio and video of a Microsoft Kinect 2.02 camera, sending this video to the web server hosted by Jetty and changing the texture of the animated face, along with controlling the parameters of the animated face.

For this implementation, the Jetty server is used to host a web page, that can receive and transmit video and audio, as explained further detail in Section 3.4.

3.2.2 The animated face

The animated Furhat face is controlled by various parameters. Changing those parameters causes a morph between different animated faces, resulting in one an- imated face, that is then displayed on the robot. A more detailed explanation of all available parameters is presented in Table A.1. For this implementation, we will use only a selected set of parameters, displayed in Table 3.1. In addition to these parameters, the Synface (Section 3.2.3) software will use the 16 phonetic parameters to additionally control the mouth movement. The animated face itself is depicted in Figure 3.2. Due to the projection on a 3D-mask, the sides of the animated face looks stretched when presented in 2D.

1http://eclipse.org/jetty/ - March 30, 2015

2https://www.microsoft.com/en-us/kinectforwindows/ - May 26, 2015

(19)

Parameter Description SMILE_CLOSED

Controls the smile, open or closed.

SMILE_OPEN BROW_DOWN_LEFT

Controls the eyebrow movements. These parameters move eyebrows up or down, for each eyebrow individually.

BROW_DOWN_RIGHT BROW_UP_LEFT BROW_UP_RIGHT

Table 3.1: A presentation of the used facial parameters for the Furhat animated face.

Figure 3.2: The animated face of the Furhat robot has various parameters that can be adjusted, such as brow height and smile.

3.2.3 Synface

Synface is a software for syncing the mouth movements to detected speech from audio [16]. It was originally developed as an aid to hearing impaired people, and would allow them to participate in telephone calls by giving a visual aid of an animated face that moves its mouth according to the phonemes detected in the audio. This would allow them to lip-read, as well as hearing the audio, giving them a way to properly communicate despite the hearing impairment. In this thesis, the software will be used to project these phonemes to the Furhat animated face, simulating the speech detected in the audio stream. This should hopefully remove the lack of details of the facetracking done by clmtrackr.js (Section 3.3), and give proper lip synchronization, enhancing the telepresence experience.

(20)

3.3. CLMTRACKR LIBRARY

(a) The animated face of Synface.

(b) The Synface soft- ware running on a mo- bile device.

Figure 3.3: The Synface animated face. [17]

3.3 Clmtrackr library

Clmtrackr.js is a library for facial tracking in a video. In this implementation, it will be utilized for its ability to retrieve facial coordinates. The retrieved coordinates will be sent to the server and translated to parameters for the animated face. Figure 3.4a shows the visual video overlay of the face tracking, while Figure 3.4b presents all coordinate points retrieved.

(a) The overlay from clmtrackr when successfully tracking a face.

(b) The facial parameters cap- tured by the face tracking.

Figure 3.4: The clmtrackr library. [18]

Not all of these points will be used, but selected ones, for example 41 and 50 to track mouth width. In order to make sure that the width is normalized, we also need to look at another set of parameters. For this task, the nose width given by parameters 31 and 39 were used. After this, the parameters are mapped to the parameters from the Furhat animated face (Section 3.2.2). In this example, the parameters were mapped to the SMILE_OPEN or SMILE_CLOSED parameters.

(21)

3.4 System details

This section will explain further details of the client side (Section 3.4.1) and the server side (Section 3.4.2), along with a thorough presentation of the flow of the system (Section 3.4.3), followed by a description of how the resampling and buffering was done (Section 3.4.4).

3.4.1 Client side

The client side is where the operator is located. The graphical interface that the operator is presented with is depicted in Figure 3.5. Here, the operator can control the incoming audio volume, the resolution of the incoming video, whether to use face tracking or not, along with which animated face texture to be used. The operator can also click on the video in order to direct the gaze of the Furhat robot to that point in the remote location.

Figure 3.5: A presentation of the graphical interface that is presented to the operator.

System requirements The hardware requirements for the client system are de- signed to be as simple as possible. This implementation requires a computer with a screen, along with an audio input device, an audio playback device, and an op- tional web camera. The camera is optional due to the lip synchronization performed by Synface being dependent on audio phonemes, and thereby works without face tracking.

(22)

3.4. SYSTEM DETAILS

3.4.2 Server side

The server side is where the Furhat robot is oriented. This is also where the in- terlocutors speaking to the operator will be oriented, as the operator will appear through Furhat.

System requirements The hardware requirements for the server side are stricter than the client side. A Furhat robot is required, along with a Microsoft Kinect 2.0 camera for use as video input device. Having external microphones will yield the best experience for the operator, due to the lack of echo reducing technologies, but it is optional since the Kinect camera can also be used as an audio input device.

3.4.3 Data flow

This section will aim at describing the data flow of the system in further detail. It will use the steps 1, 2 and 3 in Figure 3.1 as reference points, along with splitting the section into two parts; one for each direction of flow.

3.4.3.1 From the operator to the Furhat robot

This section will explain the data flow from the operator to the Furhat robot in further detail, as shown in Figure 3.6 (which is simply an extraction of the top (green) part from Figure 3.1).

Figure 3.6: An overview of the flow from the operator to the Furhat robot.

Step 1: From the operator to the client This input consists of a lot of dif- ferent types. First off, the operator generates video and audio input by appearing in front of a web camera and speaking into a microphone connected to the client computer. This video and audio needs to be processed before we can send it to the operator. The video is run through the clmtrackr.js (Section 3.3) software in order to retrieve facial coordinates, and the audio is buffered and resampled to 16 kHz (see Section 3.4.4 for details).

(23)

The operator can also send commands to the client by clicking with a mouse in the graphical interface presented in Figure 3.5. These commands consist of turning face tracking on and off, changing the resolution of the incoming video, selecting the audio input device of the server, modifying the incoming audio volume, changing the textures of the animated face from a list of textures available at the server, and clicking on the video to direct the gaze of the Furhat robot.

Step 2: From the client to the server In this step, the facial parameters, generated by the clmtrackr.js library, are sent to the server, along with resampled audio bytes. This step also takes care of sending the commands generated by mouse clicks, explained in Step 1, to the server for execution.

Step 3: From the server to the Furhat robot Once the data has arrived on the server, it has to be processed. The facial coordinates are remapped to facial parameters for the animated face, and the buffered audio is played through the speakers of the Furhat robot after a brief delay, allowing the Synface software to produce facial parameters corresponding to the audio phonemes. The commands are executed to perform their designated task. In case of the command corresponding to a click on the video (meaning a gaze direction command), these coordinates will be sent to IrisTK, that translates those coordinates as seen by the Microsoft Kinect 2.0 camera to coordinates as seen by the Furhat robot, and adjusts the head tilt and rotation accordingly, along with pupils.

3.4.3.2 From the Furhat robot to the operator

This section will explain the data flow from the Furhat robot to the operator in further detail, as shown in Figure 3.7 (which is simply an extraction of the bottom (blue) part from Figure 3.1). Note that step 1 located to the right, and step 3 is located to the left, as opposed to the steps in Figure 3.6.

Figure 3.7: An overview of the flow from the Furhat robot to the operator.

Step 1: From the Furhat robot to the server The input of this step is ideally recorded with a Microsoft Kinect 2.0 camera as video input device and multiple microphones as audio input devices, but it is also possible to use the Kinect camera

(24)

3.4. SYSTEM DETAILS

as an audio input device. Either way, the recorded audio is set to be recorded in 16 kHz sample-rate, and then stored in a buffer on the server. The recorded video is of the resolution previously specified by the operator.

Step 2: From the server to the client The audio from the audio buffer is sent in chunks of a predefined buffer size to the client. The video is streamed straight to the client without any modification.

Step 3: From the client to the operator Before the audio can be played to the operator, it has to be resampled to fit the sample-rate of the client computer. After this is done, it can properly be played without any high or low pitch noises. In order to make this possible, however, it has to also be buffered on the client. This buffering also helps with removing audio clipping caused by connection latency (more about resampling and buffering in 3.4.4). The video is displayed to the operator without any requirement of modification.

3.4.4 Resampling and buffering

All of the various resampling is done on the client side. Either the resampling is done before sending the audio to the server, and will then be resampled to 16 kHz, which is the sample-rate of the server, or it will be resampled to fit an unspecified (varies between computers) sample-rate before being played to the operator. When doing the resampling, we need to transform a number of bytes being played at a set speed to another number of bytes being played at another speed. Due to having fixed buffer sizes for the playback unit, we need to also incorporate audio buffers in order to make resampling possible. During the implementation, it was noticed that the buffers are not only required for making resampling possible, but also for reducing audio clipping. Synface also requires a bit of delay on the audio in order to have the time to process the audio data and synchronize lip movements properly.

As a result of this buffering, a bit of delay has been introduced into the system, something that could affect the results of the evaluation (Chapter 4).

(25)

Evaluation

In order to test the implementation presented in Section 3, a user evaluation was done, investigating the viability of using the implementation as a communication medium for telepresence by comparing it to a similar, but already established medium:

a video call with a 2D screen. This evaluation has been carefully designed to incor- porate important elements of the telepresence solution. This chapter will explain how the evaluation was set up (Section 4.1), and what areas it aimed to evaluate (Section 4).

4.1 The Desert Survival Task

The evaluation required one operator and two interlocutors, meaning a total of three test subjects per group. The operator operated the Furhat robot by appearing in front of a web camera on the client side and clicking with the mouse on the video to control the gaze of the robot. The interlocutors appeared in front of the Furhat robot (seen in Figure 4.1) and would, together with the operator, attempt to solve a problem. The problem consisted of ordering several pictures according to their value when surviving in the desert. This is a test that has previously been played with Furhat acting as a social robot, and is referred to as the Desert Survival Task, depicted in Figure 4.2. This setup was then compared with a similar setup; instead of using the Furhat robot as a visual display for the operator, it would be replaced by another medium of communication – a video call with a 2D screen. Prior to these two scenarios taking place, the Desert Survival Task was played once with no external medium of communication, but instead the operator was sitting at the table together with the interlocutors and actually participating physically. This was done to introduce all users to the task so that there would be no confusion about the goal of the task when the actual testing of the Furhat system and video call setups begun.

After the evaluation, the test subjects were asked to fill a questionnaire in order to provide information about their feelings towards the different setups. The ques- tionnaire consisted of two parts:

(26)

4.1. THE DESERT SURVIVAL TASK

Figure 4.1: Two interlocutors will sit on the two chairs in front of the Furhat robot in order to participate in the Desert Survival Task together with the operator, participating through the Furhat robot.

Figure 4.2: The Desert Survival Task consists of five cards that depicts different objects.

The cards should be sorted to reflect their priority when surviving in the desert, with the highest priority object to the right, and lowest priority object to the left.

Figure 4.3: An example of a question from part 2 of the questionnaire. This specific question was taken from the operator-specific questionnaire.

Part 1 General information about the test subject, such as gender and age, as well as if the operator/interlocutor was previously known to the test subject.

Part 2 Specific questions to the role of operator or interlocutor. These questions reflect the evaluation areas presented in Section 4.2. These questions were to

(27)

be answered on the same scale for both setups, so that a comparison could be made for each question. An example of a question from this part is presented in Figure 4.3

This task was not only a method to get a conversation going. The pictures on the table and the two interlocutors introduced the importance of being able to tell the gaze direction of the operator. Accurately being able to tell the gaze direction should also allow for eye contact, or at least the appearance of eye contact for the interlocutor (the operator needs to have the camera oriented in the eyes of the robot to truly achieve eye contact). It should hopefully also give an enhanced understanding of which object the operator was referring to.

4.2 Areas of evaluation

When doing the evaluation, several things were looked at, as described in the list below. The full questionnaires are presented in the appendix in figures B.1 and B.2.

• The ability to convey which object was referred to.

• Who was being spoken to.

• Detecting when it was your turn to speak.

• Eye contact and gaze direction.

• Conversational flow.

• Physical presence importance.

• Synchronization of lip movements to speech.

• Uncanny feeling from the robot not actually being a human being, but talking like one.

• How much the controlling of the Furhat robot’s gaze with mouse clicks dis- turbed the conversation.

The questions on the questionnaire will aim at covering the above areas. The reason of their importance is mostly due to the conversational flow. If you are able to detect who is being spoken to, or what object is being referred to, and when it is your turn to speak, the conversational flow should be very good. Some of the areas are also simply an evaluation of how the users experience their interaction with Furhat compared to an interaction that they should be familiar to from before, the video call, in order to determine if using a robotic head might be a viable medium of communication.

(28)

Chapter 5

Results

As a result of the evaluation, two different questionnaires were answered, depending on the role of the test subject (operator or interlocutor). The summary of these answers are presented in Sections 5.1 and 5.2, together with a brief description of their meaning. The questionnaires were written in Swedish for the test subjects’

convenience, but the questions will be presented in English in tables 5.1 and 5.2 for this thesis. The answer scale ranges from 1 to 9, and all questions except the last one for each questionnaire were presented as scales between "Not at all" to "As live", with "As live" meaning as if everyone would be sitting in the same room. The last question of each questionnaire was for the Furhat setup only and was to be answered on a scale of 1 to 9 between "Not at all" and "Yes, very". The test subjects were asked to fill in a value both for the Furhat setup and the video call setup, allowing for a comparison.

Note that only the median values will be observed from the survey, instead of the average. This is due to the answers being on an ordinal scale, meaning that the av- erage values might be misleading. For the same reason, it can not be assumed that the values follows the normal distribution, which is why the Wilcoxon signed-rank test1 was used to analyze the answers statistically, with an assumed alpha-value of 0.05.

5.1 Interlocutors

Results from the interlocutor-specific questionnaire are presented in Table 5.1.

An important addition to these results are the comments retrieved. A lot of the participants felt that they had a better connection to the operator when the Furhat robot appeared in the room, despite only seeing an animated version of the oper- ator’s face. One test subject explained this with the comment "Having the robot in the room caused it to be less likely for me to blur out the operator as part of the environment and more likely to react to its movements, compared to the video call".

1http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test - May 20, 2015

(29)

Question Furhat Video P-value # Samples I could easily convey to the operator

which object I was referring to. 7 7 15

It was easy to understand which ob-

ject the operator was referring to. 7 7 15

It was easy to understand who the

operator was speaking to. 8 6 0.046 15

It was easy to determine when it was

my turn to speak. 7 7 15

It felt like I had eye contact with the

operator. 7 5 0.026 13

It felt like there was a good conver-

sational flow. 8 7 15

It felt like the operator understood

when I was speaking to him/her. 8 7 15

It was easy to distinguish the oper-

ator’s facial expression. 5 8 0.001 15

The lips moved well along with the

audio. 6 7 15

It felt like the operator was present

in the room. 6 7 15

It was uncanny to speak to a hu-

manoid robot. 1 - 15

Table 5.1: Results from the interlocutor-specific questionnaire. First column is the question as specified in the survey. The second and third are the median values for the Furhat and Video call setups, respectively. The fourth column represents the statistical significance (presented only if p <0.05). The fifth column is the number of samples.

Table 5.1 has an orange cell, meaning that something was off. In this case the number of samples were only 13 instead of 15, as the rest of the questions. This was due to extremely confusing and contradicting data, which is most likely due to the test subject misinterpreting the question. In this specific case, the test subjects stated that it was very easy to understand who the operator was speaking to, but that they felt like they had no eye contact with the operator whatsoever. The as- sumption is made that there is a strong connection between these two questions, and that their values should be interconnected. For all of the other participants, the answer to these two questions were almost the same, but these two participants answered a 9 respectively a 6 on the question "It was easy to understand who the operator was speaking to" and both answered a 1 on the question "It felt like I had eye contact with the operator". This is believed to be due to the fact that there is no actual eyes or face to be seen of the operator, but only a robot, which made the test subjects interpret the question as actual eye contact, not perceived eye contact

(30)

5.2. OPERATORS

through a medium. It should also be noted that removing all answers from these test subjects still results in the same statistical significance for the other questions.

For the question "It was easy to distinguish the operator’s facial expression", the video call setup showed a significant advantage over the Furhat system. This is presumably due to the limitations of the implementation in terms of mapping facial coordinates to facial parameters, discussed further in Section 5.3.

5.2 Operators

Results from the operator-specific questionnaire are presented in Table 5.2.

Question Furhat Video P-value # Samples

I could easily convey which object I

was referring to. 8 8 8

It was easy to understand which ob- ject my interlocutors were referring to.

8 8 8

It was easy to understand who my

interlocutors were speaking to. 6.5 6.5 8

It was easy to determine when it was

my turn to speak. 6.5 5 8

It felt like I had eye contact with my

interlocutors. 4.5 3.5 8

It felt like there was a good conver-

sational flow. 7.5 7 8

It felt like the interlocutors under-

stood when I was speaking to them. 8 6 0.056 8 It felt like I was present in the room. 5 3 0.039 8 It was demanding to click where

Furhat was supposed to look. 6 - 8

Table 5.2: Results from the operator-specific questionnaire. First column is the question as specified in the survey. The second and third are the median values for the Furhat and Video call setups, respectively. The fourth column represents the statistical significance (presented only if p <0.05). The fifth column is the number of samples.

The orange row in Table 5.2 means there was something off. In this case, the p-value was only slightly higher than 0.05, which is why it was still presented, albeit with a notation.

A lot of the operators did state verbally that they did not think there was a lot of difference between the setups, other than the clicking being quite demanding (re- flected by the question "It was demanding to click where Furhat was supposed to

(31)

look"). The lack of a difference between the setups is an understandable notion, due to the operator receiving video from the same camera in both setups. They did, however, state that the interlocutors reacted more to their gaze when they controlled the Furhat robot, causing them to feel more present in the room, as reflected by the favor of the Furhat setup in the statistically significant question "It felt like I was present in the room", marked by the yellow row in Table 5.2.

5.3 Limitations of the implementation

When interpreting and looking closer at the results, it is important to note that the implementation is simply a proof-of-concept. It is by no means a complete product.

There are a lot of improvements to be made, such as optimizing camera position, better mapping of facial parameters, perhaps even using an emotion classifier and predefined faces for emotions instead of trying to establish a good mapping between facial coordinates and facial parameters, especially when there is such elements as distance from the camera, that greatly changes the relative positions of the facial positions. It should theoretically be a perfect relation to the face, but there is no guarantee for it.

Further potential improvements to the implementation is presented in Section 6.3.

(32)

Chapter 6

Discussion

This chapter will discuss the project as a whole, and especially look closer at the results presented in Chapter 5.

6.1 Interpreting the results

Overall, the results showed a surprisingly good outcome for the Furhat setup. For several questions, such as "It was easy to determine when it was my turn to speak."

and "The lips moved well along with the audio", we expected the video call to be significantly better than the Furhat setup, but according to the results retrieved, they were equal. The fact that all questions, except one, showed equal results or a significant difference in favor of the Furhat robot exceeded the expectations. The question that did show a significant difference in favor of the video call was "It was easy to distinguish the operator’s facial expression", which was not very surprising, due to the limitations explained in Section 5.3. It should be noted that a more developed version of the implementation is likely to reduce or even remove this dif- ference. This implementation was limited due to a shortage of time, and should act only as a proof-of-concept, as previously stated in Section 5.3.

The main points that showed an improvement in the Furhat setup was the questions

"It was easy to understand who the operator was speaking to" and "It felt like I had eye contact with the operator" for the interlocutors. With the Furhat setup getting such a good response in terms of eye contact proves that the Mona Lisa effect is indeed handled by the 3D facial mask of the robot. An important thing to note is that these questions can find their counterpart of the operator questions also show- ing favor of the Furhat setup in terms of presence in the room and a feeling that the interlocutors better understood when they were spoken to. There was also a slight favor (albeit not statistically significant) in terms of eye contact. Together, these favors show a strong advantage in telepresence for the Furhat system, while also showing the importance of the gaze.

(33)

6.2 Effects of the uncanny valley

There was initially a fear that the limitations of the implementation might cause an uncanny interaction. The test results from the question "It was uncanny to speak to a humanoid robot" in Table 5.1 do, however, indicate that the test subjects did not appear to find the interaction with the Furhat robot uncanny at all (receiving a median value of 1). While this does not necessarily prove that the theory from Section 2.5 holds, it proves that using an animated face on a robot head is not cause of an uncanny experience, and that it is in fact a good idea to use an avatar. It still remains to be seen if it is better than a 3D projection of an actual face, as used in the LiveMask study [9], but it certainly is simpler.

6.3 Future Work

After speaking to the test subjects, it is obvious that there are a lot of improvements that can be made. Some of them involves advanced technology, and while using such technology could greatly improve the telepresence experience, it would be done at the expense of mobility and simplicity.

Improving the facial parameter mapping It has been previously proven that facial expressions are maintained through animation [19]. This, together with the current limitations of the system (Section 5.3) indicates that the parameter map- ping is not perfectly implemented. This means that there is room of improvement, something that could potentially reduce the gap between the Furhat and the video call setups, as presented by the question "It was easy to distinguish the operator’s facial expression" in Table 5.1.

Emotion detection Since the IrisTK (Section 3.2.1) API provides access to facial parameters with a scaling of the predefined animated faces for the emotions Anger, Disgust, Fear, Sad and Surprise (shown in Table A.1), it could be interesting to use these instead of mapping the facial parameters as is done in the current implemen- tation (explained in Section 3.3). It is also something that could generate a more realistic avatar, improving the telepresence experience. In fact, the clmtrackr library includes an emotion detection that can be directly mapped to the Furhat animated face, removing any errors introduced in the mapping process.

Autonomous behavior As it was shown to be bothersome for the operator to use the mouse to control the gaze (pointed out in Section 5.3), it could be interest- ing to consider implementing this as an autonomous behavior that reacts to people when they are speaking and looks back at them. This would, however, not allow you to control when you want to direct your gaze at specific persons or objects to talk to or about them, but it could be possible to combine the current gaze control im- plementation with an autonomous implementation, working together. Since Furhat

(34)

6.3. FUTURE WORK

is currently implemented as a social robot that incorporates a number of different autonomous behaviors, it should be able to adjust previous implementations to a telepresence situation. Another reason for the necessity of autonomous behavior was as stated by one of the test subjects; the gaze of Furhat does become uncanny if it is located at the same spot for too long, indicating that it should be a good idea to implement a behavior that causes the robot to look away briefly if the gaze has been directed at the same spot for too long.

Eye tracking As far as gaze control goes, the ultimate solution would be to step even further and implement eye tracking. This is something that is optimally done using eye tracker hardware (such as Tobii’s eye tracker1) to track where the user is looking. It could also be investigated if it is possible to do any software eye tracking with a video from a web camera. Eye tracking would eliminate the need of clicking where to move the Furhat robot’s gaze, but would instead allow you to move it with your eyes. It is also a technology that is quite affordable, and could be implemented as an optional feature, as an alternative to clicking, for those that use the system enough, or for a dedicated telepresence computer.

Adding physical elements Proven by the field of telerobotics (Section 2.3.1) and telesurgery in specific, the technology exists to teleoperate robots with an extreme accuracy. Using a similar technology would make it possible for the Furhat robot to add robot arms with teleoperated movements, allowing for physical interaction.

Simpler variants, such as the solution presented in Section 2.3.3 is also possible, and probably more affordable, albeit not as precise.

Holograms As presented in Section 2.3.2, a future possibility could be to use holograms and get a perfect 3D image of the operator. This would eliminate the need of a robot completely.

1http://www.tobii.com/en/eye-experience/dev/ - May 19, 2015

(35)

Conclusion

This thesis has presented different telepresence communication media (Chapter 2) and their strengths and weaknesses. Previous and established media such as radio, telephone, television and video calls were presented, following by future areas such as telerobotics, holograms, physical telepresence and social robots, before conclud- ing that a robotic head with an animated face projected on a 3D face mask (more specifically, the Furhat robot) had the potential to act as a viable communication medium. It also presented known issues such as the Mona Lisa effect for 2D surfaces, and the uncanny valley. The Mona Lisa effect, has previously been proven to be solved by using the Furhat robot [3].

The implementation details (Chapter 3) demonstrated the idea behind the telep- resence system, where it works as a video call, with the exception that one of the interconnected parties is not visualized by a 2D screen, but is instead visualized by the Furhat robot.

An evaluation (Chapter 4) was performed with an operator in a remote location appearing in front of a screen, and two interlocutors appearing in front of the Furhat robot. This setup was then compared to a regular video call in order to investigate the viability of this new communication medium.

The results (Chapter 5) unveiled that there are a lot of advantages with the Furhat setup, compared to a video call, but only few disadvantages. When interpreting the results further (Chapter 6), it could be seen that all of those advantages are important aspects of telepresence, and that the disadvantages could potentially be removed by a more complete implementation, suggesting that using a robotic head with a 3D face mask should indeed be a viable option to be used as a communication medium for telepresence. There are even indications that it has the potential be a better choice of communication medium, compared to the video call.

There are still quite a few improvements to be made for the implementation to be a full-blown telepresence system (see Section 6.3). Video calls have been per-

(36)

fected for several years, and allowing this type of setup to be developed for a longer period of time would show some really interesting results. It is important to remem- ber that the Furhat robot was initially developed with the purpose of being a social robot, meaning that this solution has only been developed during a few months, albeit making use of a lot of already existing features of IrisTK. Despite this short amount of time, there are already a lot of interesting results, leaving a great amount of hope for the future of telepresence by robotic heads.

(37)

[1] J. J. Gibson, The Ecological Approach To Visual Perception. Houghton Mifflin, 1979.

[2] J. Steuer, “Defining virtual reality: Dimensions determining telepresence,”

Journal of communication, vol. 42, no. 4, pp. 73–93, 1992.

[3] S. A. Moubayed, J. Edlund, and J. Beskow, “Taming Mona Lisa:

Communicating gaze faithfully in 2d and 3d facial projections,” ACM Transactions on Interactive Intelligent Systems, vol. 1, no. 2, pp. 1–25, Jan.

2012.

[4] G. H. Ballantyne, “Robotic surgery, telerobotic surgery, telepresence, and telementoring,” Surgical Endoscopy And Other Interventional Techniques, vol. 16, no. 10, pp. 1389–1402, Oct. 2002.

[5] R. M. Satava and S. B. Jones, “PREPARING SURGEONS FOR THE 21st CENTURY: Implications of Advanced Technologies,” Surgical Clinics of North America, vol. 80, no. 4, pp. 1353–1365, Aug. 2000.

[6] P.-A. Blanche, A. Bablumian, R. Voorakaranam, C. Christenson, W. Lin, T. Gu, D. Flores, P. Wang, W.-Y. Hsieh, M. Kathaperumal, B. Rachwal, O. Siddiqui, J. Thomas, R. A. Norwood, M. Yamamoto, and N. Peyghambarian,

“Holographic three-dimensional telepresence using large-area photorefractive polymer,” Nature, vol. 468, no. 7320, pp. 80–83, Nov. 2010.

[7] D. Leithinger, S. Follmer, A. Olwal, and H. Ishii, “Physical telepresence: shape capture and display for embodied, computer-mediated remote collaboration.”

ACM Press, 2014, pp. 461–470.

[8] D. Sakamoto, T. Kanda, T. Ono, H. Ishiguro, and N. Hagita, “Android as a telecommunication medium with a human-like presence,” in Human-Robot Interaction (HRI), 2007 2nd ACM/IEEE International Conference on. IEEE, 2007, pp. 193–200.

[9] K. Misawa, Y. Ishiguro, and J. Rekimoto, “LiveMask: A Telepresence Surrogate System with a Face-shaped Screen for Supporting Nonverbal Communication,” in Proceedings of the International Working Conference on

(38)

BIBLIOGRAPHY

Advanced Visual Interfaces, ser. AVI ’12. New York, NY, USA: ACM, 2012, pp. 394–397.

[10] S. A. Moubayed, J. Beskow, G. Skantze, and B. Granström, “Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human-Machine Interaction,” in Cognitive Behavioural Systems, ser. Lecture Notes in Computer Science, A. Esposito, A. M. Esposito, A. Vinciarelli, R. Hoffmann, and V. C.

Müller, Eds. Springer Berlin Heidelberg, 2012, no. 7403, pp. 114–130.

[11] G. Skantze and S. Al Moubayed, “IrisTK: A Statechart-based Toolkit for Multi-party Face-to-face Interaction,” in Proceedings of the 14th ACM International Conference on Multimodal Interaction, ser. ICMI ’12. New York, NY, USA: ACM, 2012, pp. 69–76.

[12] S. Al Moubayed, G. Skantze, J. Beskow, K. Stefanov, and J. Gustafson,

“Multimodal Multiparty Social Interaction with the Furhat Head,” in Proceedings of the 14th ACM International Conference on Multimodal Interaction, ser. ICMI ’12. New York, NY, USA: ACM, 2012, pp. 293–294.

[13] C. Küblbeck and A. Ernst, “Face detection and tracking in video sequences using the modifiedcensus transformation,” Image and Vision Computing, vol. 24, no. 6, pp. 564–572, Jun. 2006.

[14] M. Mori, K. MacDorman, and N. Kageki, “The Uncanny Valley [From the Field],” IEEE Robotics Automation Magazine, vol. 19, no. 2, pp. 98–100, Jun.

2012.

[15] K. F. MacDorman and H. Ishiguro, “The uncanny advantage of using androids in cognitive and social science research,” Interaction Studies, vol. 7, no. 3, pp.

297–337, Jan. 2006.

[16] J. Beskow, I. Karlsson, J. Kewley, and G. Salvi, “SYNFACE – A Talking Head Telephone for the Hearing-Impaired,” in Computers Helping People with Special Needs, ser. Lecture Notes in Computer Science, K. Miesenberger, J. Klaus, W. L. Zagler, and D. Burger, Eds. Springer Berlin Heidelberg, 2004, no. 3118, pp. 1178–1185.

[17] G. Salvi, J. Beskow, S. Al Moubayed, and B. Granström, “SynFace: Speech- driven Facial Animation for Virtual Speech-reading Support,” EURASIP J.

Audio Speech Music Process., vol. 2009, no. 3, pp. 1–10, Jan. 2009.

[18] “auduno/clmtrackr,” Mar. 2015. [Online]. Available: https://github.com/

auduno/clmtrackr

[19] J. Linder and M. Gudmandsen, Telepresence using Kinect and an animated robotic face, 2013.

(39)

Facial Parameters

Table A.1: Presentation of the facial parameters and a brief usage description.

Parameter Description

EXPR_ANGER

Predefined full facial expressions

representing the emotions anger, disgust, fear, sad and surprised.

EXPR_DISGUST EXPR_FEAR

EXPR_SAD SURPRISE SMILE_CLOSED

Controls the smile, open or closed.

SMILE_OPEN

BLINK_LEFT Controls the eyelids. Blinks with left or right eye, respectively.

BLINK_RIGHT BROW_DOWN_LEFT

Controls the eyebrow movements. These parameters move eyebrows in, up or down, for each eyebrow individually.

BROW_DOWN_RIGHT BROW_IN_LEFT BROW_IN_RIGHT

BROW_UP_LEFT BROW_UP_RIGHT

EARS_OUT Controls the angle between the head and ears, and the epicanthic fold.

EPICANTHIC_FOLD EYE_SQUINT_LEFT

Controls the squinting of the eyes.

EYE_SQUINT_RIGHT LOOK_DOWN

Controls the pupils orientation, both eyes together. Moves them down, left, right and up.

LOOK_LEFT LOOK_RIGHT

LOOK_UP

(40)

Table A.1: (continued)

Parameter Description

PHONE_AAH

These parameters are used to simulate mouth movements of the phonetic sounds. Very similar movements (such as b, m and p) has been grouped together to one parameter.

PHONE_B_M_P PHONE_BIGAAH PHONE_CH_J_SH

PHONE_D_S_T PHONE_EE PHONE_EH PHONE_F_V

PHONE_I PHONE_K PHONE_N PHONE_OH PHONE_OOH_Q

PHONE_R PHONE_TH

PHONE_W

NECK_TILT Controls the panning and tilting elements of the neck.

NECK_PAN

Could also add examples for all faces here, showing differ- ent phonemes, smile and eye- brow move- ments.

Could also add examples for all faces here, showing differ- ent phonemes, smile and eye- brow move- ments.

(41)

Questionnaires

This chapter will be used to present the questionnaires used in the evaluation. The interlocutor questionnaire is presented in B.1, and the operator questionnaire is presented in B.1.

(42)

General questions

The operator is previously known to me Yes No 

The interlocutor is previously known to me Yes No 

Gender Male Female 

Age ____ 

Interlocutor side

Mark F for Furhat and V video call for the below questions. 

 

I could easily convey to the operator which object I was referring to  Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

It was easy to understand which object the operator was referring to  Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

It was easy to understand who the operator was speaking to  Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

It was easy to determine when it was my turn to speak  Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

It felt like I had eye contact with the operator 

Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

It felt like there was a good conversational flow  Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

It felt like the operator understood when I was speaking to him/her  Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

It was easy to distinguish the operator's facial expression  Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

The lips moved well along with the audio 

Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

It felt like the operator was present in the room  Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| as Live   

Note: Mark only for Furhat 

It was uncanny to speak to a humanoid robot 

Not at all |­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­|­­­­­| Yes, very   

Other comments: 

Figure B.1: The questionnaire as presented to the interlocutors.

References

Related documents

 Once the motors and the hardware has been chosen, make the prototype have as high of a centre of gravity as possible, by putting all heavy parts of the hardware as high as

Some factors (such as physical separation of development teams, number of teams, number of distributions, team size, culture distance and the collaboration modes) must

In the present study, credibility is defined by two dimensions: a) the daily practice of gathering facts by talking to news sources and b) the daily practice of producing news

The judicial system consists of three different types of courts: the ordinary courts (the district courts, the courts of appeal and the Supreme Court), the general

IP2 beskriver företagets kunder som ej homogena. De träffar kunder med olika tekniska bakgrunder men i stort sett handlar det om folk som är anställda inom

The purpose of this essay was to compare the classic vampire narrative, Bram Stoker's Dracula, to a more contemporary vampire narrative using the first book, Twilight, in

Austins föreläsningar är detaljerade och på sitt sätt filosofiskt tekniska. Hans dröm om att kunna skapa en icke-teknisk filosofisk vokabulär kanske inte helt har realiserats men

2012:34, se prop. 52 Se till exempel Asp, Petter, Asp, Petter, En modernare påföljdsreglering? SvJT 2010 s. 53 Borgeke, Martin &amp; Heidenborg, Marie, Att bestämma påföljd