• No results found

Context aware voice user interface

N/A
N/A
Protected

Academic year: 2021

Share "Context aware voice user interface"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Context aware voice

user interface

Nora Demeter

Interaction Desgin One-year Master Thesis 15 credits

Spring semester / 2014 Supervisor: Jörn Messeter

(2)

Foreword

Speech is the most natural and widely used modality of interaction, and the fastest form of communication between humans. The ways humans and machines communicate has long fascinated us in literature and film, while scientists have been researching the area for over five decades. Controlling computers and mobile phones by voice has been a subject of science fiction, starting from the 60’s, it appeared in many popular movies, such as 2001: A Space Odyssey, where HAL 9000 is not only capable of recognizing and interpreting speech but develops intelligence and emotions, and Star Trek, where the captain can easily verbally identify themself and communicate with the Starship Enterprise’s computer.

In reality speech recognition is still far from perfect. For systems like those described to function we would need 100% accuracy, since in life threatening scenarios, a misheard word could cause a life. There are still challenges to overcome, but the technology made a pretty big leap recently, and is becoming increasingly available on mobile devices for commercial use. Speech recognition systems now understand natural language and use the context of the application to increase the recognition accuracy. Speech recognition and synthesis are also available in a large number of languages. The technology has now reached such a level of maturity that we can consider speech as a main medium of interaction on mobile platforms, and more and more people are taking advantage of the value that these hands-free, eyes-free interfaces provide in many situations.

(3)

Introduction

Traditional input/output modalities such as keyboards are designed from a machine perspective and the machine is able to parse any keyboard entry with complete accuracy. However keyboards are not a natural means of interaction for humans. Keyboards on mobile devices became on screen keyboards and with the lack of tactile feedback typing became a slower task with a higher mistake ratio than before (Rajput, Nanavati 2012) Hence new input/ output modalities must be explored to provide natural and direct interaction. Speech and audio allow for faster input, requiring only the use of speakers and microphones rather than a large on screen or physical keyboard.

A voice–user interface (VUI) makes human interaction with mobile devices and computers possible through a voice/speech platform in order to initiate a command by speech instead of using the graphical user interface. A VUI is the interface to any speech application.

In this thesis I address the topic of a non-visual approach for interaction on mobile, as an alternative to their existing visual displays in situations where hands free usage of the device is preferred. The current technology will be examined through existing work with special attention to its limitations, which user groups are currently using any sort of speech recognition or voice command functions and look at in which scenarios are these the most used and most desired. Then I will examine through interviews why people trust or distrust voice interactions and how they feel about the possibilities and limitations of the technology at hand, how individual users use this currently and where do they see the technology in the future. After this I will develop an alternative voice interaction concept, and validate it through a set of workshops.

(4)

Methods

I took a phenomenological approach to my study, that aims to understand and analytically describe acts of individuals’ and their intentional consciousness. The object to be studied is the world of everyday life, also called “lifeworld”. I choose this approach because of the subjectivity of the subject, each individual have a different opinion and different way of using or not using voice interaction and different reasoning behind it. This is a qualitative method, based on interviews and observations, I especially found it useful because it offers a close understanding of the phenomenon and gives the subject a high degree of participation in the study.

This approach is built on subjectivism, which stands in contrast to objectivism (Sandström, 2010). Edmund Husserl has introduced the term lifeworld in his unfinished book (Husserl, 1954), where he describes it as an intersubjective world of natural, pre-theoretical experience and activity and is a start for phenomenological analysis.

His theory of the lifeworld became a start for the phenomenological sociology of Alfred Schütz. Schütz provides a complete and original analysis of human action and its “intended meaning” (Schütz, 1967) through phenomenology and provides a philosophical basis for Max Weber’s sociology and its focus on subjectively meaningful action (Weber, 1978). These three authors have become a starting point for contemporary phenomenological sociology.

Schütz says that one should start out from the premise of the intentional conscious experience directed towards the other self. In short, the empirical material that is collected by the researcher is the mental content of peoples’ natural attitude.

Aspers argues that the empirical material of the social scientists is what people in the everyday life take for granted.

Aspers described a method of implementing phenomenology as a method of observation in practice in a study of markets in fashion in Sweden (2001) and states that Phenomenology aims to build an entire scientific approach on subjectivism and that this should be emphasized through many of the steps and methods.

(5)

He gives seven steps to conduct empirical phenomenological studies: 1. Define the research question.

2. Conduct a prestudy.

3. Chose a theory and use it as a scheme of reference. 4. Study first-order constructs (and bracket the theories). 5. Construct second-order constructs.

6. Check for unintended effects.

7. Relate the evidence to the scientific literature and the empirical field of study. First he explains that the researcher has to decide what problem is at hand. In order to decide the researcher has to study questions relevant to this theory within the field and conduct a prestudy.

I do that by investigating the contemporary works and authors in interaction design, mobile design and natural language interfaces. I looked at current products and what they missing, new concepts and how they implemented, ideas and concerns relating to speech interaction, voice commands and voice-user interfaces both on mobile devices and computers used privately or publicly. I have left the area of investigation intentionally quite broad, since voice interaction can be used in different settings and in different contexts, but many share similar advantages and disadvantages, and ideas can often be merged or modified to suit one device over the other, and I wanted to know the subjective experience of the participants for each.

I conducted a quite broad set of interviews with 8 different people of age groups ranging from 20 to 50 years old and technological competency ranging from the lower middle to higher end of the scale. All participants had smartphones which are used in their daily lives. The interview consisted of questions on general voice user interfaces, and the subjective interpretation of the users become apparent, which gives a good overview on the subject. Thereafter, it is possible to choose a preferable theory and give focus to the study.

The empirical phenomenological approach demands that a scientific explanation is reached only when we understand the actor’s perspective. To achieve this understanding I concluded with a second round of interviews that are more to a specific category of problems and helps to see the user’s perspective. This has been done using the same 8 participants. However, the level of meaning can vary because it is a complex situation to reach a clear understanding of actors meaning.

(6)

The process moves from the subjective level of the actor’s first-order constructs to the objective level of the second-order constructs. This process may generate new theories. Acts normally have both intended and unintended consequences and the researcher may be able to present a new picture of the actor’s lifeworld.

Findings of unintended effects become interesting to the user as a result of the research, and what users see as uninteresting may be very interesting to the researcher, because they have different horizons of interest.

During this second-order construct I have developed an understanding on what is missing from the field of voice control and have decided upon a design opening and a design proposal which then I intended to validate through workshops.

During the workshops we have used the method of role-play and body storming to validate some of the design suggestions and get feedback. Four workshops has been conducted with largely the same comments and results.

During the body storming workshops the location is selected that is identical or similar to the original environment, where the interaction idea would be used, such as in car, or in a noisier public environment, or in semi public and more quite place, and the participants took part in a role play when we looked at the way they would interact during a call in a natural situation and directly compared it to how they react using the interactions I have developed.

I chose this approach because bodystorming permits immediate feedback for generated design ideas, and can provide a more accurate understanding of contextual factors (Danholt, 2005; Oulasvirta et al. 2003). Awareness of these contextual attributes was seen as a prerequisite for introducing ubiquitous computing products into everyday activities.

After the workshops I drew conclusions, related the evidence to scientific literature and pointed out ways to further develop the ideas.

(7)

Current works, advantages and concerns

Right now the user groups that take full advantage of the voice command functions are mainly blind or upper body impaired people (Azenkot and Lee, 2013; Karat et al. 1999). Several studies are specifically focusing on examining this user group and how they take advantage of the new natural language interfaces, speech recognition and voice control. I think blind or disabled people in this case can be viewed as early adopters, since they have a hard time to use traditional input/output methods, and would be the first to take advantage of other interaction solutions, but even sighted users are able to achieve faster input through voice modalities.

These findings translate especially well to sighted users if they are in a situation when keeping their eyes on their phones would cause a significant disadvantage or even endanger the user, such as during driving or biking, any time when the user’s hands or eyes are busy. Distracted driving is a major problem that leads to unnecessary accidents and human casualties everywhere in the world. The ubiquity of mobile phones is one cause of distracted driving. (Lindquist and Hong 2011)

In cases of simultaneously receiving a call or communicating with other and scheduling an event or searching for a fact, dual-purpose and social speech modalities are a compelling idea. It makes it possible to speak on the phone or interact with people, and at the same time take care of scheduling, finding a phone number or a fact without trying to talk and search for data on the phone at the same time and put the social activities “on hold” (Lyons et al 2004).

One of the main concerns on voice based user interfaces is privacy. Although Internet privacy has lately become a burning question that has often been the topic of news and it has been continuously asked whether our private information is really private, if all information entered to social media can be found by strangers; voice commands generate a much more obvious privacy question of others simply overhearing our information we provide (Staddon at al. 2012; Stutzman et al. 2011).

During interviews the difficulty to switch between different languages when speaking and the pronunciation of names and nicknames in Contacts were raised repeatedly. These are both language related concerns and despite that speech recognition and voice control is available in many languages, switching between these languages are not an easy task, neither is connecting names with their pronunciations. Natural language interfaces although lately are able to adapt to the pronunciation of the user, it is still not enough, to adapt to contact names.

(8)

Below I would like to address these advantages and concerns with examples of current works and solutions provided, and take a look at existing and functioning popular services.

Input modalities

First I am addressing quick and efficient text input. Studies show that typing is much slower than voice input, the preference of keyboards were based on the belief of achieving a more accurate way of input, but speech recognition and natural language user interfaces have been perfected and they are able to learn and adapt to the specific user, and achieve greater accuracy than ever before, and as such speed up the input method. But despite the greater speed, the verbal input method is still not widely adopted.

Voice can also be a more expressive and efficient source of information then text, also because it places less cognitive demand on the speaker and permits more attention to be devoted to the content of the message (Chalfonte et al. 1991). The intonation in speech also provides many implicit hints about the message content.

Services like Google Now, Microsoft Cortana and Siri are the most well known and most powerful natural language user interfaces and are able to adapt to the pronunciation of the specific user and retrieve data from searches and other apps installed on the phone, making them more “intelligent” then before, but it is still not perfect and that discourages people from using them.

The most common user group that take full advantage of the capabilities of natural language interfaces and voice commands are blind or upper body impaired people, but in my opinion, the user group is an early adopter group and the findings can be reused in situations where the user is a sighted person, but his or her hands or eyes are busy. I have investigated scenarios connected to the speed of input and that led to the realization that the much of this time is spent on actually correction instead of just speech input. Azenkot and Lee (2013) found that speech input is nearly 5 times as fast as the keyboard for blind, and in case the situation does not allow for the users to look at the screen, it results the same for sighted users. Many commercially available products are promising 4-5 times faster input speed for dictation of e-mails and SMS’s. While participants of Azenkot’s and Lee’s study were mostly satisfied with speech input, editing recognition errors were frustrating. Participants spent an average of 80.3% of their time editing, and it was further shown by Karat et al (1999), that sighted users with the possibility of visual editing spent about 66% of their time editing the output on a desktop system. This implies that although dictation is faster and more efficient and least when it comes to shorter messages than typing, the editing functions are more complicated and takes more time to use. (Cox et al. 2008)

(9)

One of the existing services that offer a compelling solution for faster error correction is Dragon Naturally Speaking. Dragon Naturally Speaking is a desktop application and it has a different answer for facilitating for editing dictated text, with either the use of a correction menu, spelling window or playback aided correction. If you say the command “correct” it brings up the correction menu, which offers a numbered list of statistically significant variations, and the user can choose by the number. If the right version of the word is not in the list, “spell” opens the Spelling Window, which gives the possibility to spell out the right word. “Play back” initiates the playback feature, which has previously recorded the dictation. This is useful in case of a longer text, if the user forgets what the word should be. Much of the correction can be done without using manual input or looking at the choices, but not all.

Siri has a function that is supposed to ease the correction process of dictated text as well. During dictation it underlines words that are statistically likely to be incorrect, and the autocorrect function can be used for correction. But in that case, the right word has to be chosen manually, that both requires visual and manual help. Before sending the text, the system reads back the dictated text, that helps to filter out mistakes without visually getting engaged.

Driving and home

Distracted driving is a major problem that leads to unnecessary accidents and human casualties everywhere in the world. The ubiquity of mobile phones is one cause of distracted driving. Many reported having voice interaction as a backup solution at these times, for example, during a typical daylight moment in the US in 2011, 5% of all drivers were using a hand-held phone while driving and 1.3% were visibly manipulating a mobile phone while driving. (Traffic Safety Facts, 2013)

Lindquist and Hong (2011) described several solutions for mobile phones that don’t distract the users while driving, embrace the idea of context-awareness and implement burden-shifting, time-shifting, and activity-based sharing. In their study, they describe a system that can automatically send SMS’s based on GPS tracking and travel speed as the equivalent of the scenario when someone wants to communicate their current location for a pick up. Activity based sharing means that through a social network type of communication the caller can be aware of when the user is driving or automatically transferred to voice mail, and user only receives notification of the incoming SMS or voice mail when stopped driving. The service accommodates for emergency calls. This service is highly graphical user interface based and raises questions on how a phone know when the user is driving or a passenger instead, but the basic ideas on avoiding the

(10)

distraction of users during an event where the user must primarily attend to events in their environments are useful for the point of my thesis. It has also been found that speech and music in the background and peripheral auditory cues can provide an awareness of messages or signify events, without requiring one’s full attention or disrupting their foreground activity. Audio easily fades into the background, but users are alerted when it changes.

Another scenario of voice-user interface used while the users must primarily attend to other events in their environment are home automation systems. There are several variation for home automation systems, there are popular DIY solutions based on Raspberry Pi and commercial systems based on Android or ready made solutions such as Voice Pod. Brush et al (2011) studied how users are interacting with a voice-user interface system at home. The goal of the study was to understand more generally how people might use speech input to interact with computers in the public spaces. The results demonstrated that the participants were interested in speech interaction at home, in particular for web browsing, calendaring and email tasks, although there are still many technical challenges that need to be overcome. More generally, the study suggested the value of using speech to enable a wide range of interactions.

Context aware speech interactions

Lyons et al (2004) have talked about augmenting conversations using dual-purpose speech. In the study, the speech mentioned supposed to occur within the boundaries of a normal conversation but with the emergence of sensors and the role of mobile, pervasive devices as sources of context are enabling more and more intelligent environments. It is possible to read various types of context–environmental, situational, personal and social– by using such sensors. I believe that the increasing availability of rich sources of context and the maturity of context aggregation and processing systems suggest that the time for creating conversational systems that can leverage context is here, and that dual-purpose speech could be used automatically during phone conversations eliminating the need to look for something that has been mentioned during a phone call.

The scenario in the study is described as a normal flow of conversation while a meeting is being scheduled. The main focus of the study is how the conversation is used normally, while it is also being used to provide input to a computer.

Palviainen et al (2013) examines how mobiles can be turned into social devices that enhance interaction between people to support the face- to face, co-located interactions. In the concept the use of natural language and speech as a socially embedded interaction modality is emphasized. The study involves a handful of scenarios; in all scenarios the

(11)

mobile phone is context aware and aware of other known users around it. The scenarios range from recognizing if someone else has the same application installed as we do, giving an answer to a question asked during conversation, to making calendar event reminders and notes. Of course issues of privacy and interruption have emerged, and concern for what the phone says aloud and in what context and what environment has been shown.

Privacy concerns

Privacy concerns of voice commands are often raised, and they are both related to authentication and others in the user’s environment overhearing dictated messages or arriving messages. These privacy concerns can cause users to turn away from voice-controlled applications, even if the usage of it would ease the task performed.

Staddon at al (2012) and Stutzman et al (2011) found a significant correlation between privacy concerns and user behavior. Boyd and Hargittai (2010) provide a longitudinal study on privacy settings and find significant behavioral evidence of privacy concerns even in the age group 18-19 years olds. Staddon et al find that engagement is directly related to privacy concerns and those users with less privacy related issues are more likely to engage with the system. This study although have reflected on social networks, but similar privacy concerns occur during using a voice user interface.

Authentication concerns have been addressed by Zhu et al (2009) and it was found that security and privacy mechanisms are becoming a major obstacle for the users of hands-free speech applications. The problem of someone overhearing the user while maintaining a speech-based authentication and that leading to an easy to exploit door to identity theft has been addressed and an indirect speech based authentication system has been proposed, which takes advantage of the computer-to-user channel to pose questions to the user. Then users need to respond to these questions by combining a system-provided detail with information contained in their secret password. On the other hand Di Crescenso et al (2007) describes an algorithm that can be used to uniquely recognize someone’s voice and use it as an authentication system, together with a password. The study states that the human voice has a great but perhaps unrealized potential as a simultaneously very usable and very secure biometric factor for entity authentication.

Love and Perry in (2004) examines the reaction of bystanders to mobile conversation. It states that despite varied expressed views on embarrassment, discomfort and rudeness, patterns of behavior were remarkably similar. Visible disengagement were performed by all participants, they were demonstrably not attending still all of them were able

(12)

to remember the precise content of the conversation they have overheard and other social mechanisms were used by the bystanders to diffuse the perceived intrusiveness of the call and to grant “permissions” for these intrusions, such as frequent glances of acknowledgment and final closure when the calls have ended. This embarrassment that the bystanders feel overhearing a private conversation can show up when they overhear voice commands performed by a caller or private SMS being dictated.

To these privacy concerns there are several hardware solutions that lessen the audibility of commands, dictations or conversations.

Yuksel et al (2011) proposes a design for a basic mobile phone, which is focused on the essence of mobile communication and connectivity, based on a silent speech interface and auditory feedback. The paper states that this device would utilize low-cost and commercially available hardware components, thus it would be affordable and accessible by majority of users at the same time as discarding the disadvantages such as the background noise, privacy and social acceptance associated with voice control systems. The proposed device is the size of a headset taking input through silent speech input and providing auditory feedback through earphones. A SSI or silent speech input identifies phonemes that an individual pronounces without actually using the sound of their vocalization. This allows speech processing for synthesizing the actual speech or recognition of different commands even in the absence of an intelligible acoustic signal such as while non-audibly whispering. Huo et al (2012) presents a new wireless and wearable assistive technology called dual-mode Tongue Drive System (TDS). The TDS detects users’ tongue motion using a magnetic tracer and an array of magnetic sensors embedded in a headset. Despite that this is aimed for disabled people and meant to improve computer usage, the device could be used for mobile technology. During the study it was shown that speech recognition used at the same time as the TDS made it significantly easier to navigate than just TDS alone, and TDS served as a pointer device that is controlled by the user tongue. In my case this device can also be used for basic control of mobile devices instead of audible commands. Heracleous et al (2005) presented the Non-Audible Murmur (NAM) microphones focused on automatic speech recognition. A NAM microphone is a special acoustic sensor attached behind the talker’s ear and able to capture very quietly uttered speech (non-audible murmur) through body tissue. The same paper also presented a new Silicon NAM microphone, which during testing managed to reach over 93% accuracy. In situations when privacy in human machine communication is preferable, NAM microphone can be effectively applied for automatic recognition of speech inaudible to other listeners near the talker. On the other hand in real life noisy environments the accuracy drops due

(13)

to the modified speech production. The lack of auditory feedback in noisy conditions makes the speaker try to increase the intelligibility of the speech and that leads to several changes in speech characteristic, such as speech intensity increases, fundamental frequency and formants shift, vowel durations increase and the spectral tilt changes. This is called a Lombard reflex.

Over all these studies show that privacy concerns are a strong alienating factor, despite that by now it is known that many websites collect information, the feeling of someone overhearing the same information has a stronger effect. At the same time the person overhearing information also experiences discomfort. There are several hardware solutions that exist and can help to handle these issues, but all of them are very specific and are not widely available nor used, so in my thesis I would like to propose interaction design solutions instead of hardware solutions.

Role of mobile devices

Palviainen et al (2013) have already discussed about social devices and how mobiles can be built in into social situations and conversations together with the concerns it have raised. Strom (2002) also discusses how mobile devices can be used as props to convey a specific impression to people around and how this restricts or opens the body language of the user. The article divides devices to cold and warm devices depending on whether they are inviting other users to part take in the activity, like cameras that invite people to pose and look at the image taken, or it is rather sending the message that the user would not like to be approached. According to this categorization a phone is a warm device because it may invite people in, especially if it contains a camera, but when a phone is used with a headset, that makes it a cold device, such as musical devices, and it signals that the user is busy. Often people want to take their calls undisturbed and in that case the phone must be visible to avoid possibly embarrassing situation. This leads to the phenomenon of people walking around with a phone in hand despite talking to a headset, that is designed to free up the hands.

(14)

Interviews

First set of interviews

After investigating the current works, I identified the big problem areas in existing solutions and I performed the first set of unstructured interviews to see how people relate to these problems and to validate my findings. During the interviews I asked questions about how they use their mobile phones generally, in what situations they think it would be useful to have voice interaction and in what situations they are actually using voice interactions. Then I asked details about what they think are the shortcomings of current solutions and what makes them not to choose it and not to make use of the existing functions.

I have started up the interviews by describing what is my thesis is about and that I am interested in their subjective opinion and experience with such interactions.

The identified problem areas during the interviews were:

1. Language related problems with both traditional and speech input 2. Privacy issues relating to voice user interfaces

3. Distrust of technology connecting to speech synthesis 4. Difficulty of multitasking while on the phone

Below find detailed descriptions and partial transcripts of interviews that validate these findings.

Language related problems

During the interviews one of the major problems that has been brought to light were the language related issues. Six participants said it is hard to switch between different languages for voice input or even text input for many of the larger mobile operating systems, if someone is bi-lingual, which is very common in the environment I have conducted my studies, it causes such problems that make the users more likely to abandon voice interface solutions.

“It is difficult to switch between languages in speech recognition”

At the same time the contact names have also appeared to be problematic in multiple languages, where the pronunciation of the name does not match the expected English pronunciation, and results in calling the wrong contact.

(15)

“The old voice commands required you to record the sound of the names, but recently they tried to be smart and add voice recognition to names based on how they think the name should sound like which almost never matched the right pronunciation”

And finally it came up that the voice user interfaces are only available in common languages, but not in less common ones.

“I would use it for chat, but it is not available in Serbian, but I mostly chat with people in Serbian. “

Privacy issues

Four participants said that dictating something longer or more private would feel too awkward to do. It makes them feel uncomfortable even if people nearby overhear the name of the person they are trying to call, and they would really feel dictating messages would penetrate their privacy to such extent which they would not sacrifice for a safer way to communicate while driving.

“Would be awkward to dictate an e-mail, and it is more sensitive”

“You need these commands most when you are on the bus, or holding something, and that is exactly when I felt embarrassed to use it”

One participant tried to change the system by using a “code word” for interactions, which has no connection to the task they wish to perform, which shows the need but also the limitations of current systems.

“I used the phrase blå burk to answer the phone”

Distrust of technology

All participants were to some extent unfamiliar with the developments of existing services and previously have experienced problems with the accuracy of voice recognition and therefore they use it in a very limited way or have been previously discouraged through failure, and they harbor a general distrust towards natural language interfaces.

(16)

“I don’t find voice commands so useful, I think typing is more accurate than saying a command, although it has been quite long since I tried it”

“If the recognition would be more accurate I would maybe use it, I tried it before, it didn’t work so well, and it pushed me away, I never tried it since”

All participants have said that they tried voice recognition and even used it for a while, but then it just didn’t seem to be “smart” enough to keep using it and now they just don’t find the way to implement the options of voice commands in their daily life.

“I was using it before when I had some older phone… I used it mostly while I was biking, and wearing a headset”

“I don’t use voice recognition at the moment, before, I used it to start calls” “I only used it for starting calls, but I never think about possibilities what it can be used for”

Difficulty of multitasking

All participants mentioned that it would be great to have alternative ways interacting with the phone in two situations: while on the phone with someone and while driving. Everyone mentioned either or both.

Four participants had owned or frequently used a car and out of that, two use their phones with no hands free option to answer calls and even SMS or chat messages. They have stated they only do this, because they do not feel like they have a better solution, but if there would be, they would use it.

“If I drive a car, and in situations where you cannot use your hand would be useful to have voice interaction”

“I answer SMS’s by typing on the on-screen keyboard while driving, I know it is unsafe, but it is the quickest”

(17)

Two participants mentioned particularly the problems of interacting with the phone, while being on the phone, but when asked, all participants have confirmed the issue. The two participants said they are specifically keeping an offline paper calendar or notebook to be able to talk on the phone at the same time as confirming appointments or taking notes.

“I only keep a calendar because it sounds unprofessional if I take too long to answer whether an appointment at a certain time fits me”

Conclusions

I have confirmed the practical existence of some of the problematic areas that have been identified by looking at existing works, such as the problem of privacy with voice interactions which makes participants sacrifice the usefulness of the technology and possibly use less safe methods to keep more private, and I have confirmed that there is a desire of development, a design opening in the area of multitasking while driving and on the phone, and I have chosen this area to do further investigations and develop interaction solutions to.

Furthermore I have identified that two of the main obstacles that stop the spreading of voice user interface is the lack of easy solution to switch between languages, lack of available languages and the distrust of the technology and previous bad impressions.

Second set of interviews

During the second set of interviews I have showed some videos of existing solutions to see what people think of contemporary products. I wanted to see if the distrust of technology would be easy to lift and would the participants be more open to new solutions based on the current standing of the technology.

Finally, I showed the video from the study of Lyons et al (2004), which talks about augmenting conversations using dual-purpose speech. In the video a device on the head is used to show the person the calendar view and through dual-purpose speech, a calendar entry is created.

(18)

Contemporary products

I have shown the promo video of Dragon Naturally Speaking and explained that speech synthesis currently is better than what the participants experienced when they last used it. Also, that they use algorithms to individualize the voice recognition, and that it would better with time and accommodate for individual variations.

The reactions were over all positive, all participants said they would use this or similar products on mobile phones if it were really as accurate as it is on the video, but all participants expressed disbelief too, and doubt that it would not work so well right out of the box.

“It’s prefect, if its really this accurate”

“If it is efficient, I would use it all the time, it is a matter of habit for me.”

It was also expressed, that despite the seemingly fluid interaction, they would still not dictate a longer message to their phones, only shorter ones, like an SMS.

“I would use it for short messages while driving, but I wouldn’t dictate a very long message”

After, I have shown a video from VoicePod on home automation where the natural language interface is quite well developed and during the promo video the interactions are smooth and blend in to every day life.

(19)

The reactions were again quite positive, but contained some doubt over whether the natural language interface is really this good and would work this well, or is it just an edited video. All participants said that they would only use a system similar to this for basic interactions.

Dual-purpose speech

I have shown the following video about augmenting conversations using dual-purpose speech. The video is from 2004, and it is shown to accommodate for nearly normal speech while reserving an appointment in the calendar of a PDA simultaneously.

Reactions varied, all participants liked the possibility of quick interaction blended into speech but were taken aback by the hardware shown. It was clear that none of the participants would want to wear hardware similar to this, and that they have not seen

(20)

it as a viable system with having to push a button, and especially weren’t viewed, as it would make real life conversations smoother or easier, but they have seen the value of the interaction itself.

“A bit weird to have this thing in the head, but if its more hidden, it can be pretty cool”

“It could be much faster to handle calendar entries and that is pretty cool, but without the headgear, just on the phone.”

Conclusion

I concluded the second set of interviews with confirming that changing the mind of participants and earning or earning back the trust of users may be very difficult, and a continuously ongoing task. Even if the product is capable of learning, participants are expecting seamless, accurate and intuitive interactions out of the box, and not providing this immediately may put them off of voice user interfaces for good.

Furthermore I noted the positive response to dual-purpose speech and began to investigate the ways of possibly implementing similar solutions in contemporary technology.

(21)

Concept design

During the concept design phase I have came up with a couple of different voice interaction solutions for the established problems and different interaction ideas that can be viable and have validated them through either interviews or interviews and role-playing or body storming exercises.

Concepts and validations

Inaction is an action

The first concept was inspired by the concept of a web and SMS based safety service called Kitestring. There are several safety apps available for mobile phones, but in every case alerting emergency contacts requires some action that people under high pressure or being in trouble may not be able to perform. Kitestring on the other hand reacts to inaction. It is based on the idea of setting up trips with estimated travel times and checking in when arrived safely. If the person fails to check in and fails to reply to a reminder SMS, Kitestring would alert the emergency contacts with a custom SMS message.

I connected the idea of “inaction is an action” to the privacy questions raised by voice control and the distrust in the accuracy of speech synthesis. I proposed an idea where it is allowed for the mobile device to start communication and perform tasks, unless the user actively stops the device performing it. A solution, where inaction is the confirmation of an action.

This solution could be used over headsets or in cars, where performing an action or initiating communication with a mobile device may be distracting or the feeling of others overhearing the voice commands used made participants feel uneasy. It was shown during interviews that the majority of drivers want to use their mobile devices while driving and would engage in calls or even in SMS messages. If users want to see arriving SMS messages, and would get engaged in calls at least to the extent of wanting to know who is the caller, I drew the conclusion that it can be assumed in the majority of cases they would confirm that they want to hear their messages read aloud or the name of the caller read aloud if it is asked.

Based on this conclusion I proposed a solution where the interaction is automatically initiated by the mobile device and it doesn’t require a confirmation but requires interruption

(22)

in order not the perform reading aloud an SMS message or the name of the caller. Interruption can also be audible conversation, in which case the phone displays context awareness by not interrupting the conversation with starting up its own conversation but instead using an unobtrusive notification sound. In the paper of Palviainen et al (2013) was shown it is a preferred interaction in case of social situations, and that users can be concerned about the device speaking at the wrong time.

Validation

To validate this idea I have conducted interviews and staged a situation when people are sitting in the car, driving and simultaneously getting phone calls and SMS’s. Whenever the participant got an SMS or call, I have read aloud the caller’s name or the content of the SMS.

During the interviews and the staged situations, it became clear that this is only a good option if the user is the only one who could hear the interaction of the phone. Privacy concerns flared up strongly, and people felt out of control of the situation, which ended up distracting them more than just the curiosity over the identity of the caller or the contents of an SMS message.

“An SMS I wouldn’t expect to be read aloud”

“If I hear who is sending it and then I have to say stop, it is almost the same if it would have been read. My discomfort would be obvious.”

The usage scenario when people are alone and have a headset on, like during biking or taking the public transport is viable, the feedback has been positive to that, since it doesn’t bring up the same privacy issues if only the owner can hear it, but the main purpose of this was for users to be able to use it while driving and it failed in that sense.

Context aware on call speech recognition system

The second concept was inspired by the positive response to Lyons et al (2004) dual-purpose system and the idea of context awareness in the paper of Palviainen et al (2013). During the interviews the other problematic area in the set of problems related to multitasking was the issue of being on the phone at the same time as having the need to interact with the phone. Since the dual-purpose system’s interaction ideas were viewed positively, but the hardware received negative feedback I set on to look at how to implement similar interactions on modern hardware that blends in to every day life.

(23)

I proposed a dual-purpose context aware speech system that could be used while on the phone for entering calendar entries or sending phone numbers of other contacts. Those functions are the ones where participants would typically have to put the person they are talking to on hold, take their phones away from their ears and hold it up to look up information. My solution would result in your phone “listening in” on your calls and giving you options depending on the context of the conversation. If the caller talks about a specific person, the phone can perform a quick search in Contacts and the contact details of the person can pop up, a certain place, the phone can search and offer navigation with either car or public transport depending on the default setting, appointments can be offered to be scheduled at the right time, and since the contact, calendar and often the navigation app all work offline, no Internet wide search is necessary for this, and the results are not ending up in browser history. The tips are popping up silently, not to distract the phone call itself, since it is assumed that if the users want to interact with the phone during the conversation, they would look at their phone screens, and notice the actual interaction proposal displayed.

I have conducted interviews and prepared a low-fi hand drawn prototype of the interaction, to better test the idea through a staged bodystorming validation process and make the user understand how the interaction works.

Validation

I have performed a set of workshops in semi-public environments, such as a park, a restaurant, and in private environments such as the home of one of the participants. During testing and validation, we performed a “phone call”, first the usual way, and then with my interactions.

(24)

During the conversation, I have asked the person on the phone to:

1. Provide me with the phone number and contact details of a mutual friend, 2. Asked if s/he is available on a certain date, and invited her/him to an event, 3. Gave him/her the address of the event.

During the first conversation, when I asked people to just do everything as they would normally do, during providing me the contact details of a mutual contact, all of them had to stop the call, send me the contact details and then call me back.

Availability check resulted in having to put the call on hold or break the call, either forcing me to stay on hold for minutes while the person checked her/his availability in an offline calendar or break the call while using the phone calendar and call me back. When the invitation was received, the recipient either asked me to send a confirmation SMS or forced me to stay on hold for several minutes while the appointment has been placed in the Calendar.

When the address was received, all of them asked me to send an additional text message about the address.

This has confirmed the findings of the interviews, which said that conversations on the phone are quite inefficient when it comes to having to confirm or provide data that is also only available through the mobile phone.

After seeing this, I have continued with the second set of conversation, I have repeated the same request, and I described the background of the speech recognition system to put into context the interactions that are received during the conversation.

To deliver the interaction, a low-fi paper prototype was used.

When I asked the person to provide me with the number of a mutual contact, the phone automatically identified the name of the person from the contact list and offered the contact card. The person in conversation could just send the contact card with one push of a button, and a confirmation message showed up, that confirmed the number has been sent.

(25)

During asking the person about her/his availability, the phone has listened in on the cue word calendar followed by a date, and it have showed the date in question. The calendar on the given date was empty, and the user could quickly confirm without putting me on hold that they are indeed available.

The date has been followed by a time, and the application is listening to this particular combination of information, an event could be added for the specific date and time mentioned, if the action is confirmed and if the entry did not exist before.

When I gave the address of the event, the phone has listened in and recognized the street name as a nearby destination contained in the GPS navigation application and provided a rough navigation route from the users home to the place mentioned. Since the address has followed an event that has been added, the address can now be added to the event or the contact card of the person on the phone line.

(26)

These interactions are popping up silently, so the call itself is not distracted by the information. If there is no reaction, the interaction is automatically dismissed after a short time.

The reception of this idea was overwhelmingly positive. It was viewed as something that can make conversation on the phone easier and more fluent.

“I would like my mobile to offer guesses about what I am talking about, worst case they are wrong and I cancel them or do nothing about it.”

It was viewed especially useful for important job related calls, when you seem unprepared if it takes long to answer whether an appointment is good or not.

It was also confirmed, that the interactions should not have sound notifications, every participant said that it would bother and distract them during phone calls, even if it is just vibrations.

There were a few concerns as well, that came up during the discussion that followed the testing, such as for the mobile to be able to find contacts on the phone, the contact has to be set up correctly, mobile, home and work phone numbers under the correct title and the name of the contact has to be pronounced correct. This is a problem that came up before in different context as well, and the only current solution to it is to set up a recording on the pronunciation of the name, as it has been done in previous, outdated phone operating systems.

(27)

It has been mentioned that none of the interaction confirmation should be verbal, it should all be manual using a button, to avoid confirmations being done accidentally, not should the phone have any audible interactions during the call.

The possibility of searching the Internet through a context aware system brought up the privacy concerns, participants would feel almost like their conversations are recorded having certain words or phrases searched for and remaining in their search history. The distrust of technology has been also addressed, since many participants generally found voice interactions inaccurate, they are unsure if this concept would be accurate enough. But at the same time they see that the solution would significantly ease the process of interaction with the phone during phone call if it is working, and if not, then the users are just back to putting the caller on hold as before. Since they are taking the phone away from their ears while looking for information the caller asked for, they either see a popped up interaction they can use, or they would just look up the information as they normally do. The technology can only add to current functionalities, but if it does not work fully, it still does not hinder the existing ones.

Since the participants all confirmed, that they see the use of this kind of context aware interaction, and generally shown enthusiasm, I felt the concept has been successful and has been validated.

(28)

Conclusions

During examining the subject of voice-user interfaces, I have looked at contemporary ideas and existing applications. Looking at these applications and concepts, I have ordered them into different topics to address the possibilities and limitation of the technology, which shown, that speech synthesis as input method may be fast, but it is not widely used, one of the concerns about it is the correction of inaccuracy. During driving, at home or in different social situations voice-user interfaces would be very useful, but are still not used. After, I have moved on to confirm these findings through interviews. The set of interviews revealed the distrust in technology and that many have negative preconceptions regarding accuracy, but also confirmed, that it would be viewed positively if there would be a solution for such occasions as driving or speaking with someone on the phone and trying to use other applications, such as calendar or contacts at the same time.

This has outlined the design opening, and I have come up with two concepts for the set of problems. The first one tried to overcome the distrust of inaccuracy of understanding spoken commands and put the mobile device into the position to initiate communication. This concept has been designed to be used in cars while driving, but it failed during the validation process. Users felt out of control and found the idea of the phone reading private messages aloud discomforting.

The second concept tried to find a solution for interactions with the mobile while being in a call. This idea has addressed the fear of inaccuracy as well, but in a non-disruptive way, and that made the participants feel safer, and the reception was positive.

(29)

References

ASPERS, P., 2001. Markets in fashion: A phenomenological approach. Stockholm: The City University Press.

AZENKOT, S. and LEE, N.B., 2013. Exploring the use of speech input by blind people on mobile devices, Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility 2013, ACM, pp. 1-8.

BOYD, D. and HARGITTAI, E., 2010. Facebook privacy settings: Who cares?, First Monday.

BRUSH, A.J., JOHNS, P., INKPEN, K. and MEYERS, B., 2011. Speech@home: an exploratory study, CHI ‘11 Extended Abstracts on Human Factors in Computing Sys-tems 2011, ACM, pp. 617-632.

CHALFONTE, B.L., FISH, R.S. and KRAUT, R.E., 1991. Expressive richness: a comparison of speech and text as media for revision, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems 1991, ACM, pp. 21-26.

COX, A.L., CAIRNS, P.A., WALTON, A. and LEE, S., 2008. Tlk or txt? Using voice input for SMS composition. Personal Ubiquitous Comput., 12(8), pp. 567-588.

CRESCENZO, G.D., COCHINWALA, M. and SHIM, H.S., 2007. Modeling cryp-tographic properties of voice and voice-based entity authentication, Proceedings of the 2007 ACM workshop on Digital identity management 2007, ACM, pp. 53-61.

DANHOLT, P., 2005. Prototypes as performative, Proceedings of the 4th decennial conference on Critical computing: between sense and sensibility 2005, ACM, pp. 1-8. DRAGON NATURALLY SPEAKING, Correcting Transcription Errors in Your Dictated Text. Available: http://www.nuance.com/ucmprod/groups/dragon/@web/doc-uments/collateral/nc_008223.pdf2014].

HERACLEOUS, P., NAKAJIMA, Y., SARUWATARI, H. and SHIKANO, K., 2005. A tissue-conductive acoustic sensor applied in speech recognition for privacy, Proceed-ings of the 2005 joint conference on Smart objects and ambient intelligence: innova-tive context-aware services: usages and technologies 2005, ACM, pp. 93-97.

(30)

HUO, X., PARK, H. and GHOVANLOO, M., 2012. Dual-mode tongue drive sys-tem: using speech and tongue motion to improve computer access for people with disabilities, Proceedings of the conference on Wireless Health 2012, ACM, pp. 1-8. HUSSERL, E., 1954. The Crisis of European Sciences and Transcendental Phenome-nology: An Introduction to Phenomenological Philosophy.

KARAT, C., HALVERSON, C., HORN, D. and KARAT, J., 1999. Patterns of entry and correction in large vocabulary continuous speech recognition systems, Proceedings of the SIGCHI conference on Human Factors in Computing Systems 1999, ACM, pp. 568-575.

KITESTRING, , Safety with strings attached. Available: https://www.kitestring. io/2014].

LINDQVIST, J. and HONG, J., 2011. Undistracted driving: a mobile phone that doesn’t distract, Proceedings of the 12th Workshop on Mobile Computing Systems and Applications 2011, ACM, pp. 70-75.

LOVE, S. and PERRY, M., 2004. Dealing with mobile conversations in public places: some implications for the design of socially intrusive technologies, CHI ‘04 Extended Abstracts on Human Factors in Computing Systems 2004, ACM, pp. 1195-1198. LYONS, K., SKEELS, C., STARNER, T., SNOECK, C.M., WONG, B.A. and ASH-BROOK, D., 2004. Augmenting conversations using dual-purpose speech, Proceed-ings of the 17th annual ACM symposium on User interface software and technology 2004, ACM, pp. 237-246.

MCELHEARN, K., 2013-last update, Beyond Siri: Dictation tricks for the iPhone and iPad. Available: http://www.macworld.com/article/2048196/beyond-siri-dictation-tricks-for-the-iphone-and-ipad.html2014].

OULASVIRTA, A., KURVINEN, E. and KANKAINEN, T., 2003. Understanding contexts by being there: case studies in bodystorming. Personal Ubiquitous Comput., 7(2), pp. 125-134.

PALVIAINEN, J., SUHONEN, K., VÄÄNÄNEN-VAINIO-MATTILA, K., AAL-TONEN, T. and LEPPÄNEN, T., 2013. Exploring usage scenarios on social devices: balancing between surprise and user control, Proceedings of the 6th International Con-ference on Designing Pleasurable Products and Interfaces 2013, ACM, pp. 96-105.

(31)

RAJPUT, N. and NANAVATI, A.A., 2012. Speech in Mobile and Pervasive Environ-ments. Chennai, India: John Wiley & Sons Ltd.

SANDSTRÖM, E., 2010. Performance Art: A Mode of Communication.

SCHUTZ, A., 1967. The Phenomenology of the Social World. Evanston, Illinois: Northwestern University Press.

STADDON, J., HUFFAKER, D., BROWN, L. and SEDLEY, A., 2012. Are privacy concerns a turn-off?: engagement and privacy in social networks, Proceedings of the Eighth Symposium on Usable Privacy and Security 2012, ACM, pp. 1-13.

STIFELMAN, L.J., ARONS, B., SCHMANDT, C. and HULTEEN, E.A., 1993. VoiceNotes: a speech interface for a hand-held voice notetaker, Proceedings of the INTERACT ‘93 and CHI ‘93 Conference on Human Factors in Computing Systems 1993, ACM, pp. 179-186.

STROM, G., 2002. Mobile Devices as Props in Daily Role Playing. Personal Ubiqui-tous Comput., 6(4), pp. 307-310.

STUTZMAN, F., CAPRA, R. and THOMPSON, J., 2011. Factors mediating disclo-sure in social network sites. Comput.Hum.Behav., 27(1), pp. 590-598.

U.S. DEPARTMENT OF TRANSPORTATION, NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION, 2013. Traffic Safety Facts.

WEBER, M., 1978. Economy and Society: An Outline of Interpretive Sociology. Uni-versity of California Press.

YUKSEL, K.A., BUYUKBAS, S. and ADALI, S.H., 2011. Designing mobile phones using silent speech input and auditory feedback, Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services 2011, ACM, pp. 711-713.

ZHU, S., MA, Y., FENG, J. and SEARS, A., 2009. Don’t listen! I am dictating my password! Proceedings of the 11th international ACM SIGACCESS conference on Computers and accessibility 2009, ACM, pp. 229-230.

(32)

References

Related documents

The hypothesis itself suggest that decision-making isn't as rational as explained by older theories, such as the expected utility theory, and that emotional mechanisms in the

In the previous chapters I have shown how the separate diegetic narratives each correspond to different depths within the subconscious of the implied author, and also how

These researchers, amongst with Brinhosa, Westphall and Westphall (2008) and Bangre and Jaiswal (2012), which were.. mentioned in chapter 5 Related work, also looked at the

The experiment results demonstrate, compared to conventional acoustic only based speech recognition, bimodal speech recognition scheme has a much improved

vocal expression of different emotions (Davitz 1964, p. 26) and an early attempt to use voice percept in a Brunswikian analysis of the recognition of personality in the voice

The objective of this study is to find out if a Voice User Interface could be used to improve the customer experience for the "Ask the guides" function.. The three

Re-examination of the actual 2 ♀♀ (ZML) revealed that they are Andrena labialis (det.. Andrena jacobi Perkins: Paxton & al. -Species synonymy- Schwarz & al. scotica while

Let A be an arbitrary subset of a vector space E and let [A] be the set of all finite linear combinations in