Usability and enjoyability of natural language interface technology in computer games

(1)

DEGREE PROJECT, IN COMPUTER SCIENCE , FIRST LEVEL

STOCKHOLM, SWEDEN 2015

Usability and enjoyability of natural

languange interface thechnology in

computer games

(2)

Robert Alnesj¨o

Johan Darnald

Supervisor

: Richard Glassey

Examiner

: ¨

Orjan Ekeberg

(3)

Abstract

This report aims to examine the application of natural language interface (NLI) technology in computer games in terms of usability and enjoyability. To test this a NLI is compared to a formal language interface (FLI), using the system usability scale (SUS) to determine usability. Results show that even though the NLI scored less mean SUS score than the FLI, it was measured to be more enjoyable. It is concluded that using NLI technology is well motivated since the primary focus of a computer game is enjoyability.

(4)

Abstrakt

Syftet med denna rapport är att undersöka tillämpningen av naturligt spr˚ak gränssnitt (NLI) teknik i datorspel när det gäller användbarhet och nöjdhet. För att testa detta s˚a jämförs ett NLI med ett formellt spr˚ak gränssnitt (FLI) med hjälp av systemanvändbarhetsskalan (SUS) för att bestämma användbarheten. Resultaten visar att även om NLI:t fick lägre SUS-medelpoäng än FLI:t s˚a uppmättes det till att vara roligare. Slutsatsen är att användning av NLI tekniken är väl motiverad eftersom det primära m˚alet för ett dataspel är att vara underh˚allande.

(5)

Chapter 1 Introduction

1.1 Natural language interfaces

Natural languages can roughly be described as languages that humans have created and use to communicate with each other. Although the definition can vary, the details are of little importance to this study.

The purpose of a natural language interface (NLI) is to understand the semantics behind a sentence from a user – where verbs, phrases and clauses act as user interface controls – and respond in context to the input. By fulfilling this, dialogue with machines can be achieved using natural language rather than learning the rules of a formal computer system to formulate instructions. For convenience, all user interfaces which are not NLI are referred to as formal language interfaces (FLI).

1.2 NLI technology in computer games

As progress is made in the field of natural language processing (NLP), NLI tech-nology becomes an increasingly practical solution for certain systems. Namely computer games.

(8)

1.3 Evaluation with human resources

Human evaluation can be seen as the best evaluation method when deciding if a product is suited for its designated task if that task involves human interaction at some point. Using humans for evaluation can also be one of the harder methods. It is time consuming, it is often expensive and it can be hard to get a big enough group of testers that is sufficiently representative in areas like age and profession to get data that is statistically sound.

(9)

Chapter 2 Background

2.1 ELIZA

An early attempt of implementing a natural language interface is ELIZA made by J. Weizenbaum, 96 [11]. She would respond to user input using pattern matching techniques and responding in context to some identified keyword. Even though it was not a very sophisticated system, it became immensely popular throughout the world.

The pattern matching approach of ELIZA allowed for relevant replies in the case of a pattern hit. Some generic responses was given otherwise, to avoid the question (see figure 2.1). This is close to the behaviour sought after in the NLI used in this study and it is likely achievable as the conditions of the problems are similar.

2.2 NLI for databases

Databases with integrated NLI technology (NLIDB) are systems that allow infor-mation stored in a database to be accessed by a user through the NLI. There are many examples of implementations [2], both experimental and commercial.

(10)

Figure 2.1: Screen capture from wikipedia showing ELIZA running on GNU Emacs 21.

2.3 NLI in commercial video games

Usage of NLI in computer games is scarce but there does exist some worthy of mention, for example The Hitchhiker’s Guide to the Galaxy [1], Fa¸cade [9] and Bot Colony [7]. All of these games are mostly played with a natural language text based interface utilizing some sort of NLU algorithm to parse user input.

Even though observing these games is unlikely to yield good understanding of how they solved problems related to implementation or evaluation, the examples are to be thought of as a good source of inspiration.

(11)

2.4 Lexical resources

To make the NLI a useful tool in the game, a reasonably high accuracy of natural language understanding is required. The problem of fully understanding natural language is AI-complete meaning that it can not be solved without human com-puting or artificial general intelligence. The general approach to work around this is to gather large lexical resources, which have been preprocessed by humans, and structure it for processing by a computer.

WordNet is a lexical database containing nouns, verbs, adjectives and adverbsR

grouped into 117 000 sets of cognitive synonyms called synsets. It is similar to but distinguishes itself from a thesaurus in two aspects: WordNet interlinks not only words together but also word senses; and it labels the semantic relations among words [10].

While WordNet provides an API in C there does exist a number of alternatives in higher level languages such as the WordNet interface in Natural Language Toolkit (NLTK) for Python [4]. NLTK also provides a suite of NLP libraries and programs along with interfaces for multiple lexical resources, WordNet being only one of them.

2.5 System usability scale

The system usability scale (SUS) created by John Brooke in 1986 is a questionnaire that can be given to participants of a survey after their testing is done [5]. It is used to measure the usability of a product or service. The SUS is a widely used method of evaluation that is quick and easy to use and one that does not require a big sample size to give a valid result. The SUS is also context free which means it does not discriminate what systems it can be used on.

The testers are presented with ten statements about the system. They are then asked whether or not they agree with the statement, on a scale from Strongly disagree to Strongly agree. To prevent response bias the statements are posed in an alternating positive and negative fashion. This encourages testers to think about each statement [5].

(12)

Bangor, Kortum and Miller added a question to the SUS where they asked testers to rate the system with an adjective [3]. Looking at studies with this added question they found a correlation between different adjectives and SUS scores. These adjectives were added to the usability scale. The most negative adjective, Worst imaginable, corresponds to a score of below and around 25 while the most positive adjective, Best imaginable, corresponds to a score of close to 100.

(13)

Chapter 3 Method

3.1 Game

The game used for evaluation consists of a FLI, a NLI; two scenarios and a central controller (see figure 3.1). It is played as a text adventure where the user pro-gresses through an interactive story called a scenario. A scenario is a list of levels consisting of a description and a set of choices, these choices each consisting of a description and a reply. To end the game the user must win by reaching the goal. Implementation is done in Python mainly because of the handiness of NLTK (see 3.1.3) but also thanks to the nice built-in string and list types.

(14)

3.1.1 User interface

User input is handled by an interface to pick one of the choices each level. The formal language interface is simple; it presents the user with the different prede-fined choices and lets them pick one. The natural language interface semantically compares the user input sentence to the levels choices and selects the most similar choice.

Figure 3.2: Activity flow of an example scenario level.

Interchangeability between the user interfaces is a big concern in development. By making the game properly modular the amount of work needed is reduced. This also prevents distorted results in testing caused by an user interface having different preconditions. Another effort made to set common preconditions is refraining from manually tagging part-of-speech in the choice descriptions, which is useless for the formal language interface.

(15)

3.1.2 Sentence similarity

An algorithm comparing the semantics of two sentences is a cornerstone of the natural language interface. If the similarity between the user input and the de-scription of a choice can be measured, then picking the choice with the highest similarity gives the best approximation of what the user intended to do.

The sentence similarity algorithm used in the natural language interface calculates the average of the maximal word similarities M between two sentences U and V . For a word u _{∈ U, word similarities W is a list of similarities between u and all} the words v ∈ V . M is a list containing the maximal value of W for each word u _{∈ U (see figure 3.4). It is reliant on a word similarity algorithm which returns} a number where zero means no similarity and more means that the words share some semantical meaning.

Require: The lists of words U and V .

Ensure: The similarity between U and V as a number _{≥ 0.} function sentence similarity(U, V )

M _{← list()} for all u_{∈ U do} W _{← list()} for all v∈ V do w_{← word similarity(u,v)} W_.insert(w) end for m _{← max(W)} M_.insert(m) end for return average(M) end function

Figure 3.4: Pseudocode for the sentence similarity algorithm.

(16)

3.2 Evaluation

To help ensure that the results of the tests are not skewed by factors such as who is conducting the test, a protocol is followed during the evaluation. Some rules are also set up for what the evaluator is allowed to say about the test in general and about the user interfaces. The protocol used during testing can be found in appendix A.

3.2.1 Evaluation Form

To gather information about testers an evaluation form is prepared. The before testing part of the form contains questions about their name, age, sex, occupation and previous experience with text based games. In the after testing part they are asked what interface they enjoyed using the most and optionally if they have any comments about anything concerning the evaluation.

3.2.2 Using the SUS

The system usability scale is used in this test because it has been a standard for quick and easy testing for almost half a century [8]. It provides a reliable way to test systems context free without requiring a large testgroup. There are also no other easy to find measurement scales for the specific context that this report covers.

Testers fill out the questionnaire one at a time after they finish playing the game and not two at the same time after they have tried both user interfaces. This is a good idea because the tester then gets a short break from playing one interface before beginning the testing of another. It is desirable that the interface that they are grading is fresh in their mind. This method of using the SUS is pointed out as the recommended way of doing it by J. Brooke [5].

In the appendix there is an exact copy of the SUS questionnaire. It is the standard SUS form provided by Digital Equipment Corporation, the original creators of the SUS.

(17)

interface tested will not always get a slightly faster time because the testers have previous experience with how the stories are presented. This is also why the testers are only allowed to play each interface and each scenario once.

Using two different scenarios and two different user interfaces gives four possible combinations of order and interface–scenario pairings. These four different combi-nations are labeled A, B, C and D. The combicombi-nations are then rotated through in alphabetical order when conducting the evaluation. This is to prevent systematical errors.

3.2.4 Procedure

At the start of the evaluation testers are asked to fill out part of the evaluation form. The combination and order of interface–scenario pairing the tester is to play is chosen from the rotation schedule. A short explanation of how the first interface works is given and then the first scenario is started for the tester.

After the tester finishes playing the first scenario they are asked to fill out a SUS questionnaire where they are to evaluate their experience with the user interface. This procedure is then repeated with the second interface-scenario combination. When the tester is done filling out the second SUS they are asked to fill out the last part of the evaluation form.

3.2.5 Other collected data

During testing, the time the player stays on each level is recorded so that it is possible to see which user interface is more time consuming. The testers are not told that the time is measured until after both tests are done, as to not make them feel that there is any time constraints involved.

(18)

Chapter 4 Results

Tests were conducted from late March to early April on various acquaintances to the authors. No adjustments to forms or the protocol had to be made but after observing a tester with lacking English comprehension, good English became a requirement for future testers. Test forms can be found in appendix B.

Looking at the test group, it was found that age and occupation of testers was not very diverse. Eleven out of twelve testers were in the age range of 20 to 27 years old and currently identify their occupation as studying. Out of the twelve testers there were nine male and three female. Seven out of twelve said that they had previous experience with text based adventure games.

4.1 System usability scale

The natural language interface has a mean SUS score of 70 (as seen in table 4.2) which is considered between OK and Good based on the results of Bangor et al. [3]. The formal language interface with a mean SUS score of 80 (as seen in table 4.1) is considered between Good and Excellent.

Table 4.1: Formal language interface SUS results grouped into rotations. Mean SUS score

Rotation A Rotation B Rotation C Rotation D Total 80 89 57 94 80

(19)

Table 4.2: Natural language interface SUS results grouped into rotations. Mean SUS score

Rotation A Rotation B Rotation C Rotation D Total 60 75 56 88 70

4.2 Most enjoyable user interface

Enjoyability data (see tables 4.3 and 4.4) gathered from the test forms where testers were presented with the question: Which interface was most enjoyable to use?.

Table 4.3: Testers who responded that the formal language interface was more enjoyable.

Number of testers

Rotation A Rotation B Rotation C Rotation D Total

2 1 1 0 4

Table 4.4: Testers who responded that the natural language interface was more enjoyable.

Number of testers

Rotation A Rotation B Rotation C Rotation D Total

1 2 2 3 8

4.3 Time records

Tables 4.5, 4.6, 4.7 and 4.8 present time recorded during testing sessions. As seen in figures 4.1 and 4.2 the NLI was on average measured to be slower than the FLI.

(20)

Table 4.5: Testing of the formal language interface on scenario 1.

Elapsed time (s) at end of level

Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 43.08 64.36 117.89 137.85 148.58 174.98 188.43 205.58 47.73 63.31 88.90 107.81 122.66 144.27 157.31 172.07 143.36 198.65 296.98 338.10 370.01 418.61 451.56 522.61 57.74 84.25 115.07 149.49 171.03 200.22 219.46 240.77 37.86 63.54 94.29 123.35 132.69 150.31 165.74 185.93 73.57 111.35 163.18 195.85 236.31 295.58 365.14 415.98

Table 4.6: Testing of the natural language interface on scenario 1.

Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 102.78 154.19 202.14 230.36 260.79 305.06 332.84 365.93 101.92 195.54 279.64 326.37 375.16 413.19 434.45 467.57 65.71 151.52 259.83 286.58 323.77 357.12 394.73 430.03 85.10 152.90 359.06 406.65 443.81 556.89 587.54 629.38 60.11 108.42 260.66 302.24 339.19 367.93 403.08 451.03 65.64 130.36 176.29 232.06 258.38 307.11 327.64 348.52

Table 4.7: Testing of the formal language interface on scenario 2.

Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9 Level 10 21.99 62.21 75.91 91.52 125.97 149.94 179.93 208.54 227.86 253.72 41.76 52.16 53.84 74.81 80.89 104.06 111.47 133.19 144.21 148.65 34.01 72.72 90.91 113.27 166.08 174.17 183.57 218.52 230.84 236.82 33.66 53.38 74.03 113.57 127.57 147.62 169.26 194.06 212.89 218.80 33.92 56.16 71.28 91.03 100.16 118.26 133.38 143.05 153.12 165.25 37.83 59.12 73.69 91.04 101.42 144.81 190.48 209.90 218.96 260.73

(21)

Table 4.8: Testing of the natural language interface on scenario 2.

Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9 Level 10 20.05 59.39 81.70 100.11 133.84 153.03 159.96 182.33 191.67 214.81 38.06 60.08 76.23 97.42 158.29 191.49 198.03 217.65 253.13 276.37 45.46 116.62 178.03 212.13 231.40 276.17 303.97 347.52 832.58 1169.47 27.43 63.97 88.67 107.37 117.35 155.64 174.38 187.15 200.72 238.54 42.64 80.09 93.24 105.98 167.28 184.65 192.04 203.83 236.65 274.18 57.99 58.83 155.16 220.72 256.21 286.49 320.47 351.33 363.08 395.09

(22)

0 2 4 6 8 0 100 200 300 400 500 Level Time (s) Scenario 1 Formal language interface mean

Natural language interface mean

(23)

0 2 4 6 8 10 0 100 200 300 400 500 Level Time (s) Scenario 2 Formal language interface mean

Natural language interface mean

(24)

Chapter 5 Discussion and conclusions

5.1 Usability

The SUS scores of the two user interfaces differed by ten points which shows that on average the testers found the FLI more usable than the NLI. It is important to note that both interfaces received high enough SUS scores to be considered above OK in terms of usability [3]. On average the elapsed time at the end of a scenario was highest when using the NLI. This supports the claim that the FLI is more usable than the NLI.

5.2 Enjoyability

Two thirds of the testers reported that they enjoyed using the natural language interface more than the formal. As the preconditions of the user interfaces are equal it can be concluded that the NLI is more enjoyable than the FLI counterpart. From the results it is noticeable that while the FLI got a higher SUS score, more people answered that they enjoyed the NLI more. This would then mean that a system that is more usable is not always more enjoyable. There is likely some factor that regulates how enjoyable a system is other than usability. This is important to note since games are made for enjoyability. A solution that is less usable can be the better for enjoyability.

(25)

5.3 Self criticism

There is a overrepresentation of students within a certain age amongst participants of the test. This may cause systematical bias in some unknown direction which in turn will produce skewed results. Such an outcome does not necessarily invalidate the results if the groups of interest match.

Results are not necessarily representative of what a larger scale study could have gathered, especially not the enjoyability as it was tried in a binary fashion.

5.4 Improvements

Instead of having the binary choice of what interface the testers found most enjoy-able it would have been favourenjoy-able to have a scale similar to that of the SUS for each interface with a statement like: I find this interface enjoyable to use.

5.5 Closure

It was concluded that even though the NLI is less usable than the FLI, it can be more enjoyable. In terms of usability, the NLI is not an improvement in comparison to the FLI approach while it seems to be an improvement in terms of enjoyability. As the primary focus of a computer game is enjoyability, using NLI technology is well motivated.

(26)

Bibliography

[1] Douglas Adams and Steve Meretzky. The hitchhiker’s guide to the galaxy. video game, 1984.

[2] Ion Androutsopoulos, Graeme D Ritchie, and Peter Thanisch. Natural lan-guage interfaces to databases–an introduction. Natural lanlan-guage engineering, 1(01):29–81, 1995.

[3] Aaron Bangor, Philip Kortum, and James Miller. Determining what individ-ual sus scores mean: Adding an adjective rating scale. Journal of usability studies, 4(3):114–123, 2009.

[4] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python. O’Reilly Media, Inc., 2009.

[5] John Brooke. SUS-A quick and dirty usability scale, chapter 21. In [8], 1st edition, 1996.

[6] Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.

[7] North Side Inc. Bot colony. video game, Cancelled (2015).

[8] Patrick W. Jordan, Bruce Thomas, Ian Lyall McClelland, and Bernard Weerd-meester. Usability Evaluation In Industry. Number ISBN 0 7484 0314 0. Taylor & Francis Ltd., 1st edition, 1996.

[9] Michael Mateas and Andrew Stern. Fa¸cade. video game, 2005.

[10] Princeton University. About wordnet. http://wordnet.princeton.edu, March 2015. Accessed: 1 April 2015.

[11] Joseph Weizenbaum. Eliza—a computer program for the study of natural language communication between man and machine. Communications of the

(27)

Appendix A

Test resources

(28)

(29)

User interface evaluation protocol

1. Supervisor explains the rules of the game 2. Supervisor prepares a test form

3. Tester fills out the beginning of the test form 4. Supervisor explains the interface

5. Tester plays the scenario with the interface 6. Supervisor explains the SUS

7. Tester fills out the SUS 8. Repeat from 4 as needed

9. Tester fills out the end of the test form

Game rules

• Linear level based scenario • No game over except win

Interface description

• Predefined choices (1)

• Natural language commands are recommended (2)

SUS explaination

• Rate the interface

• Record immediate response • If uncertain: mark center

(30)

User interface evaluation form

Filled in before testing:

Name: . . . . Age: . . . Sex: male / female

Occupation: . . . . Previous experience with text adventure games: yes / no

Filled in after testing:

Which interface was most enjoyable to use? first / second

Comments (optional): . . . . . . . . . . . . . . . . . . . . . . . .

Filled in by supervisor:

Name: . . . . Rotation First Second

A 1–1 2–2

B 1–2 2–1

C 2–1 1–2

D 2–2 1–1

(31)

System Usability Scale

Strongly Strongly disagree agree 1. I think that I would like to

use this system frequently 2. I found the system unnecessarily complex

3. I thought the system was easy to use

4. I think that I would need the support of a technical person to be able to use this system 5. I found the various functions in this system were well integrated 6. I thought there was too much inconsistency in this system 7. I would imagine that most people would learn to use this system very quickly

8. I found the system very cumbersome to use

9. I felt very confident using the system

10. I needed to learn a lot of things before I could get going with this system

(32)

Appendix B

Tests

(33)

(34)

(35)

(36)

(37)

(38)

(39)

(40)

(41)

(42)

(43)

(44)

(45)

(46)

(47)

(48)

(49)

(50)

(51)

(52)

(53)

(54)

(55)

(56)

(57)

(58)

(59)

(60)

(61)

(62)

(63)

(64)

(65)

(66)

(67)

(68)

(69)

(70)

Usability and enjoyability of natural language interface technology in computer games