Quantifying User Experiences of Physical Products : A Case Study of Combining NASA-TLX and Product Reaction Cards for Actionable Insights

(1)

Linköping University | Department of Computer Science Master Thesis, 30hp | Cognitive Science Spring term 2021 | LIU-IDA/KOGVET-A--21/008--SE

Quantifying User Experiences of

Physical Products

A Case Study of Combining NASA-TLX and Product

Reaction Cards for Actionable Insights

Emma Jaeger Tronde

Supervisor, Mattias Arvola Examiner, Arne Jönsson

(2)

i

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/her own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security, and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: https://ep.liu.se/.

(3)

ii

Abstract

This case study investigated how to evaluate users’ experiences with physical products with a small sample size in a reliable way and how to provide actionable insights for future decisions about design and practice. Through an improving case study performed in collaboration with ASSA ABLOY, a global leader in access solutions, this study specifically focused on interactions with locks. Three major activities were performed and used to investigate how to conduct usability tests with a small sample size and to explore possible measures that can help increase reliability. Altogether 13 participants participated in this case study through two sets of tests and one final workshop. These activities had the purpose to investigate, to suggest and to evaluate how to best capture users’ experiences with locks using Single-Ease Question, Nasa Task Load Index instrument and Product Reaction Cards. The results showed that there are many different measures to apply to increase reliability in test design whilst using a small sample size, for example to mix methods and to counterbalance, both metrics and tasks, to both deepen the understanding of the experiences and to decrease the risks of possible biases. Further results also showed that qualitative and quantitative methods provide different insights about user’s experiences with respect to detailed and general knowledge respectively. It also showed that a combination of the two provides deeper insights than what one method provides alone and that they help validate individual findings.

(4)

iii

Acknowledgement

I want to thank my supervisor at Linköping University, Mattias Arvola, for his support and advice. I would also like to thank ASSA ABLOY and Fredrik Einberg for this opportunity, I have learned a lot. Finally, I want to thank Iva Radosevic for your commitment, patience and encouragements that have been more than helpful for me to grow during this time.

Linköping, 2021 Emma Jaeger Tronde

(5)

(6)

v

1 Introduction

User experience (UX) refer to what the user of a product perceives the interaction to be like, if it is good or bad. Information that reveals what the interaction is like can be used to improve it, for example by evaluating the product with respect to the goals that the user has of the interaction, which usually is the primary purpose of usability testing. Many different aspects affect users’ experiences, for example what they have experienced before and what the hopes are for the experience (ISO 9241-210, 2010). Various methods have been developed to capture users’ experiences with products and there are different ways to combine and apply these. A common way to collect information about users’ perceived experiences is to ask them about their feelings of it (Laurans et al., 2012; Scherer, 2005). This way of capturing users’ perceptions is also said to be the most straightforward approach (Laurans et al., 2012) and some even argue that it is the only way of getting access to the feelings users have (Scherer, 2005). Ways of how you ask participants to report about their experiences are numerous and vary depending on the nature of the research, whether it is qualitative or quantitative (Creswell & Clark, 2018). Some common approaches are to use subjective self-reported metrics or to use objective measures of performance (Tullis & Albert, 2013), for example the System Usability Scale (SUS) (Brooke, 1996) or methods measuring

performance related to task time and completion rate (Sauro & Lewis, 2009). Among the choices are many common standardized methods that measure users’ perceived experiences and they also differ in what aspects of the user’s experience they aim to investigate, for example SUS is often used to evaluate the usability of a product (Brooke, 1996), whereas other methods like the Repertory Grid Technique investigates users’ experiences of products by comparison (Fallman & Waterworth, 2010) and Product Reaction Cards (Benedek & Miner, 2002) that evaluate users’ experiences by letting them choose words that explain their them.

When methods, like those previously mentioned, are applied to evaluate users’ experiences, they are faced with many different challenges and obstacles, for example the challenge of validation (Hornbæk, 2006). One measure to overcome this challenge is to combine methods and it is also emphasised to combine different methods, for example by mixing both qualitative and quantitative ones, because it deepens the understanding of the studied phenomenon (Creswell & Clark, 2018). Another way of combining methods is to triangulate them to increase the validity of the test design (Runeson & Höst, 2009). Evidently, there are many different versions of these techniques that can be applied when conducting usability tests which means that there are many decisions to make regarding how to best evaluate users’ experiences. All decisions can affect how well the testing can provide sufficient insights for decisions about practice and design.

Another challenge and factor that many argue affect the testing and hence the study’s validity is the sample size, that is how many participants that participate in the study. This issue might as well be a watershed between different research fields where some argue that a bigger sample size makes the data more rigid (Tullis & Albert, 2013) and others argue that even small sample sizes are sufficient (Nielsen & Landauer, 1993). Meanwhile a third argument is that focus should rather be on test design (Lindgaard & Chattratichart, 2007).

(9)

2

What measures that can be used and test design to be applied to ensure a reliable usability testing with a small sample size is the problem area that this master thesis stems from and below follows information about the purpose of the case in which this thesis investigates.

1.1 Purpose

The purpose of this master thesis is twofold. First, it aims to investigate how to evaluate users’ experiences with physical products, in this case locks, with a small sample size. Secondly, the study aims to provide an approach on how to best capture users’ experiences using methods that can provide actionable insights.

1.2 Research Questions

1. How can we measure users’ experiences with physical products with a small sample size in a reliable way?

2. How can we translate the outcome of usability tests into actionable insights?

1.3 Delimitations

This study is delimited in that it is a case study of how to evaluate user experience of physical products, specifically locks, hence the results can be of limited generalizability to other products than physical products. A common delimitation for case studies in general is that results are rarely generalizable to other contexts, these circumstances are also a delimitation within this case study because the environment of the studied phenomena is very specific.

(10)

3

2 Theoretical Background

This section provides a thorough theoretical background into the area of what user experience (UX) is and how to measure users’ experiences through usability testing. It also provides some examples of methods, of how they can be combined and how the discussion of sample size effect testing. Finally, the section ends with some ideas on how collected data can be analysed and used for future decisions about design and practice.

2.1 Measuring User Experience (UX)

User experience (UX) can be explained in many different terms depending on what the area of interest is. Arvola (2020) explained UX as an area of interest that stretches beyond the user interface and the human-computer interaction (HCI). According to ISO 9241-210 (2010), user experience is defined as the person’s perception and response resulting from the actual use, or anticipated use, of a product, service, or system. It is also explained to include the user’s emotions, beliefs, preferences, perceptions, physical and physiological responses, behaviours, and accomplishments that occur before, during and after use.

There are different reasons for measuring a user’s experience, but it is overall a central way of determining whether users achieve their goals while using a product, service, or system. The essence of measuring users’ experiences is also to create descriptions of possible problems and to give design recommendations (Sauro & Lewis, 2016). Measuring users’ experiences is a complex activity for many different reasons, for example because emotions are hard to understand and get a grip of. As previously mentioned, ISO 9241-210 (2010) denoted that a user’s current emotional experience is affected by what the user have

experienced in the past and what the user expects the experience to be. This description of an experience as a process that evolves over time align with Scherer’s (2005) definition of an emotion, which is that an emotion is a process affected by the individual’s subjective feeling of an experience. Scherer also pointed out that emotions are complex and that there is yet not a single metric that can measure all aspects of emotions at once. However, there are several developed methods that can measure one aspect individually. For example, nonverbal and behavioural methods measuring objective experience of an emotion. Yet, Scherer pointed out that there is no objective way of measuring an emotion. One opinion that Scherer have in common with other professionals, is that you must therefore ask the individual to report the on the nature of the experience to get information about the user’s emotions (Laurans et al., 2012; Scherer, 2005).

2.2 Usability Testing

The concept of usability is a widely used and researched area within HCI (Hassenzahl, 2004; Hornbæk, 2006) and it has been explained in many different terms, for example, in terms of efficiency, effectiveness and satisfaction (Barnum & Palmer, 2010; Hassenzahl, 2004; ISO 9241-210, 2010). According to ISO 9241-210 (2010), usability is related to specific goals within efficiency, effectiveness, and satisfaction and that these are defined based on the specific user for a system, product, or service. It is in many aspects related to user experience, for example, when usability is interpretated from the perspective of the user’s personal goals. These goals can in turn provide criteria of usability to enhance or improve the user’s

(11)

4

experience. For example, efficiency is a quality specifically concerned with the physical effort made to achieve accuracy and completeness in a task, whereas effectiveness is concerned about the level of accuracy and completeness of this achievement. The third quality,

satisfaction, is rather focused on the freedom of discomfort and the positive attitudes towards the use of a product, system, or service (ISO 9241-210, 2010). Aside of efficiency,

effectiveness and satisfaction are also other quality aspects of usability, for example, those that are not specifically task-related or tangible, so-called hedonic qualities. A study

conducted by Hassenzahl (2000) resulted in some arguments that qualities like “ease of use”, “appeal” and “motivation”, that are either experienced or even expected, are important qualities for the perceived usability and hence experience.

According to Hornbæk (2006) there are many challenges with measuring usability. To name a couple, the challenge to distinguish and empirically compare between the subjective and objective measures that are available is one of these and the challenge of validating and standardizing subjective metrics is another one. It is, for example, addressed that satisfaction is a particularly difficult quality to measure because studies often result in data too difficult to generalize as they rarely use standardized questionnaires that may result in possible side effects such as problems of validity and reliability. Another severity causing validity and reliability problems is that satisfaction as an area of investigation is rarely defined and is instead studied as a great phenomenon. Hornbæk also highlight the issue of when different qualities are difficult to distinguish from each other. One example is when measuring the perceived human effort, which is a question that can be perceived as relevant for both

satisfaction and efficiency. It is therefore stressed to choose methods depending on what type of experience that is under investigation, but the classification of different measures is a reportedly difficult task. Another challenge within usability testing is the various opinions about the nature of the methods available. As previously mentioned, one of the challenges with measuring usability is to distinguish between and empirically compare between the subjective and objective measure. There are many opinions on the topic on what is perceived as a subjective or objective measure. A study review conducted by Hornbæk led to the

distinction that subjective usability measures are those that measure the user’s perceptions and attitudes towards the interaction. In comparison to objective measures that do not include the user’s perception and that are able to collect information in a validated way that is not

possible for the subjective ones. There are reasons to believe that a combination of them both is beneficiary, for example, because they may lead to different conclusions that can come in handy not only to improve the objective performance of the users, but also to generate design advice on how to improve their experience of interaction. Examples of subjective and

objective measures are given in the next section.

2.3 UX Evaluation Methods

Within this subsection are some examples of methods, both subjective and objective methods, within the field of UX.

2.3.1 Qualitative Versus Quantitative Approaches

Qualitative and quantitative measures can be defined and used in many ways. A common opinion of the distinction between qualitative and quantitative data is that the former provides

(12)

5

a more detailed understanding of a problem while the latter provide a more general understanding (Creswell & Clark, 2018). According to Mårdberg & Carlstedt (2019), the quantitative approach seeks concrete facts and explanations often expressed in numbers that enables comparisons. However, it does not necessarily have a opposite relationship to the qualitative approach. The authors stressed that the qualitative approach, although it rather uses verbal explanations, can be transformed into quantitative measures. There are many methods available that do not specifically belong to either the qualitative or quantitative ideology. Next two paragraphs exemplify two methods.

The Repertory Grid Technique is a method used to capture the user’s experience and it originates from the personal construct theory. According to this theory, we make sense of our surroundings based on, and in comparison, to what we have seen or experienced before (Fallman & Waterworth, 2010). Countless of efforts have been made to create methods and tools to capture the user’s experiences that provide external validation and reliability, the Repertory Grid Technique is one of these attempts. By using both quantitative and qualitative measures, the tool is said to integrate the user’s emotional and rational/logical elements of experience and hence reflecting the holistic perspective of experience. This tool has been named both a “hybrid approach” (Fallman & Waterworth, 2010; Tomico et al., 2009) and a “mixed method approach” (Stergiadis & Arvola, 2018) since the method does not clearly belong in either the qualitative or the quantitative field.

Product reaction cards (PRC) is another example of such a method that mix a

quantitative and qualitative approach. It was first introduced by Benedek & Miner (2002) and created by Microsoft with the purpose of testing their own products. It is a method applied to evaluate users’ experiences with products and said to help users mediate their experiences through a collection of 118 words. The collection was at first much smaller, but the final collection includes 60 percent positive words and 40 percent neutral or negative words. The collection was created through a series of tests with experts in the UX field and is based on critique against the commonly used Likert Scales. The critique stems from the idea that scales, such as Likert scales, are created by the researcher and hence limit the participants ability to express themselves through their own words. Therefore, PRC was developed as an attempt to create a method where users would choose the words themselves, as opposed to standardized questionnaires (Benedek & Miner, 2002). As mentioned earlier, according to ISO 9241-210 (2010) there are three main areas of usability: effectiveness, efficiency, and satisfaction. Barnum & Palmer (2010) argued that satisfaction is the area that is most

influential on the overall perceived usability. They also stressed that experience methods most commonly evaluate the former two, effectiveness and efficiency, rather than satisfaction and that PRC is a rigid method for that purpose. It has been argued for that purpose because by comparing the number of negative words versus positive ones, users’ senses of satisfaction towards a product can be calculated (Barnum & Palmer, 2010). The method has many advantages, it is quick to administer, it is a quantitative approach providing qualitative

opportunities for analysis and it is said to help users speak the truth and to point out problems and negative aspects. Thereby, the method is said to decrease the risks of acquiescence bias, which is when participants provide pleasing responses rather than honest ones (Barnum et al., 2010).

(13)

6 2.3.2 Performance and Self-Report Metrics

There are several quantitative approaches available to apply for usability testing. Tullis & Albert (2013) divide these into five different categories: performance-, issue-based-, self-reported metrics, behavioural and physiological metrics. Some performance and self-self-reported metrics are explained next.

A study conducted by Sauro & Lewis (2009) showed that quantitative measures like task time, completion rate, errors metrics were most used for usability testing. According to Tullis & Albert (2013) classification, these are most referred to as performance metrics. Self-reported metrics, on the other hand, are commonly used metrics to evaluate the latest

perception of an experience with a product. The same study conducted by Sauro & Lewis (2009) also showed that post-task or post-test satisfaction metrics are some of the most used methods for usability testing. These tests and other self-reported metrics do not evaluate the user’s behaviour, but rather what user share about their experiences. Aside from providing users’ perceptions of their experiences with a product, the metrics can also, at an emotional level, tell us something about what the users feel about the product and the experience. Even though self-reported metrics are mostly used to evaluate the latest perception of an experience with a product and hence applied after a task has been performed, they can also be used to evaluate the users anticipated experience and impression of a product. In these cases, the metrics are applied prior to any task or session and can also be used to find out users’ expectations of a product (Tullis & Albert, 2013).

The self-reported metrics can be divided into two main categories, that they are either provide subjective data or preference data. Even though subjective data are subjectively reported by the user, it can still be valued as objective data from the perspective of the

researcher. The ways in which you ask the participants about their subjective experiences can take many different forms, with various use of scales and attributes and hence adjust how the user self-report on their experience. One of the most commonly ways to self-report is by using scales on which the participants decide on and the most classical scales are the Likert Scale and the semantic differential scale (Tullis & Albert, 2013) . One of the most widely used self-reported metrics that use such a scale is the System Usability Scale (SUS) (Brooke, 1996). The questionnaire assesses ten questions about the perceived usability in which a five-point Likert Scale, ranging from “Strongly disagree” to “Strongly agree”, is used. Even though SUS is a reliable method to measure users’ perceived usability of a product, other metrics have proven to replace SUS with little or no difference. A study conducted by Sauro and Dumas (2009) researched on whether “one-question” questionnaires could replace SUS. The results showed that the Single-Ease Question questionnaire (SEQ) is at least as good as or even better than multiple questions when gathering post-task subjective satisfaction towards a product, system, or service.

Another reportedly reliable and validated method is the Nasa Task Load Index instrument (NASA-TLX) that measure users’ perceived workload for a specific task (Hart, 1986). It is known to be a sensitive and reliable estimate for that purpose. It differs from the classically used scales. The metric is made up of a questionnaire using a 20-point scale, allowing for participants to assess the scale on a wider range of choices than, for example, SUS. But just like SUS, this questionnaire uses only “anchor” attributes ranging from “Very

(14)

7

low” to “Very high”, except for one of the attributes with the anchors “Perfect” and “Failure” (Hart & Staveland, 1988). After 20 years since it was created, this method is being used as a benchmark-method and still proven a reliable method for evaluating users’ perceptions of the emotional status and workload of a task (Hart, 2006).

One opinion many have in common when dealing with these types of metrics is that it is beneficiary to combine several of them (Lewis, 2018). A combination can produce more meaningful and substantive results (Barnum & Palmer, 2010; Creswell & Clark, 2018), make it possible to examen data from multiple perspectives (Barnum, 2020) and to help validate findings (Tullis & Albert, 2013). Ways to combine various measures and approaches are described in the next section.

2.4 Combining Methods

There are many pronounced benefits of combining or adding different activities commonly performed in research studies, for example to combine methods from both fields of qualitative and quantitative research or to combine activities to increase validity in research. The

following two sections dig deeper into the advantages of applying a mixed method approach or applying the principle of triangulation.

2.4.1 Mixed Method Approach

The view on what mixed method is defined as has varied over the years. For some time, researchers have not agreed on what this field of research can be characterized as, or what the focus of mixed method is. Whether it is a methodology or simply a method has separated the researchers. However, one definition of mixed methods was created due to the composition if a journal about mixed method research (Creswell & Clark, 2018). According to Creswell and Clark, in mixed method, the researcher performs some core activities. These are, to collect and analyse both qualitative and quantitative data and to mix or combine these forms of data in their results. The reasons for applying a mixed method approach are close to unlimited. It is not a question of valuing one of the ways, qualitative or quantitative, higher than the other because each research has each need. In general, research suited for a mixed method approach is typically those in which one single source of information is insufficient.

The two types of data, qualitative and quantitative, each provide different views and perspectives upon a studied problem or phenomenon, the qualitative study the individual meanwhile the quantitative study a bigger group of individuals, the general. Both perspectives add their own strengths and weaknesses to the research, hence the weaknesses of one

perspective can be diminished by the strength of the other (Creswell & Clark, 2018). By including several perspectives, the various data can either help strengthen or deepen the understanding or they might contradict each other (Creswell & Clark, 2018; Sauro, 2016). Regardless, without both perspectives this had not been made visible to the researcher and such information could be valuable to the researcher no matter the outcome.

There are three core mixed methods designs. First, by combining qualitative and quantitative methods results are converged and this is referred to as the “convergent design”. Second, by adding qualitative data to quantitative results the results are explained, this order is referred to as “the explanatory sequential design”. Third, by adding quantitative data to qualitative results, the qualitative results can be explored and/or generalized and this manner

(15)

8

is referred to as “the exploratory sequential design”. The convergent design will be of focus in this thesis and when applying this type of mixed method design, the emphasis on either can be equal or placed on one of the two. They are also implemented at the same time, unlike the other two core designs that are sequential. There are many reasons for applying a convergent design, but the most common purpose of applying the convergent design is that it can provide a more complete understanding of the studied problem. However, it can also be used to see whether the results of the methods correlate or contradict each other and how they might converge or diverse (Creswell & Clark, 2018).

The intent behind the approach is to bring strengths and weaknesses of qualitative and quantitative approaches together, to compare statistical results to qualitative findings and to illustrate qualitative findings quantitatively among many other reasons to best understand the research problem (Creswell & Clark, 2018). The convergent design approach is much like a

triangulation design from the qualitative research field, but because researchers within the

mixed method community feared to be confused with another field and because the

convergent design does not require three different approaches, but simply the combination of the two databases – qualitative and quantitative – they named the approach to convergent design (Creswell & Clark, 2018). The following section will elaborate on what triangulation is and what it is used for.

2.4.2 Triangulation

It is unlikely that one single method for evaluating users’ experiences is sufficient to provide all aspects of usability and aside from mixed methods there is also a principle called

triangulation. Triangulation is the activity of combining several methods which has the

purpose of creating more meaningful and substantive results and to enhance our

understanding of users’ experiences (Runeson & Höst, 2009). Aside from combining methods are also other ways of triangulating, for example data triangulation which is the activity where researchers use several data sources to collect on different occasions and observer triangulation means to use more than one observer to study a phenomenon (Stake, 1995).

As opposed to mixed method that stresses the advantages of combining qualitative and quantitative methods, triangulation is an activity most applied within qualitative and empirical research. The reason for this is because qualitative data results are less precise than

quantitative research and hence triangulation is an activity that provides a more precise data if the research relies primarily in qualitative data (Runeson & Höst, 2009).

2.5 Optimum Level of Sample Size

Attention towards the “right” sample size has long been discussed (Lindgaard &

Chattratichart, 2007). Within the quantitative research field, a common knowledge is that the higher amount participants also increase the confidence in your data (Tullis & Albert, 2013). This is not a misconception, but others also argue that even small sizes, for example 8-10 participants, can reveal the most critical pains in an experience and be valuable for user evaluations (Nielsen & Landauer, 1993; Tullis & Stetson, 2004). There are different tips on how to improve the test designs when you are constrained to a small sample size, for example to use simple rating scales like Likert scales ranging from “very easy” to “very difficult” (Tullis & Albert, 2013) and SUS is one of these examples that have been proven effective for

(16)

9

small sample sizes (Tullis & Stetson, 2004). A study conducted by Tullis and Stetson

investigated how different sample sizes effected the results using standardized questionnaires that assess usability. The study showed that more than 12 participants for a study using SUS was not necessary because both 12 and 14 participants yielded the same results. Out of all questionnaires in this study, SUS was the questionnaire that derived the most accuracy with the lowest number of participants which said that 8 participants were sufficient to report 75% of all usability problems. Another study conducted by Nielsen & Landauer (1993) also showed that an astounding small number of test participants can reveal most of the usability problems in design. The authors created a cost-benefit model to be able to calculate the number of usability problems partly based on the number of participants. It basically said that tests should keep on if the value for finding new usability problems is higher than the cost of adding new tests. The study uncovered that as soon as you test with a single participant, almost a third of all usability problems are revealed. While testing with a second, a third or more participants you will learn less and less about the usability problems as you will witness several participants experience the same problem again and again. Hence the authors of this study put emphasize on the benefits of performing small iterations of tests rather than one test with many participants.

When working with small sample sizes there are several aspects to be vary of. For example, not to overgeneralize the results which is often the case when using percentages instead of using frequency. It is easy to overgeneralize results in self-reported metrics and the results should hence be treated cautiously (Tullis & Albert, 2013). By analysing the results using frequency, especially when working with small sample size, is said to reflect the reality in greater extent than percentages (Rubin & Chisnell, 2008). So, what seems to be a

discussion about an optimum level of participants might rather be a discussion that should focus on test design (Lindgaard & Chattratichart, 2007)

2.6 How to Provide Actionable Insights for Redesign

After data collection follows the steps of how to analyse the data and how to use data for decisions about research practice and design going forward. The following sections provide some insights into how these practices might look and exemplifies how analysis of data can point in direction for redesigns and how to turn data into valuable insights.

2.6.1 Grasping Patterns

There are many different techniques on how to analyse data to gain insightful knowledge about users’ experiences. One way of analysing collected data is to apply a content analysis (CA) which is a widely used qualitative research technique to apply. It is mostly applicable for studies that collect verbal data. The technique offers three different approaches:

conventional, directed, and summative. These approaches differ in the ways data is analysed, for example coding origin, and depending on the nature of the research different approaches are applied. In the conventional approach the codes are derived from the text, meanwhile a theory guides the directed approach, and the summative approach depends on counting and comparing based on keywords or content (Hsieh & Shannon, 2005). The summative content analysis has a quantitative and a qualitative technique. The quantitative technique means that key words or contexts are counted (Hsieh & Shannon, 2005), this approach is also said to

(17)

10

manifest the content by making the material visible at the surface level or literally present in

the text (Kondracki et al., 2002). Meanwhile, the qualitative approach focuses on discover underlying and latent contexts of the counts and comparisons, for example why one word appears more than the other or what the interpretation of that word is (Hsieh & Shannon, 2005).

Another common analysis method to apply on verbal data is the thematic analysis (TA) which is an arguably flexible method for identifying themes and patterns within qualitative research. Even though the method was developed to be used within psychology research, it is also applied and used far beyond that scope. The benefits of using TA are that it allows a flexible research tool that can provide rich and detailed interpretation of the data. Just like CA as previously described, TA also provides a tool for analysing the latent meaning of the data. What is regarded as a theme, size of a theme or importance of a theme is decided by the researcher’s judgement. Yet, what might frame this judgement is whether the themes capture important aspects for the research questions in focus. The advantage of applying TA is that it is not a difficult method to apply, and it provides an insightful analysis that aims to answer research questions (Braun & Clarke, 2006).

Affinity diagramming is another method used to arrange collected data to reveal

common themes, or issues, among the participants. This analysis follows three specific tasks to complete the diagram, first notes and extracts are placed on sticky-notes in random order, then the notes are arranged in a hierarchical order and grouped into specific issues, and finally the groups are labelled. It might seem like a straightforward method, but it might take days to conduct depending on how many researchers are included since this is a group activity. The identified issues could then be used to guide the future research and design activities

(18)

11

3 The Case Study

This master thesis is performed within the context of a specific case and is conducted from guidelines applied for case studies. Case studies can have different purposes, for example be exploratory, interpretative, or improving. What they often have in common is that they do not intend to generalize results from the studied phenomenon onto the bigger population

(Runeson & Höst, 2009). This master thesis is a case study investigating how ASSA ABLOY, hereafter referred to the “security company”, can perform usability testing on their products to evaluate their approach and to suggest improvements. Hence this case study is classified with an improving approach.

3.1 Project Outline

This project has been performed through three bigger phases and these are: 1. Baseline

2. Improvements

3. Evaluation of Improvements

The phases are presented iteratively in this thesis, including sections as method, results, and analysis for each phase. The thesis outline is framed like this because each phase is based on the one prior to it, for example the second phase “Improvements” was based on the prior phase “Baseline”. The outline is framed like this due to it being a specific case study

influenced by Runeson and Höst (2009) definition of a holistic case study (see Figure 1).

Figure 1. Holistic case study (Runeson & Höst, 2009).

Next follows the description of each frames of a holistic case study from the perspective of this master thesis. The context of this master thesis is how to evaluate users’ experiences using a small sample size in a reliable way, this context was introduced in the previous section, the theoretical background. Meanwhile, the case is how the security company can evaluate users’ experiences in their concept lab. In the middle lies the unit of analysis in which the two first phases, “Baseline” and “Improvements”, was performed. The third phase of this case study, the “Evaluation of Improvements”, was used to analyse the case study, hence the suggested improvements. This means that the evaluation aimed at analysing the results from Improvements phase and to suggests how the security company can measure users’ experiences to gain sufficient insights to make decisions about design and practice, while using a small sample size.

(19)

12

This case study followed guidelines about how to perform ethical research practice and general rules pronounced by the Swedish Research Council (Vetenskapsrådet, 2017). Prior to all tests the participants was briefed about the purpose of the research and the task at hand, whether it was prior to testing in Baseline or Improvements or participating in the evaluative workshop. All participants were also asked to consent the part of their participation that needed to be audio recorded.

(20)

13

4 Baseline

The following sections provide information about the first phase within this case study, which was the unit of analysis where the security company’s usability testing method was performed and evaluated. The sections feature information about the method, results from applying the method, an evaluation of the method and lastly the proposed improvements are presented.

4.1 Purpose

The purpose of the baseline study was to test and evaluate the usability testing method that the security company currently work with. I will hereafter refer to this method as the “in-house method”.

4.2 The In-House Method

The in-house method consists of three self-reported metrics and post-task questionnaires, the System Usability Scale (SUS) questionnaire, the Task Load Index instrument (NASA-TLX) and another questionnaire measuring experience (Experience measure), and a fourth

qualitative element. The goal of the in-house method is to be able to evaluate how seamless the interaction is for the users in their access solutions.

4.3 Method

Following sections provide a more thorough presentation of the in-house method. 4.3.1 SUS

The SUS questionnaire used was the standardized English version with two alteration, adding “awkward” next to “cumbersome” in the eighth item and using both numbers and terms on all five answer options. The terms that were used are a standard five-point scale version of a Likert Scale, with terms ranging from “Strongly disagree”, “Disagree”, “Neither”, “Agree” to “Strongly agree” with accompanying numbers (see Appendix A for SUS sheet).

4.3.2 NASA-TLX

The NASA-TLX questionnaire includes the six properties, (1) mental demand, (2) physical demand, (3) temporal demand, (4) performance, (5) effort and (6) frustration. The participant can enter an answer on a 100-point scale ranging from “very low” to “very high”, except (4) range from “perfect” to “failure”, with 5-point steps. The questionnaire does not show any numeric value on the scale and the participant can also choose to not answer (see Appendix A for NASA-TLX sheet).

4.3.3 Experience Measure

The Experience measure questionnaire use the same type of scale as NASA-TLX, ranging from “very low” to “very high”, but this time asking the participants to rate 10 different experience properties. In addition, they are asked to mark a word that best describes the experience of either “old”, “known”, “improved”, “new” and “innovative” and finally asked whether they thought the experience was “handsfree” (see Appendix B for Experience measure sheet).

(21)

14 4.3.4 Participants

Three participants, two females ([P1], [P2]) and one male [P3], with an average age of 47 years (m=46,6) participated in the baseline study.

4.3.5 Procedure

A one product version of a lock, a mechanical key, was tested in this phase with all participants. The participants were instructed that the task was to unlock the door with the mechanical key, enter and lock the door. The task was presented in a scenario where the subject was on their way home from work and carrying a backpack with the key in the backpack.

After performing the task, each participant was asked two questions about their experience, “What did you like about this experience” and “What did you dislike about this experience”. The qualitative questions were added to understand what aspects of the

interaction that the participants liked and disliked. Thereafter, the participants were asked to fill out the three questionnaires in the following order, SUS, NASA-TLX and Experience measure.

4.4 Results

The following sections provide information about the results from each of the questionnaires. 4.4.1 SUS

The total average SUS score for the mechanical key was 84,2 and reached grade A. This indicate that the mechanical key is, for example, easy to use.

4.4.2 NASA-TLX

The participants reported that the mechanical key has a low task load with respect to all properties in NASA-TLX, see Figure 2 for the results. The task load demand least inflicted using mechanical key was mental demand (m=95), meanwhile the lock inflicted most task load with respect to physical demand (m=83,3).

The reported spread for physical demand (std=7,6) was higher than all other items with temporal demand (std=5,5) and performance (std=5) close after. Effort (std=2,8) and

frustration (std=2,8) reported much lesser spread and mental demand (std=0) reported no spread at all. This result indicate that the participants did not agree as much on what physical effort the mechanical key demand.

(22)

15

Figure 2. Median score in NASA-TLX for mechanical key.

The participants reported that the experience of using a mechanical key is that it is responsive (m=95, std=5) and intuitive (m=95, std=5). They also experience it as easy (m=88,3,

std=12,58), quick (m=83,3, std=20,82), reliable (m=83,3, std=24,66) and secure (m=75, std=30,41). The properties that the participants experienced as less than those mentioned was

feel safe (m=73,3, std=24,66), convenient (m=66,6, std=10,41) and intelligent (m=53,3,

std=35,66). They experience the lock the least as productive (m=40, std=30,1). See Figure 3

below for the results.

Figure 3. Median score in Experience measure for mechanical key.

0 10 20 30 40 50 60 70 80 90 100

(No) Mental (No) Physical (No) Temporal

Overall Performance

(No) Effort (No) Frustration Score NASA-TLX properties 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00 Sco re Experience properties

(23)

16

4.5 Analysis

The results indicate a usable product with average SUS score 84,2. The interaction demanded very low workload according to the results reported in NASA-TLX with average workload of 6,67. The results from NASA-TLX indicate that using the mechanical key demand low, or basically no, mental effort. It also shows that the interaction requires more physical effort than mental. However, both task load properties alongside the others, reflect a low task load

overall. The average Experience measure scored 90,00. The results from the Experience measure indicate that the mechanical key is responsive during use and intuitive. The latter result most likely is due to the ordinary use of this kind of unlocking function. The experience with lowest average score was “productive” (m=40,00). Whether this result indicate a

misinterpretation of the property or simply because the participants felt unproductive is impossible to say. You would need to ask the participants to explain their answers to fully grasp their interpretations and experiences.

The participants did not seem to share the same type of experience using this type of lock since they answered very differently on most experience aspects in the Experience measure questionnaire. The only properties the participants seem to agree on was “reliable” and “intuitive”, which happened to be the two experiences mostly describing the experience using this lock. This too would need further investigation to fully comprehend why this were the case.

The participants were also asked the question “What did you like about this experience” after the task. The results are aligned with the answers from the participants, saying that the mechanical key “feel real” [P3] and that they are in control [P1]. When asked “What did you dislike about this experience”, the participants reported safety issues, that they lose the control if the key get lost ([P1], [P2]). One participant [P3] mentioned that the mechanical key was heavy and slow, same participant scored physical demand as higher than the other two participants. However, results indicate that the physical demand is still very low.

4.6 Theoretical Evaluation of In-House Method

Within this section, a discussion about the performance of the in-house method is held with respect to knowledge in the problem area and research field of usability testing. The in-house method consists of three post-task questionnaires and is used to evaluate the interaction with locks. The goal of the user experience is to have a seamless access, which according to the metrics used, is defined as having an overall easy and usable interaction and a low workload. An interaction not demanding much physical and mental demand while using products and that provides a reliable and intuitive experience is regarded as a good experience. The in-house method consists of metrics that can evaluate these experiences in different ways, but to some extent they are argued insufficient.

4.6.1 SUS

The in-house method does not use the original SUS questionnaire, because there are two alterations made. The alteration of adding “awkward” next to “cumbersome” in the eighth item is advocated for, because the word “cumbersome” might be too difficult to understand (Bangor et al., 2008) and because “awkward” is a more commonly used word (Finstad, 2006).

(24)

17

The standard SUS only uses so called “anchor terms” on the ends and numbers for all five options, ranging from “Strongly disagree” (1) to “Strongly agree” (5) (Brooke, 1996). The labels that are used in the in-house method are a standard five-point scale version of a Likert Scale, with terms ranging from “Strongly disagree”, “Disagree”, “Neither”, “Agree” to “Strongly agree” with accompanying numbers like the standard version. The effects of this alteration are unknown, but according to Tullis & Albert (2013) if the labels make sense, they will not affect the study considerably and the labels used in the in-house method are common Likert Scale ones. If only terms were present without the numbers, then the scale would be considered ordinal which would make calculation of averages impossible (Tullis & Albert, 2013). But since the numbers are present, the data could still be interpreted as interval, hence averages are possible because the numbers on each answer options makes the distance between them equal, for example the distance between 3 and 4 is the same as 6 and 7 (Mårdberg & Carlstedt, 2019).

There are many ways of analysing the results from the SUS questionnaire and ways to use these as an indication for future design, for example Bangor et al. (2008) suggested to use a three-level grading of “acceptable”, “marginal” and “not acceptable” based on SUS score and another example is to use a grading ranging from A to F providing a more detailed level of SUS scores where A indicate better usability (Bangor et al., 2008). The in-house method uses the latter analysis method, which has been widely used but criticised because the grading scale does not stem from research.

4.6.2 NASA-TLX

NASA-TLX is a validated metric to use to investigate the perceived level of workload for a task. The way in which the in-house method applied it is according to the standardized way and no alterations were made. One matter about the questionnaire that is worth discussing is the use of such a scale that is seen in this questionnaire. There is a general discussion about the design of scales and the scale used in NASA-TLX differ from the traditional Likert Scales as this is a 20-point scale that does not show any numeric value on the scale. Due to the fact the scale has an even numbered response option, the participant is forced to take a stand to each of the 8 items. Some researchers argue that the participant should be left a response option that has a neutral position (Tullis & Albert, 2013). However, in NASA-TLX the participant has the response option to “not answer” which might replace the neutral position even though they might not have the same meaning.

The number of choices possible on scales is also a discussed matter within the field of psychometry. Although seven or five response options are the most common number of options used in standardized questionnaires like these (Lewis, 2018), some say more options is better but not all researchers agree (Tullis & Albert, 2013). The scale used in NASA-TLX differs from what the research field say is the most popular, which is seven, but there is not a single scale that is appropriate for all types of research (Cox, 1980).

The Experience Measure is the third questionnaire in the in-house method in which the design is influenced by NASA-TLX. It displays the same type of scales but investigates 8

(25)

18

pleasureable, desirable and valuable. It also investigates another feature where the users are asked to classify whether the locks are perceived as “old”, “known”, “improved”, “new” or “innovative” and if it is “handsfree” or not. Same concerns as described about NASA-TLX are applicable here as well. However, since this questionnaire is not a validated method, hence not reliable, it seems more important to try out other methods than to continue using this questionnaire.

4.6.4 The Qualitative Questions

It is stressed to be beneficial to mix both qualitative and quantitative methods when

performing research (Creswell & Clark, 2018). Although the two questions, “What did you like about this experience” and “What did you dislike about this experience”, add a qualitative aspect to the in-house method, it was experienced as insufficient to understand why the

participants rated as they did. For example, it would be interesting to know why the

mechanical key was experienced to invoke a low productivity, hence, to investigate why the participants reasoned as they did in the questionnaires.

4.6.5 Suggested Improvements

Based on the theoretical evaluation of the in-house method, the proposed improvements are to replace SUS with SEQ, to replace Experience measure with PRC and to counterbalance the order of the questionnaires during tests.

There are many benefits of applying standardized questionnaires, like SUS and NASA-TLX, because they are said to be quick to administer and to analyse (Lewis, 2018; Tullis & Stetson, 2004). Another benefit is that some have been proven effective even for small sample sizes and SUS is one of these (Tullis & Stetson, 2004). The reason behind the decision to replace SUS with SEQ is because SEQ have been proven to be just as good as or even better than SUS (Sauro & Dumas, 2009) and since it is a “one-question” questionnaire it might also help decrease the risk of participant fatigue since it only asks the participant one question. So, the proposal is to exchange the SUS with the SEQ to evaluate their interchangeability.

The reason to replace Experience Measure with PRC is motivated for several reasons. First, because the Experience measure is not a researched method and not proven valid and reliable. Secondly, because there are many challenges and obstacles of using solely

standardized methods, for example that they are said to “prime” the participant due to the response alternatives (Scherer, 2005), that the participants are restricted to the range of

questions provided by the researchers (Barnum & Palmer, 2010) and that the participants tend to provide pleasing responses, often referred to as acquiescence bias (Barnum et al., 2010). So, based on these insights PRC was added to provide the participant a method where they can choose how they want to explain their experience themselves, which is another reported benefit of the PRC method (Benedek & Miner, 2002). Also, the reason why PRC was chosen and not the Repertory Grid Technique (RGT), is partly because RGT it is a more

time-consuming method to apply and partly because some researchers argues that it is a difficult task for the participants to perform (Fallman & Waterworth, 2010; Stergiadis & Arvola, 2018; Tomico et al., 2009).

Ultimately, it is said that there are no objective methods for measuring the subjective experience of emotions (Scherer, 2005) and it is stressed important to include other metrics to

(26)

19

evaluate the broader construct of experience (Lewis, 2018). By adding the qualitative

dimension of PRC, the method proposed will be based on a mixed method approach (Creswell & Clark, 2018). Even though the in-house method is compiled of several beneficial usability methods, it became apparent that a need to know why participants rated as they did motivate the decision to add a qualitative method. This is a typically known situation to use a mixed method approach, when quantitative results require an explanation as to what they mean (Creswell & Clark, 2018) since quantitative results most commonly represent general

descriptions and understandings. A more detailed understanding is lacking, and by combining qualitative and quantitative research can lend support and give depth to each other (Sauro, 2016).

(27)

20

5 Improvements

The following section present the methodological choices and results of the suggested

improvements described in the previous section, it is hence the second phase in this case study and represent the second and last part of the unit of analysis.

5.1 Purpose

This iteration has the purpose to test and evaluate the suggested improvements for the in-house method and to evaluate whether they provide actionable insights for decisions about design and practice.

5.2 New Evaluation Method

A mixed method approach, specifically a convergent approach, was applied by using measures from both a qualitative and quantitative perspective. The SEQ and NASA-TLX constitutes the quantitative approach, meanwhile the PRC constitute the qualitative approach. The emphasize on the qualitative or quantitative approaches was equal. However, since the PRC is a mix between the two approaches, the study had a more quantitative nature. All metrics were implemented at the same time for all participants. The purpose of applying this method design was twofold: to see whether the results of the methods correlate or contradict each other and to investigate what impact the qualitative approach of PRC have on the in-house method.

5.2.1 SEQ

The single-ease question was used instead of the SUS. Each participant was asked to rate their overall experience of how difficult or easy they though the task was. They were asked to rate on a 7-point Likert Scale, ranging from “very difficult” (1) to “very easy” (7) (see Appendix C for SEQ questionnaire).

5.2.2 NASA-TLX

As mentioned earlier, the NASA-TLX is a metric used to evaluate users’ perceived level of workload of a task and experienced emotions when interacting with a product. All participants received the questionnaire in paper format and used a pen to fill out all items (see Appendix C for NASA-TLX questionnaire). The original questionnaire was used, the same as in the in-house method.

5.2.3 Product Reaction Cards

The procedure PRC was influenced by the one used in the original study conducted by Benedek and Miner (2002). The participants received a A4-paper where all 118 words were displayed (see Appendix D for reference). The task for the participants was to first choose as many words as they wanted from the collection. They managed this by simply circling the words or to tick off the words using a pen. The second task was to highlight five of the words already chosen using a highlighter marker. There were no time-limits for these tasks. After completing the two steps, each participant was asked to explain why they chose each of the five words. This section of the task was audio recorded. If a participant by some reason did not chose more than five words, the second task of highlighting was simply ignored, and the

(28)

21

participants were asked to explain their reasoning behind their choice of words no matter the amount chosen.

5.2.4 Participants

A total of 9 individuals participated, two females and 7 males. The participants in this study were recruited using convenience sampling which in practice meant that all individuals asked to participate were colleagues working at the security company. They work at various

departments within the company. Because of this, it is important to acknowledge possible biases that comes with these participants, for example acquiescence bias is when respondents tend to provide pleasing responses when filling out a questionnaire (Sauro & Lewis, 2011). Based on their situated knowledge, they might provide answers they assume are important due to their profession.

5.2.5 Test Design

A within-subject design was applied for this research testing multiple product versions of locks. The order in which the participants performed the unlocking functions, “A”, “B” or “C”, as well as the order of the questionnaires, “X” or “Y”, was counterbalanced as visualised in Table 1 below.

Table 1. The test design for the Improvements phase.

Participant Lock function Questionnaire/Test 1. (A) Keypad, App, Proximity (X) SEQ, NASA, PRC 2. (A) Keypad, App, Proximity (X) SEQ, NASA, PRC 3. (A) Keypad, App, Proximity (X) SEQ, NASA, PRC 4. (B) App, Proximity, Keypad (X) SEQ, NASA, PRC 5. (B) App, Proximity, Keypad (X) SEQ, NASA, PRC 6. (B) App, Proximity, Keypad (Y) SEQ, PRC, NASA 7. (C) Proximity, Keypad, App (Y) SEQ, PRC, NASA 8. (C) Proximity, Keypad, App (Y) SEQ, PRC, NASA 9. (C) Proximity, Keypad, App (Y) SEQ, PRC, NASA

The questionnaires were counterbalanced as a means of minimizing the risks of participant fatigue, that they would get tired in later tasks and hence decrease their engagement in the task. Meanwhile, the unlocking functions were counterbalanced to

decrease the risk of experience bias, as the participants might rate one lock depending on the previous experiences of another lock.

5.2.6 Test Environment

All tests were all conducted in a concept lab located at the security company’s office in

Stockholm. This room was equipped with a door and the lock we used to during testing. There were also one-way mirrors placed on two of the four walls, making it possible to observe the participant from different angles interacting with the lock. One observer and one moderator were present during testing. The observer stayed behind the mirrors and the moderator stayed

(29)

22

with the participant and was in control of the tests. Because of this setup, everyone participated in equivalent circumstances and same environment.

5.2.7 Materials

The materials used in the study are listed below with a short description of them. The materials can be viewed in full in the Appendix section disclosed in each description.

• Questionnaires included the SEQ and NASA-TLX in printed format (see Appendix C). The same went for the full list of the Product Reaction Cards (see Appendix D).

• A sound recorder was used during the PRC, “Röstmemon” on an iPhone 11, to record each participants motivation behind each chosen word.

• The lock used in the test, which is a multifunctional lock and smart lock. The function referred to “keypad” (see Figure 4) is a common unlocking function that most users have experience of, and it unlocks by entering a six-digit code. The function referred to as “app” (see Figure 5) is an unlocking function performed using an application downloaded on your phone. The function referred to as “proximity” (see Figure 6) is an unlocking function that does not exist. Therefore, this task was performed using Wizard of Oz technique which is a common technique to use on early developed prototypes in which certain functions are not available but where a “wizard”, typically someone moderating the test, manipulates the interaction in a way that is not obvious to the participants (Barnum, 2020). In this case, the Wizard of Oz technique was based on the idea that the lock would react to the presence of your phone and unlock automatically through Bluetooth connection. The observer of the test acted as the wizard.

.

Figure 4. Keypad. Figure 5. Application. Figure 6. Proximity.

5.2.8 Procedure

All participants were recruited within the security company and was all briefly informed about the purpose of the test and the procedure. Each participant was given the instruction to unlock using a certain unlocking function within the scenario that they were on their way

(30)

23

home from work and was entering their front door. App and proximity required a phone, and the participant was instructed to place the phone wherever they felt comfortable with, for example in their pocket or their hand. The pin code required to unlock using keypad was written on a sticky note and placed on the door above the lock. Each participant was asked to perform the task twice since this enabled a “trial and error” process where the participants could learn from their mistakes and try again. The moderator did not assist the participants if they failed to unlock, but in the cases in which the participant failed to lock the door when they entered, the moderator quickly locked the door to enable the next iteration. After the task was performed, the participants were asked to fill out the set of questionnaires and partook in the PRC task. This procedure was then repeated three times, one for each unlocking function. Each test took about 20-25 minutes in total.

5.3 Results

In this section the results from the second phase in this case study, the Improvements phase, are presented. First the results from the Single-Ease Question for all unlocking functions are presented, then NASA-TLX and lastly Product Reaction cards.

5.3.1 Single-Ease Question

All participants were asked to rate on a 7-point Liker scale on overall how difficult or easy they found the unlocking functions to be, and a higher rate indicates an easier experience. The perceived level of difficulty using keypad (m=5,3) was reportedly harder than the average level of difficulty using application (m=5,5), which was experienced as the easiest function. Yet keypad was easier than unlocking by proximity (m=5,2). Figure 7 below present the results in a boxplot. The results from the SEQ-questionnaire are bland because it shows that all three unlocking functions were almost equally easy.

(31)

24

Figure 7 above demonstrates the mean, median and interquartile range (IQR) for all three unlocking functions. The (X) demonstrate the average score, the lines in each box represent the median for each of the unlocking functions.

5.3.2 NASA-TLX

Figure 8 below presents the average score for all six properties and for all three unlocking functions starting with keypad, then app and proximity.

Figure 8. NASA-TLX for all three unlocking functions.

Figure 8 above demonstrates the median of each of the task load properties for all unlocking functions through the “line” in each box, meanwhile the “X” in each box visualises the average score. The boxplot also visualises the spread and potential outliers. A lower score in the questionnaire indicates a smaller workload, but for the sake of visualising all charts equally, the scales and scores for the NASA-TLX have been flipped to provide a more

straightforward view of the results, which is why the “(no)” is added in front of each property to make it even clearer. Therefore, a higher score in Figure 8 above indicates a lower demand or effort, for example proximity scored a high value for “(no) mental demand” which means the unlocking function demand very little mental demand.

The average workload, which is the usability aspect that NASA-TLX investigates, provide information regarding the perceived level of workload that the participants had using the unlocking functions. Since a higher level indicate a lesser restrain with respect to

workload, keypad was experienced invoking the highest level of workload (m=68) and proximity (m=91) was experienced demanding the lowest level of workload, meanwhile the app (m=83,3) ended up in the middle of these. These results that indicate a first, second and third place are also evident when looking at the results for each of the six properties

respectively. Proximity imposes less workload than the other two unlocking functions with respect to all properties except for “overall performance” in which proximity (m=81) scored same as app (m=81). Meanwhile, unlocking using keypad scored the lowest on all properties except for “temporal demand” (m=67), indicating that it imposes highest level of workload for all properties except for “temporal demand” where the app was perceived as slightly more demanding (m=64).

(32)

25

The highest scored property for keypad was “overall performance” (m=72) reporting that most participants perceived their performance of the task as more towards perfect than failure. However, both app and proximity scored higher on “overall performance” than keypad. These results also align with those previously presented, which point towards keypad as the least preferred unlocking function with respect to most properties in NASA-TLX.

Unlocking using app shared the same average score of “effort” (m=78) as unlocking using proximity (m=78), yet the results for app are slightly more scattered. This indicate that the participants have varying experiences with the app. This pattern is also apparent in other properties for the app function, for example in “temporal demand” and “frustration”. It seems like the participants did not have a common view of app as an access solution.

5.3.3 Product Reaction Cards

Product Reaction cards collection was used as a method that allow the participants to provide an explanation of their interaction with each of the three unlocking functions. They were then asked to explain why they chose each of the five words that in their opinion explained their experience the best. This part of PRC was collected using an audio recorder and analysed using content analysis (CA) to manifest users’ reasonings. A deductive approach was applied basing the analysis on the words in the PRC word collection. The manifests are excerpts from transcriptions and presented as quotes. The results from the PRC, both from quantitative and qualitative analyses, are presented in an interwoven manner below.

The unlocking function with the highest number of positive words was proximity where 94 percent was positive ones, whereas both app (88 percent) and keypad (81 percent) had a much lower result of positive words. The same trend is apparent in the number of negative words where keypad had the highest recorded amount with 19 percent negative ones, where app (12 percent) had a lower amount and proximity (6 percent) the lowest. Hence these results point primarily in a satisfaction towards proximity, secondly app and lastly unlocking using keypad.

Aside from the number of positive versus negative words, the results from this task were also analysed using the five categories, Ease-of -Use, Usefulness, Efficiency, Appeal and

Engagement. This categorization was influenced by Merčun (2014) and although the

complete collection of the 118 words was not used in Merčun’s study, this categorization provided a way to analyse the data with respect to different kinds of UX goals and not solely on direction of satisfaction. The categorization also allowed for all words chosen to be included in the analysis. Since not all 118 words was included in Merčun (2014)

categorization, a workshop together with three other UX designers and researchers at the security company was performed with the purpose of placing each of the 118 words into the five categories. This was done remotely using Miro as a tool for the workshop. Miro is an online collaborative whiteboard platform where several participants can create, collaborate, and communicate in real time. Four templates, one for each participant, including all 118 words and categories were set up prior to the workshop. The task for each participant was to place each word to a category they saw fit on their own. This task took about 20 minutes and afterwards we discussed each of our interpretation of the five categories for about 40 minutes (see Appendix E for some material from Miro). This session resulted in four interpretations of

Quantifying User Experiences of Physical Products : A Case Study of Combining NASA-TLX and Product Reaction Cards for Actionable Insights