• No results found

The influence of explanations in recommender systems on user engagement

N/A
N/A
Protected

Academic year: 2021

Share "The influence of explanations in recommender systems on user engagement"

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

The influence of explanations in recommender

systems on user engagement

Felix Rossel

July 2020

Main Field: Informatics, Human Computer Interaction

Supervisor: Domina Robert Kiunsi

(2)

Dem Ingeni¨or ist nichts zu schw¨or. f¨ur Margit und Fritz

(3)

Acknowledgements

I would like to say thank you to everyone that helped me during my studies. Special thanks go to my parents and grandparents. Without your support I would have not been able to finish this. I would also like to thank Domina, Bruce, K’Pieter, P’Pum and K’Ayuth for all your support. Vielen Dank! Tack S˚a Mycket! Kop Khun Mak Krap!

(4)

Abstract

Recommender systems are without a doubt a staple of the modern internet. Services like Amazon, Netflix, YouTube and Spotify rely on them. What makes them so engaging that millions of users spent billions of hours on them every day?

User engagement is widely accept as a core concept of user experience but we still don’t know what role the user interface plays into it. This thesis investigates the effect of explanations in recommender systems on the users engagement with a case study on BMW Financial Services Thai-land’s recommender system.

An experiment on Amazon Mechanical Turks with the User Engage-ment Scale and A/B testing with Google Analytics proved a significant influence of explanations on the users engagement.

(5)

Contents

1 Introduction 7

1.1 Background . . . 7

1.2 Problem Statement . . . 7

1.3 Purpose and Research Question . . . 9

1.4 Scope and Delimitation . . . 9

1.5 Dispostion . . . 9

2 Theoretical Framework 10 2.1 Recommender Systems . . . 10

2.1.1 BMW’s Recommender System . . . 11

2.2 Explanations in Recommender Systems . . . 14

2.2.1 Goals of Explanations . . . 14

2.2.2 Types of Explanations . . . 14

2.3 User Engagement . . . 15

2.3.1 Measuring User Engagement . . . 16

2.3.2 User Engagement Scale Short Form . . . 16

2.3.3 A/B Testing . . . 17

2.4 Hypothesis Testing . . . 18

3 Method and Implementation 18 3.1 Link between research questions and methods . . . 18

3.2 Work Process . . . 19

3.3 Approach . . . 20

3.4 Design . . . 20

3.4.1 User Experience Short Form Analysis . . . 20

3.4.2 Pilot Studies . . . 20

3.4.3 A/B Testing with Google Analytics . . . 25

3.5 Data Collection . . . 25 3.5.1 A/B Testing . . . 25 3.5.2 UES-SF . . . 25 3.6 Data Analysis . . . 28 3.6.1 A/B Testing . . . 28 3.6.2 UES-SF . . . 28

3.7 Validity and Reliability . . . 29

4 Empirical Data 29 4.1 Google Analytics . . . 29

4.2 Amazon Mechanical Turks Experiment . . . 30

4.2.1 UES-SF Variables . . . 30

(6)

5 Analysis 32

5.1 Amazon Mechanical Turk Experiment . . . 33

5.1.1 UES-SF . . . 33

5.1.2 Explanation Goals . . . 34

5.1.3 Preferred Application . . . 34

5.1.4 Attitude towards BMW . . . 35

5.2 Google Analytics . . . 36

5.2.1 Average Time on Page . . . 36

5.2.2 Bounce Rate . . . 37

6 Discussion and Conclusion 38 6.1 Findings . . . 38 6.2 Implications . . . 38 6.3 Conclusions . . . 39 6.4 Further Research . . . 39 References 39 Appendices 41

A User Experience Scale Short Form (UES-SF) 42

B Pilot 1: Between Subjects Design 43

C Pilot 2: Within Subjects Design 45

D Full Study 47

E Collected UES-SF Data 49

(7)

1

Introduction

1.1

Background

An explosive growth of the internet as a means for e-commerce led to a new problem in the last decade. Among the vast catalogue of available items online, users do not know which product is the best for them. When buying products people often rely on recommendations from their peers to make a choice. These insights led to the development and use of systems that recommend items to the user in a similar way a friend would do. Today we call them recommender systems [1].

Recommender systems can be categorised by the way they create recommen-dations. The most common variations are content-based, collaborative filtering, knowledge-based as well as hybrid systems which apply a mix of different tech-niques [1]. A detailed explanation of the different recommender types can be found in the section Theoretical Background.

Each of these systems generate their recommendations in a specific way, but the user rarely gets to understand this process. In fact a lot of recommender systems function like a black box, giving little to no insight into their processes [2].

One way to handle this problem is to explain the recommendations the system gives the user. One of the most common aims of explanations in rec-ommender systems is to increase the transparency of a system. Other aims as formulated by Nava Tintarev and Judit Masthoff include Efficiency, Persuasive-ness, Satisfaction and Efficiency [3].

In May of 2020 BMW Leasing Thailand released a constraint based recom-mender system that helps users find their right car and financial offer. The ap-plication asks users about their preferred car characteristics (e.g. performance, sustainability), as well as questions related to their lifestyle. Based on this the system finds suitable cars and matches the user to BMW’s finance products and explains why each car was recommended to the user.

This thesis investigates the influence of explanations on user engagement with a case study on BMW Leasing Thailand’s recommender system.

1.2

Problem Statement

What makes a recommender system like BMWs’ successful? That is a complex question which is hard to answer. But one thing we know. The user’s experi-ence plays a big role into how successful an application becomes. Just in the beginning of June 2020 Google started to incorporate user experience into its search engine ranking algorithm [4].

For a long time the evaluation of recommender systems had mainly been focusing on accuracy metrics, like how effective a system is at creating rec-ommendations. But to make a system accurate does not necessarily mean it becomes enjoyable to use. McNee [5] and Ziegler [6] have shown that user sat-isfaction does not always correlate with high recommender accuracy.

(8)

We know that the user experience of web applications is a complex construct, which is influenced by multiple aspects [7]. The accuracy of the recommenda-tion algorithm is only one of those factors. To find out more about the user experience of a web application one needs to look at it from different angles and with different methods.

In 2006 several researchers [8] realised that the users perspective in rec-ommender systems didn’t receive the amount of attention it deserves. With that a new phase of research began. Scientists started to investigate the UX of recommenders. A comprehensive summary of the user centered research on recommenders until 2012 was released by P. Pu et al. [9].

Researchers and industry leaders claimed that the user interface of a recom-mender system might have a far larger influence on the users’ experience than it’s accuracy. Knijnenburg, Bart et al. developed a framework [10] that analy-ses recommender systems which also incorporates the effect of the systems user interface components. One of the user interface functions that has an effect on the user experience are explanations.

As mentioned before explanations have been developed to give more informa-tion about the recommendainforma-tion to the user. Nava Tintarev and Judit Masthoff laid the foundation for evaluating what makes up a good explanations with their paper “A Survey of Explanations in Recommender Systems” [3], in which they present possible evaluation methods for explanations and give an overview of the different goals explanations can have. Goals as formulated by them include the increase of Trust, Persuasiveness, Satisfaction and Efficiency [3]. Their study encourages further research on explanations as we still need to learn more about the effect on explanations on the user.

With the shift of focus away from accuracy metrics towards Human-Computer Interaction methods, user engagement has grown to be recognised as an impor-tant measurement to evaluate the user experience of web applications. “User engagement can be described as the user’s cognitive involvement with a tech-nology which usually relates to the quality of the engagement.” [11]. In simple words user engagement keeps users motivated to use an application.

A popular tool to measure the user engagement in web applications is the UES-SF questionnaire developed by O’Brian et al. [12]. The questionnaire tracks four variables that combined form a impression of the users engagement: perceived usefulness (PU), reward factor (RW), aesthetic appeal (AE) and the focused attention (FA). Research by Moshfeghi [13] with the UES on a news search site proved that certain UI elements increase the user engagement in several aspects.

The research of Moshfeghi and Tintarev et al. let’s one assume that explana-tions in recommender systems have a significant effect on the user engagement. This study investigates this assumption with a case study on BMWs’ Financial Services recommender system.

(9)

1.3

Purpose and Research Question

As the industry and research community value good user experience more it also becomes more important that we understand how to build applications with good UX.

User Engagement has a great influence on the users experience, but due to its complexity has not been fully explored yet. With the UES-SF we now have a robust tool to measure user engagement and to explore how to create it. This study aims to provide insights into how explanations shape the user engagement of recommender systems. With this study I hope to provide useful insight for further research on recommender systems and the development of engaging recommender systems.

The research question therefore is: Do explanations in recommender sys-tems influence the users engagement? The hypothesis is that explanations in recommender systems have a positive effect on the users engagement.

1.4

Scope and Delimitation

It is important to note that there can be different reasons why explanations get implemented in recommender systems. When applying this study to other recommender systems one should consider that the two systems need to have similar explanations goals. The Explanations of the investigated recommender showed the biggest effect on the perceived usability of this recommender system. The explanation goals efficiency and effectiveness seem to influence this.

The amount of participants in the final study were 99, limited by the budget for this study. For a more accurate result more participants should be recruited. The demographic of the sampled participants could also be a limiting factor in the thesis. Most of the participants were 26 years or older, which might skew the results towards the perception of these age groups.

Another limiting factor could have been the language of the application. Users coming from the ad campaigns were mostly of Thai nationality, but the website was selecting English as the standard language. The Google Analytics data shows data almost no users switched the interface language to Thai. This could have led towards a biased result, where users who are not comfortable with English were bouncing and only users comfortable with English staying.

The interpretation of the A/B Test with Google Analytics is only one possible option to look at the results with the knowledge of the MTurk experiment as a basis. To prove this theory of interpretation more metrics like the Click Events are needed, which have not been recorded.

1.5

Dispostion

The remainder of the report continues with the Theoretical Framework, which gives background to the topic of recommender systems and engagement. It also provides an introduction into the statistical tools used to analyse the collected data. The section Method and Implementation gives a detailed insight into

(10)

the conducted experiment and how it was conducted. Empirical Data gives an overview of the collected results of the experiment. The next section Analysis then digs into the data with the statistical methods described prior and analyses the findings in detail. In the final section Discussion and Conclusion the findings of the analysis get summarised and ideas for future research are presented.

2

Theoretical Framework

2.1

Recommender Systems

Most recommender systems can be either classified as knowledge-based, col-laborative filtering or content-based. They differentiate by how they create recommendations [1].

Collaborative filtering RC’s use ratings of users with similar interaction his-tory as the user to create recommendations. An example for this case can be found on amazon.com (Figure 1), where users get recommendations when visit-ing a product page about what customers with a similar interest have bought.

Figure 1: Amazon’s collaborative filtering recommender system

Content-based recommender systems also utilize the browsing history of the user to create recommendations. But instead of looking at the history of similar users, these systems look at the history of items rated by the user and rec-ommend items similar to the ones the user rated positively in the past. The knowledge base in use focuses on item attributes.

Knowledge-based recommender systems are different from the previously two mentioned systems in the sense that they don’t rely on the users browsing history but instead utilize domain experts’ knowledge to create recommendations. This way knowledge-based systems don’t suffer from the so called cold-start or ramp-up problem, where the systems can not give appropriate recommendations due to the lack of sufficient history, when a user signs-up.

Knowledge-based recommender systems are used in cases where there is no available data of previous interaction with items. This is a common problem with items that are either seldom bought, like luxury goods or real-estate or

(11)

when there is a complex product space that requires expert knowledge to nav-igate like financial services and automobiles. In this case, recommendation systems can guide the user, similar to a sales consultant, to the right choice.

Knowledge based systems explicitly solicit user requirements and therefore do not suffer from the cold-start problem - a problem that occurs when a sys-tem has not enough data to make sufficient recommendation. Collaborative or content-based systems are prone to this [14].

This characteristic makes knowledge based systems suitable for items that are rarely bought. Such items are often highly customized, which makes it hard to collect enough data to sufficiently power a collaborative or content based system. Typical domains for knowledge based systems are automobiles, real estate, financial services.

There are two well known ways of knowledge-based recommender systems, utility-based and constraint based. Utility based recommender systems calculate the utility of each recommended product for the user and present them in a ranked list (n-Top Items).

Constraint-based recommender systems use explicitly defined rulesets (con-straints), to match up the users preferences with the available products. These recommender systems need a knowledge-database that maps the domain knowl-edge of experts to the system.

A typical knowledge database consists of Product Constraints (PC), Cus-tomer constraints (CC), Filter constraints (FC) as well as incompatibility con-straints (IC) [2].

Product constraints define characteristics of items in the recommender sys-tem. Customer Constraints define the characteristics chosen by the customer. Filter constraints represent rules defined by marketing and sales.

2.1.1 BMW’s Recommender System

The recommender system in use in this study is a simple constraint based rec-ommender system. The recrec-ommender system lets users define key car charac-teristics of importance. The system then filters out the cars not matching the requirements and retrieves the n-Top items for the user. The recommender sys-tem can be visited with this link: www.yourbmwleasingthailand.com/bmw/car-preference

Figure 2 illustrates the user flow of the recommender.

Figure 3 shows the start screen of the recommender, where users are in-structed to select their preferred car preferences.

If the RS can not recommend any product for the defined customer con-straints, repair actions will be instructed. In our case the recommender system tells the user to either change monthly installments or his defined requirements. The following tables illustrate the constraints and variables of the recom-mender system in use.

The systems incompatibility constraints (IC) are defined by the available cars and their attributes. E.g. Users will not be able to get results if they choose Compact Size and Roominess since those two characteristics are opposites of

(12)

Figure 2: User Flow of BMW Financial Services Thailand’s recommender sys-tem

Table 1: Customer constraints of the BMW’s recommender

Customer Constraints (CC) Example Value Monthly Instalment Limit 50.000 Baht

Preferred Car Characteristics Performance, Sustainability

each other. Research in the past as mentioned in the problem statement has mainly been focused on improving the recommenders accuracy. But as a simple example by McNee et al. from their paper Being Accurate is Not Enoug shows, it is obvious that a good recommender is more than accurate. “Imagine you are using a travel recommender system. Suppose all of the recommendations it gives to you are for places you have already traveled to. Even if the system were very good at ranking all of the places you have visited in order of preference, this still would be a poor recommender system. Would you use such a system?” [8] Their paper focuses on three areas of recommender systems that didn’t receive much attention in research up until then. These areas are as following:

1. Similarity (users only get very similar items recommended)

2. Serendipity (“experiencing an unexpected but fortuitous item”) 3. User Experiences and Expectations

In another study [9] Pu et al. summarize the existing HCI work on recom-menders in three “crucial interaction activities”:

1. the initial preference elicitation

2. the preference refinement process

3. the presentation of the results

This study focuses on the last of these activities, the presentation of the results which also includes explanations.

(13)

Figure 3: User Flow of BMW Financial Services Thailand’s recommender sys-tem

(14)

Table 2: Product constraints of the BMW’s recommender

Product Constraints (PC) Example Value

Name X7 M50d

Characteristics Performance, Motorsport, ...

Applicable Leasing Products HP, HPB, FL

Monthly Costs (HP, HPB, FL, FC) 110.999, 76.999, 79.999,

-Table 3: Filter constraints of the BMW’s recommender

Filter Constraints (FC)

The monthly costs need to be equal or less than users defined monthly install-ment limit.

The users preferred car characteristics need to be matched in at least two char-acteristics of the product.

2.2

Explanations in Recommender Systems

2.2.1 Goals of Explanations

The general function of explanations in recommender systems are to give the user more information about the recommendation result. Sinha and Swearingen showed that users like transparent recommendations and the more the user has to “pay” the more transparent the system needs to be [15]. A more detailed look reveals that explanations can serve different goals. Tintarev at al. defined seven different goals [16]. These goals give a good overview of the possible functions explanations serve.

1. Transparency (Explain how the system works)

2. Scrutability (Allow users to tell the system it is wrong)

3. Trust (Increase users’ confidence in the system)

4. Effectiveness (Help users make good decisions)

5. Persuasiveness (Convince users to try or buy)

6. Efficiency (Help users make decisions faster)

7. Satisfaction (Increase the ease of usability or enjoyment)

2.2.2 Types of Explanations

There are a multitude of different explanations available. How an explanation can be generated is dependent on the type of system it is implemented in. A

(15)

common categorization is to differentiate based on the type of recommender system they are part of.

Gerhard Friedrich and Markus Zanker give an in-depth analysis of explana-tion types and create a taxonomy for explanaexplana-tions in their report A Taxonomy for Generating Explanations in Recommender Systems [17].

Furthermore explanations can be personalized as well as unpersonalized, with personalized explanations reflecting aspects of the users profile. A per-sonalized explanation for a camera recommender system could be: This was recommended to you because it fits your wishes for a mid range price with full frame sensor. Tintarev et al. found out in their study Effective Explanations of Recommendations: User-Centered Design [16] how important it is to personalize explanations. Personalised explanations seem to be more persuasive.

The explanations used in the study are personalized that use text as well as icons to explain the recommendation. Figure 4 shows the explanations of BMW’s recommender.

Figure 4: Explanations of BMW’s recommender with (r) and without active hover state (l)

While the research of Tintarev et al. shows the importance of explanations from a user centered perspective, the link to established UX constructs like user engagement has not been investigated yet.

2.3

User Engagement

User engagement is a quality of user experience characterized by the depth of an actor’s investment when interacting with a digital system (O’Brien, 2016a). Nowadays user engagement is growing to become accepted as a key metric to measure the user experience of web applications. Applications always have an advantage if they engage the user. Even when there are no competing ap-plications making the experience for the user engaging is a good idea, since it increases user satisfaction. And who doesn’t want to have satisfied customers? Research on user engagement in recommender systems has only been done in three studies explicitly [18, 19, 20] while many others investigated areas of the user experience that are thought to have an impact on the user engagement.

(16)

In [18] Wu et al. present a solution to optimise long term user engagement by modifying the search algorithm by modeling it for optimised user clicks and their return behaviour. Similarly Zou et al. introduce a framework for optimised long term engagement with reinforced learning on social network sites in [20].

More similarity in the approach to this study can be found in [19] where Freyne et al. demonstrate how adding recommendations during the sign in process on social media portals increases the users engagement. Both [20] and this study investigate the influence of recommender components on the user engagement.

2.3.1 Measuring User Engagement

The most common ways of measuring user engagement can be categorised into:

1. behavioural metrics like visits of a webpage, bounce rate and dwell time

2. neuro-physiological measures like eye tracking

3. self-report measures like questionnaires

All previously mentioned studies on engagement in recommender systems [18, 19, 20] use solely behavioural metrics for measuring engagement. But since the above mentioned measures have their unique advantages and limitations it is not uncommon to combine different techniques to counteract the individual limitations of each method. This study combines behavioural metrics with self report measures to gain an understanding of the user engagement of BMW’s recommender.

2.3.2 User Engagement Scale Short Form

The User Engagement Scale Short Form (UES-SF) falls into the category of self-report measures by being a multidimensional questionnaire. It is a derivative of the User Engagement Scale (UES), which consists of 31 items. H.L. O’Brien et al. analysed the UES and concluded that a condensed version which tracks four instead of six variables lead to comparable results. This conclusion led to the development of the UES-SF.

The six variables of the UES are the following:

1. FA: Focused attention, feeling absorbed in the interaction and losing track of time (7 items).

2. PU: Perceived usability, negative affect experienced as a result of the in-teraction and the degree of control and effort expended (8 items).

3. AE: Aesthetic appeal, the attractiveness and visual appeal of the interface (5 items).

4. EN: Endurability, the overall success of the interaction and users’ willing-ness to recommend an application to others or engage with it in future (5 items).

(17)

5. NO: Novelty, curiosity and interest in the interactive task (3 items).

6. FI: Felt involvement, the sense of being “drawn in” and having fun (3 items).

The UES-SF uses four variables. FA, PU and AE are of the original UES. The fourth one Reward Factor (RW) has been newly introduced and combines the three other variables of the UES into one.

In the study by Moshfegi et al. [13] the influence of a timeline and a named-entity on the user engagement gets investigated with the UES. The study finds significant improvement when implementing the mentioned UI elements in the web application. This study uses a very similar approach to measure the influ-ence of explanations on a recommender systems engagement scores.

2.3.3 A/B Testing

A/B Testing is the process of comparing two versions of a product. Most monly two versions of a website or digital content like advertisements get com-pared. The goal of an A/B test is to find out which of the tested versions performs better.

What metric gets measured depends on the goal of the study. Common metrics are the Click Through Rate (CTR) for advertisements or the average time on a page for a website.

1. Pageviews

2. Average Time on Page

3. Bounce Rate

4. Exit Rate

5. Pages per Session

The advantages of A/B Testing are that it is possible to test big quantities of users with relatively low efforts and costs involved. That big amount of data makes it possible to track even small differences in performances.

The disadvantages of A/B Testing are that you don’t get any information about why you get the measured results. That is why it is advisable to pair A/B Testing with other methods that give insights into those matters.

While A/B Testing could tell us that for example users tend to bounce less on the results page with explanations, it doesn’t tell us why this happens. But paired with a method like the UES we can find out which of the four tracked variables differentiate from the version without explanations and therefore ex-plain why users interact more on one version.

(18)

2.4

Hypothesis Testing

To gather insights about the data gathered with quantitative tools like the UES the observations are typically statistically analysed. Depending on the significance of the measured difference assumptions that are of high probability can be made. This study compares the two data sets by the common statistical method of hypothesis testing. There are multiple versions of Hypothesis testing but all take a look at the probability of a null hypothesis against an alternative hypothesis. One being Fisher’s method which uses the so-called p-value to make assumptions about a hypothesis being true. Fisher’s test has been further developed by researchers like Neyman & Pearson. Today hypothesis testing is mainly done with Null Hypothesis Significance testing (NHST).

The core concept of all of these methods is best explained with an example. Let’s say we want to test whether a coin is fair. Which means when tossed 50% of the times it shows Head (H) and the other 50% it shows Tail (T). To do this we run an experiment and toss the coin 100 times (n=100) and track whether it shows Head or Tail. This becomes our experiment data D (D= T,H, . . . , T). For each NHST we need a null hypothesis (H0) and an alternative hypothesis (H1). In our case we want to test whether the probability for Head is 0,5. From this we get our null hypothesis: H0 : p(H) = 0, 5. The alternative hypothesis is always the opposite of this, so: H1 : p 6= 0, 5.

With the data D collected during our experiment we can calculate the prob-ability for H0. This number will be called the p-value or short p. In our case p = 0, 10.

We then take a look at how big p is and decide if we reject the null hypothesis, which means we believe the alternative hypothesis or fail to reject the null hypothesis and believe the null hypothesis is probably true.

Typically H0 gets rejected if p < 0, 05. Which means the probability to get the results that we saw in our experiment lies under 5% if we believe the null hypothesis is true. In our experiment we got p = 0,10, with which we can say that we fail to reject H0 and therefore believe that the coin is fair.

Figure 5 shows a visualization of the probability p = 0,05 on a normal distribution.

3

Method and Implementation

3.1

Link between research questions and methods

The research question of this study is: Do explanations in constraint based recommender systems influence the user engagement? Since user engagement is multi-faceted and hard to measure, it is common to use multiple methods to investigate it. The methods used in this study are:

1. Application of the User Experience Short Form (UES-SF) questionnaire and statistical analysis of the gathered data

(19)

Figure 5: Visualization of the probability p = 0,05 on a normal distribution

3.2

Work Process

The work on this study started with the design of the recommender system in January of 2020. The knowledge base of the recommender system was built with the help of the management and product planning of BMW Leasing Thailand. A Bangkok based agency devleoped and deployed the finished product. The version of the recommender system used in this study was launched on the 8th of May 2020. It can be found under this URL:

www.yourbmwleasingthailand.com/bmw/car-preference

After launch of the recommender the application was prepared for the A/B test. A script was installed on the server distributing visitors equally be-tween two versions of the recommender, one with and one without explana-tions. Google Analytics was already integrated on the website. Also two links to the application were configured that would lead to specific versions of the recommender. One link would lead to the version of the recommender system with explanations, the other link to the version without explanations. Those links would later be used in the assignment description (HIT) for the Amazon Mechanical Turk study.

(20)

assets designed. The campaign, and with it the A/B testing, started on the 4th of June. During this time the Amazon Mechanical Turk study was prepared and finally executed on the 14th of July 2020.

3.3

Approach

Since User Engagement is relatively hard to measure with a qualitative ap-proach, I decided to go for a quantitative approach. The ability to drive traffic to the web application via BMW Thailand channels favoured an approach as a case study. This allowed to study the effect of explanations in a real world setting. The quantitative approach of this within-subjects study was conducted with the crowdsourcing platform MTurk. Crowdsourcing was used because of the possibility to gain access to a big workforce with relatively low costs. With the benefits of access to a big workforce the risk of a population bias was re-duced.

3.4

Design

3.4.1 User Experience Short Form Analysis

To measure the user engagement of each system the User Engagement Scale Short Form (UES-SF) was utilized. The UES-SF was chosen over the UES because of its benefit of being completed in 5-10 minutes instead of 15 minutes [12], while still providing results that match those of the lengthier UES. More information about the differences of the two questionnaires can be found in the section Theoretical Background.

To answer the research question an experiment on the crowd working plat-form Amazon Mechanical Turks (MTurk) was conducted. MTurk was chosen because it provides access to a large workforce with limited costs. The general idea of the experiment was to give workers the same tasks to complete on the recommender system with explanations as well as on the system without expla-nations and measure the user engagement with the UES-SF, then compare the engagement scores.

3.4.2 Pilot Studies

To test the setup of the MTurk assignment and the HIT description before the actual study two pilot studies were conducted. The first pilot followed a between subjects approach, where each participant was presented only one version of the recommender. After conducting the first pilot a second one was launched with a within subjects approach, where each participant was presented both versions of the recommender system.

3.4.2.1 Between Subjects Study

The pilot did not aim to gather data to compare the two interfaces so partici-pants were only sent to the version of the recommender without explanations.

(21)

The complete task description can be found as Appendix B. The study was setup on MTurk as seen in Table 4:

Table 4: Setup of Between Subjects Pilot

Reward per response 1,10 USD

Number of respondents 10

Time allotted per Worker 20min

Survey expires in 3 days

Auto-approve and pay Workers in 2 days

Table 5: Participation requirements for Between Subjects Pilot

HIT Approval Rate (%) >95% Number of HITs approved >1000

Car Ownership true

These high requirements should ensure the quality of the gathered data. Car Ownership was set as mandatory to ensure identification with the task and an understanding about the topic of acquiring a car.

The pilot study was completed on the 29th of June 2020 and took four hours until completion. The majority of the accepted answers have been submitted in the first 30 minutes. The answers of each submission were checked for validity. For task two and three the answers were checked against the catalogue of possible answers, while the survey answers were checked for unusual patterns.

The percentage of submissions with answers outside of the catalogue was very high at around 70%. While all participants filled out the survey reasonably the answers to task two and three were more often faulty than correct.

Most of the time participants didn’t input cars that would match the defined requirements of the task but cars either outside the price range or with different attributes than instructed. Oftentimes these were cars of the top price range like the i8 or cars of the M series. One participant even submitted Lexus and Toyota models.

I decided to change the task description to reduce the amount of possible answers and therefore have an easier validation.

One worker contacted me per mail mentioning that the allotted time for completing the HIT was not enough, since workers were usually working on multiple HITs at once. I decided to remove task three of the HIT in the final study and allot five minutes more for the completion of the survey to make the HIT less demanding in the hopes that workers would feel less rushed and therefore submit more conscientious answers.

All results can be downloaded as a CSV file generated by the platform. While doing this the submitted results were oddly separated in the table cells, which made analysis a tedious process.

(22)

The submitted survey data was put into a spreadsheet to calculate the means of each sub-scale and the engagement score, as a sum of all means. A high engagement score shows strong engagement. A mistake in the survey setup was spotted here. The scoring of the Perceived Usability Sub-Scale needs to be inverted like instructed by O’Brian et al. [12]. The questions of the sub-scale ask for negative experiences and therefore strongly disagree needs to be counted as 5 instead of 1. The HTML of the survey was adjusted respectively.

3.4.2.2 Within Subjects Study

The budget of the study limits the amount of participants to 120. In a between subjects approach this would allow for 60 participants in each condition. Since 60 participants might be a sample size that is too small to prove a statistical significant influence on the engagement this pilot was conducted. With a within subjects design experiment each participant gives two measures which results in 120 data points for each version of the recommender system.

The first pilot was designed as a between subjects study out of the assump-tion that participants would take too long to test both systems and fill out the UES-SF twice, so that 1 USD would not be sufficient compensation and respon-dent fatigue appears. Because of this the second pilot’s HIT was shortened by removing the first task of pilot one, which just functioned as a get to know of the application. The full task description can be found in Appendix 4.

The Within Subjects pilot was conducted on the 4th of July 2020 with 10 participants on Amazon’s Mechanical Turk platform and took around 5 hours to completion. Although no participation requirements were used on this pilot the rejection rate was lower than the previous pilots’ one.

The study was setup on MTurk as following:

Table 6: Setup of the second pilot

Reward per response 1 USD

Number of respondents 10

Time allotted per Worker 1 hour

Survey expires in 2 days

Auto-approve and pay Workers in 2 days

All lessons learned from the first pilot were applied on this pilot. The biggest lesson from the previous pilot was to allot more time not to stress participants. Even though the HIT assignment would take participants on average 17 minutes and 43 seconds the HIT allowed up to 1 hour to finish the HIT. I added the approximate duration of the HIT (15 min.) to the title of the HIT during setup to let participants know the approximate duration of the HIT before they commit to it.

Although this pilot did not focus on gathering data but to test the design of the study, the submitted answers were analysed in an excel sheet. The mean of the engagement scores of all participants for the application with explanations

(23)

was by 2,3% higher than the mean of the engagement scores for the application without explanations. The full collected data-set can be seen in Appendix 4.

As the assumption that participants would take too long to finish the HIT did not prove as being true, it was decided to conduct the full study in a within subjects design.

3.4.2.3 Full Study

The study was based on the HIT of the second pilot. An additional question was added to the task description of the pilot after Moshfeghi et al. that asks participants which version of the system they prefer. The full task description can be found in Appendix 5.

As the data from the second pilot indicated that the effect of explanations might be small and therefore hard to measure with the UES-SF alone an ad-ditional set of questions was added. These questions were aimed to measure differences in the attributes that were defined as the goals of explanations by Nava Tintarev and Judith Masthoff.

The following questions were added to the survey, each set aiming to measure a different variable connected to explanations.

1. Effectiveness

(a) I could find cars more suitable to me without the application (b) The application helped me find the right cars for me

2. Efficiency

(a) I was able to decide quickly which cars are right for me

3. Persuasiveness

(a) I would consider buying an offer I found on this website (b) The application made me want to buy a car

4. Transparency

(a) I understand why the application recommended me the cars in my results

Like the rest of the survey these questions were measured on a five point Likert scale.

An increase in user satisfaction was identified by Nava Tintarev and Judith Masthoff [6] as one of the aims of explanations in recommender systems. Their definition of satisfaction is the increase of usability or enjoyment. A connection to the User Engagement Scale can be drawn here, as one of its subscales mea-sures perceived usability. Usability and usefulness are seen as the backbone for user engagement.

Further satisfaction is also seen as a component of user engagement. An increase in self reported user satisfaction has been linked to an increase of user

(24)

engagement by Chapman [21]. Wiebe et al. analysed the USE-LF in the context of video games [22] and came to the conclusion to reduce the six factor UES into a four factor one – the fourth factor being Satisfaction. Wiebe et al.’s observations in addition to others lead to the creation to the UES-SF. The questions of Wiebe et al. sub-scale used to measure satisfaction are taken out of different UES sub-scales. The questions are:

1. I would continue to go to this website out of curiosity.

2. I would recommend playing the game on the website to my friends and family.

3. Playing the game on this website was worthwhile.

4. My gaming experience was rewarding.

5. This gaming experience was fun.

6. I felt discouraged while on the website.

Two of these questions are already included in the current questionnaire (Nr.3, Nr.4) in their task adjusted form. I included question Nr.2 and Nr.6 to the final questionnaire.

In their paper A Survey of Explanations in Recommender Systems [6] pos-sible ways for analysing Satisfaction are discussed. Tintarev and Masthoff pro-pose to directly ask users which system they preferred, the one with or the one without explanations. This question is already covered in the task description. Another proposition is to simply ask if the system is fun to use.

With these additional questions a more differentiated analysis of the user engagement might be possible.

I concluded the study with the following demographic questions:

• Occupation • Education • Age • Gender

Additionally two questions were included to measure the attitude towards the brand BMW. The full study description can be found in Appendix 4

To prevent the influence of decision fatigue and the influence of the task de-scriptions four batches of HITs with different task order and recommender order were launched. Each batch was released 25 times for a total of 100 participants. Table 7 gives more detail about the configuration of each batch.

The HIT was setup on MTurk for the first two batches as seen in Table 8 and Table 9. The setup of the last two batches had to be adjusted as explained in the section Data Collection.

(25)

Table 7: Batch Setup

Batch Task Order Recommender Order

1 A ->B No Explanations ->Explanations 2 B ->A Explanations ->No Explanations 3 A ->B Explanations ->No Explanations 4 B ->A No Explanations ->Explanations

Table 8: Setup of first two batches

Reward per response 1 USD

Number of respondents 25

Time allotted per Worker 1 hour

Survey expires in 7 days

Auto-approve and pay Workers in 2 days

3.4.3 A/B Testing with Google Analytics

An advertising campaign on Facebook and the messenger application Line, which is Thailand’s most popular chat application and comparable to WeChat, drove visitors to the recommender system. Figure 6 shows screenshots of the published advertisements on Facebook.

A script on the web-server hosting the website then distributed the visitors in a ratio of 50/50 between the two versions of the recommender system. Google Analytics tracked the behaviour of the visitors on the results page, where there were either explanations for the recommendations or not, depending on which version the visitor landed. Then the average time spent on the page as well as the Bounce Rate of the two systems were compared.

3.5

Data Collection

3.5.1 A/B Testing

The data was collected during the period from the 4th of June to the 19th of July 2020. A social media ad campaign started on the 4th of June and lasted over 14 days until the 17th of June. The majority of visitors visited during the campaign period as the be seen in Figure 7. The latest spike of visitors during the 13th to 15th of July was due to a event of BMW Thailand which drove visitors to the application.

3.5.2 UES-SF

As previously mentioned the survey was released in four batches of each 25 assignments. The first batch was launched on Monday July the 13th of 2020. The submitted data was checked for validity by checking the answers to the attention questions, unusual patterns, previous participation in the pilot studies

(26)

Table 9: Requirements of first two batches

HIT Approval Rate (%) >95% Number of HITs approved >1000

Car Ownership true

Figure 6: Advertisements on Facebook

or invalid responses that indicate that a user did not read the instructions. Typical submissions that were rejected typically included different car brands than BMW. This resulted in 24 approved submissions and 11 rejected ones. All following batches were controlled after the same criteria. The second and third batch were launched on the following day, July 14th. Batch 2 had a significantly lower quality in submissions as the previous batch. That’s why the requirements for participation in the study were adjusted to a 98% approval rate for batch 3 and 4. Batch four was launched on Wednesday the 15th July. The quality of submissions increased after the adjusted approval rate requirement. Figure 10 shows a summary of the amount of rejected and approved submissions per batch.

The data collection ended on Wednesday 14th in the evening with a total of 99 submissions and 56 rejections across four batches.

(27)

Figure 7: Visitors over the data collection period

Table 10: Overview of batch statistics

Batch Approved (A) Rejected (R) Rejection Rate (R/A) Avg. Time per Assignment

1 24 11 0,46 26 minutes 38 seconds 2 25 23 0,96 21 minutes 43 seconds 3 25 15 0,6 16 minutes 57 seconds 4 25 7 0,29 15 minutes 3 seconds

Total 99 56 0,57 20 minutes 1 second

decreased a lot with every new released batch.

The collected data was downloaded as four CSV files, one for each batch. The pilot studies showed that transferring the answers manually into the excel sheet used for analysis was a time consuming and error prone process. That’s why the file was parsed by a helper application which removes all fields besides the approved answers. The helper application was written in JavaScript with the React framework. The source code to the application can be found here: https://github.com/fxrl/thesis-analysis-helper. As mentioned before the submit-ted answers are delivered in JSON notation by Amazon MTurk. The helper application formats the answers into a human readable and exports a CSV file.

3.5.2.1 Participant Demographics

Participants were asked about the following demographics: age group, gender, education and occupation. The answer possibilities for each demographic infor-mation can be seen in Figure 11.

Table 11: Demographic question with possible answers

Demographic Possible Answers

Age Prefer not to say, 18-25, 26-35, 36-45, 46-55, 55 and older Gender Prefer not to say, male, female, other

Occupation employed, self-employed, student, not-employed

level of education Prefer not to say, High School degree , Bachelor’s degree , Master’s degree , PhD

Out of all participants (n=99) 58% were male (n=57) and 42% were female (n=42). The majority of participants were employed by either a company or organisation (n=71), 16 participants answered to be self employed and 3 an-swered to be students. 9 participants anan-swered that they are not employed. 51 participants claimed to have a bachelor’s degree or equivalent, 28 a masters degree or equivalent, 18 a high school degree or equivalent. 1 did not want to answer about their education level and one person answered to have a PhD.

(28)

Most of the participants were in the age group from 26-34 (n=45), the second biggest age group was 36-45 with 21 participants followed by 15 participants in the age of 46-55. 56 years or older were 10 participants and 7 participants were 18-25 years of age.

3.6

Data Analysis

3.6.1 A/B Testing

The analysed data was collected in the period from the 4th of June to the 19th of July 2020. Because the script on the server to distribute the visitors on the recommender versions was already installed before the start of data collection, slight differences in the amount of visitors for each version can be observed. Taking a look at the entire time frame since the implementation of the server site script the amount of users is identical on both versions.

Before the start of the social media campaign visitors of the application were mainly me, colleagues of me or the developer. Therefore the data before the ad campaign is highly biased as it doesn’t represent the browsing behaviour of regular users.

The collected data was interpreted with the help of the MTurk experiment results.

3.6.2 UES-SF

The following things have been analysed between the two conditions with ex-planations and without exex-planations:

1. Difference between the engagement score means

2. Difference of the UES subscale means

3. Difference between the goals of explanations

The data was collected in an excel sheet calculating the means for every subscale of the UES and an overall engagement score per condition of the sub-mission. As recommended by H.L. O’brien et al. the scoring of the questionnaire uses calculated means of each subscale. The overall engagement score is the sum of the averages of all four subscales. As an example:

Table 12: Example calculation of engagement scores

Subscale FA PU AE RW Overall Engagement

Mean 3 3 4 4 14

The statistical analysis was done with the statistics software R and RStudio. The scores were compared with a paired two-samples T-Test which returns a probability for the hypothesis that the means do not differ.

(29)

Also the Effect Size was calculated to gain an idea of how big the magnitude of effect is and allow an interpretation of how practically important the effect is. For this Cohen’s d was calculated.

3.7

Validity and Reliability

All cited academic literature and articles where carefully selected and are peer reviewed or from reputable sources to ensure the validity of the information. Also the delimitation of the study has been clearly defined.

The User Engagement Scale underwent extensive testing in its creation by O’Brien and others that developed the survey further. It has been successfully used to measure user engagement in different contexts and proved to be a robust tool to measure user engagement. It’s practical use was clearly instructed by O’Brien, which I followed carefully.

The collected data was published on my website for additional transparency and the source code of the helper application was published on GitHub.

The statistical analysis of the collected data has been conducted with NHST, which has a long history of being used and is a well established method in academic statistics. Although NHST has been critiqued in the past because of often misinterpreted results, it is still considered the essential method for comparing two sampled means.

Google Analytics as the industry leader of online statistic platforms has a high quality and is able to measure over 95% of visits on the website. The measured statistics are therefor considered to be of high accuracy.

The interpretation of the results has been done truthfully and to my best knowledge of academic standards.

4

Empirical Data

4.1

Google Analytics

Over the period form the 4th June 2020 to the 19th July 2020 6838 visitors visited the application. 663 visited the results page with explanations and 672 visited the results page without explanations. Table 20 shows the collected data from Google Analytics.

Table 13: Google Analytics data for the recommender results page from 4th of June 2020 to 19th of July 2020

Version Pageviews Unique Pageviews Avg. Time on Page Bounce Rate % Exit

with explanations 1293 663 43,97 72,73% 22,74%

without explanations 1251 672 52,83 63,64% 21,90%

Total 2544 1335 48,35 68,18% 22,33%

The column Pageviews counts how often users would view the results page per version. Unique Pageviews are the amount of users visiting the application.

(30)

One user can be responsible for multiple Pageviews. The Avg. Time on Page shows how long users would visit the page in seconds. The Bounce Rate shows the rate of users leaving the page without any further visit of the other pages of the application, while the column % Exit shows the percentage of users that left the application from the measured page.

4.2

Amazon Mechanical Turks Experiment

99 participants of the study submitted results that fulfilled all criteria for them to be included in the analysis. The tracked variables of the survey were:

• UES-SF Variables 1. Aesthetic Appeal 2. Focused Attention 3. Perceived Usability 4. Reward Factor • Explanation Goals 1. Satisfaction 2. Transparency 3. Efficiency 4. Effectiveness 5. Persuasiveness

For each of the listed variables a score between one and five was calculated. Five indicating high engagement for the UES variables or respectively a strong positive influence on the area of explanation goals. The calculated means of each participant for every sub-scale can be found in Appendix 5 and 6.

4.2.1 UES-SF Variables

Figure 8 shows the means for each tracked variable between the recommender with and without explanations.

As one can see in Figure 8 each variable scored a slightly higher mean on the variation of the recommender system with explanations. The overall difference in the engagement means can be seen in Figure 9. The mean differences for each scale and the total engagement (ES) can be seen in Table 14.

Table 14: Mean Differences of the UES-Subscales and total Engagement Score

Scale AE-S PU-S RW-S FA-S ES

Mean Difference 0,08 0,14 0,09 0,06 0,38

(31)

Figure 8: Overview of measured UES-SF variable means. (E = explanations / NE = no explanations)

Figure 9: Overview of measured total engagement scores across versions.

As to be taken from Table 14 the mean difference between the two versions was relatively stable across all sub-scales at around 2%. The only exception is the perceived usability sub-scale which mean difference was higher at around 3%.

4.2.2 Goals of Explanations

Figure 10 shows a bar diagram of the measured mean scores for each explanation goal in the version with explanations (E)and without explanations (NE).

(32)

Figure 10: Mean scores of each explanation goals.

The abbreviations on the scale can be interpreted as following:

1. Satisfaction (SA)

2. Transparency (TRA)

3. Efficiency (EFI)

4. Effectiveness (EFE)

5. Persuasiveness (PER)

The biggest measured differences can be seen on the effectiveness (EFE) and efficiency (EFE) scale. Interestingly the transparency scores on each version seems not to differ from each other.

5

Analysis

In this section the research question Do explanations in recommender systems influence the user engagement? is statically investigated.

(33)

5.1

Amazon Mechanical Turk Experiment

Data collected with the MTurk experiment was analysed in the statistics tool RStudio. A null hypothesis statistical analysis was conducted with the t-test function in R. The paired Student t-Test provides a hypothesis test to compare the means of a pair of random samples. The test statistic is calculated with this formula:

t = m

s/√n

m is the mean differences. n is the sample size. s is the standard deviation of the data. Cohen’s D was then calculated to measure the impact of the observed effect with this formula:

d = √t n

All calculations shared the same hypothesis: The null hypothesis is that the mean differences between the values collected for the recommender system with explanations and the version without explanations are 0. The alternative hypothesis is that the mean difference is not 0. In mathematical notation, where H0 is the null hypothesis, H1 is the alternative hypothesis and m is the mean

difference:

H0: m = 0

H1: m 6= 0

5.1.1 UES-SF

As seen in the previous section the mean scores of all sub-scales were higher on the version of the recommender system with explanations than on the version without explanations. This results in a higher total engagement score on the version with explanations. This analysis investigates if the observed results are of statistical significance.

The following things were checked for statistical significance.

1. Difference in means of each sub-scale of the UES

2. Difference in means of total engagement score of the UES

5.1.1.1 Difference in means of each sub-scale

All sub-scales were analysed with a paired t-test in R. The results can be seen in Table 15.

As the p-values of the perceived usability (PU) and the aesthetic appeal (AE) are inferior to 0.05 one can conclude statistical significance of the mean differences between the two versions of the recommender. Taking a look at Cohen’s D of both sub-scales the value for PU lies around 0.2, which indicates a small effect, while the value for AE lies at around 0.1 which implies no effect of importance.

For the focused attention and reward factor sub-scale no statistical signifi-cant difference in means could be measured.

(34)

Table 15: Results of paired t-test analysis of all sub-scales of the UES-SF.

Sub-Scale t Degrees of Freedom p-value 95% Confidence Interval Cohen’s D

PU 2.3398 98 0.02132 0.02198765 - 0.26757464 0.1633233 AE 2.111 98 0.03732 0.004743185 - 0.153505973 0.1267189 FA 1.0645 98 0.2897 -0.05528588 - 0.18323200 0.07831773 RW 1.7297 98 0.08684 -0.01289581 - 0.18797999 0.1359432

5.1.1.2 Difference in means of total engagement score

Each engagement score was stored in a tab delimited text file and imported into a R frame. A summary of the data can be seen in Table 16.

Table 16: Results of the paired t-Test

t Degrees of Freedom p-value 95% Confidence Interval Cohen’s D

2.6555 98 0.009243 0.09487029 - 0.65597146 0.1778632

Since the p-value is inferior to 0.05 we can reject the null Hypothesis and accept the alternative Hypothesis that the true difference in means is not equal to 0. A Cohen’s D of approximately 0.2 is consider a small effect size.

5.1.2 Explanation Goals

Table 17 shows the results of the paired t-test for all tracked explanation goal variables.

Table 17: Results of paired t-test analysis of all explanation goals.

Goal t Degrees of Freedom p-value 95% Confidence Interval Cohen’s D

EFE 2.7724 98 0.00666 0.05167442 - 0.31196194 0.2545133 EFI 2.3994 98 0.01831 0.03057029 - 0.32296506 0.2319612 SA 1.649 98 0.1024 -0.02157809 - 0.23369930 0.138548 PER 1.2652 98 0.2088 -0.06890412 - 0.31132836 0.1619671 TRA -0.59805 98 0.5512 -0.1962831 - 0.1053740 0.1024384

A statistically significant effect was measured for the effectiveness and ef-ficiency goal. Both variables have a Cohen’s D of approximately 0.2 which is considered a small effect size.

5.1.3 Preferred Application

The survey also asked participants about their preferred application after fin-ishing the scenario tasks of the study. Figure 11 shows the answers in a pie chart.

As one can see from Figure 11 a majority preferred the version of the rec-ommender with explanations. Nava Tintarev and Judith Masthoff explicitly mentioned this question as a measure for explanation goal Satisfation.

(35)

Figure 11: Pie Chart of the preffered version

5.1.4 Attitude towards BMW

Participants were asked about their general opinion towards the brand BMW Figure 12 and Table 18 show that the vast majority of participants has either a good or very good opinion of the brand BMW.

Table 18: Amount of answers for each attitude.

Attitude very good good neutral bad very bad

Answer Count 52 37 9 1 0

To investigate if the overall brand attitude influences the measured engage-ment score Kendall’s correlation coefficient τ was calculated for the version with explanations and without explanations. The results can be seen in Table 19.

Table 19: Correlation between attitude and engagement score

Version p-value z τ

with explanations 0.002761 2.9931 0.2447946

without explanations 0.003818 2.8928 0.2356582

(36)

Figure 12: Attitudes visualized in Pie chart

correlation between the values. Since both correlations have a p-value inferior to 0.05 we can say that there is a small positive correlation between the attitude towards BMW and the measured engagement score.

Figure 13 shows the plots of the correlation between the measured attitude (2 = bad, 3 = neutral, 4 = good, 5 = very good) and the engagement score of the participants for each recommender version.

5.2

Google Analytics

Over the period form the 4th June 2020 to the 19th July 2020 6838 visitors visited the application. From these 663 visited the results page with explana-tions and 672 visited the results page without explanaexplana-tions. Table 20 shows the collected data from Google Analytics.

5.2.1 Average Time on Page

The data in Table 20 shows that the average time that users spent on the results page with explanations is almost 10 seconds less than on the version

(37)

(a) with explanations (b) without explanations Figure 13: Correlation between attitude and engagement score

Table 20: Google Analytics data for the recommender results page from 4th of June 2020 to 19th of July 2020

Version Pageviews Unique Pageviews Avg. Time on Page Bounce Rate % Exit

with explanations 1293 663 43,97 72,73% 22,74%

without explanations 1251 672 52,83 63,64% 21,90%

Total 2544 1335 48,35 68,18% 22,33%

without explanations. This observation can be interpreted when looking at the results of the MTurk experiment. Participants found that the recommender with explanations helped them find the right car faster than the version without explanations. This could lead to the visitors clicking faster on the car they find interesting and therefor to leave the results page faster.

5.2.2 Bounce Rate

Table 20 also shows that the Bounce Rate, meaning the rate of users not in-teracting with the page before leaving, is higher on the recommender version with explanations. The Bounce Rate of Google Analytics includes everyone that did not visit any other page from the visited page before leaving. In our case that could mean every user that might scroll over the results page but does not choose a car or offer to see more details. One way to interpret this data with the results from MTurk is that the explanations help the visitor the easier distinquish wich offers are right for them, without having the need to click on a result and investigate further. The Mturk results show that visitors find that explanations help them to make a choice.

(38)

6

Discussion and Conclusion

6.1

Findings

The t-test analysis of the gathered Amazon Mechanical Turk data shows that the biggest influence of explanations, specific to this recommender, was on the Perceived Usability of the system. Efficiency and Effectiveness were proven to be influenced the most of all investigated explanation goals by the MTurk experiment. Explanations significantly improved the total overall engagement score. This results have been partly anticipated. It was expected that there would be differences in the Perceived Usability and the Reward Factor. The Aesthetic Appeal and Focused Attention were not expected to be influenced by adding explanations. Interestingly the explanations didn’t prove to influence the Reward Factor. Reasons for that could be that the reward of using an application is more related to the overall application workflow. So that adding explanations, which only influence a small part of the experience, does not add much to the reward the user perceives.

The explanation goal results are interesting in that only Efficiency and Effec-tiveness were proven to be influenced by explanations. Surprisingly no increase in the Transparency of the system was measured. I expected to see a difference there. This might be a result of wording the question asking about the trans-parency not clearly enough. Only one question was used to track transtrans-parency. Unclear wording of that question might lead to a misleading observation.

Although Cohen’s D of the measured effects indicates a small effect size to the user, a big difference can be seen on the metrics tracked with Google Analytics. The average time spend on the page differed 18,3% between the version with and without explanations. Also the Bounce Rate was differing by 9% between the versions.

But the results of the Google Analytics A/B Test are of limited power as more metrics would be needed to prove the interpretation of the results. What would have been interesting are the Click Events on the results page. Especially how many users went to the single car page is a metric that could prove the interpretation that the average time on page is lower due to people finding what they want faster. These metrics have not been captured.

6.2

Implications

The above described findings have implications for both industry practice and research. It appears that explanations help the user to choose results fast and effective. One could argue that efficiency and effectiveness both are attributes playing into the perceived usability and therefore increase the users engagement of the recommender system. The study showed that explanations in recom-mender systems can influence the users engagement positively.

In practice, when designing a recommender system, one should include ex-planations in the application. Integrating exex-planations will improve the users engagement with the application. Researchers need to be aware of the fact that

(39)

explanations influence the users engagement. When comparing systems with explanations and without explanations the users engagement will be different.

6.3

Conclusions

This study investigates the research question Do explanations in recommender systems influence the user engagement?. We can conclude that the results of the case study show that explanations can have an effect on the user engage-ment. A positive influence on the user engagement was observed with the case study on Amazon Mechanical Turks and Google Analytics. In both cases users seem to have a better experience with the application, when explanations for the recommendations are provided. Based on the findings it seems evident that explanations in recommender systems can influence the users experience posi-tively. In a competitive market the user experience can be the property that decides over success or failure of web applications. My recommendation for industry practice therefor is to include explanations in recommender systems whenever possibly.

6.4

Further Research

The limitation of the study is mainly the characteristics of the explanations. Therefore an interesting direction for further research is the effect of expla-nations with different goals. The explanations in this recommender mainly influenced the goals efficiency and effectiveness, which seemed to influence the perceived usability of the recommender. Different explanations might influence different properties. Another limitation of this study were the amount and de-mographics of participants. An experiment with more participants and more people under the age of 26 would be interesting.

As the quantitative methods of this study allow only assumptions over why users prefer a system with explanations it would be interesting to investigate the effect of explanations qualitatively. Also the magnitude of the effect of ex-planations would provide for an interesting study as Cohen’s D and the tracked Google Analytics measures implicate different effect sizes.

An especially interesting research would be to find out how explanations influence the conversion rate of recommender systems. The results of this study indicate that there might exist a positive influence of explanations on the con-version rate, as they allow users to make faster and better purchasing decisions.

(40)

References

[1] F. Ricci, L. Rokach, and B. Shapira, “Recommender systems: introduction and challenges,” in Recommender systems handbook. Springer, 2015, pp. 1–34.

[2] A. Felfernig, S. Schippel, G. Leitner, F. Reinfrank, K. Isak, M. Mandl, P. Blazek, and G. Ninaus, “Automated repair of scoring rules in constraint-based recommender systems,” AI communications, vol. 26, no. 1, pp. 15–27, 2013.

[3] N. Tintarev and J. Masthoff, “A survey of explanations in recommender systems,” in 2007 IEEE 23rd international conference on data engineering workshop. IEEE, 2007, pp. 801–810.

[4] I. Grigorik, Introducing Web Vitals: essential metrics for a healthy site, May 2020. [Online]. Available: https://blog.chromium.org/2020/05/ introducing-web-vitals-essential-metrics.html

[5] S. M. McNee, S. K. Lam, C. Guetzlaff, J. A. Konstan, and J. Riedl, “Con-fidence displays and training in recommender systems,” in Proc. INTER-ACT, vol. 3, 2003, pp. 176–183.

[6] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen, “Improving recommendation lists through topic diversification,” in Proceedings of the 14th International Conference on World Wide Web, ser. WWW ’05. New York, NY, USA: Association for Computing Machinery, 2005, p. 22–32. [Online]. Available: https://doi.org/10.1145/1060745.1060754

[7] H. O’Brien and P. Cairns, Why engagement matters: Cross-disciplinary perspectives of user engagement in digital media. Springer, 2016.

[8] S. M. McNee, J. Riedl, and J. A. Konstan, “Being accurate is not enough: how accuracy metrics have hurt recommender systems,” in CHI’06 extended abstracts on Human factors in computing systems, 2006, pp. 1097–1101.

[9] P. Pu, L. Chen, and R. Hu, “Evaluating recommender systems from the user’s perspective: survey of the state of the art,” User Modeling and User-Adapted Interaction, vol. 22, no. 4-5, pp. 317–355, 2012.

[10] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell, “Explaining the user experience of recommender systems,” User Modeling and User-Adapted Interaction, vol. 22, no. 4-5, pp. 441–504, 2012.

[11] J. Carlton, A. Brown, C. Jay, and J. Keane, “Inferring user engagement from interaction data,” in Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–6.

References

Related documents

As a result, we contribute to recognize user needs through user data and behaviors during the user engagement process; and by adapting digital ecosystem terms to the research,

Considering the conclusions from the focus group and the established conceptual design guidelines, two final concep- tual design proposals for deepening user engagement on an

question cannot stand alone,” and that it is important to also study consumers’ consumption from affective and cognitive processes (‘how’), consumption behaviours

• Supporting and mobilize end-user engagement by training and supporting IoT-LSPs teams for organizing co- creative workshops, using the U4IoT toolkits, and providing Living

This study is focused on discovering and understanding which game mechanics and dynamics would be more suitable for an online TV platform that wants to be

This study aims to examine an alternative design of personas, where user data is represented and accessible while working with a persona in a user-centered

According to organizational justice theory, employees are going to be more motivated to perform at high levels when they perceive that the procedures used to make decisions

The study reveals that the inclusion of empirical data, which end users bring to the design effort, introduces an inherently dynamic catalyst into the mechanics of the design