Towards Trustworthy AI

(1)

Towards Trustworthy AI

A proposed set of design guidelines for understandable,

trustworthy and actionable AI

Brenda Uga

Subject: Human-Computer Interaction

Corresponds to: 30 HP

Presented: VT 2019

Supervisor: Annika Waern

Examiner: Henrik ˚

Ahman

(2)

Sammanfattning

Artificiell intelligens används idag b˚ade i vardagliga applikationer och expertsystem. I situationer d˚a förtroendet för utdata fr˚an AI innebär en risk för negativa konsekvenser blir det viktigt att först˚a varför AI-systemet har producerat dess utdata. Tidigare forskning inom m¨ anniska-datorförtroende har identifierat förtroendeföreg˚angare som bidrar till att skapa förtroende för en AI-artefakt, varav först˚aelse för systemet är en av dem.

Inom ramen för Pipedrive, ett säljhanteringssystem, utreder denna avhandling hur AI-f¨ orut-sägelser kan designas p˚a ett först˚aeligt och p˚alitligt sätt, och i förlängningen vilka förklarande aspekter som kan ge vägledning gällande de ˚atgärder som ska vidtas, samt vilka presentations-format som stödjer skapande av förtroende. Med hjälp av en metod för forskning genom design undersöks flera utföranden för att visa AI-förutsägelser för Pipedrive, vilket leder till ett förslag till en uppsättning riktlinjer för design som stödjer först˚aelse, p˚alitlighet och funktionsduglighet. B˚ade design och riktlinjer har utvecklats iterativt i samarbete med användare och designutövare.

Abstract

Artificial intelligence is used today in both everyday applications and specialised expert systems. In situations where relying on the output of the AI brings about the risk of negative consequences, it becomes important to understand why the AI system has produced its output. Previous research in human-computer trust has identified trust antecedents that contribute to formation of trust in an AI artifact, understanding of the system being one of them.

(3)

Acknowledgements

(4)

5 Guidelines 36 5.1 Preliminary guidelines . . . 36 5.2 First iteration . . . 45 5.3 Second iteration . . . 45 5.4 Validating guidelines . . . 46 5.5 Discussion . . . 52 6 Conclusion 54 6.1 Ethical consequences . . . 54 6.2 Strength of results . . . 55 6.3 Future work . . . 56 References 57 Glossary 62

A Usability test tasks 63

B Sources of guidelines 65

C Preliminary guidelines with trust bases 68

(5)

1 Introduction

Artificial intelligence (AI) and machine learning (ML) are increasingly more ingrained in our lives. These technologies are involved when we do online shopping, regulate the temperature in our homes, order a taxi, and even when we go to the doctors. Nevertheless, with the widespread use of AI and ML enabled products, AI and ML remain ”black boxes” for the users. While there has been significant research on how to build these kinds of products, research on how the end users are able to understand and make use of these products is currently a key topic in the HCI community.

With the advent of e-commerce and online communities early in this century, research on recommender systems and online trust has been ongoing for almost two decades now. More recently, there have been significant advances in AI/ML technologies, that allow more sophisti-cated products, like voice assistants, self-driving cars and systems that can detect cancerous cells better than doctors themselves. These are complex systems, that require increasing amounts of trust from the users to be able to make use of them. Trust is a key factor that decides whether one is going to adopt a system and continuously use that system [1]. As one of the goals for developing AI/ML systems is to provide a solution for users’ problems, the designers of these systems must take trustworthiness into account if they are to attract those users in the first place. Thus, efforts to make AI and ML systems trustworthy are more and more relevant in both industry and academia.

It has been established that trust in AI/ML systems is dependent on these systems being understandable and explainable [2], [3], [4]. Only when the user is able to understand the working of the systems and why the system gives the results it gives, is the user able to trust the system at all. Trustworthy AI and explainable AI are hot topics, that propose several means of making AI systems explicable, thus allowing trust to be formed into the system. Recently, there have been several incentives for making AI systems explainable, like DARPA’s funding programme for explainable AI [5], Google’s Human-Centered Machine Learning and PAIR initiatives [6], and the ”right for explanation” as given in the General Data Protection Regulation [7].

While there has been research into applying explainable AI in military and medical domains [8], [9], [10], what remains to be less researched, is the forecasting application domain, especially in the context of sales. There is extensive research on forecasting algorithms, time-series pre-diction and ML models that perform with high accuracy in fast-changing circumstances. Yet, there is little understanding of how human users make use of the forecasts these models provide. This thesis explores the use of predictive algorithms in the context of sales management, as this domain can greatly benefit from accurate models to forecast market prices, revenue or volume of deals. Understanding the forecasts and being able to gather actionable insights from ML models is something that can make or break a salesperson. Here, the need for explainable AI arises. Acting on AI predictions might bring about severe negative consequences, such as lost opportunities, unhappy clients and even monetary losses for the salesperson themselves. This is why having comprehensible and scrutable prediction systems, that clearly explain why the prediction came to be that way, is crucial in this context.

1.1 AI or ML?

(6)

of machine learning, which can be of varied natures, for example, image recognition algorithms, computer vision or deep neural networks all are examples of ML. Thus, most AI is comprised of ML techniques, but AI adds on top of these the ability to provide the user solutions to their problems.

An example would be the target feature in this design work, a revenue forecast AI feature. This feature uses a machine learning model to predict the revenue for a given time period, which in turn is consumed by the user in order to solve the problem of knowing how much revenue will likely be brought in. Additionally, the AI of the feature lies in the explanations given for the predictions, which help the user to know which deals one should focus on and which deal characteristics are most salient predictors of success. Therefore, to refer to the target feature, the term AI would be more appropriate than ML. Similarly, AI-enabled products will be used to refer to other systems that use AI and ML to help solve a user need.

1.2 Pipedrive

To investigate AI and ML systems in the sales domain, this thesis project was carried out in Pipedrive. Pipedrive is a sales management tool1_{, that helps salespeople manage their complex}

sales processes by providing a visual overview of their sales pipeline. As a company, Pipedrive is developing more and more AI/ML based features into the product, that brings about a need to employ trustworthy and explainable AI. The concrete feature this thesis targets is the Rev-enue Forecast prediction, that forecasts the sales company’s revRev-enue over time and in upcoming periods.

The goal for Pipedrive is to provide actionable insights from this prediction, that help sales-people take actions to improve their sales situation. This needs to be preceded by having the AI be able to explain its predictions in a comprehensible and trust-instilling way. In the author’s pre-study, participants articulated ways in which they use Pipedrive. For example, some users want to notice patterns in the data that might influence their earnings: ”I want to notice it early if someone has fallen behind or if any patterns have changed that would influence our results”. The Revenue Forecast feature and forecasting one’s revenue, in general, is an important task for salespeople, especially sales managers, in order to be able to budget, plan the production of the products they sell, analyse which items sell well and which do not, and even to acquire loans and funding. It was found in the pre-study that some people use the Forecasting features in Pipedrive for prioritising: ”Forecast view in general helps me understanding where the focus needs to be also to see the seasonal trends that I need to be aware of.” The importance and gravity of the actions taken based on forecasts is an indicator of how much trust is needed of the AI-enabled products in the sales domain.

1.3 Research question

Considering the need to investigate the topics of trustworthy and explainable AI in the sales domain, and to provide actionable insights in the Revenue Forecast prediction, an opportunity presents itself to create guidelines on how to satisfy these needs. While there is extensive knowl-edge on how to build AI systems, that perform reliably and are robust enough to be employed in changing situations, there is little knowledge on how to design the interfaces that present the AI to the user. As shown in the paper by Amershi et al., AI/ML systems tend to not adhere to commonplace UI/UX design guidelines nor to AI-specific design recommendations, despite extensive research into making those kinds of systems intelligible, transparent and explainable [12]. The situation is ripe for designing interfaces that are trust-inducing, offer comprehensible explanations and guidance on what actions to take based on the AI predictions. Thus, this thesis explores how to design trustworthy AI predictions, in a way that satisfies the above crite-ria. The goal of this thesis is to suggest design guidelines for the development of trust-inducing explanations to AI-based prediction systems. Design guidelines as a form of generative interme-diate knowledge, seek to provide more abstract knowledge than specific examples of design, but not as general knowledge as theories do [13]. Developing guidelines as the contribution appears appropriate to the situation, as the guidelines could be applied in a variety of contexts, where trust becomes critical to warrant usage. As also shown in the paper by Amershi et al., intuitive

(7)

and applicable design guidelines are needed in the present situation of ever-growing numbers of AI-enabled products that users need to understand, trust and use [12].

Stemming from the goal to explore how trustworthy AI can be designed for, this thesis considers the following research question:

”How can AI predictions be articulated to be trustworthy and actionable?” As only the presentation of AI predictions is considered, and not the way the predictions are generated in the first place, the author investigates the articulation of predictions. Focusing on the presentation of the predictions allows creating design guidelines for designers without the need to dive deep into the underlying machine learning systems. While the overarching goal is to articulate predictions in a sufficiently trustworthy and actionable way, a need for explanations of the predictions still arises. Thus, a first sub-question is defined as such:

”What explanatory aspects are important to support trust?”

As AI predictions are mainly used to base some decisions on them, the predictions need to be articulated in ways that bring out the possible actions to take. Therefore, a second sub-question is defined as such:

”What guidance should the articulation provide towards actions to take?”

A third sub-question that is considered comes from the wide array of different interfaces that can host the articulation of predictions, such as a virtual anthropomorphic assistant, a collection of graphs or natural language conversation mechanisms. Presentation formats need to be evaluated from the angle of trustworthiness and comprehensibility, yielding the following sub-question.

”Which presentation formats best support formation of trust?”

(8)

2 Background

Trust in an important topic to investigate in research regarding human-computer relationships. There are several different definitions and concepts of trust, from which there has not emerged a widely accepted and used framework. In this chapter, some theories and framings on trust, that consider the formation of trust in AI systems, are explored. Additionally, the use of AI in the domain of sales is examined.

2.1 Definition of trust

As trust has been identified as a means of being able to rely on complex systems that cannot be understood or explained completely, trust is evidently a necessity when dealing with AI/ML systems [14]. Therefore, it is important that AI/ML systems are designed in a manner that induces trust in the human user of the system. To accomplish that, a common understanding of what trust is, is needed.

One of the major works on trust, Mayer’s model of organisational trust [15], defines trust as ”the willingness of a party [trustor] to be vulnerable to the actions of another party [trustee] based on the expectation that the other will perform a particular action im-portant to the trustor, irrespective of the ability to monitor or control that other party.”

Another well-accepted definition comes from Lee and See [14], defining trust as:

“the attitude that an agent will help achieve an individual’s goals in a situation char-acterized by uncertainty and vulnerability.”

The major distinction between these two definitions is that in Mayer’s definition trust is an intention or a willingness to be vulnerable, but in Lee and See’s it’s that trust is an attitude or a mental state. An attitude can exist without any actions taken on that party’s side, but a willingness to be vulnerable means that there is an action involved in becoming vulnerable. In Mayer’s definition the trustor willingly engages in an act of becoming vulnerable, whereas in Lee and See’s definition, the trustor has an attitude irrespective of whether they engage or do not engage in some actions with the trustee. The definition of Lee and See acknowledges that attitudes and beliefs are factors that guide a person to have some intentions, and these intentions transform into behaviours and actions. So in that sense, in Lee and See’s definition, trust as an attitude is a prerequisite to any actions to be done, especially to the action of submitting oneself to be vulnerable. Lee and See see that beliefs and attitudes underlie trust, and different levels of trust elicit different intentions and behaviours.

Both definitions highlight the relationship between the trustor and the trustee to be depen-dent on the uncertainty and risk involved in the act. One party is willing to take part in some risk, as they cannot be sure that the other party is going to fulfil their end of the deal. This is an important aspect to consider in relations between AI and humans. AI often acts unpredictably, and human users might take AI’s predictions and act on them, thus making the human vulner-able to whatever consequences might arise from relying on bad predictions. The uncertainty of the situation here is two-fold, one part is established in the nature of AI being unpredictable and the other part lies in the consequences of the human actions that are taken based on the AI’s output.

(9)

These two definitions can be consolidated by looking at one other definition of human-computer trust by Madsen and Gregor [17], where human-human-computer trust can be defined as

”the extent to which a user is confident in, and willing to act on the basis of, the recommendations, actions, and decisions of an artificially intelligent decision aid.” This definition acknowledges trust to be based on two aspects - confidence in the system and willingness to act on the system’s decisions and advice. Madsen and Gregor further address the more general nature of confidence as an underlying attitude of the user, which can be considered a prerequisite to the user being willing to act on the AI’s outputs. Their definition consolidates Lee and See’s trust-as-an-attitude and Mayer’s trust-as-willingness definitions in this regard, yet leaves out the risk and uncertainty aspects.

Both Lee and See’s and Madsen and Gregor’s definition seem to suit the situation between an AI and a human well, with one describing the uncertain nature of the interactions and the risk involved in situations where an agent is able to help fulfilling user’s goals, and the other basing willingness on confidence. Madsen and Gregor’s definition also suits the research question and sub-questions posed in this thesis with regard to both trustworthiness and actionability. Accepting these definitions of trust leads this chapter to explore some approaches to trust that match the definitions either fully or somewhat.

2.2 Trust in AI

There exist several approaches to describe the formation of trust in IT systems, that concern themselves with explaining what factors impact the formation of trust relations between humans and computers and the situations in which these relations are formed.

One group of approaches is based on the social structure of interpersonal relationships, that are adapted to fit between an IT system and a human. The framework of trust in technology-mediated relationships by Riegelsberger et al. defines the trust relationship between two actors - a trustor and a trustee, establishing the IT system as one of them [18]. They acknowledge that while IT systems do not have motivations nor the free will to act trustworthily, human actors still treat IT artifacts as human actors, thus the framework is applicable in the situations concerning AI as the trustee and a human user as the trustor. Their framework specifies trust-inducing properties that are either intrinsic (ability and motivation which are based on benevolence and internalised norms) or contextual (temporal, institutional and social embeddedness), stating that these properties need to be present for the trustee to warrant trust in themselves. Riegelsberger et al. acknowledge that the trust relationships are relevant in risky situations, and develop several concepts that provide ways for the trustee to signal their trustworthiness, and willingness to fulfil their part of the exchange.

Another framework of trust is Lee and See’s [14]. This framework, similarly to the one before, identifies individual, organisational and cultural contexts as factors that influence the formation of trust. Trust in their treatment is dependent on the internal factors of the actors, such as their predisposition to trusting and their previous experiences regarding trust; the organisational fac-tors like relationships between groups of people, reputation and gossip; and cultural facfac-tors such as social norms in the community. Additionally, it specifies three bases of trust: performance, process, and purpose. Performance reflects the what of the system, meaning what the system does to achieve the operator’s goals. Process is concerned with the how of the system, in this case, the algorithms governing the behaviour of the AI system. Finally, purpose is the extent to which the system behaves according to the intent of the system’s designers, reflecting the benevolence and the positive attitude of a trustee towards the trustor. These bases of trust, when interpreted through analytic (rational consideration of the situation), analogical (assign-ing to trust levels based on similar attributes) or affective processes (emotional response), have the ability to influence trust.

(10)

S¨ollner et al. argue that frameworks conceptualised on human-human relationships are not applicable when analysing trust between IT artifacts and humans, as these approaches consider the artifact to be mediating the interpersonal relationships, instead of the artifact itself being one of the actors in the trust relationship [19]. Although S¨ollner et al. explain that these frameworks are based on the fact that humans perceive IT systems to behave in human-likeness, and thus the frameworks apply in this relationship as well, because we can draw a parallel between an IT system and a human actor, they consider these concepts to have a weakness. The authors consider Lee and See’s framework of trust in automation to be more suitable to describe trust in human-computer relations. They build on Lee and See’s definition of trust and adapt the framework to include measurable trust antecedents. Additionally, de Visser et al. disprove the media equation by identifying differences in how humans and computers evoke different levels of trust in human subjects, and state that Lee and See’s framework is more suitable to describe trust in human-agent relationships [20].

In this thesis, the author considers S¨ollner et al.’s adaptations of Lee and See’s framework to be appropriate in investigating trust between AI systems and human users. The adaptation considers the contexts that engulf the AI systems, the bases and antecedents of trust and the different processes through which one can analyse the trustworthiness of a system. Addition-ally, it gives measurable dimensions on trust, with which it is possible to later evaluate the trustworthiness of the proposed designs with regard to the trust bases and antecedents.

Figure 1: Trust antecedents according to S¨ollner et al. [19]

In addition to the trust mechanics described by Lee and See, trust in IT artifacts is also considered to have antecedents; precursors that affect humans’ trust development [19]. Each of the trust bases defined by Lee and See has several antecedents, that affect the formation of trust, as seen in Figure 1. For the performance base, these are competence (the system is able to help achieve the user’s goal), information accuracy (the data presented is accurate), reliability over time (the system can be relied on over time) and responsibility (the system has all functionality needed to help achieve the user’s goal). For the process base, these are dependability (the system is consistent), understanding (the user understands the system), control (the user perceives to be in control of the system) and predictability (the future behaviour of the system is anticipated). For the purpose base, these are motives (that the purpose of the designers is communicated to the user), benevolence of the designers (that designers have the best interests of the user in mind) and faith (that the system can be relied upon in the future).

(11)

im-proved by providing explanations to its output, the explainability of the AI system becomes a relevant goal of the designers in order to build trustworthy AI systems [21]. Explanations can explain different aspects of the system: the what the system consists of, the how the system works, and the why the system gave a specific output [22]. Explanations of the why a given output was generated increase trust into the system the most, as the explanations promote un-derstandability of the system, that is, in turn, one of the antecedents of trust [3], [19]. Based on this, we can say that providing explanations on why the AI has given such an output is one of the factors that contributes a lot into the formation of trust between an AI system and the user. In addition, providing explanations has several other benefits to the user, with the user perceiving less time spent on tasks [21] and increased performance on information retrieval tasks [1], [3].

Trust is largely affected by design features of the artifacts, as established by Hoff and Bashir [10]. More anthropomorphic features have garnered increased trust resilience, meaning, that in cases of failures and bad predictions occurring in anthropomorphic systems the decrease in trust was less than in more machine-like systems [10]. However, some studies conflict the above with findings on people preferring text interfaces over anthropomorphic interfaces [23], and stating that anthropomorphic features are very sensitive to different user populations, regarding cultural and individual differences [24].

The consensus exists in that anthropomorphism must be considered carefully and designed well to not turn the positive effects on trust around. Politeness of the system, regardless of anthropomorphic features, has also been identified as promoting trust, as interfaces that ex-hibit patience and do not interrupt the user elicit greater trust and improved task performance from the users [25]. Additionally, explaining system failures and offering confidence levels with predictions have a positive effect on trust formation [10].

Trust in complex IT artifacts is positively affected by the transparency of these artifacts [1]. Transparency can be achieved via means of explaining the how and the why of the resulting output, yet the level of transparency coming from opening the black boxes of algorithms must be appropriate to the context the system is used in. Although explanations are not easily generated from all ML algorithms, the positive effects on trust resulting from appropriate use of explanations is a great motivator for opting for algorithms that are able to open their black boxes.

Based on the above, understandability, good user experience and competence of the IT sys-tems seem to be largely positive factors in contributing to the formation of trust in relationships between users and systems. This is evident also in several applications of AI systems.

2.3 AI in the sales domain

AI is applied in several different products that are already permeating our daily lives. Tesla’s cars with driver assisting AI, Amazon’s Alexa with its voice interface, and Netflix with its smart suggestions on films one might enjoy, are just a few examples of how AI/ML technologies are being put to use in commercial products. They help in solving differing use cases ranging from predicting and recommending products and optimizing driving routes to assisting users with everyday tasks via voice control. While in some of these products, the smartness of the product is one selling argument and thus heavily used in promoting the product, in most cases, AI is used under-the-cover.

Studies have shown how people are unaware that their social media feeds are curated by machine learning algorithms [26], and that even some videos of human beings doing things are actually fakes generated by neural networks, also known as deepfakes [27]. These are examples of how users of algorithmic products are not able to distinguish algorithmic features from non-algorithmic. Yet, with these seamless products, we still see huge amounts of users using them, even when not acknowledging the presence of AI/ML. As explained above, transparency of the IT artifact is one way to promote trust in situations involving some risk or uncertainty. In these situations, however, there is little risk involved for the user. Consuming social media, online searching and watching entertaining videos are actions that people take in their everyday lives without perceiving any risk coming from the actions. It might be that this is why these systems can still thrive without utilising any trust-promoting techniques.

(12)

experimenters with AI systems that detect skin cancer [28], offer personalised healthcare [29] and reduce administrative workload [30]. Similarly, AI is used to aid military personnel in missile detection and combat training, as well as providing lethal autonomous weapons. These systems often perform in high-risk situations, where relying on AI might bring about severe consequences, therefore these systems usually relinquish control to the user and function more as suggestions to the user. In these systems, contrasting the IT systems in use in everyday lives, trust becomes increasingly important, as the stakes are simply higher. False positives and negatives in the medical domain might lessen the quality of the healthcare people receive from doctors assisted by AIs, while false predictions when browsing movie catalogues affect the user considerably less. Incorrect decisions made by autonomous weapons, might, in severe cases bring about loss of civilian lives. This distinction in the level of risk and uncertainty involved in the situations where AI is used is important to consider when choosing whether to employ trust-inducing techniques.

In these differing situations, ethics becomes another key question when choosing to open the black boxes or not. The previously mentioned example of deepfakes, has provoked discussion on ethics. Deepfakes are mostly employed for malicious reasons, like for creating fake news or for revenge pornography, and these uses have provoked tech giants such as Google, Reddit and Twitter to take steps to ban deepfake content [31]. Several facial recognition algorithms have been developed as inherently biased, by only providing training data of homogeneous user groups, excluding people of colour and women [32]. These algorithms are used for a wide variety of cases, such as preventing crime and for credit evaluation, but also for providing ”filters” on Snapchat and Facebook, with varied negative consequences for the excluded user groups. It seems as transparency in these cases could be used as a means for collaborative accountability of the algorithms.

Another domain with higher risk involved is sales. AI applications also have their place in the sales domain, where recommender systems are transforming the relations between customers and salespeople, forecasting is used to aid decision making and several AI products are focused on gathering insights from vast amounts of sales data.

Risk and uncertainty are present in the sales domain in many forms – salespeople have uncer-tain incomes when they work on commission, there is risk in lost opportunities and unceruncer-tainty in which opportunities one should prioritise. To help manage these risks, several products have been developed to assist in selling. While many of these do not employ AI/ML, some products have found value in AI/ML techniques, such as Salesforce2_{, Zoho} 3_{, Aviso} 4 _{and Clari} 5_{. As}

established before, trust is necessary in situations characterised by risk and uncertainty, which sales certainly is. It would, therefore, make sense to look at these products from the aspect of trust and factors that positively contribute to trust formation.

Salesforce and Zoho, of the above-mentioned products, are examples of customer relation management software, allowing salespeople to manage their sales processes. In Salesforce, AI is used in a considerable amount of features, offering lead scoring, insights on opportunities and insights on the whole sales process. Salesforce has developed their AI to resemble Albert Einstein and named it aptly as Einstein as an effort to use anthropomorphism to increase perceived competence by likening their AI to a renowned scientist. In Einstein, several trust-inducing techniques are employed, such as explanations on the predictions, showing model confidence and offering ways to give feedback on insights. Einstein also offers means to uncover details about the ML models used, such as different performance metrics and coefficients, in order to give the user means to verify the AI’s outputs with regard to their reliability and dependability. Salesforce’s Einstein can be considered to make use of many trust-inducing factors, which could contribute positively to Salesforce’s revenue numbers [33].

On the other hand, Zoho, another CRM offering AI/ML features, does not make use of many trust-inducing factors. Explanations are only available on one AI feature, the same feature allows to see model confidence levels and offers a way for the user to provide training samples. While this feature seems to be designed with trust as one of the foci, most of Zoho’s AI Zia does not offer many means to generate trust.

Two other products, Aviso and Clari, are sales insight dashboards, that are advertised largely

2_{https://www.salesforce.com/products/einstein/overview/} 3_{https://www.zoho.com/}

(13)

as AI products. Yet, in their AI features, only a couple offer explanations into their behaviour, with no other trust-inducing factors being available. In these products, the Opportunity Scoring in Clari and WinScore in Aviso both offer explanations to the predictions, yet neither offers any means to see the confidence levels or to give feedback on the predictions. As these are just the main features in those products, with many more AI driven insights available in the products, the amount of trust-inducing features used is low. Regarding trust antecedents and that the products make little use of various trust-promoting factors, it could be said that most of the trust antecedents are not fulfilled.

Based on these products, we can say that offering trust-promoting features is not very widely used. These trust-promoting features are mainly concerned with the process and purpose bases of trust, improving understandability, control and faith in the systems. As it seems that these products are still successful, they might be fulfilling at least the performance basis of trust with good user experience and competence of the systems. In here, an opportunity presents itself for other sales domain companies developing AI features to do well by fulfilling the apparent gaps in the process and purpose bases of trust.

(14)

3 Method and methodology

In this section, the methodology and specific methods used in this thesis to answer the posed research question are outlined.

3.1 Methodology

This thesis is focused on creating design guidelines that help design practitioners design AI features to be understandable, trustworthy and actionable. Design guidelines serve this purpose well as they can be used both in generating designs and evaluating them by providing a simple list of rules that can be easily referred to during the design process. In this thesis, the guidelines are used to design an AI feature in Pipedrive and the guidelines themselves are evaluated based on the resulting design prototypes.

A qualitative approach is selected in this thesis, because in this considerably new research topic, there is still a lot to discover, meaning that a quantitative approach of measuring things is not suitable enough if there exists little knowledge on the concepts to be measured. Additionally, as trust and actionability are quite subjective concepts that depend a lot on the person’s feelings and previous experiences, a qualitative approach seems suitable to evaluate the trustworthiness and actionability of the designs done on the guidelines.

A research-through-design approach is used because of a need to evaluate the proposed guidelines in-context, meaning that the guidelines should be evaluated in the real situation of designing an AI feature based on them. The design of the AI feature serves as a means of research, providing an artifact to help in the evaluation of the guidelines. As opposed to [12], where guidelines are created with a focus on them being applicable in evaluating designs, the goal in this work is to propose guidelines that can be used to create designs. The guidelines should aim to be useful in real design practice and the designs resulting from using the guidelines should provide value to the end-user, by being understandable, trustworthy and actionable in the user’s mind. Therefore, a research-through-design approach is chosen and employed to generate knowledge that can be used to evaluate the usefulness of the guidelines in designing AI products. As stated by Barab and Squire in [34], the knowledge produced by doing research-through-design lies also in the research-through-design artifacts themselves and the research-through-design process used. The research-through-design artifacts serve as an exemplification of what the guidelines can be used to achieve considering all the constraints resulting from the real-life context of use [34], and as a means of evaluating the guidelines with regards to the aims the guidelines have. The designs are the result of applying the guidelines to a situation constrained by the intricacies of the real-world context of use, with constraints ranging from adhering to the existing product design language at Pipedrive and to the feature the design work is based on, to considering the situation and tasks the end-user uses the designed feature in. As AI features and the provided value to the user are largely dependent on the situation of use and the task at hand, because the situation and task affect the acceptance and use of the AI, research-through-design serves well in this real-life situation. It makes use of design that takes into account all the complex circumstances affecting the design in addition to theories, that under-specify the factors that influence the solution [35], [34].

Additionally, as the thesis work is done in the context of a company, the resulting designs serve as input into future feature development by providing a proposed concept of presenting AI predictions together with documentation on the design itself [34]. Therefore the knowledge lies in the design artifacts - the pen-and-paper drawings, low-fi concepts - that ended up being used in the final feature and also in those that were disregarded, as knowledge on what does not work is valuable input as well. Similarly, the emerging pattern of displaying AI predictions as an extra layer can function as a design pattern in Pipedrive, to be used in future developments of AI predictions, therefore comprising a form of intermediate level knowledge.

(15)

they can be applied in a wider variety of situations and applications.

The research-through-design approach was applied in this thesis in the form of first gathering design guidelines via literature review and affinity diagramming, then designing a proposed feature based on the guidelines using an iterative design process, and finally evaluating the guidelines and the designs in two stages - one with end-users of the designs, and one with the users of the guidelines.

3.2 Methods

A two-level iterative design process was followed, one cycle of iterations for the guidelines and one cycle of iterations for the design of the target feature, as shown in Figure 2. The steps in the process were as follows, in chronological order:

1. Literature search to find insights on trust-affecting factors

2. Affinity diagramming to consolidate the findings from literature search 3. Deriving preliminary guidelines from clustered findings

4. Flow mapping to communicate and think through the flow the users take when using the Revenue Prediction AI

5. Storyboarding to scope the target feature and see which interface screens are needed in the flow

6. Low-fi ideation to ideate on the target feature based on the preliminary guidelines 7. Internal testing to eliminate weaker concepts and gain feedback to iterate stronger ones

further

8. High-fi ideation to design the final concept based on feedback from internal testing and the preliminary set of guidelines

9. External testing to test the design concept with regards to its understandability, trustwor-thiness and actionability

10. Iterating guidelines based on the author’s feedback from using them in design work 11. Participatory workshop with design practitioners to test the iterated guidelines’

applica-bility and clarity

12. Iterating guidelines based on feedback from the workshop to result in the final set of guidelines

(16)

Figure 2: Thesis process

Similarly to Amershi et al.’s paper on developing human-AI interaction guidelines [12], first a literature search was done to find relevant articles on trust in human-computer relations, focusing especially on trust in AI products. The sources found through the search were read through and any relevant pieces of information regarding how to promote trust were extracted on sticky notes. These were then affinity diagrammed to form clusters of related ideas, for example, one cluster about keeping the user in control of the system and another cluster on not explaining the inner workings of the ML. Another pass of clustering was done to group together similar clusters into more higher-level clusters. From these clusters, preliminary guidelines were formed based on the concepts on the sticky notes, with the group headings derived from the cluster headings. This constituted a bottom-up approach to deriving guidelines, starting from specific instances of insights to ending up with more general ideas on how to promote trust.

In the iterative process for the design of the feature, two iterations were done to reach an interactive high-fidelity prototype. Several methods used in Pipedrive to scope, ideate, and design features were employed, such as flow mapping user flows, storyboarding the story told in the evaluation sessions and Google’s Design Sprint exercises6_{to produce several design concepts}

in a relatively fast manner. These methods were used to keep the design process user-centred, as the flow maps and storyboard were largely based on existing knowledge and previous research the author had done in the company beforehand. Using wide-spread methods used by design practitioners in the industry, the target feature was designed with the user’s needs in mind, while still being constrained by the existing Pipedrive product in terms of which features are already available and which would be beneficial to develop further from the commercial viability standpoint. According to Zimmerman et al. [36], design researchers should not be concerned with making a commercially viable product, with this aspect differentiating them from design practitioners. As this project is carried out in a company setting, whose goal is still to develop commercially successful products, the author strived to design artifacts that can be considered both commercially viable and right. Employing methods mainly used by design practitioners allows the author to consolidate the knowledge from research practice into a design artifact that would be the basis to a marketable product, thus contributing to transferring knowledge from HCI research to the community of design practitioners.

After the concepts were ideated on, they were tested internally with 4 product designers from Pipedrive, to get free-form feedback on the concepts. This feedback served as input to the high-fi concept development, as some weaker concepts were eliminated and some were iterated further to result in the high-fi concept.

(17)

The use of these methods and the preliminary set of guidelines resulted in an interactive prototype that was evaluated in usability test sessions with Pipedrive users. There were 4 test sessions conducted over video calls with 4 Pipedrive users of varied sales roles, both managers and salespeople. The target group of users was screened based on their user of forecasting features in Pipedrive, to have people who are familiar with revenue forecasting and see the need to do that participate in the tests, in order to save time on the tests (by not having to have an introduction to forecasting). The participants were recruited over email and the total target group was of 172 users, of which 5 people signed up to the test sessions (one ended up cancelling the session, totalling in 4 sessions conducted). The test sessions were semi-structured, following a task list but allowing for further probing on interesting remarks from the participants. These usability tests functioned as input to assessing whether the designs resulting from using the preliminary guidelines achieved the set goals of being understandable, trustworthy and offering actionable insights. The external test sessions thus served as a validation step to the preliminary guidelines, and also allowed the author to iterate the guidelines for the first time based on the author’s own feedback from using them in the design work.

Additionally, a participatory design approach was employed in the form of a workshop with 7 product designers from Pipedrive as participants, as the guidelines should be useful for design practitioners in Pipedrive and elsewhere. The author targeted for recruitment all product de-signers in the design team at Pipedrive, and the participants ended up being of varied seniority levels and most of them did not have experience designing for AI-enabled systems. This work-shop was focused on improving the guidelines’ structure and understandability, and finding out whether the guidelines could be used in actual design work by designers. After the workshop, the guidelines were iterated for the second time, based on the feedback on their clarity and applicability from the workshop participants. This resulted in the final set of guidelines, that was based on the two previous versions and the feedback gotten from both the external tests and the workshop.

The results of the usability tests were analysed by qualitatively coding the transcriptions, allowing the author to see which aspects of the designed feature elicited reactions regarding trustworthiness, understandability and actionability. Structural coding was used to find positive and negative remarks on any aspects of the design, also looking for any mentions of aspects the participants would want but lacked in the existing design [37]. Coding and thematically analysing the resulting data allows for the analysis process to be more rigorous, especially in a qualitative research setting [38]. This qualitative coding and analysis functioned as the validation step for the preliminary guidelines, as the external tests were based on the design resulting from using the preliminary version.

(18)

4 Design process

The preliminary guidelines, outlined in Chapter 5.1, were used to design an AI-enabled feature based on Pipedrive’s Revenue Forecast Report. The feature aims to help salespeople know how much money they will earn in time by offering revenue predictions based on their historical data inside Pipedrive. The feature is based partly on an existing ML model, which predicts winning probability and an expected close date for deals, and total revenue for a given period of time. The goal of the design process was to design the feature according to the guidelines, and to test it with users of Pipedrive to see whether they can understand, trust and gain actionable insights from the predictions.

4.1 Low-fi ideation

(19)

Figure 3: Flow maps

(20)

After flow mapping, the design of the target feature was scoped. Considering limited time resources, the design was scoped to include only the existing feature flow map, which means that onboarding and setting up the feature were not designed for. To further scope the design, a storyboard was developed to map out the journey the user would partake in when testing the design solutions later. This was done to clearly delimit and communicate the story told in the usability testing sessions, and to see clearly which interface screens were needed for telling the story. The storyboard was told from the perspective of two users - one who is open and accepting of the prediction right away, and one who at first distrusts the predictions but later gains trust. This was done to create empathy around the two possibilities, and to see whether they would need the same interface screens to complete their story. In the final storyboard, there are 15 frames, with about half representing the interface screens and half representing user thoughts, as seen in Figure 4.

Figure 4: Storyboard

The next step in the process was to start ideating. This was done at first with pen and paper, utilising some exercises from Google’s Design Sprint7 to boost creativity and create as many ideas as possible. The Design Sprint exercises which were used were Crazy 8s and Solution Sketches. With the Crazy 8s, 8 different ideas are sketched out in 8 minutes, essentially creating a seed of ideas for the next steps of ideation. Together with a colleague, the author partook in the exercise, with the result being ideas on two different aspects of the design. Some ideas were centred around the presentation of the explanations for the predictions, and some were centred around the placement and context where the whole AI feature would sit in Pipedrive’s product. With the ideas from Crazy 8s on the table, Solution Sketches were drawn to flesh out a

7

(21)

couple more promising ideas in three frames. This activity allows participants to further develop some concepts from the Crazy 8s, remix others’ ideas or to come up with entirely new concepts. These two exercises resulted in 6 low-fi ideas for presenting the predictions and explanations. From this also emerged two different approaches to presenting the AI predictions - one with an anthropomorphised AI character and one without. The 6 ideas were then further iterated on paper to flesh out some more details, without going into specifics of the explanations or the presentation. In the process of generating the ideas, the guidelines were followed roughly, without going into many details in order to not stifle creativity and bring to the table also not-ideal ideas that could possibly inspire further ideas. Also, the ideas were not all based on the Revenue Forecast Report, as shown in Figure 5, as the author and the company were open to exploring other options for the placement of prediction both for the revenue prediction and possible future predictions as well. Therefore some following concepts are based on Pipeline view, shown in Figure 6.

Figure 5: Revenue Forecast Report

(22)

Figure 7: AI predictions tab concept

(23)

Figure 8: Chatbot concept

(24)

Figure 9: Anthropomorphic assistant concept

(25)

Figure 10: Revenue Forecast Report extra data concept

(26)

Figure 11: AI mode concept

(27)

Figure 12: Prediction hub concept

The sixth and final concept is of a notification mechanism combined with a prediction hub, as illustrated in Figure 12. The notification mechanism would deliver notices to the user when new predictions are available, possibly also some other content regarding their use of Pipedrive, functioning as a feed. Then the user would have the chance to access the predictions in detail in a central prediction hub that houses all AI content. Designed as a dashboard, one would be able to get a clear overview of all important predicted data, with the possibility to dig deeper into each one separately, to access explanations and to give feedback. In the prediction hub, one could also set up the whole AI system, to control the data used for predictions and to see a history of predictions. While it could offer a central location for all AI/ML content, the hub is disconnected from the salespeople’s workflows, and it would be harder to relate content in the hub to the respective places where the data comes from.

4.2 Internal testing of designs

The first round of testing in the design process was done with these paper sketches. The 6 concepts were presented to 4 people from the company while asking for free-form feedback on them. The goal was to eliminate less promising ideas and to gather insights and input for further rounds of iterations, with the feedback and conclusions summarised in Table 1. Some ideas were received better than others, with the testers appreciating having an overview visible at once, with details being accessible on demand. This eliminated the chatbot concept (Figure 8) which did not allow for a quick visual overview. Additionally, it was said that when presenting the predictions in the Revenue Forecast Report, they should appear near to the existing graph, as the testers would like to compare predictions to the existing revenue projections. This feedback eliminated the AI tab concept (Figure 7), as the AI tab would require switching between views to see predictions and existing revenue data, not allowing for easy comparison.

(28)

Concept Feedback Conclusion 1. AI tab Does not allow comparison with

non-AI data

Not selected for further de-velopment

2. Chatbot No visual overview, needs a lot of effort from user to get useful output

Not selected for further de-velopment

3. Anthropomor-phic personal assis-tant

Friendly and accessible, but full-bodied embodiment might be too ca-sual and fun for business environments

Anthropomorphic presenta-tion selected for further de-velopment

4. Extra data in existing views

Good connection to underlying data and existing workflows

Selected for further develop-ment

5. AI ”mode” Good connection to existing data, views and workflows. Offers good overview and required minimal user in-put

Toggling AI ”mode” selected for further development 6. Prediction hub Not immersed in existing workflows

and disconnected from underlying data, requires discovering and navigat-ing to it. Good notification mecha-nism, allowing for contextually surfac-ing predictions

Notifications selected for fur-ther development

Table 1: Internal testing feedback on concepts

when invoked by the user, essentially borrowing from the AI ”mode”, the notification mecha-nism and the anthropomorphic AI assistant ideas. In the concept, some data in the underlying views could be highlighted by the AI, together with presenting predictions that are relevant for that particular view. In the context of the Revenue Forecast Report, the AI predictions could be presented as extra data on top of the existing graphs and tables, and in the Forecast View, for example, the predictions could appear on the deal cards themselves in order to highlight probable deals. The concept could make use of an anthropomorphic AI, that would notify the user on new predictions, offer suggested actions and present insights and patterns from data.

(29)

Figure 13: Forecast view

The AI layer concept was then developed further as high-fi sketches. As the preceding internal testing had not presented strong indications for or against anthropomorphism, two versions were further developed - one anthropomorphic and one not. This was done to gauge the reactions to anthropomorphism in the actual user base, with usability tests presenting two prototypes. The high-fi sketches were done using Pipedrive’s design system Convention UI, which defines the design language, grid system, brand colors and design elements in use in Pipedrive’s product, and using Sketch and Marvel App8_{for high-fi sketching and creating interactive prototypes. The two}

versions of the same concept were developed in two iterations. After the first iteration, design feedback was gathered from designers and researchers in Pipedrive. The feedback was integrated into the second iteration, which was the version used in the final user testing. In addition to the design of the whole feature, an important aspect was to design the anthropomorphic presentation of the AI. For this, personality traits that the AI should convey were developed. These were stated in the form of opposing values in one sentence. The personality traits were as follows. The AI is:

• a mentor, but not a personal assistant • patient, but not interrupting

• insightful, but not a know-it-all • straightforward, but not dictating • trustworthy, but not infallible • delightful, but not overzealous

These personality traits were taken into account when designing the icon and name used to refer to the AI and to formulate the AI’s utterances. The anthropomorphic version was designed to be as machine-like, as with diverse user bases, a human-like AI would possibly alienate some user groups based on the demographic embodiment of the AI [39], [40]. With machine-likeness, the perceived capabilities of the AI would be more accurate to the actual capabilities, than with human-likeness, which would trigger overestimation [39]. To adhere to guidelines G22.1 - G22.3, a machine-like representation was chosen, and the above personality traits developed to convey the essence of the anthropomorphic AI.

(30)

The AI’s icons were also created by the author, one for the anthropomorphic version and one for the non-anthropomorphic version, as seen in Figure 14. Association mapping was used to generate concepts and metaphors for the non-anthropomorphic version, with the final concept combining machine-likeness with a human touch. The icons, together with the colour purple and, in the anthropomorphic version, the name Piper were used to fulfil the guidelines G22 -G25.

Figure 14: Left: Icon for anthropomorphic AI. Right: Icon for non-anthropomorphic AI.

4.3 Final design concept

Figure 15: Toggle for the AI layer The final concept, based on the preliminary set of

guide-lines in Appendix C, is of an AI predictions layer, that can be invoked to appear on top of existing views in Pipedrive, by toggling the layer from a switch in the bottom left corner of the screen (G23, G29), as shown in Figure 15. The layer makes predictions that relate to the underlying views and data appear in the context of those views, allowing the user to see AI predictions

(31)

Figure 16: Notification

In the anthropomorphised version, there is also a notification mechanism coupled with a contextual sidebar of predictions and insights from the AI. This notification is designed to appear when new predictions are available, and would appear for a short period of time before disappearing (G24), as shown in Figure 16.

(32)

Figure 17: Sidebar

(33)

Figure 18: Left: Deal explanations in anthropomorphic version. Right: Total revenue predic-tion explanapredic-tions in anthropomorphic version

The explanations for the predicted expected close dates and deal probabilities are shown as characteristics of the deal that affect the predicted numbers. There are 4 explanations shown, with different amounts of positive and negative factors shown (G38). The possible explanations are, for example, that the deal has stayed in one pipeline stage for a longer or shorter amount of time than the average is, or that the deal’s fields contain some data, like the decision maker’s name or the delivery date, that affects the prediction (G33, G35, G36, G37, G39, G40, G46, G47, G48, G49). All the explanations are shown together with an indicator showing the direction this factor has affected the predicted number in, as shown in Figure 18. This offers the user a way to figure out how their actions in Pipedrive have affected the predictions (G19). In the explanation card, there is a dropdown menu, which houses options to give feedback on the predictions and the insights, together with an option to let the AI overwrite the existing probability and expected close date values with the values it predicted, and an option to see the data the prediction is based on (G7, G8, G26, G28, G30, G32), as shown in Figure 19.

(34)

Figure 19: Deal explanation card dropdown

The non-anthropomorphic version functions the same way, offering prediction data on top of the Forecast view, but it does not feature the notification mechanism or the sidebar, as those features were mostly put in to give the anthropomorphic AI a possibility to express itself.

In the non-anthropomorphic version, the total revenue prediction is shown in a more concise and structured way, without conversational sentences. The explanations for the total revenue prediction are shown similarly to the anthropomorphic version, as 4 deals that affect the total prediction in a positive or a negative way, as shown in Figure 20.

Figure 20: Left: Deal explanations in non-anthropomorphic version. Right: Total revenue prediction explanations in non-anthropomorphic version

(35)

it necessary (G43, G44). As the explanations are the same in both versions, the same guidelines that were applied to design the explanations in the anthropomorphic version were also applied in this version. Also, the same functions for accepting predictions, giving feedback and seeing underlying data are present in this version.

Both of these versions were made into interactive clickable prototypes9 _{using Marvel App.}

Figure 21: Anthropomorphic version of AI predictions

Figure 22: Non-anthropomorphic version of AI predictions

4.4 Usability testing

To see how well the guideline-based designs work with regards to understanding the feature, trusting it and gaining actionable insights from it, usability tests were conducted. Both pro-posed versions were tested with 4 Pipedrive users over video calls. The participants were selected from a pool of English-speaking active users of the Forecast View of varied sales roles (sales man-agers, salespeople) and locations. The selection of users was done so to target users who currently

(36)

forecast their revenue and sales, and thus see the need in doing that. This selection aided the author in saving time during the tests by having participants already knowledgeable on the topic of forecasting. In total, 172 invites were sent out via email, and 5 tests were scheduled. Of the 5 scheduled tests, 4 took place, with the fifth test being cancelled on the grounds of the tester not showing up. Participants were compensated for their time by sending them a Pipedrive gift bag.

The usability tests served 4 purposes: • to validate the proposed design guidelines

• see if users understand, trust and get actionable insights from the feature • see if there are strong preferences for anthropomorphism or not

• see if there is a need for AI predictions in Pipedrive

As the tested feature was designed with using as many proposed guidelines as possible, the usability tests were used as a means of validating whether the resulting design of the feature promoted the 3 goals in the research question: understanding, trust and actionability. The goal of validating the proposed guidelines was partly conditioned on the aforementioned goal, yet each guideline was evaluated separately based on which design elements resulted from it and which of these elements were received well by the testers. The first two goals are connected to the first and second sub-questions on explanatory aspects in support of trust and guidance towards actions to take. A secondary goal, stemming from the fact that there were two versions of the same feature - one anthropomorphic and one not - was to find out whether users prefer one presentation format to the other. This goal is connected to the third sub-question on presentation formats. As this thesis was carried out at Pipedrive, the tests also served a goal of the company, to gauge whether there is a need for AI predictions inside Pipedrive, and whether people saw added value in the predictions.

As trust in the AI predictions and understandability and actionability of the predictions are subjective properties, that are largely dependent on the user’s person, the usability evaluation was set up as a qualitative test with a within-subjects set-up. Within-subjects design was used to minimise the number of testers needed, as testers are considerably hard to recruit for usabil-ity tests in the company. Several assumptions based on the research question and the feature itself were written down before the tests, and the test tasks were structured so to validate the assumptions.

The assumptions were as follows:

• Users are able to understand the meaning behind the predicted revenue number

• Users are able to find and understand the reasoning behind the number of predicted revenue • Users are able to trust the predicted revenue number and the reasoning behind it

• Users are able to find out which deals they should focus on in order to achieve the predicted revenue, and why they should focus on them

• Users are able to agree or disagree with the prediction and the reasoning, and they under-stand how to give feedback on the predictions

These assumptions were stated to communicate the goals of the feature that needed to be validated by the tests. For the first two, the goals these reflect are the usability and understand-ability of the predictions and the explanations. The third one is concerned with the trustwor-thiness of the predictions and explanations. The fourth assumption is about the actionability of the predictions and insights, and the fifth is concerned with the extent to which people are able to agree with the predictions and the usability of the feature regarding giving feedback on the predictions.

(37)

explanations, confirming to the definition of trust by Madsen and Gregor [17]. Actionability was evaluated by having the users specify which deals they would focus on based on the AI predictions of expected close date and deal probability. For assessing whether users are able to formulate an opinion on agreeing with the AI’s output, they were asked to evaluate their confidence in the prediction both before and after the real result, according to the scenario, was determined.

In order to validate the assumptions and, in turn, achieve the goals of the usability tests, a tasks and questions list was prepared. There were 13 tasks for each prototype, with 1 or 2 follow-up questions for each task to probe further into the topic if the participant did not already answer them. The tasks were about finding and enabling the prediction feature, finding the predicted data and explanations to it, and about the tester’s confidence and trust into the feature and their willingness to base further actions on the predictions. The tasks were developed so to enable the participants to voice their opinions on all the major parts of the prototypes, to see how they react to the design elements. In the tests, a scenario with a negative ending was played through, to emulate the likely situation where the AI predictions are not accurate and to gauge the tester’s reactions to this situation. The full task list can be seen in Appendix A. To see how much time would be needed for the test sessions and to clarify any confusing wording of the tasks, a pilot test was conducted before the real test sessions.

Testers had to complete 13 tasks on both prototypes and answer some follow-up questions based on the assumptions. As a within-subjects design was used in the tests, each tester was shown both prototypes, of which the order was counter-balanced among all participants. Each tester was first introduced to the topic and the testing protocol was explained to them. The participants were given the information on why the tests are done and where the results used (both in Pipedrive and this thesis), and their permission to record the session was asked for. Then a couple of introductory questions were asked to give the participant time to get comfortable. The questions were about their sales process, their clients and on how they currently forecasted their revenue. After the introductory questions, the tester was asked to open the first prototype and a small scenario was told of being a salesperson in a weekly team meeting. This scenario was chosen based on previous research done on forecasting in Pipedrive, where the weekly meeting scenario emerged as the most often occurring situation where forecasts are looked at. As the participants were urged to think aloud, the need to ask any follow-up questions was minimal, as the participants had mostly already answered the questions by thinking aloud. After the tasks were completed for the first prototype, the second prototype was opened and the same tasks were given again. As during the second part of the tests, the users were already acquainted with the feature and had even remembered what they were asked to do from the previous part, the participants were in some cases not explicitly asked to do some tasks, as they had already completed them before the author could prompt them. In the post-task debriefing, the participants were asked a couple of follow-up questions regarding their preferences on the presentation formats, and their opinions on whether the feature would be useful in their sales work.

4.5 Analysis and results

The results of the usability tests are detailed in this section.

In general, the testers were positive about both the design of the feature and in their trust in the feature. There were minor problems regarding the usability of the feature, but most of the testers could find and understand the design elements in focus without problems. All of the tasks, with the exception of a few, were completed successfully by all participants. Of 17 tasks, 5 can be considered not completed as only one participant on each task could complete them. These tasks were concerned with noticing the AI features, explaining the capabilities the AI has, finding a place to give feedback to the system and noticing deals changing position on the screen. Problems arose on task 1, with half of the users not noticing anything differing from the current live website on the prototype, meaning 2 people did not notice the notification on the anthropomorphic version nor the toggle on the other prototype. This could have been caused by the elements being not prominent enough by their visuals (small elements, white background on the notification card) or by the users not having the original live website to compare to in order to find the differences.

(38)

and one noticing it after toggling the AI mode on and off repeatedly. This could have been an issue of the prototype, as it was evident from the screen sharing that the prototype screens used to emulate the transition from one column to other loaded long enough to trigger a spinner in Marvel App. This impacted the prototype running the animation smoothly, in some sessions cutting out the deal moving totally, with only the start and end states visible. On the other hand, it was also designed such that only one deal moved to simplify the test, yet in real-life situations many more deals would move at once, making the transitions noticeable. In a situation with real data, the number of deals on the screen would be on average in the hundreds, and not 22 as in the prototype. Of these hundreds of deals, there would be many more receiving a new expected close date predicted by the AI, making the deal change columns. This brings about also a concern for the scalability of the animation, as it is not feasible to make all deals, which could receive new expected close dates by the AI, move columns with this kind of animated movement.

Users mostly had trouble finding the place to give feedback to the AI in the prototype, with only 1 participant finding the correct place in the prototype. In two tests, the tasks regarding giving feedback were not done at all due to the time running out and the participant needing to leave on time. This could be grounds for the feedback option being surfaced proactively instead of being hidden in a menu. On the other hand, having enough time to explore the feature and proper onboarding to the feature would possibly improve the chances of finding the feedback options on the cards as they are.

Regarding being willing to accept the predictions and make a decision on it, the users mostly felt like they would be willing to trust the AI, with 2 caveats. One participant mentioned being not willing on first use to trust it, but when the AI has established a good track record of being accurate, they would be more willing to trust it. Having a track record being generated for the AI retroactively on past data, even before the user has started using the AI, would possibly give the users already more incentive to trust the AI on first use. This finding is in accordance with the differentiation between initial trust and gained trust.

Another finding is that users had mixed opinions on how the total revenue prediction should be calculated. According to how the current model works, there are two ways of calculating the total. Exactly half of the participants said that it would make sense to them it being based on a threshold. With using a threshold to calculate the total prediction means that all deals with the predicted probability over a certain threshold (60% being mentioned by one of the testers) would count to the total with 100% of their value. All deals that would remain under the threshold would not be counted into the total, as for these deals it is more likely to lose them and to not earn any of the money. In contrast, the other half of participants stated that deals should be counted into the total by the predicted probability value, meaning that the deal value would be multiplied with the predicted probability, and this value being counted towards the total. These differing formulas might not be relevant when the feature is deployed using a different ML model, that might just output the total revenue prediction, without regard to our interpretations of the inner calculations.

The testers did not have any trouble finding the explanations nor understanding them. One participant, however, understood the red and green arrows next to explanations on the total number to mean the trend, meaning that the deals shown as explanations having increased or decreased in probability since a previous period. This was probably because the icons used to indicate the direction the deals affect the total number in being the same icons as used in another part of Pipedrive, where they mean the trend.

Towards Trustworthy AI