“That's (not) the output I expected!” On the role of end user expectations in creating explanations of AI systems

(1)

Artiﬁcial Intelligence 298 (2021) 103507

Contents lists available atScienceDirect

Artiﬁcial

Intelligence

www.elsevier.com/locate/artint

“That’s

(not)

the

output

I

expected!”

On

the

role

of

end

user

expectations

in

creating

explanations

of

AI

systems

✩

Maria Riveiro

a,

∗

,

Serge Thill

b

a_Department_of_Computing,_School_of_Engineering,_Jönköping_University,₅₅₁₁₁_Jönköping,_Sweden

b_Donders_Institute_for_Brain,_Cognition,_and_Behaviour,_Radboud_University,₆₅₂₅_HR,_Nijmegen,_the_Netherlands

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received 30 April 2020

Received in revised form 14 April 2021 Accepted 15 April 2021

Available online 20 April 2021

Keywords: Expectations Explanations Factual Counterfactual Contrastive Explainable AI Mental models Machine behaviour Human-AI interaction

Research in the social sciences has shown that expectations are an important factor in ex-planations as used between humans: rather than explaining the cause of an event per se, the explainer will often address another event that did not occur but that the explainee might have expected. For AI-powered systems, this ﬁnding suggests that explanation-generating systems may need to identify such end user expectations. In general, this is a challenging task, not the least because users often keep them implicit; there is thus a need to investigate the importance of such an ability.

In this paper, we report an empirical study with 181 participants who were shown outputs from a text classiﬁer system along with an explanation of why the system chose a particu-lar class for each text. Explanations were both factual, explaining why the system produced a certain output or counterfactual, explaining why the system produced one output instead of another. Our main hypothesis was explanations should align with end user expectations; that is, a factual explanation should be given when the system’s output is in line with end user expectations, and a counterfactual explanation when it is not.

We find that factual explanations are indeed appropriate when expectations and output match. When they do not, neither factual nor counterfactual explanations appear appropri-ate, although we do find indications that our counterfactual explanations contained at least some necessary elements. Overall, this suggests that it is important for systems that cre-ate explanations of AI systems to infer what outputs the end user expected so that factual explanations can be generated at the appropriate moments. At the same time, this infor-mation is, by itself, not sufficient to also create appropriate explanations when the output and user expectations do not match. This is somewhat surprising given investigations of explanations in the social sciences, and will need more scrutiny in future studies.

1. Introduction

Modernlivesareincreasinglyshapedbydata-drivendecisions,oftenmadebysystemsthatuseArtiﬁcialIntelligence(AI) andMachineLearning(ML)algorithms.Thesesystemshavethepotentialtoaugmenthumanwell-beinginmanyways[1]; however, they oftenoperate autonomouslyas“black boxes” that were not designedfortransparent interaction withend users [2–6].In complexcases,users thushave diﬃcultiesunderstanding thebehaviour ofsuch systems[7–9],whichcan

✩ _{This paper is part of the Special Issue on Explainable AI.}

*

Corresponding author.

E-mailaddress:maria.riveiro@ju.se(M. Riveiro). https://doi.org/10.1016/j.artint.2021.103507

(2)

resultinmistrustandmisusesee,forexample,thealgorithmaversionproblem [10]).Tomitigatesomeofthesechallenges, authors acrossrelevantdisciplines havesuggestedfocusingon transparencyandinterpretability[9,11,4,5,12–14,1]. Conse-quently, explainable AI(XAI),which isseen asawayto achieve suchtransparency andinterpretability[15], hasrecently seenanincreaseininterestfromboththeHumanComputerInteraction(HCI)andtheAI/MLcommunities[16].1

In terms ofcontent, although manyexplanation types havebeen proposed [25], appropriate design remains an open question.Ingeneral, thistouchesnotonlyonwhatcontentisrelevant,butalsohowmuchinformationisnecessarygiven specifictasksandusers,orevendecidingwhen(orwhy)aretheyrequired[26,27].Further,whilepositiveeffectsof provid-ingexplanationshaverepeatedlybeendemonstrated[5,28–31],thereisanassociatedcost;itisthusimportanttoconsider theircost-benefitratioandtounderstandinwhichsituationsexplanationsaremostlikelytobeneededandhelpful[32–36]. Inthiscontext,itisrelevantthatsystemrequirements(andthusassociatedaspectssuch ashardware costsand devel-opmenttime),are,atleastinpart,dictatedbythetypeofexplanationthesystemismeanttogenerate.Forexample,some explanationsonlyneedtoconveywhatfeaturesofaparticularinputledanAI-systemtomakeaparticulardecisionandthe generatingsystemthusonlyneedsinformationabouttheAIsystemitself.Thistypeofexplanation–describingonlywhich featurescontributedtothesystem’soutput–istermedfactual.“ThisisWilsonWarblerbecausethisbirdhasayellowbellyand breastwithshortpointybill” isthusafactualexplanationofabirdclassification[37,p.4] becauseithighlightsthefeatures deemedrelevantfortheprediction.Suchexplanationsareused,amongstothers,bytheLinearInterpretableModel-agnostic Explainer(LIME) [5],ShapleyAdditiveExplanations(SHAP)[38],andsaliencymethodsforhighlightingrelevantfeaturesin images[39].

Explanations can also speciﬁcally address eventsthat usersmight consider unexpected or abnormal[40,41]. In other words, whenusers expect tosee one output,butobserve another, such an explanation wouldaddress whythe observed, ratherthantheexpected,outputwasproduced.Todosorequiresthusacounterfactualelement,hencethistypeofexplanation isalsoreferred toascounterfactual.As such,“ifthecarhaddetectedthepedestrianearlierandbraked,thepassengerwouldnot havebeeninjured” [42,p. 6277] or“[y]ouweredeniedaloanbecauseyourannualincomewas£30,000.Ifyourincomehadbeen £45,000,youwouldhavebeenofferedaloan” [43,p.884] arebothexamplesofthistype.AIsystemsproducingcounterfactual explanationshaveonlyrecentlybeguntoappearintheliterature[43–47].

Atypicalapproachtogeneratingcounterfactualexplanationsistoidentifythefeaturesthatwouldaltertheoutputofthe modelifminimallychanged[48].However,aswewillexpanduponinthe followingsection,researchinthesocialsciences suggests that humans often use counterfactual explanations to address the expected outcome rather than the observed one [41]. Identifying such an expected outcome (and thus appropriate counterfactual) from a possibly very large set of candidatesisamoregeneral, andmorechallenging, problemsinceitrequiresan understandingoftheuserinadditionto information abouttheAI-system itself.Creating such counterfactualexplanations islikely tobe diﬃcult inpracticesince therearenostraightforwardwaysforamachinetoinferuserexpectations.Italsoremainsunclearinwhichsituationssuch explanations (as opposedto typesthat are easierto produce,e.g. factual explanations)are even required; whilethere is arich literatureinthesocialsciencesdiscussingthedynamicsofhumanproductionandunderstandingofexplanationsin various contexts, the degree towhich theseinsights apply to machinesis much lessexplored (but see[41] fora recent XAI-focused discussion).

Toaddress this, we conduct an empiricalstudy inwhichusers are givenexamples ofboth factual andcounterfactual explanations inthecontext ofan AI-assisteddecisionmaking taskinvolvingtextclassification.Theexamples aredesigned such that the system’s output either matches user expectations ordoes not. When it does, the explanation provided is eitherfactual(explainingwhatkeywordsledtothetextbeingclassifiedthewayitwas)orcounterfactual(explainingwhy the classificationwas chosenover arandom incorrectalternative). When,on theother hand,theoutputis thoughttobe unexpected fromthe enduser’s perspective, we provide either a factual explanation asbefore, or one of two types of counterfactualexplanations(either explainingwhythe givenoutput waschosen overwhattheusermostlikelyexpected, oroveranotherunexpectedcategory).

Ourcorehypothesisisthatexplanationsshouldalignwithenduserexpectations.Thus,theidealsystemwouldprovide factual explanations whenthe outputis inlinewithuserexpectations andcounterfactualexplanations whenit isnot. If thehypothesisissupported,itwoulddemonstratethatthesystemgeneratingexplanationsdoesneedawaytoassessuser expectations(achallengingtask, asnotedbefore),both todecidewhattype ofexplanationto useon acase-by-casebasis andtocreatesuitablecontentforcounterfactualexplanations.

Theremainderofthispaperisstructuredasfollows.Thenextsectionreviewsrelatedworkanddiscussesthemotivations forourstudyinmoredetail.Wethenpresentthedetailsofourexperiment,inparticular,detailsonhowweoperationalise andmeasureaspects suchasenduserexpectations.Thisisfollowedby bothaquantitative andqualitative analysisanda discussionofourresultsinthecontextofthewiderliterature.

1 _{In general, explanations can serve many additional purposes [}₁₇_–₁₉_{], including trust and trust calibration [}₂₀_,₂₁_{], justiﬁcation [}₂₂_{], debugging [}₂₃_{] or} fairness assessment [24].

(3)

M. Riveiro and S. Thill Artiﬁcial Intelligence 298 (2021) 103507

2. Background

2.1. Typesofinterpretableandexplanatorymethods

While the realisation that there is a need to make AIand ML systems understandableand transparent to endusers goesback,atleast,totheeraofexpertsystems[49,50],a renewedinterestinthistopichasappearedmorerecentlywith theincreasedavailabilityofAIandMLsystems.Consequently,therearenowmanydifferentapproachestointerpretableML (see,forexample,[51–54] forrecentreviews).Here,webrieﬂysummarisethestateoftheartintermsofdifferentstrategies bywhichissuesofinterpretabilityandexplainabilitycanbetackled.

Lipton[55],andSilvaetal.[56] distinguishbetweenmodels that addresstransparency – howthemodel works–and post-hocexplanations –whatelsethemodelcantell[55,56].Theformercomprisesinterpretablemodels thatfacilitatesome sense ofunderstanding ofthemechanismby whichthe modelworks[55]. Thiscanbe achievedatvarious levels: atthe leveloftheentiremodel(e.g.,simulatability),atthelevelofindividualcomponents(e.g.,parametersanddecomposability), or atthelevel ofthe training algorithm (algorithmictransparency) [55]. Post-hocexplanations, meanwhile, conferuseful informationforMLusersorpractitionerswithoutaddressingdetailsofthemodel’sinnerworkings [55,56].Thisisachieved,for instance,throughvisualisations ofmodels,explanationsbyexample,naturallanguageexplanations,orfactualexplanations [5,38,57,39]. One of the advantages of post-hoc explanations is that interpretations are provided after-the-fact without sacriﬁcingpredictiveperformance[55].

Itisalsocommontodistinguishbetweenlocal (or instance-level[58])andglobal explanations,dividingworkon inter-pretableMLmodelsintothreecategories[59].TheﬁrstcategoryisroughlyequivalenttoLipton’s[55] interpretablemodels. Thiskindofworklearnspredictivemodelsthatare,somehow,understandablebyhumansandthusinherentlyinterpretable (for example,[60,61]).The second andthird categoriescompriseapproaches thatexplain more complexmodels, such as deepneuralnetworksandrandomforests.Inpractice,suchcomplexmodelsareoftenpreferredsincetheyoftenoutperform their moreinterpretablecounterparts[5,60].Tononethelessexplain theirworkings,one strategyistobuild local explana-tionsforindividualpredictionsoftheblackbox.Anotheristoprovideaglobal explanationthatdescribestheblackboxas awhole,normallyemployinganinterpretablemodel[59].Thus,consistencyiseithermaintainedlocally inthesensethata givenexplanationthatistrueforadatapointalsoappliestoitsneighbours(see,e.g.,[5,38]);orglobally inthesensethat anexplanationappliestomostdatapointsinaclass(see,e.g.,[62,63]).

Inspiteofthisdiversityinapproachesandrecentattemptstocategorizethem(see,forinstance,taxonomies,ontologies and toolkits in [64–66,25]), there remains a lack of clear guidelines regarding what type of explanation or explanatory methods should be used depending on,for example, context, task, type of enduser. There are also no clear guidelines on what content is appropriate, how much information is necessary, andwhen the explanations are required (if at all) [26,27,64].

2.2. Contrastiveandcounterfactualexplanations

Workfromthesocialsciencesdemonstratesthathumanexplanationsareoftencontrastive[41,67]:theydonotmerely addressa particularpredictionthatwas madebutratherwhyitwas madeinsteadofanotherone [68].Forexample,when humans expect a particularevent Q but observeanother event P , thequestion in their minds is“WhyPratherthanQ?” Here, P and Q arecalled“fact”and“foil”respectively[68].Morespeciﬁcally, Q ,representssomethingthatwas expected but didnot occur [41]. It is thus the counterfactualthat the explanation is expected to address.To complicate matters, whenhumansarticulatesuchaquestion,theyoftenleavethefoilimplicit[41],askingsimply“WhyP ?” Itisthereforerarely explicitwhatexactlywasexpectedinsteadof P .

InXAI,theterms“contrastive”and“counterfactual”haveoftenbeenusedinterchangeably.Inthispaper,weusetheterm counterfactual,andadopttheterminologyfromLipton[68] andMiller[41],usingtheabovedeﬁnitionsofP and Q asfact andfoil,respectively.Acounterfactual explanationisthusonethataddressesthefoil.

Themainadvantageofsuchcounterfactualexplanations(addressingwhyP waschosenoverQ )isthattheyareintuitive andlesscognitivelydemandingforbothquestionerandexplainer[41]:thereisnoneedtoreason(orevenknowabout)all causesofP ;itissufficienttoaddressonlythoserelativetothefoil Q .However,sincehumansoftenleavethefoilimplicit, itisuptotheexplainertoinfertheintendedone.ForXAI,thisisasignificantchallengebecausethesetofcandidatefoils inanygivenquestionispossibly infinite.Togive anexamplefromMiller[41,p.16],the question“WhydidElizabethopen thedoor?” has manypossiblefoils,including,e.g.,“WhydidElizabethopenthedoor,ratherthanleaveitclosed?” and“Whydid Elizabethopenthedoorratherthanthewindow?”

For XAI, using counterfactual explanations thus sidesteps the substantial challenge of having to explain the internal workingsofcomplexMLsystems[43];theycanprovideinformationtotheuserthatisbotheasilydigestibleandusefulfor understandingthereasonsunderlyingadecision.Thisalsoenablesusers,ifneeded,tochallengethesedecisionsandthereby inﬂuence the futurebehaviour of the systemsuch that performance increases.Creating such explanations is, however,a challenging task in itself [69,70,45,71,72] and signiﬁcant effort is put into devising methods for generating appropriate counterfactuals[44,73,74,43,47,48,46,43].

Onewaytogeneratecounterfactualexplanationsistoidentifyfeaturesthat,ifminimallychanged,wouldaltertheoutput ofthemodel[48].Todoso,onecanframethesearchasanoptimisationproblem,seekingtoﬁndtheclosesthypothetical

(4)

pointthatwouldbeclassifieddifferentlyfromthepointcurrentlyinquestion[48],althoughdefininganappropriatedistance metricis farfromtrivial(see [43,75,76] for examplesofthisapproach).Bycontrast,amodel-agnostic approach–termed SEDC – for generating counterfactuals is presented in [58]. SEDC uses a heuristic best-first search for finding evidence that counterfactuallyexplainspredictions ofanyclassification modelforbehavioural andtextdata. Comparedwithother approaches such as LIME-Counterfactual [5] and SHAP-Counterfactual [38], SEDC is typically found to be more efficient, althoughLIME-CandSHAP-Chavelowandstablecomputationtimes[44].

Whiledifferentapproachesforgeneratingcounterfactualsexistandtheyhavebeenshowntobeusefulinvariousspeciﬁc domains (see for example [77,78]), there remain many open challenges. Here, we address two of them by building on thepreviously discussedinsightsfromsocialsciences[41]:deciding(1)inwhichsituationa counterfactualexplanationis appropriateand(2)iftheyareappropriate,whichcounterfactualtouseinanexplanation.Wehypothesisethatexplanations needtoalignwiththeexpectationsoftheuserandthusthat(1)counterfactualsshouldonlybeusedifthesystemoutput isdifferentfromwhattheuserexpectedand(2)whenused,counterfactualsshouldaddresstheexpectedoutput.

ThesequestionspointtowardspotentiallyhighlightsignificantchallengesforworkinXAI:ifcounterfactualexplanations are indeedneededwhenthe system’soutput doesnot matchwhatusersexpected, andtheappropriate foilisdefinedby thisexpectation,thenproducingsuitableexplanationswillbe,asdiscussedabove,averydifficulttask.Forthatreason,itis importanttounderstandunderwhichconditionsthistypeofexplanationisneeded(aswellasunderwhichconditionsitis not).

3. Methods

3.1. Aimandhypotheses

To summarise the previous section, there are good reasons to expect that humans interact with intelligent systems andcomputersinwaysthat directlyresemblehowthey interactwithother humans[33,79,80].Further,inhuman-human interactions, explanations are oftenneeded when an event isunexpected and they need to explain the unexpected fact in relationtoan implicit expected foil[41].Overall, weexpect that thetype ofexplanation givenby a systemshouldbe congruentwiththeoutputthatusersexpected,leadingtothefollowingtwohypothesesforthisstudy:

H1: Factualexplanations are appropriate forcorrect predictions because the systemoutput is in line withthe expected output.

H2: Counterfactualexplanations that contain the expectedfoil are appropriate when the systemprediction is incorrect, becausethissituationmirrorstheuseofcounterfactualexplanationsinhuman-humaninteractions.

Totestthesehypotheses,weshowedparticipantssmallpiecesoftextalongwiththeirclassaspredictedbyanAI-based textclassiﬁcationsystemandanexplanationofthesystem’soutput.Wechosethisscenariobecausedeterminingtheclass of simple pieces of text is something that humans also generally do well. It is therefore reasonableto expect that the participant’sexpectation ofthesystem’soutputcorresponds tothetrueclassofthetext, and,consequently,that thelatter canbeusedtodeterminethevariousconditionsintheexperimentaldesign.

Further,giventhecurrentlackofguidelinesforwhattypeofexplanationorexplanatorymethodsshouldbeprovidedin agivencontext[26,27],wealsoaddressafewadditionalquestionsusingthedataobtainedduringtheexperiment;namely: (1)howfactualandcounterfactualexplanations affectusers’mentalmodels,(2)howthetypeofexplanationsusedinthis study areperceived intermsofcontentandcompleteness, and(3)whetherthereisaneedforan interactiveexplanation system(whichalsoconnectstodiscussionsregardingthedynamicsofexplanations,andwhethertheyarebetterunderstood asaproductoraprocess[41]).

3.2. Experimentaldesign

Conditions Thefactorsandlevelsoftheexperimentwereasfollows.TheAI-systemproducedapredictionthatwas either correctorincorrect(twolevels).The Explanation-system,ontheotherhand,producedthreetypesofexplanations:factual (F), counterfactualwiththe correct(and thus expected) outputincluded (CC), andcounterfactualbut includingincorrect (unexpected)output(CI)(threelevels).Therewerethustwoindependentvariables(typeofpredictionandexplanation)and a 2x3experimental layout.Since weused abetween-subjectdesign,thisresultedinsixsubjectgroupsorconditions,see Table1.

Animportantassumptioninthisdesignisthatparticipantswillgenerallybeableto,bythemselves,inferthetrueclass ofthetext tobeclassified.Ifthisisthecase,thenwe canreasonablyexpect thatacorrectclassificationwillbeexpected output, andan incorrectonewill notbe. Ifthisassumption were tobe violated,the numberofdata points incondition couldpotentiallybeheavilyunbalanced,tothepointwheresomeaspectsmightnotbetestedatall.Howeveritispossible totestwhetherparticipant’sexpectationsinourstudywereconsistentwiththisassumption,whichweaddressasthefirst pointofourresultssectionbelow.

(5)

Table 1

Experimental conditions and groups. Type of explanation if the prediction by the AI-system is correct by type of explanation if the prediction is incorrect (2x3 experiment). Between-group ex-perimental design with six groups.PC=Prediction is Correct,PI=Prediction is Incorrect,EF =Explanation Factual,EC=Explanation Counterfactual,ECC=Explanation Counterfactual with expected-Correct class andECI=Explanation Counterfactual with a non expected-Incorrect class. For example,PC:EF/PI:ECC: if the Prediction is Correct (PC) this group sees an Explanation that it is Factual (EF) / if the Prediction is Incorrect (PI) this group sees an Explanation that is a Coun-terfactual with the Correct class (ECC).

Explanation (E) if Prediction is Incorrect (PI) Counterfactual (C) with

Factual Correct class Incorrect class

(F) (C) (I)

Explanation (E) if Prediction is Correct (PC)

Factual (F) PC:EF/PI:EF PC:EF/PI:ECC PC:EF/PI:ECI

Counter- PC:EC/PI:EF PC:EC/PI:ECC PC:EC/PI:ECI

factual (C)

Measures Table2summarisesthemainmeasuresusedinthisstudy.Theyprimarilyaddress(1)howsatisfiedparticipants were withthegivenexplanationsand(2)thedegreeto whichtheyunderstoodlocalandglobalAI-systembehaviour and wereabletoformmentalmodels.Weusedtwostrategiestoassessexplanationsatisfaction.First,weincludedthequestion “Howsatisfyingdidyoufindtheexplanationintermsofunderstandingwhythesystemmadeitsclassification?” inthemain task aftertheparticipantsreadthetext,prediction,andexplanationoneachscreen.Thiswasansweredona5-pointLikertscale rangingfrom“Notatallsatisfying” to“Highlysatisfying”.Second,wedesignedaquestionnairebasedonthereviewandscales presented by Hoffman etal.[81, e.g., AppendixC, pp. 39-40] thatcontainedclosed andopen-endedquestions regarding the explanations provided assessing satisfaction, completeness andhow usefulthe explanations were in supporting the understandingoftheAI-systembehaviour.

Asecond questionnairefocused onassessing participants’understandingof thebehaviour oftheAI-system.Following, onceagain,thesummarypresentedbyHoffmanetal.[81],weincludedquestionsthatassesstheunderstandingofthelocal (particular prediction)andglobalbehaviourofthe AI-system,andthelearnedmodelsforeach class.Therefore,we asked participantswhethertheycouldpredicttheoutcomesofthesystem(asdonerecentlyby,e.g.,[82] andlistedbyHoffmanet al.[81,p.11] asamethodtoelicitmentalmodels).TheywerealsoaskedtoratetheAI-system’sperformanceforeachclass, showingiftheyhaddiscoveredhowwelltheAI-systemclassiﬁedeachcategory(seesection3.3formoredetailsregarding theAI-systemhiddenbehaviour).

Allquestionnaireitemswereeitherﬁve-pointLikert-typescalequestionsoropen-endedquestions,withtheexceptionof ratingtheAI-system’sperformanceforthevariousclasseswheretheparticipantshavetochoosebetween“Poor”,“Fair”,“No opinion”,“Good”,“Excellent”.DetailsofthequestionscanbefoundinTable2andFig.B.12.

Pilot Beforerunningourexperimentalstudy,wecarriedoutapilotstudywith20participantsthatledtosomechangesin theformulationofboththemaintaskandthequestionnaires.Thepilotstudy wasalsousedtodeterminehowmanytexts should be classiﬁedwithin a reasonableamount oftime;we decided that 18 textswouldbe appropriate foran average completiontimeof25min.

3.3. Dataset,apparatusandstimuli

Dataset Wechosethe20Newsgroups textdataset2_[83],_a_collection_of_{approximately}_20,000_newsgroup_documents, parti-tionedintoaround20differentnewsgroups.ThiscollectionhasbecomeapopularcorpusforcarryingoutMLexperiments fortextclassiﬁcationpurposes[23,84].Inanefforttokeeptheclassiﬁcationtasksimple,yetelaborateenoughtoallowthe constructionofallcounterfactuals,weselectedthreeexistingclassesfromthecollection:

1. politics (

talk.politics

inthedataset,includingtextsfromthe

talk.politics.misc

,

talk.politics.guns

, and

talk.politics.mideast

newsgroups),

2. science (

sci

inthedataset,includingtextsfrom

sci.crypt

,

sci.electronics

,

sci.med

,and

sci.space

),and 3. leisure (

rec

in the dataset, including texts from

rec.motorcycles

,

rec.sport.baseball

,

rec.autos

, and

rec.sport.hockey

).

TheAI- andExplanation-systems Participants were told that an AI-system would classify random text and emails found online intothe threeclassesmentioned above, andthat an Explanation-system wouldgeneratean explanation regarding whytheAI-systemmadetheseclassiﬁcations. Fig.1wasshowntotheparticipantstoillustratetherelationbetweenboth

(6)

Table 2

Measures and example questions used for each of the measures. QI refers to the Explanation-system questionnaire and QII to the AI-system questionnaire, see Fig.B.12.

Measure Measuredin

Completeness, overall satisfaction and

content of explanations “The explanations provided of how the AI-system works have suﬃcient detail.” (1-5) [QI] “The explanations provided regarding how the AI-system classiﬁes the text seem complete.” (1-5) [QI]

“The explanations provided of how the AI-system classiﬁes text are satisfying.” (1-5) [QI]

“Would you have liked for the explanations to contain additional information? If so, what type of information and when/which situations?” (open-ended question) [QI]

Perceived understanding of inner workings

of AI-system based local explanations “The explanations provided help me to understand a particular prediction made by the AI-system.” (1-5) [QII] “How satisfying did you ﬁnd the explanation in terms of understanding why the system made its decision?” (1-5) [Main classiﬁcation task]

Perceived understanding of the AI-system

global behaviour “The explanations provided help me to understand the global behaviour of the AI-system.” (1-5) [QII] “The explanations provided help me to understand a particular prediction made by the AI-system but also the global behaviour of the AI-system.” (1-5) [QII]

“The explanations provided help me to understand the limitations and mistakes of the AI-system.” (1-5) [QI]

Perceived capability of predicting the

AI-system behaviour “I know what will happen the next time I use the AI-system because I understand how it behaves.” (1-5) [QII] “Do you think that the AI-system classiﬁes the different types of text equally?” (Open-ended question) [QII]

Actual capability of predicting AI-system

behaviour and performance “The performance of the AI-system classifying text about politics/science/leisure was:” (Poor, Fair, No opinion, Good, Excellent) [QII]

Perceived need for an interactive explanation-system

“I would have liked to have an interactive explanation system that would answer my questions.” (1-5) [QI]

“If you would have liked to have an interactive explanation system, what would you like that system to be like?”(Open-ended question) [QII]

systems.TheonlinebehaviouralscienceplatformGorilla[85],whichincludestheGorillaExperimentBuilderplatform,3was usedtoimplementandconducttheexperiment.

Weselectedtextsfromthe20Newsgroups datasetandthethreeclassesaccordingtothefollowingtwocriteria:(1)the textsdidnotincludeanypersonalorcontroversialinformation,and(2)theyconsistedofoneortwoparagraphsthatcould ﬁtthedesignofthewebpage(therebydiscardingtextsthat weretoolongortooshort).Toclassifytheselectedtextsand generatethekeywordsusedintheexplanations,weusedanadaptedversionofthecode(availableon

GitHub

)byRibeiro etal.[5] thatuses amultinomial NaïveBayesclassiﬁer (

MultinomialNB

)fromthe

scikit-learn

library[86].Using the predictionprobabilitiesofthisclassiﬁer, we thenalsodiscarded textsthat wereeithertooeasy ortoochallenging to classify.

Fig. 1. An

AI-system classiﬁes text in one of three classes: politics, science or leisure. This classiﬁcation can be correct or incorrect. The Explanation-system

provides either a factual explanation based on the most important words for the predicted class, or a counterfactual explanation based on words missing from the text that would, if present, have led to another class being selected. This other class (the foil) could be either the correct class of the text, or another incorrect class.

Finally,ourexperimentaldesignrequiredparticipantsalsotobegivenclearlyincorrectpredictionsfromtheAI-system. Thus, we modiﬁed the classiﬁer such that the classpolitics will always be predictedcorrectly,the class science will ran-domly be predicted correctly 50% of the times and the class leisure will always be predicted incorrectly. We balanced

(7)

Fig. 2. Process

of generating explanations. A selection of texts from the three classes (politics, science and leisure) from the 20 newsgroups dataset was

run through LIME [5] to find the most relevant words that contributed positively to the prediction of a class. Some of the words highlighted by LIME were collected to build a global model (bag of words) for each class (right-hand side of the figure). We built the factual explanations based on the highlighted words from a particular text. The counterfactual explanations are built taking words from the global models (highlighted in pink in the example) and the individual highlighted words from each text. (For interpretation of the colours in the figure(s), the reader is referred to the web version of this article.)

classes acrossthe counterfactuals,ensuring that no class was over-represented.Participants were not toldabout this as-pect ofthe system; we therefore refer to itas the “hidden”globalbehaviour of the system. This is incontrast withthe localbehaviour of the systemin terms of each particular classiﬁcation andassociated explanations that participants did see.Weagainusedtheclassmembershipprobabilitiesgivenby theclassiﬁertoassigntheincorrectclass.Thepurposeof thishidden globalbehaviour was totest whetherthe participantsbuilt mentalmodels ofthe systembasedon the local explanations.

Stimuli:explanations Twotypesofexplanations, factual andcounterfactual,were generatedfortheselectedtexts.The foil usedinthecounterfactualexplanationsforincorrectpredictionscouldbe eitherthecorrecttextclassoranotherincorrect one. We usedLIME[5] and an adaptedversion of its implementationforthe multiclasscaseprovided by theauthors at

GitHub

tohelpusgeneratesuchexplanations.BasedonthewordshighlightedbyLIME(keywords)andthethreemodels builtforeachoftheclasses,wemanuallybuiltreadableandnarrativeexplanationsfornon-datascientists.Forconsistency, all explanationsfollowed thesamesentencestructuresubstitutingonly thekeywordsfromeach text.Thus,we builtnine different explanations (since each incorrect prediction has two possible outcomes) for each text. An illustration of the process isshowninFig.2,anda fullexample includingaparticulartext andall possibleexplanations isprovidedinthe AppendixC.Thefollowingisoneofthepossibletextsfollowedbyexampleexplanationsthatweregeneratedforit:

(8)

Factualexplanation,correctprediction:The AI-system classifies the text as politics because words as economy, Americans and percent were found.

Factualexplanation,incorrectprediction:

[Predictedclassscience]The AI-system classifies the text as science because words as survey and harvard were found in the text. Counterfactualexplanation,correctprediction:

[predictedclasspolitics,foilscience]The AI-system classifies the text as politics instead of science because words such as experiment or investigation were not found (even though the words survey and harvard were).

Counterfactualexplanationwiththecorrectclass,incorrectpred.:

[predictedclassscience/leisure,foilpolitics]The AI-system classifies the text as science/leisure instead of politics because words such as financial or growth were not found (even though the words American and percent were).

Counterfactualexplanationwiththeincorrectclass,incorrectpred.:

[predictedclassscience,foilleisure] The AI-system classifies the text as science instead of leisure because words such as travel or vacation were

not found (even though the words Japan and worldwide were).

Intotal, weused 20textsfromeachclass,andofthose, basedonthecompletion time determinedin thepilot study, we selected 18 in total for our experimental study (six for each class). For all of them, we generated all the possible explanationsgivencorrectandincorrectpredictions.Foreachincorrectprediction,thereweretwopossibleoutcomesfrom theAI-system,andinturn,thecounterfactualexplanationcouldincludethecorrectclassornot.Forexample,ifthecorrect classofatext wasscience, incorrectoutcomeswerepolitics or leisure.Acounterfactualexplanationofacorrectprediction hadtoincludeoneoftheincorrectclassesasthefoil,whileacounterfactualexplanationofan incorrectpredictioncould includeafoilthatwaseitherthecorrectclassoranincorrectone,i.e.,science ifthecorrectclasswasused,ortheremaining incorrectclass(leisure orpolitics,respectively).

3.4. Procedure

Fig. 3. Experimental

design, procedure and participant’s view of the experiment. After the consent, instructions and demographics form, each participant is

randomly assigned to one of the six groups (between-subjects). Each group sees a different set of explanations (as shown in Table1). Illustrative examples of the explanations are given in section3.3and at the bottom of the ﬁgure. Each participant goes through 18 (randomly presented) texts to classify and ﬁnally, answers two questionnaires.

(9)

Fig. 4. Snippets

of the on-line experiment interface: text to classify, prediction by the AI-system and explanation. The participants needed to answer two

questions (1) “howsatisfyingdidyouﬁndtheexplanationintermsofunderstandingwhythesystemmadeitsclassiﬁcation?” (1-5)

and do you agree with the

classiﬁcation made by the AI-system? (Yes/No). If the answer was No, an additional question (as shown at the bottom) was displayed; the participant would then enter which class he/she thought was the correct one (politics, science, leisure).

Theexperimenthadthreestages:(1)introduction,(2)mainclassificationtaskand(3)evaluationquestionnaires(seethe overall procedure inFig.3). The firststage consistedofa consent andabasic demographics(gender,ageandeducation) form,followedbytheinstructionsforcarryingthetask(seeaprintscreenoftheinstructionsinAppendixA.11).Theoverall aimof thetaskpresentedto theparticipantswas tohelp improveandevaluate anAI-system thatclassified randomtext andemailsfoundonline. Themaintask,stage2,consistedof18textstoclassify(6perclass),onetext perscreen(Fig.4). Allparticipantsreceivedthesametextsinrandomisedorder.Aftereachtext,thesystempredictedoneofthethreepossible classes(politics,scienceandleisure).Eachpredictionwasfollowedbyanexplanation(generatedbytheExplanation-system) astowhytheAI-systemmadesuchclassification.Thetypeofexplanationsparticipantsreceiveddependedontheir experi-mentalgroupasdescribedinTable1.

On each screen of the main task, participants had to answer two questions, one regarding how satisfied they were withthe explanationprovided (“Howsatisfyingdidyoufindtheexplanationintermsofunderstandingwhythesystemmadeits classification?”)andoneonwhetherornottheyagreedwiththeclassificationmadebytheAI-system.Iftheparticipantdid not agree,they hadtospecifywhichcategorythey thoughtwas thecorrectone.Theinstructionsindicatedthatproviding thecorrectclasswouldhelpthesystembecomebetter:“youwillhelptheAI-systemtolearnfromyouandbebetterinthefuture”. Allquestionshadtobeanswered.TheinterfaceduringthemainclassificationtaskisshowninFig.4.

Finally,two questionnairesweregiventothe participants,onefocusingon theExplanation-systemandone onthe AI-system(seeAppendixFig.B.12andpreviousdescriptionofmeasuresinTable2).

3.5. Participants

200participantswererecruitedthroughtheonlineplatformProlific4and181retainedforthefinalanalysis.Theselection criteriaforparticipatinginthestudywerethattheywerefluentinEnglishandagedbetween18-65.Participantswerepaid £ 3.50forfinalizingthetest;thepaymentwasmadethroughProlific.Theagedistributionoftheparticipantswasasfollows: 60participantswerebetween18-24years,46between25-30,46between31-40,20between41-50,6between51-60and3 between61-65.Themeanagewas31,15

σ

=

10.1.91oftheparticipantsweremale(50,28%)and90werefemale(49,72%). The highestacademic qualiﬁcationreported by theparticipants was “high school”:63,“bachelor’s degree”:70,“master’s degree”: 38and“other”: 10. Eachparticipant was randomly assigned(balanced) to one ofthe six conditionsor groups; afterremoving andcleaningthe data,thenumberofvalidparticipantsper groupwere [PC:EF/PI:EF]

=

31,[PC:EF/PI:ECC]

=

28, [PC:EF/PI:ECI]

=

31, [PC:EC/PI:EF]

=

30, [PC:EC/PI:ECC]

=

32 and [PC:EC/PI:ECI]

=

29, with a total of 181. The participantstookpartintheexperimentfor21:09minonaverage(

σ

=

11.35).

4 _{Proliﬁc is a platform designed for online participant recruitment by the scientiﬁc community [}₈₇_,₈₈_{]. This platform allows for gathering at least equally} high quality data as experiments carried out at university laboratories and higher quality data than alternative platforms [88,89].

(10)

Table 3

Number of participants and resulting data points (n ×18 questions) per condition, as well as the number and proportion of “consistent” data points, for which participant agreed with a correct classiﬁcation or disagreed with an incorrect one (see section3.2).

n participants n data points prop. consistent

total consistent PC:EF/PI:EF 31 558 509 0.91 PC:EF/PI:ECC 28 504 465 0.92 PC:EF/PI:ECI 31 558 481 0.86 PC:EC/PI:EF 30 540 471 0.87 PC:EC/PI:ECC 32 576 510 0.89 PC:EC/PI:ECI 29 522 470 0.90 Table 4

Measured variables and corresponding questions for an overall appraisal of the explanations given by the system, as well as the ﬁgures in which the corresponding visualisation of the responses can be found. Note that both Likert scale and open-ended questions were used.

Measure Questionnaire item(s) Figure

Satisfaction “How satisfying did you ﬁnd the explanation in terms of understanding why the system _{made its decision?”} 5a

“The explanations provided of how the AI-system classiﬁes text are satisfying.” 7a

Completeness “The explanations provided regarding how the AI-system classiﬁes the text seem complete.” 7b “Would you have liked for the explanations to contain additional information? If so, what

type of information and when, i.e., in which situations?” N/A

Suﬃcient detail “The explanations provided of how the AI-system works have suﬃcient detail.” 7c

Understanding “From the explanations provided, I understand how the AI-system works.” 7d

4. Results 4.1. Collecteddata

Table3summarisestheﬁnalnumberofparticipantsperconditionandthecorrespondingnumberofdatapointscollected (18perparticipant).First,weveriﬁedthatparticipantsareabletodeterminethetrueclassofeachtextandthat,therefore, it is valid touse the correct output asa proxy foruser-expected output, as assumedin ourexperimental design.Thus, we checked how oftenparticipants agreed witha correct AI-system output anddisagreed withan incorrectone (which we refertoasbehaviour thatisconsistent withtheassumptionunderlyingtheexperimentaldesign). Wefoundthatusers wereconsistentabout90%ofthetime(seeTable3),whichindicatesthattheassumptionwasreasonable.Weexcludedthe inconsistentresponsesformostoftheanalysisbelowbecausetheassumptionthatwhetherthesystemoutputwascorrect ornotcannotbeusedasaproxyforuserexpectationsinthiscase.Whileonemightconsiderto“reassign”inconsistentdata points toother conditions,thiswoulddistributeresponses fromindividualparticipantsover severalgroups, whichshould beavoided.Wedohoweverdescribetheseresponsesqualitativelyinthenextsection.

4.2. Satisfactionwiththegivenexplanations

Toaddress ourmainhypotheses,we ﬁrstanalysed participantresponsesto questionsrelatedtotheir satisfactionwith explanations givenby thesystem(see Table4). Inparticular, weconsidered both satisfactionwithbothindividual expla-nationsandtheoverallimpression(howsatisfyingexplanationswereoverall,whethertheycontainedsuﬃcientdetail,how completetheywereperceivedtobe,andtowhatdegreetheyhelpedunderstandthebehaviouroftheAI-system).

TotestforthestatisticalsigniﬁcanceofdifferencesbetweentheexplanationtypesintheLikertscaleresponses,weused ordinalregressionmodels;speciﬁcally,cumulativelinkmodels(CLMs[90,91])providedbytheRpackage

ordinal

andtest forsigniﬁcanceusingTypeII AnalysisofDeviance (ANODE)tests[90].AlthoughLikertscaledataare sometimesanalysed usingparametrictests,itisimportanttonotethatwhiletheappropriatenessofsuchtests,inthiscase,remainsamatterof debate,toerronthesideofcaution,itisrecommendednottoapplythemtoordinaldata[92].AlthoughaMann-Whitney Utestwouldbeanon-parametricalternative,ithasrecentlybeenarguedthatordinallogisticregression,e.g.CLMs, would bepreferable[90].

ThecoreassumptionbehindCLMsistheso-calledproportionaloddsassumption.Intheanalyses below,wecheckedthis assumptionforeachtest.Iftheassumptionwasviolated,wemodiﬁedthemodeltoallowdifferentscalesforthevariables inquestion(whichisthepreferredwaytoapproachthisviolationofassumptionsandisdirectlysupportedbythe

ordinal

(11)

Fig. 5. Distribution

of satisfaction ratings per group for all data points as well as only instances in which users agreed with the AI-system’s output and

those where users disagreed with the AI-system’s output. Numbers below the x-axis indicate the total number of points in each group (as per Table3).

Table 5

Summary statistics for the data presented in Fig.5.

How satisfying were the explanations?

PC:EF PC:EF PC:EF PC:EC PC:EC PC:EC

PI:EF PI:ECC PI:ECI PI:EF PI:ECC PI:ECI

Overall

Mean 3.041 3.065 2.99 2.635 2.633 2.738

Std 1.627 1.422 1.537 1.362 1.405 1.32

Median 3.0 3.0 3.0 2.0 2.0 3.0

Mode 1.0 4.0 1.0 1.0 1.0 2.0

Users agree with correct output

Mean 4.394 4.225 4.264 3.543 3.664 3.672

Std 0.81 0.728 0.776 1.132 1.139 1.058

Median 5.0 4.0 4.0 4.0 4.0 4.0

Mode 5.0 4.0 5.0 4.0 4.0 4.0

Users disagree with incorrect output

Mean 1.694 1.919 1.61 1.738 1.688 1.78

Std 0.998 0.923 0.781 0.895 0.852 0.754

Median 1.0 2.0 1.0 1.0 1.0 2.0

Mode 1.0 1.0 1.0 1.0 1.0 2.0

4.2.1. Satisfactionratingsforindividualexplanations

Fig.5presentsavisualrepresentationoftheLikertscalesatisfactionratingsoftheindividualresponsesprovidedduring theclassiﬁcationtaskforeachgivenexplanation(i.e.,theanswerstothequestion“Howsatisfyingdidyouﬁndtheexplanation intermsofunderstandingwhythesystemmadeitsdecision?”). Table 5 contains the corresponding descriptive statistics. An interesting observationisthat the ratingwas stronglydependenton whetherornot participantsagreedwiththe system outputasrevealedbysplittingtheresponsesaccordingly(Fig.5,bottom-rowplots).

In thiscontext,itis alsopossibleto considertheinconsistent datapoints that areexcluded fromtheremaining anal-ysis(Fig. 6). It is,forexample,apparent that thisobservationremains true evenwhen userbehaviour isinconsistent, as illustratedinFig.6bottom-rowplots,showingthatuser’sexpectationswerelikelyanimportantfactorintheirsatisfaction. Althoughvisualinspection ofthetwoﬁguressuggeststheactualsystemperformancewas alsorelevant(forexample, sat-isfactionappeared tobe higherwhenthesystemwas actuallycorrect, seethebottom-rowleft plotinFigs.5and6), the heavilyunbalancednumberofdatapointsineachgrouppreventsstrongstatementstothiseffect.

Statistically,wefound asigniﬁcanteffectonlyfortheexplanationtype whenthesystemoutputwascorrect(Table 6): inthiscase,factualexplanationsscoredhigher.Ourﬁrsthypothesis(H1)isthereforesupported butH2isnot.Thisisalso illustratedbytheseparationintocorrectandincorrectoutputshowninFig.5,inwhichahighersatisfactionforthegroups that received factual explanations forcorrect outputs can be observed whilenoneof the explanationtypes forincorrect

(12)

Fig. 6. Distribution

of satisfaction ratings per group for all instances in which users showed

inconsistent behaviour:

disagreeing with a correct output or

agreeing with incorrect output. The ﬁgure is the complement to Fig.5. Numbers below the x-axis indicate the total number of points in each group. Table 6

Summary table (ANODE Type II tests) for effects of explanation type on the satisfaction rating for individual explanations based on a CLM model. All variables were adjusted for scale effects due to assumption violations.

Explanation type for LRχ2 _Df _Pr(>_χ2₎

Correct output 35.544 1 2.5e-09*

Incorrect output 0.987 2 0.610

Correct×Incorrect 2.770 2 0.250

outputs appearedtomodulatesatisfaction.Overall,thisindicatesthatnoneoftheexplanationdesignsusedinthepresent study – including, surprisingly, counterfactual explanations – were optimal when end users expected the AI-system to produceanoutputdifferentfromwhatwasactuallyproduced.

4.2.2. Overallsatisfaction,completenessandcontentofexplanations

Sincetheoverallimpressionofexplanationscandifferfromindividualresponses,wenextturnedtoresponsesregarding overallaspects oftheexplanationsystem.Fig.7summarisestheresponsesregardingoverallsatisfactionwiththe explana-tions,whethertheyhadsuﬃcientdetail,appearedcomplete,andhelpedinunderstandingtheAI-system.Thecorresponding descriptive statistics are giveninTable 7.For allquestions, we againfound asigniﬁcant effectonly forexplanation type whenthesystemoutputwasinlinewiththetrueclassofthetext,with,asbefore,participantsassigninghigherscoresto thesemeasuresforfactualexplanations,andinlinewithH1,whileH2receivesnosupport.(SeeTable8.)

Wealsoanalysedwhetherparticipantsneededadditionalinformationintheexplanationsthroughacontentanalysisof theanswers totheadditionalinformationopen-endedquestion.Specifically,thecommentswere firstanalysedto identify categoriesandthenlabelledmanuallybyonecoder.Theprocessofextractingthecodeswasiterative,goingfromaninitial valueof11toafinalvalueof5.Forgroups[PC:EF/PI:EF],[PC:EF/PI:ECC],[PC:EF/PI:ECI]and[PC:EC/PI:ECC]aroundathirdof theparticipants(11-13)ineachgroupcommentedthattheydidnotneedadditionalinformation.Forgroups[PC:EC/PI:EF] and [PC:EC/PI:ECI], however,only 6 and4 participants respectively indicated no need foradditional information. Thisis interestinginsofarasthat groupscoveredbybothourhypotheses(H1: [PC:EF/PI:EF]and[PC:EF/PI:ECI];H2:[PC:EF/PI:ECC] and [PC:EC/PI:ECC]) showed least need for additional information. In terms of what additional information participants wouldhaveliked,wefoundthatitclusteredintothefollowingfivecategories(seeevidenceforeachcategoryinTable9):

Context. Thissuggestionwas voicedbyparticipantsinallgroupsandreferstothefact thatourexplanationsonly con-tainedsingle wordswhileparticipantswouldhavelikedtoseesentences andconnections betweenwords,colloquialisms, slang,orthevariousmeaningsofthewordsinrelationtothecontextofthetext.

(13)

Fig. 7. Answers to the Explanation-system questionnaire and classiﬁcation task. See Table4for the complete question phrasing.

Table 7

Item PC:EF PC:EF PC:EF PC:EC PC:EC PC:EC

Explanations how the system classiﬁes text are satisfying

Mean 3,452 3,750 3,290 2,933 2,719 2,828 Std 1,121 0,928 0,938 1,081 1,054 1,002 Median 3 4 3 3 3 3 Mode 3 4 3 3 3 2 Explanations have suﬃcient detail Mean 3,129 3,321 2,968 2,900 2,844 2,517 Std 0,176 0,945 1,197 1,242 1,194 0,949 Median 3 3 3 3 3 2 Mode 3 3 3 2 4 2 Explanations appear complete Mean 3,387 3,357 2,936 2,633 3,156 2,552 Std 1,202 0,911 1,093 1,129 1,221 0,985 Median 3 3,500 3 2 3 3 Mode 2a ₄ ₃ ₂ ₄ ₃ Explanations help understand system Mean 4,290 4,179 4,161 3,767 4,063 3,690 Std 0,739 0,548 0,688 0,858 0,878 0,930 Median 4 4 4 4 4 4 Mode 4 4 4 4 4 4

a _{Multiple modes exist; the smallest value is shown.}

Factualexplanations. Allgroupsthatreceivedacounterfactualexplanationwhenthesystemoutputdidnotalignwiththe trueclass(i.e.allgroupsexceptfor[PC:EF/PI:EF]and[PC:EC/PI:EF])hadparticipantswhoaskedforfactualexplanations,i.e. whythepredictedclasswaschoseninsteadofthealternativeoneincludedinthecounterfactualexplanation.Thissupports thepreviousresultthatcounterfactualexplanations,atleastbythemselves,appeartobeinsuﬃcient.

Metrics. Someparticipantsaskedforquantiﬁedinformation,suchasamatchingpercentageforeachclass.

Globalmodel,processandbehaviour.Moreinformation abouttheclassiﬁcation process atahigher abstraction leveland whythesystemselectedsomewordsandnootherswasrequested.

Moreexamples. Some participantswouldsimply haveliked to seemore wordsin theexplanations, forexample, more keywordsthatcontributedpositivelyandnegativelytotheﬁnalprediction.

Other additional suggestionsrelatedto the constraintsofthe study itself.For example,some participantswould have liked more reﬁnedcategories andsubcategoriesfor classiﬁcation.Forinstance, some feltthat the “science” category was toobroad(“IthinkthecategoriesScience,LeisureandPoliticscouldhavebeenbrokendownfurther,forexample,Sciencecouldalso beMedicineorPhysics,thatwouldhavemadetheAImoreaccurateinmyopinion”;[PC:EF/PI:ECC])orthattheresimplycouldbe

(14)

Table 8

Summary table (ANODE Type II tests) for effects of explanation type on the ratings presented in Fig.7. Statistically signiﬁcant effects are indicated in bold. Test assumptions are satisﬁed in all cases.

Explanations how the system classiﬁes text are satisfying

Correct output 17.612 1 2.7e-05

Incorrect output 1.372 2 0.503 Correct×Incorrect 3.071 2 0.215 Explanations have suﬃcient detail Correct output 4.804 1 0.028 Incorrect output 2.599 2 0.273 Correct×Incorrect 0.272 2 0.873 Explanations appear complete Correct output 6.344 1 0.012 Incorrect output 5.730 2 0.057 Correct×Incorrect 2.329 2 0.312 Explanations help understand the system Correct output 8.366 1 0.004 Incorrect output 1.585 2 0.453 Correct×Incorrect 4.099 2 0.129 Table 9

Example evidence for the main classes of requests for additional information voiced by participants as well as the groups this evidence is sampled from. Context

“The AI seemed to pounce on certain words and not look at the context” (PC:EF/PI:EF)

“I would’ve preferred the explanations to contain more contextual information rather than assumptions made with the absence/presence of certain words” (PC:EC/PI:ECC)

“Some description of whether or not the AI recognises context in anyway. The impression given by the explanations is that the AI relies on simple word matching, which is plainly insuﬃcient” (PC:EC/PI:ECI)

Factual explanations

“The explanations told when it didn’t classify the text as an option because certain words weren’t used. I would’ve liked to have seen more about how it considered which words that WERE used” (PC:EC/PI:ECC)

“I would have liked additional information on why the incorrect choices were made by the AI, and not the reasons why the correct answer was not chosen” (PC:EF/PI:ECC)

“I would have liked to see why the AI system classiﬁed texts based on which were included (as opposed to the model of which words were omitted)” (PC:EC/PI:ECI)

Metrics

“I think that the AI should have looked for more keywords to gauge the subject of each email. Perhaps adding a percentage meter showing how much the subject matches its ﬁndings would be helpful” (PC:EF/PI:ECI)

“Information on why certain classiﬁcations were excluded, if the system is based on ‘scoring’ signiﬁcant words, information on scores would be useful” (PC:EF/PI:ECI)

Global model, process and behaviour

“Anything that helped me understand better what is the algorithm behind the classiﬁcation system employed by the AI” (PC:EF/PI:ECC) “I would like to know how it selected the relevant words and why it ignores other words” (PC:EC/PI:EF)

“The amount of words that the AI has associated for each category” (PC:EF/PI:ECC) More examples

“More words that it picked up to decide the category in which the text ﬁts into” (PC:EC/PI:EF)

“I would have liked the explanations to include more words that they screened and more words that they ruled out as being relevant to the category” (PC:EF/PI:ECI)

morecategories(“Therearecertainsituationsinwhichmorecategorieswouldbebeneﬁcial.Acategoryforsport,religion,medicine, etc.wouldlikelybebeneﬁcialifproperlyimplemented” [PC:EC/PI:ECC]).Itisworthbearinginmindthatthewholenewsgroups datasetcontains6topicsand20subcategories,ofwhich3topicswereusedinthisstudy.

4.3. PerceptionoftheAI-system

One ofthe goalsof usingXAIsolutions can be tosupport creatingan understanding ofthe inner workingsof an AI-system;forexample,suchthatusersareabletobuildanaccuratementalmodelofthesystem.Here,weanalysedtheeffect thatdifferenttypesofexplanationshadonthis.Table10liststhemeasuresandcorrespondingquestionsthatwereusedto thiseffect.

4.3.1. PerceivedunderstandingofAI-system’sinnerworkings

Weﬁrstlookedatwhetherparticipantsfelttheycouldunderstandthesystemlocally,globally,bothlocallyandglobally, andintermsofitslimitationsandthemistakesitmade(seethevisualisationsoftheresponsesinFig.7andfullquestions

(15)

Table 10

Measured variables and corresponding questions investigating participants’ perception and understanding of the functionality of the AI-system based on the explanations provided. Note that the questions regarding perceived performance of the AI-system were open-ended questions.

Measure Questionnaire item Figure

Understanding of the AI system

“The explanations provided help me to understand a particular prediction made by the AI-system.” 8a

“The explanations provided help me to understand the global behaviour of the AI-system.” 8b

“The explanations provided help me to understand a particular prediction made by the AI-system but also the

global behaviour of the AI-system.” 8c

“The explanations provided help me to understand the limitations and mistakes of the AI-system.” 8d Predictability of the AI

system

“I know what will happen the next time I use the AI-system because I understand how it behaves.” 9a

“The outputs of the AI-system are very predictable.” 9b

Performance of the AI system

“Do you think the AI-system performed well considering the classiﬁcations you have seen?” N/A

“From your point of view, does the AI system need improvement?” N/A

“Do you think that the AI-system classiﬁes the different types of text equally?” N/A

Fig. 8. Answers

to the AI-system questionnaire measuring whether users believe that explanations help to understand the behaviour of the text classiﬁer

(a) locally, (b) globally, (c) locally and globally as well as whether they help to understand (d) the classiﬁer’s limitations and causes of the mistakes. See Table10for the complete phrasing of the questions used.

in Table 10). We found effects for the explanation type both forcorrect andfor incorrect outputs with respect to the groundtruth only forlocal understanding(Table 12), with higherscores givento factual explanations forcorrectoutput andcounterfactualswiththecorrectclassexplanations forincorrectoutput (seethesummary statisticsinTable 11). This is inlinewithboth H1andH2.We alsofound thatthe explanationtype forincorrectoutput hada signiﬁcanteffecton the perceived ability to understand the system’s limitations and mistakes (Table 12), again withhigher scores given to counterfactualswiththecorrectclassexplanations.

Lastly,itisworthnotingthatthebestoverallratings(intermsofmostpositiveandleastnegative)for[PC:EF/PI:EF]–[PC: EF/PI:ECI] and [PC:EC/PI:EF]–[PC:EC/PI:ECI] are observed for [PC:EF/PI:ECC] and [PC:EC/PI:ECC] respectively; i.e. for the groupsthatreceivedcounterfactualwithcorrectclassexplanations whenthesystemoutput wasnotinlinewiththetrue class ofthe text.This providespartial qualitative supportforH2, althoughno strongstatementscan be madegiventhat statisticallyspeaking,nosigniﬁcanteffectswereobserved.

4.3.2. AbilitytopredictAI-systembehaviourandperformance

Prediction tasksare quickwindowsintousers’mentalmodels ofthesystem[81].Wemeasured boththe participants’ abilitytopredictAI-system’soutputsandwhethertheyconsideredthesystemtobepredictable(Fig.9andthe

(16)

correspond-Table 11

Understand local behaviour Mean 3,968 4,071 3,710 3,600 3,844 3,414 Std 0,795 0,604 0,938 0,724 0,920 0,825 Median 4 4 4 4 4 3 Mode 4 4 4 4 4 4 Understand global behaviour Mean 3,839 3,893 3,645 3,567 3,844 3,310 Std 0,934 0,685 0,985 1,006 0,884 1,073 Median 4 4 4 4 4 3 Mode 4 4 4 4 4 4

Understand local and global behaviour Mean 3,516 3,714 3,581 3,300 3,625 3,207 Std 1,029 0,600 0,886 0,988 1,008 0,819 Median 3 4 4 3,5 4 3 Mode 3 4 3 4 4 3a Understand limitations and mistakes Mean 3,807 4,286 3,839 4 4,313 3,586 Std 1,138 0,854 0,969 1,017 0,644 0,733 Median 4 4 4 4 4 4 Mode 5 4 4 4 4 4

Table 12

Summary table (ANODE Type II tests) for effects of explanation type on the ratings presented in Fig.8. Statistically signiﬁcant effects are indicated in bold. Variables that violated test assumptions and were thus included in scale effects in their respective CLMs are indicated with an asterisk.

Understand local behaviour Correct output 5.187 1 0.023 Incorrect output 6.947 2 0.031 Correct×Incorrect 0.383 2 0.826 Understand global behaviour Correct output 1.766 1 0.184 Incorrect output 4.046 2 0.132 Correct×Incorrect 1.060 2 0.588

Understand local and global behaviour Correct output 1.378 1 0.240 Incorrect output 3.190 2 0.203 Correct×Incorrect (*) 1.238 2 0.538 Understand Limitations and Mistakes Correct output (*) 0.950 1 0.330

Incorrect output (*) 18.71 2 8.6e-05

Correct×Incorrect (*) 1.993 2 0.369

Fig. 9. Answers to the AI-system questionnaire measuring predictability.

ing summarystatistics inTable14).We onlyfounda signiﬁcanteffectoftheexplanationtype forcorrectoutputon own abilitytopredict(Table13),whichispartialsupportforH1.

(17)

Table 13

Summary table (ANODE Type II tests) for effects of explanation type on the ratings presented in Fig.9. Statistically signiﬁcant effects are indicated in bold. Test assumptions are satisﬁed in all cases.

Own ability to predict Correct output 6.854 1 0.009 Incorrect output 1.064 2 0.587 Correct×Incorrect 2.540 2 0.281 System predictability Correct output 2.399 1 0.121 Incorrect output 2.305 2 0.316 Correct×Incorrect 1.639 2 0.441 Table 14

Own ability to predict Mean 3,871 3,643 3,484 3,167 3,469 3,172 Std 0,991 0,951 1,122 1,117 0,879 1,197 Median 4 4 4 3 3 3 Mode 4 3 3a ₂ ₃ ₄ System pre-dictability Mean 3,452 3,357 3,161 2,933 3,344 2,966 Std 1,121 1,129 0,969 1,048 1,035 1,267 Median 4 3 3 3 3 3 Mode 4 3a ₃ ₂ ₃ ₂

Fig. 10. Answer

to the question “I

wouldhavelikedtohaveaninteractiveexplanationsystemthatwouldanswermyquestions.”

No statistically signiﬁcant

differ-ences were found (see Table15).

4.3.3. PerceivedperformanceoftheAI-system

Lastly, we examined whether participants picked up on the hidden model of the system (see Methods section; the politics, science and leisure classeshad accuraciesof100%,50% and0% respectively) throughopen-endedquestions about the perceived performance (see Table 10). There was a great variety of answers to these questions. Many participants highlighted thattheAI-systemneededimprovementandsuggesteda widerangeofamendments.Some wereinlinewith previoussuggestionsforadditionalcontentfortheexplanations(e.g.,considercontext,usereﬁnedcategoriesanddisplaya largernumberofrelevantwords).

Overallwe foundthatbetween34% (for[PC:EC/PI:ECC])and50%(for[PC:EF/PI:ECC])oftheparticipantsrecognisedthe hiddenmodel andan additional 14-20%partially recognisedit(for example,pickingup that theleisure category demon-strated worse performance). Interestingly, participants attempted to rationalise this, speculating, for example, that this category hasa broader vocabularythan, e.g., science, whichmighthave morespecialised keywords. However, therewere no notabledifferencesbetweenthe groups. Assuch, whiletheseopen-endedquestions yieldedinsightsinto participants’ abilitytoinferhiddenmodels,therewasnostrongsupportforeitherhypothesis.

4.3.4. Perceivedneedforaninteractiveexplanation-system

Asaﬁnalpointofinterest,weconsideredthepossibilitythatparticipantsmighthavepreferredaninteractiveexplanation systemratherthanthetypeusedhere.Likertscaleresponsesto“Iwouldhavelikedtohaveaninteractiveexplanationsystemthat wouldanswermyquestions”areshowninFig.10andthecorrespondingsummarystatisticsinTable16.Althoughtherewere variations between thegroups, around one thirdof the participantswould haveliked an interactive explanation system, whiletherestshowednoclearpreference([PC:EC/PI:ECC]wasthegroupmostinclinedtouseinteractivity).

(18)

Table 15

Summary table (ANODE Type II tests) for effects of explanation type on preference for an interactive system (see Fig.10). No test assumptions were violated.

Correct output 1.274 1 0.259

Incorrect output 0.722 2 0.697

Correct×Incorrect 0.624 2 0.732

Table 16

Would prefer interactive system Mean 2,839 3,179 3 3,133 3,125 3,207 Std 1,241 1,188 1,065 1,279 1,289 0,978 Median 3 3 3 3 3,5 3 Mode 2 3 2 3a ₄ ₃

Table 17

Example evidence for interactive explanation system designs suggested by participants. Human in the loop

“The system could ask whether a preliminary classiﬁcation based on certain words was accurate and the user could indicate whether it had misinterpreted certain words.” (PC:EF/PI:ECI)

“If the AI categorized a subject, you have the ability to select certain words in the email to help explain why it ﬁts a different category.” (PC:EC/PI:ECC)

Questions

“It should be like chat service and answer basic questions.” (PC:EC/PI:EF)

“I could ask the AI pre-determined questions that would give me deeper insight as to why the AI chose this and this option instead of that.” (PC:EC/PI:ECC)

“I want to be able to debate the machine.” (PC:EC/PI:ECI) Show classiﬁcation process

“It should be a step by step process, maybe even showing the intermediate steps, what was the idea of the AI at a certain point and why it changes.” (PC:EC/PI:EF)

Interaction with the global model

“Yes, the system should have a deﬁnition for each three of the words and show particular keywords that fall under each category.” (PC:EC/PI:EF) “Probably something like having a list of words that were or were not used in the text. Then you could click on a word and it would explain whether

or not the word was considered and how that affected the classiﬁcation.” (PC:EC/PI:ECC) Interaction with the text

“Maybe when you highlight certain key words in text it tells you how it classiﬁes them.” (PC:EF/PI:ECC)

“Perhaps a system with highlighted keywords that, when clicked, provide more details on why this was considered.” (PC:EC/PI:EF) “Hovering over words would tell you what category the AI thought the word should ﬁt into.” (PC:EC/PI:EF)

“If you hover over a sentence I would like to see how it interprets the category it belongs to.” (PC:EC/PI:ECC)

Alongthesameline,answerstotheopen-endedquestion“Ifyouwouldhavelikedtohaveaninteractiveexplanationsystem, whatwouldyoulikethatsystemtobelike?” showedthat aroundonethird(between9and12)oftheparticipantsfromeach groupexplicitlyindicatedthattheydidnotneedorwantaninteractivesystem(min[PC:EF/PI:ECC]and[PC:EC/PI:ECI]with 9 participants, max[PC:EF/PI:ECI] with12). Participants did however,provide interesting suggestions forfuture systems, whichwesummarisehere(withexampleevidencelistedinTable17):

Human-in-the-loop.Severalparticipantsexpressedawishtoinput informationtothesystem, sothatthesystemwould learnfromthem.

Questions. Oneofthemostcommonrequestsregardedthepossibility oftalking totheAI-system andask questionsas onewouldwithchatbotsandcurrentpersonalassistantdevicessuchasSiriorAlexa.

Showtheclassificationprocess.Severalparticipantswouldliketohaveinsightintotheclassificationprocessitself. Interactionwiththeglobalmodel. Severalparticipants would like tobuild intuitions abouthow the model worked. For instance, they wouldlike to see the wordsassociated witheach category andselectthem to see their influence on the classificationoutputs.

(19)

Interactionwiththetext.Under thiscategory,we haveplaced all theproposals thatsuggested thatthe mostimportant wordsshouldbehighlightedforselectionandfurtherexploration,andthepossibilityofhoveringoverthemtoseehowthe AI-systemwouldclassifythemandshowtheconﬁdence/accuracyoftheprediction.

Othersuggestions included,forexample,apersonalisedsystemthatmatches expectations(“aspersonalizedaspossibleto myexpectationsandneeds” [PC:EF/PI:ECC])ortheuseofvoiceandsoniﬁcation (“somethingwithvoice-overandmaybelikean e-book,sowecouldrelatethespeechofthepersontothetext,sincetoneofvoicemattersquiteabitwhenspeakingand/orreading” [PC:EF/PI:EF]).

5. Discussion

5.1. Mainresult

Our results first showedthat factual explanations given when the system’s output correspondedto the true class of thetextreceivedstatisticallysignificanthigherscoresthancounterfactualexplanationsinnearlyallaspects weconsidered (specificallysatisfaction,completeness,detailandsystemunderstanding),inlinewithourfirsthypothesis.

However, wefoundnostrongevidencetosupportthesecond hypothesis,namelythatcounterfactualexplanationsthat includedtheexpectedoutputweremostappropriatewhenthesystemoutputdidnotmatchwithenduserexpectations.We can notethat thegroupsthatreceivedthiskindofexplanation,[PC:EF/PI:ECC]and[PC:EC/PI:ECC],tended toscorehigher thantheothersregardingexplanationcompletenessandsystemunderstandingwith[PC:EF/PI:ECC]nearlyalwaysproviding thebestratings,buttherewasnoevidencetosupportanyclaimsofstatisticalsigniﬁcance.Wefurtherfoundevidencethat participantsin[PC:EF/PI:ECC]mayhavebuiltthemostaccuratementalmodeloftheAI-system.

Overall, thisindicates that counterfactualexplanations (with thecorrect class)mostlikelycapture partofwhat users lookforinan explanationwhenthesystemoutputdoesnotmatchtheirexpectations,butareunlikely tobesuﬃcientby themselves.Thisaspectwillrequirefurtherinvestigation.Inparticular,wenoticed,inanswerstoouropen-endedquestions, thatseveralparticipantswhoreceivedcounterfactualexplanationsforincorrectpredictionssuggestedthattheywouldhave likedtoreceivefactualexplanations.However,datafromgroupsthatdidreceivethis([PC:EF/PI:EF]and[PC:EC/PI:EF]) indi-catedthatthiswasalsonotsuﬃcient.Infuturework,itmaybeinterestingtoinvestigatewhetheracombinationoffactual andcounterfactualexplanations whenAI-system’s outcomesdonot alignwithendusers expectationsmightshow better results,andassessifthishybridtypeofexplanationisasuitablealternativeovertheapproachesconsideredhere.

Ourresultsthereforesuggestthat itisnecessaryforexplanationgeneratingsystemstosomehowinfertheoutputthat users expectedofanAI-system. Theclearrole forthisinformation,asa waytodecide whenafactual explanationisthe mostappropriate,isdemonstratedbyourﬁrsthypothesis.Thisisimportantsince,asdiscussedatthebeginning,itimplies that explanationsystemsmight thereforeneedamodel oftheendusertoestimate theseexpectations.Interestingly,this alsoalignswithveryearlyresearchonthedesignofIntelligentTutoringSystems(1970s-1980s),asreviewedin[93,pp. 33-34],wheretheexplanationsystemrequiredboth“subjectknowledge”and“teachingknowledge,”andthus,amethod/model forhowtointeractwiththelearner.

While wedidnot ﬁndsupportforoursecond hypothesis,thedata collectedshowedthat themostappropriate expla-nationstogivewhenthesystemoutputdoesnotmatchwhatusersexpectedlikelyneedstocontainmorethanjustafoil addressing theexpectedoutput.Assuch,thisprovidesempiricalsupportforrecenttheoretical argumentsinfavourof hy-bridapproachestocounterfactualexplanations[18].Thisisalsorelevantbecausecounterfactualsarethoughttobeacritical componentinhuman-humaninteraction[41],anditiscommonlythoughtthatsuchresultswouldtranslatetointeractions withmachines[33,79,80].

5.2. Expectations,satisfactionandmodelaccuracy

ResearchinXAI,ormoregenerallyinAI-systems acceptance,rarelyexploresusers’expectationsofsystemoutputs[94]. Ourresultsdemonstratedthattheyareindeedarelevantfactorinproducingsatisfyingexplanationssinceourusers explic-itlyreportedhighersatisfactioniftheoutputofthesystemwaswhattheyexpected,eveniftheAI-systemoutputandthe users’expectationwerebothwrong. Sucharesultcannotbe fullyexplainedby simplyarguingthatmodelaccuracyhasa strongeffectonexplanationsatisfaction.Therearetworelevantstudiesinthisregard,albeitfocusingontrust.Thefirst[95] studied theinfluenceofmodelaccuracyandexplanationfidelity ontrust inAI,demonstratingthat thesystems’accuracy levelsweremostdecisiveforusertrust:thehighertheaccuracy,thehighertheuser’strust.Further,statedaccuracywasin principlefoundtoaffectedpeople’strustinthemodel[96].However,thistrustwassignificantlyaffectedbyobserved accu-racy(i.e.,afterachancetoobservethemodel’saccuracyinpractice)irrespective ofitsstatedaccuracy. Thus,bothstudies [95,96],demonstratedan impactofperceived modelaccuracyontrust.Thisisinlinewithourresults,andhighlightsthe importanceofperceived,ratherthanactual,modelaccuracyinhowusersexperienceAI-systems.

“That&apos;s (not) the output I expected!” On the role of end user expectations in creating explanations of AI systems

Artiﬁcial

Intelligence

“That’s

(not)

the

output

I

expected!”

On

the

role

of

end

user

expectations

in

creating

explanations

of

AI

systems

✩

Maria Riveiro

∗

,

Serge Thill

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

*

talk.politics

talk.politics.misc

talk.politics.guns

talk.politics.mideast

sci

sci.crypt

sci.electronics

sci.med

sci.space

rec

rec.motorcycles

rec.sport.baseball

rec.autos

rec.sport.hockey

GitHub

MultinomialNB

scikit-learn

AI-system classiﬁes text in one of three classes: politics, science or leisure. This classiﬁcation can be correct or incorrect. The Explanation-system

of generating explanations. A selection of texts from the three classes (politics, science and leisure) from the 20 newsgroups dataset was

GitHub

not found (even though the words Japan and worldwide were).

design, procedure and participant’s view of the experiment. After the consent, instructions and demographics form, each participant is

of the on-line experiment interface: text to classify, prediction by the AI-system and explanation. The participants needed to answer two

and do you agree with the

σ

=

=

=

=

=

=

=

σ

“That's (not) the output I expected!” On the role of end user expectations in creating explanations of AI systems