Visualization of Chatbot Survey Results

(1)

Visualization of Chatbot Survey Results

Tor Kjäll

Uppsala University

ABSTRACT

Chatbots are an increasingly popular technique that has seen recent steps in development thanks to artificial intelligence. There is research conducted about chatbots in different areas, but one area that is overlooked is the presentation of data collected by the chatbot. This study aimed to explore what to think about in terms of visualizations when designing an interface in order to present chatbots results to novice users. Conducting a user study with several iterations of design, the research question was: How do you visualize the results of a chatbot survey for novice users to facilitate the understanding of the data? To answer this question, four design iterations with a total of 17 participants were conducted, resulting in a final prototype. The data gathered was then analyzed using thematic analysis. The results failed to help answering the research question, but did give some suggestions on interface visualization, mainly the importance of using visualizations that are carefully selected to improve the understanding of the data presented. Moreover, this study suggests the importance of focusing on the participant’s deeper level of understanding and their conceptual models.

Author Keywords

Chatbots; Survey; Visualizations; Human-Computer Interaction; Understanding;

INTRODUCTION

The chatbot industry is predicted to see a 31% compound annual growth rate from 2018 to 2024, exceeding $1.34 billion [13]. There are many uses for chatbots and the technology is already being used within customer service, medicine, education and information services and more [13,6,20,21,10]. As more and more systems are being developed and the amount of use cases increasing, the need for knowledge of how to present data in intuitive or understandable ways is bound to increase, and this is what this thesis wishes to explore. Several issues exist with chatbots, one of them being the issue of trust: there are studies on for example how customers could feel uncomfortable in using chatbots and perceive them as less trustworthy than chatting with a human. Presenting the data in a proper and attentive manner could perhaps ease this problem. Another issue with chatbots is that a lot of the research conducted is focused on the chat interaction itself, when there at the same time are other parts of a chatbot that also are of importance. For example, after the data is collected by the chatbot there could often be a need to also present that data. This makes it interesting to look at how to visualize the collected data of a chatbot, especially since chatbots make it possible to work with big datasets. Massive datasets could be a challenge to work with

considering data can be forgotten or disappear when there is a big quantity of it to be processed, and it is not clear how the users interpret the big amount of data being presented to them either.

In this study, one specific chatbot has been chosen to be the base of the research. This specific chatbot is called Hubert and is a tool to be used by customers that wishes to collect data through setting up surveys. Instead of the traditional way of answering a survey, the users of Hubert get asked questions through a chat interface, and this data then gets collected by Hubert and presented in a results page to the customer setting up the survey. Hubert has some interesting features but is still in development. This specific chatbot was chosen for this study because it has potential to become a powerful tool, and looking into how to present the results of the data collected can be of general interest for the chatbot research field especially to be able to attract new users to adopt this technology. The structure of the data collected by Hubert differs from traditional surveys, for example Hubert asks different types of follow-up questions depending on different variables, and the collected data can then be a mix of qualitative and quantitative data together with different frequencies of answers depending on the user. Therefore, a prototype will be designed out of the results of a Hubert survey to find ways of visualizing the data.

Research question

In this thesis, the aim is to explore what to think about in terms of visualizations when designing an interface in order to present chatbot results for novice users.

The research question of the thesis is: How do you visualize the results of a chatbot survey for novice users to facilitate the understanding of the data?

BACKGROUND

(2)

used in collaboration with, or as a substitute to, humans. The use cases of chatbots are broad and there are already many different chatbots in different areas. Looking at research, one previous study looked at how chatbots could be used to overcome the response problems of web-based surveys where for example the user sometimes would not give sincere answers. The reason for this can be that the user is being inattentive, which in turn can be caused by the lack of an interviewer [11]. The results of this study were that users interacting with the chatbot were less prone to giving satisficing answers, which implies that there are benefits of using chatbots over web-based surveys. Also, the users were more likely to give differentiated responses, compared to the users of the web-based survey [11]. Web-based surveys can collect and analyze large amounts of data fast and inexpensively, more so than traditional surveys (e.g. face-to-face or telephone interviews). Also, web-based surveys also avoids problems that can be caused in face-to-face interviews and telephone surveys, such as measurement errors [11]. The limitations of web-based surveys are that since it is a self-administered method it can lead to unreliable or inaccurate data, and that it is more difficult to get the respondent to clarify or elaborate on their answers. This is where conversational agents come to attention. “Conversational agents exploit natural language technologies to engage users in text-based information-seeking and task-oriented dialogs for a broad range of applications.” [14]. A common type of conversational agents being used today are chatbots, and there are many different areas of application, like health, marketing, education and information services [4,6].

With the use of chatbots increasing there is also plenty of current research on chatbots in different areas. For example, one of these areas is about conversational agents and trust. As some companies are starting to see chatbots as a complement to customer service [7], it is important that the user trusts the chatbot to provide the actual service. One study by Følstad et al. [7] looked at what factors are affecting the users’ trust in customer service chatbots. An interview study was conducted with questions regarding the users’ experience with chatbots and factors affecting their trust. Among the findings were things like: a chatbot should correctly interpret the user and also provide informative and helpful responses, and a chatbot should be human-like in its way of communication to the user, for example it should have some kind of personal or relational style and communicate in a polite and humanlike manner. Another finding was that a chatbot should present itself by communicating what it does and how it can help in a clear way. This goes in hand with previous research as a paper by Muir [15] brought up the issue of trust between humans and machines and discussed how trust impact the way the user uses the machine; no matter the sophistication or intelligence a system may be rejected by the user if she does not trust it, and this could reduce the performance of the system since the user could for example spend lots of

time to try to direct the output towards her own decision, instead of trusting the system to lead the direction for her. On the other hand, if a user trusts the system too much there is a risk of the result not being satisfactory, for example allowing the system to do things that a human actually would do better [15].

Another area where there is research being conducted is about the motivation to use chatbots. One study by van der Goot and Pilgrim [9] looked at how age differences influence the perception of customer service chatbots. The study suggested that both young and old groups of users had the same main motivation: to get their customer query answered. Some of the users thought that chatbots were mainly useful for simpler types of questions and that for more complex, urgent or personal matters it may instead be better talk to a real human. Where the two age groups differed was that the older age group seemed to use chatbots more as a way of a first way of getting in contact with a human (live agent), while the younger age group looked at chatbots more as a way to avoid human contact [9]. A similar study, by Brandtzaeg and Følstad [2], also researched the motivation for using chatbots and suggested that for the vast majority of the users productivity was the main reason for using chatbots. A big part of the users also reported the use of chatbots for entertainment, in the sense that it was fun and entertaining to use.

These examples show that there is lots of research going on about chatbots in different research areas, and no doubt is it an emerging (yet not new) technology that has taken steps forward with the advancements in AI. An area that seems to be overlooked when it comes to chatbot research is the presentation of the data collected after the interaction with the chatbot. Naturally, an AI chatbot has the potential to collect data in both big and small quantities. The area of application of course influences how this data should be presented to the user, be it healthcare, marketing, customer service or others. Without presenting the data collected in an understandable way to the user, the benefits of using chatbots over other types of data collection methods would perhaps not be as great. Therefore, it is of interest to explore what to think about when visualizing the data collected by chatbots. One specific example of a use case for this is a chatbot called Hubert, which will be discussed in the following section.

Hubert

(3)

is interacted with through a window with a typical chat layout shown in Figure 1. Here the user has a conversation with Hubert where Hubert asks questions and, depending on the answer, follow-up questions about the domain chosen. There are around five pre-chosen questions about the specific domain and these questions are partially customizable.

According to Hubert.ai [1], when a customer uses Hubert, the first step is them choosing their domain, then the second step is them choosing a question template that Hubert uses as a basis for the survey. During the conversation Hubert interprets the response and determines if the response contains any useful information or not. If it does not, Hubert asks the user to be more precise, or to answer a follow-up question. Then the customer activates the survey and sends it out through email or by using the link. Finally, Hubert automatically compiles the essence of the conversations and presents them to the user in the surrounding interface. The data collected by Hubert is presented to the user who set up the survey. In this part of the interface, it is possible to see both an overview of what the users’ responses looked like, as well as a detailed view to get more in-depth information about their specific responses.

Figure 1. The chatbot interface of Hubert. Design

Creating a good design can be challenging. One of the most influential authors when it comes to data visualizations is Edward R. Tufte. In his work he describes how to make efficient, coherent and effective ways of presenting data. He brings up how to design to communicate more information per unit, with the goal of presenting clear data. For example, he promotes the use of color to label, to measure, to enliven, and to represent. He clearly states that you should not use too many colors because that will affect the design negatively, and that for example using light, bright

colors mixed with white often makes reading the graphics very unpleasant. Another example is about layering and separation. Clutter should be avoided and a design should try to reveal the detail and complexity of the data. It is possible to consciously use layers and separation of the data to make it arranged or sorted in a way that makes differences in the data stand out clearly [22].

According to Norman (cited in [8,19]) you can look at an interface in two parts: the system side and the human side. We affect the system side through design, and the human side through training and experience. For novice users, it takes much more effort and planning to go from their intentions to action sequence than it does for an experienced user. This means that where an expert user would use the interface with minimal effort, a novice user is on the contrary slowed down by its planning process. However, this can be aided if the user has a good conceptual understanding of the system [19]. There are three different concepts: “First, there is the conceptualization of the system held by the designer; second, there is the conceptual model constructed by the user; and third, there is the physical image of the system from which the users develop their conceptual models.” [19]. The conceptual model of the designer is called the Design Model, and the conceptual model held by the user is called the User’s Model [19]. The third concept is called System Image and is the physical image of the system that develops the User’s Model. This division of the different models help understand that a design can be understood or interpreted differently. The designer is of course the person responsible of designing the system, and this means that she should take into account things such as the user’s previous background, technical experience, and limits in the planning process [8]. The user then develops her model from the System Image, so the User’s Model can and often will differ from the Design Model, which makes it important for the designer to put effort into the design of the System Image. Since the designer only talks to the user through the system itself it is important that the designer pays attention to everything the user interacts with in the system, for example buttons, graphs, and help sections. It is also important that the designer makes the System Image consistent and explicit [19]. This way of thinking about the different models can help in the successful design of a system.

These principles and concepts of design mentioned are going to be considered during the designing of the prototype of this study.

METHOD

(4)

scenario which was about letting the participants take on a role as a hiring manager using Hubert as a survey tool to collect information about the recruitment process of a company. The survey from Hubert used in this study was an authentic survey already completed by real users taking part in a recruitment process. One design was created and tested in each iteration.

In total, 17 participants were included in the study, and a total of four iterations were done. There were three participants per iteration in the first three iterations, and then eight participants in the final iteration. The former iterations served as more formative sessions to be able to come up with a design, and the last iteration served to try to validate the design. A user study was conducted in each iteration where the users were given tasks to perform in the interface to see how they used and understood the interface. Semi-structured interview questions about the tasks were also asked in the same session during the given tasks. Observations were also made by having the researcher note down interesting things that were observed. After the tasks, a shorter interview took place. All of the sessions were recorded with screen and audio of the participants, to be able to go back through the data again for analysis. A thematic analysis [3] was conducted on the final iteration, where the user’s answers to the tasks and interview questions were analyzed and sorted into themes to be used for the results section.

Design process

In the first iteration (see Appendix A) the current look of Hubert today was tested. This design contained sections divided in so called cards to separate the data into different areas. It also contained some summarized data of the answers.

In the second iteration (see Appendix B) a new design of the detailed view was tested. Here the idea was to use the bar chart of one of the answers to help highlight different areas with color. For example, clicking the number 4 (which was in the color orange) on the scale rating of the bar chart would make all the answers on that number be displayed, while the other answers would be less visible (but still visible). Also, the use of distinct colors made it very clear to the participants what answers belonged to which reply. In this iteration, some improvements on the layout of the main page were also done, like making use of the space better by widening the design and separating some of the elements more, but the bigger focus of this iteration was on the detailed view.

The third iteration (see Appendix C) saw big improvements in the layout, where it was found that the participants preferred graphical elements over text, so in many areas text was transformed into graphics in order to help the participants with interpreting the data better. Also, hints that the participant could hover to get more information about certain sections were put in. In the detailed view, improvements were made in the sections referring to the

tagging that Hubert does (sentiment, emotions, topics) by highlighting these and giving them text explanations to make them understandable.

In Figure 2, the final iteration, iteration four, is presented. This iteration saw improvements in the design by putting in the question clearly over each answer, something that helped the participants make sense of the answer better. Each question is presented under its own headline, and the headline is named so that it relates to the question. Under the questions are the summarized answers. Depending on the type of question, it is either a graphical presentation such as a graph, or it is a presentation in text. In Figure 3 the detailed view is presented. The detailed view consists of a more in-depth presentation of the answers to one specific question. Here it is possible to see all of the information gathered in the survey about one specific question, for example the answers from the user, and tags that Hubert makes. The participant gets to this page by clicking one of the summarized answers on the main page. For a bigger version of Figure 2, see Appendix D. For a bigger version of Figure 3, see Appendix E. For another part of the design that is not presented in this section, see Appendix F.

(5)

Figure 3. The design of the detailed view. Data collection

The sample of the last iteration consisted of eight participants. Since the study was conducted remotely, all of the participants used their own computers, and stayed in their homes. The sessions were all conducted on either Skype or Zoom. All participants were given a document containing information about the study, and after that a consent form to agree to. After that had been done and the participants had understood and confirmed their participation in the study, they were given a document containing the test scenario (see Appendix G). After this was read, the participants were asked about the test scenario to make sure that they actually understood it, and that their understanding of Hubert was correct (see Appendix H). If they did not understand it correctly, they were asked to read the scenario again or, depending on what they were missing, given a verbal explanation. Ethical aspects that were considered in this study were that all of the participants should be well-informed about the layout of the study, and about their rights to withdraw their participation at any time. This was explained to them in the document with information about the study and the consent form. Another important aspect was how the data was collected and handled, so the participants were informed that all of the data would be anonymized and not shared with parties that were not directly involved in this study.

The tasks (see Appendix I) were given to the participants one at a time, and they contained questions related to them and the participants’ understanding of Hubert. Finally, a semi-structured interview was conducted (see Appendix J).

Data analysis

The audio recordings were transcribed (see Appendix K), and the video recordings were used to supplement the audio transcriptions where it was not fully clear what the participant said to better understand the participant. The

video recordings were also used to make some observations of what the participants did during the tasks.

A thorough thematic analysis was conducted inductively with the focus of answering the research question. The findings were sorted into themes and sub-themes.

RESULTS

The resulting themes and sub-themes of the thematic analysis are presented here. Figure 4 shows an overview of the themes.

Figure 4. An overview of the results of the thematic analysis. Themes are in black background and white text, and sub-themes are in white background and black text.

The individual themes and sub-themes will be presented in the following segments.

Graphics

This theme emerged from the many participants talking about the importance of graphics and visualizations. Several participants talked about how graphics help their understanding of the data, for example.:

So I can see this graphics and numbers as well, and also text like questions and comments, and it’s definitely very visual because the graphics are helping a lot to understand.

Another participant said it was important to use graphics to show information that could be seen as obvious, but for novice users it is still important:

The graphics helped a lot, I would say it’s very friendly. To me it’s important to put a little more obvious information or trivial information that you might think is not necessary, but for someone who’s not maybe very familiar with this kind of interface it would help them a lot.

Several participants mentioned the need for clearly visible colors and big enough text to be able to find these visual cues. For example, one participant said:

Now that I’m looking a bit more careful I saw it and it was easy to understand thanks to this arrow next to it, but I think the text was a little small and grey so it’s not easy to find I would say.

Visual cues

(6)

One participant mentioned how it was a nice color scheming by having the average score go from red to yellow to green as the score increases.

And also the 5.6 in a nice color, I assume the color also shows how well you’re doing, so 5.6 is orange yellow-ish because it’s not that good, it’s not that bad either, it’s neutral. So yeah nice color scheming. Something that was clear was that the pie charts and the average score numbers were visible and when hovering them the participants saw that they were clickable, leading to more information.

I think it helps a lot with the pie chart and the numbers and if I want to click I can go and see more detailed information.

Visual improvements

Visual improvements is a sub-theme that points at what the participants thought should be improved in the system visually. One thing to be improved was increasing the size of the text in some areas, such as the questions of each section and the summary in the detailed question report, since it was too small to be able to read easily. One participant said about the function to sort the answers by for example date or length:

If you find it it’s easy to understand but it’s all small letters down there and I didn’t expect there to be much more information down there.

Another participant said about the button to expand the bar chart:

Harder than I thought, I wouldn’t have realized there was something to click at if you didn’t ask to be honest. Not big enough. If it was a bit more black then I would have seen it.

Another thing brought up a few times was that there should be some sort of scale to compare the results of a question because it is not clear in what direction the scale goes:

In the case of Recommendation, I think the number, I need a scale to compare, is 10 very likely or 1 very likely? So kind of something that I can compare the number with on a scale

Hints and explanations

Many participants needed hints and explanations to know where and what to click and to understand the data that was being displayed in the interface. One participant said this about a section that did not have the explanation like the previous one did:

Yeah it’s clear, I think with the text thingy in the question mark it would be slightly better because then you would be sure what was measured but other than that it’s really clear, it shows the question and the score, so then fill the rest in and you got your answer.

One participant thought it would be easier to interpret the numbers of a bar chart if there was an explanation:

Especially when focusing on the last bar chart, maybe an explanation about the numbers, but maybe I just looked wrong and it was there.

Clarity of expression

This theme is about the importance of wording or phrasing and writing clear and concise sections of the interface so that they are easily understood. One participant thought that it was clearer reading the actual question that belonged to the answer than reading the headline of that section:

For me the question is more clear than Overall experience, because it explains more about the actual situation, but I can imagine that if you have set up the survey then you already know what you mean with Overall experience/comments/recommendations stuff like that, then it becomes a lot clearer.

The different headlines of the different sections came of as not being clear enough. One participant said:

Well I guess I was a bit confused by them, but I guess they are just the topic of the question.

Understanding of the data

There was a lot in the collected data that could be sorted into this theme, so it was split up into two sub-themes.

Ease of understanding

This theme resulted from the different degrees of understanding of the participants. For example, many of them understood what Hubert was showing at first glance, like the average score of one of the questions of the survey: So it’s like a summary dashboard with this graphic and the results of interacting with Hubert.

Another example is that many of the participants quickly understood the first page was a summary page of all of the data collected in the survey:

I think already these charts on the first page show kind of nicely an overall feeling about how… About the information that it should show us so I think it’s really good.

One participant when asked about what question belonged to what answer:

Yeah I think the rest is good, readable, understandable.

There were several things that the participants did not fully, or only after a while, understand about Hubert. One participant did not understand the bar chart in the detailed question report:

(7)

This shows that the participant did understand the average score of the bar chart but did not understand what it wanted to tell.

Another participant said about the side-by-side bar chart: I think it shows the topics or things that the candidates talked about but I’m not exactly sure what the number like 12% or 6% mean, maybe it’s just that 12% talk about recruiters and 6% about time.

This implies that the participant did in fact understand that the side-by-side bar chart shows the words the candidates talked about, but had trouble understanding the numbers being shown.

There were things the participants simply did not understand. For example, some participants were not sure about whether or not the scale of the average score rating started at 0 or 1 or some other number.

Yes like was done here, it shows the question that was asked and the scores and ranking, for example here I see 4.8/10 but I’m not sure whether it starts at 0 or at 1 or maybe it even starts at 3, so that’s just…

Some participants were really struggling with understanding how to interpret the percentages, and how they relate to the number of respondents of that particular question:

So yeah it gives a more detailed report about maybe the percentage of people from my respondents. 5… Recruiters… This kind of confuse me a bit because 12% of 5, I don’t know what means 12%, 6%, 6%, 3%… Because when I read 12% from 5 respondents it will be like 0 point something so I don’t know what it means this percentage.

The participants thought it was confusing when seeing the number of candidates answering a question and the number of answers they gave, in relation to the other answers given to the follow-up question regarding the scale rating:

The respondents 5, messages 10 means I assume that every respondent left 2 messages. On average, every person that filled in the questionnaire answered 2 questions or filled in 2 messages to a question. One participant did not understand what the bar chart was presenting:

Not sure how this graph was made. I don’t understand how this graph here was made and what it wants to say to me.

Depth of understanding

Most participants understood Hubert and the interface on a basic level, for example one participant said:

That more than half of the people are ok with the process, with the recruiting process. But there are a few things that could be improved.

In some places and by one participant in particular there were signs of a deeper understanding:

“As a manager you know where to head next, you see that the score is 5.6 which is a passing grade [...] Yeah so it gives a really clear overview that I think a manager could work really well with.”

Observations

During the test sessions some observations were made. An interesting aspect was that the participants seemed to be drawn to the first section which contained the pie charts. These pie charts contained many colors while also being in the first section, so they seemed to stand out more than the other sections that were under it, despite those sections also containing graphical elements (although not pie charts). Even though the participants were given a thorough introduction to Hubert and got a chance to ask questions before the test sessions, it seemed as some of them had a hard time grasping the more advanced functions of Hubert like the tagging or the filters such as emotions and topics. In some cases, when asked about their understanding of certain aspects of the interface, the participants claimed to understand it but when asked more questions about that specific subject it sometimes was clear that they in fact did not understand what they claimed to, at least not fully.

DISCUSSION

Clearly there were things about the interface that the participants had no problems understanding, for example the users understood that the first page was a summary page of all of the data collected in the survey. As the design contains different kinds of data, it was important to try to present it in a way that was not too confusing or overwhelming to the participant at once, especially considering novice users as a target group. The idea of the design was to present the data in a clear way through an overview (the first page) and then in a more detailed view when clicking a certain section. This design choice seemed to have worked as the participants found the first page to contain a graphic summary overview of the data, and most of them also understood that they could click certain parts of the interface to open the detailed view. They understood this because of a visual cue which was that hovering over certain parts of the interface made that part pop out with a slight size increase and shadows. This should be an obvious part of the design of an interface like this when you want to point out that something can be interacted with, as it shows the user what parts they can interact with, in this case click on.

(8)

that score wanted to tell. This could be because of the bar chart itself, and not the average score. The bar chart did not properly convey the information like it should have. It was not fully clear that this bar chart belonged to one specific question, so to the participants it was not fully obvious what it wanted to say. This means that the visualization should contain elements of design that clearly shows to the user what it wants to tell them, and if it is related to other elements (like the specific question in this case) that should be made very clear. This also goes in hand with what Tufte [22] said about layering and separating the data to sort it in a way that enhances the differences in it, so that it stands out in a clear way for the user.

The participants made it clear that there were several things that they did not understand at all. For example, some participants wanted to know if the scale of the average score rating started at 0 or 1, which of course has a huge impact on the interpretation of the score. Tufte [22] states that to clarify, add detail. This could have been done by having the whole scale from 1-10 visible around the average score. Further, some participants struggled with understanding how to interpret the percentages shown in the bar chart, and this probably goes in hand with what was said in the previous paragraph about that visualizations should contain elements of design that clearly shows what it wants to tell. One participant did not understand what the bar chart in the detailed view wanted to say. This could indicate that the design of the bar chart was not good enough, or it could also indicate that the participant did not understand what the answers actually were. Since they were lone words that Hubert had picked out of a full sentence, it perhaps was not clear how these words were processed. Some participants also thought it was hard to interpret the percentages in the bar charts. These percentages were meant to show the frequency of the word, but apparently it was not obvious since these participants tried to relate the percentages with the number of users responding to the survey. According to Tufte [22] clutter and confusion are failures of design, which could very well be what this case is. Perhaps a better way of designing this part would be to separate the different data in different areas as they are not meant to be looked at together.

Another visual cue that was used in the design was having the average score of one of the summarized answers of the first page change color depending on how high or low the score was. For example, if the score would be low, the color would be red, if the score would be mediocre the color would be yellow, and if the score would be high the color would be green. This was noted by one participant in particular who pointed out that this color scheming was a good idea. This implies that for the visual perception it is a good idea to use colors, as this helps making the data clearer to the user. This also goes in hand with what Tufte [22] says about the use of colors. Also related to this part were the participants that said it helped a lot to have pie charts, which was something that also used colors to

highlight certain parts, for example the sentiment of an answer was red, green or yellow depending on if it was negative, positive or neutral, respectively.

Another interesting aspect about the graphics of the interface was that graphics, or visualizations, in general seemed to be very important to the participants in helping them understand things. Several participants clearly said so and it helped in different ways, for example it helped their understanding, and it was important to show information even if it was obvious. This is an interesting finding because it implies that the user of an interface like this is helped by information that is perhaps not directly needed for their understanding, but instead indirectly useful by having the user subconsciously fact-check or make sense of the data. If the user has the information visible it can ease their understanding of other parts of the interface. It was also observed that the participants seemed drawn to the section containing pie charts, as these seemed to stand out more. Another aspect that was noted was that some participants thought the design could be improved by having bigger text in some areas, and using different colors, for example black instead of gray to make a text easier to read. This is not anything groundbreaking but serves as a reminder to think through the design choices one makes even in minor details.

(9)

The theme Clarity of expression is all about using clear and concise phrasing and wording of the interface, and it became evident that some sections of the design did not live up to this since several of the participants were confused about them. For example, the different headlines of the main page (“Overall Experience”, “Recommendation” etc.) did not provide the participants with enough information to understand what they were trying to convey. Instead, the participants had to read the questions of the specific section before they understood what the headlines tried to express. This indicates that the headlines were not properly worded, which is something that could be easily overlooked, but is nevertheless important.

Some of the results discussed can be explained by using Norman’s theory of conceptual models previously explained in the background section. Since this theory states that the designer is responsible for designing systems that help the user to create more coherent mental models [8], it is interesting to reflect on if the design of Hubert was successful in doing so. For example, using this theory on the areas where the participants clearly understood the interface, it can be argued for that the Design Model and the User’s Model are similar, which means that the System Image is interacted with by the user in similar ways like the designer conceptualized the Design Model. In contrast, where the participants did not understand the interface, it can be argued for that the different mental models of the designer and the user was the way they were because of the System Image not being understood in the same way by the different parties. A concrete example is where the participants wanted to know if the scale of the average score rating started at 0 or 1. This of course greatly influences how the average score is interpreted, and since this wasn’t considered in the design of the interface it could be that the designer’s mental model was very different to the user’s model.

This study was conducted to try to explore what to think about in terms of visualizations when designing an interface in order to present chatbot results for novice users. The results were not sufficient enough to be analyzed on a deeper level. It is therefore not fully possible to answer the research question How do you visualize the results of a chatbot survey for novice users to facilitate the understanding of the data?. However, the findings discussed suggest some things that could be valuable to considered when designing an interface of a chatbot’s results.

The reason for the results presented and discussed so far not fully answering the research question seems to be that the participants did answer the questions on a shallow level in that they mostly focused on the obvious things in the interface such as the graphical elements, and not the underlying data and what it actually meant. There are several reasons for this. One reason is that the conceptual models perhaps differed too much. Like previously

mentioned, a design can be interpreted differently through the different conceptual models looking at the design. If the System Model was interpreted differently through the User’s Model compared to the Design Model, which was the case in some places with this design, it could be because the designer did not properly take into account the participant’s previous background and experience. In this case all of the participants were novice users that had never used Hubert before, and it seemed like some of the design elements were not really seen by the participants the same way as the designer.

The interview effect was also apparent during the study, as some participants seemed to give an indication that they did understand everything about Hubert when that was not the case. Perhaps making the participants more comfortable by asking warm-up questions before the actual questions could have made them more prone to give honest answers.

(10)

cognition. There were some exceptions since a few participants seemed to have a deeper understanding of the data in some places. For example, the participant whose result is presented as the deeper level of understanding in the results section, talked about the data on a bit more abstract level by relating the data to the scenario which was taking on the roll of a HR manager, and discussing the data from that point of view, for example what it meant for that specific roll. This example shows the few cases in this study where the participants showed a deeper level of understanding regarding the interface, and therefore was on a higher level of the taxonomy model. It is hard to answer why this participant seemed to have a deeper level of understanding compared to the others, but one possible explanation could be that the participant had a conceptual model that was closer in line with the designer’s conceptual model. What has been discussed here highlights the importance of acknowledging the different levels of understanding and conceptual models of the participants.

One quote that is interesting to look at is this:

The respondents 5, messages 10 means I assume that every respondent left 2 messages. On average, every person that filled in the questionnaire answered 2 questions or filled in 2 messages to a question. Umm yeah.

This quote highlights why chatbot surveys differ from traditional survey results and why it is interesting to research chatbot surveys specifically. The data of Hubert is less structured compared to traditional survey data. This is because Hubert sometimes asks follow-up questions depending on the answer. If the answer is short, shallow, or touches on an interesting topic, Hubert can ask the user to elaborate. The follow-up question can differ by either asking the user to answer in text (qualitative) or on a scale of 1-10 (quantitative). The user can also choose to elaborate or bring up other subjects. All of this makes the data collected unstructured, and it becomes a mix of different types of data depending on the user’s answers. This does have some consequences, and to relate to the quote one of them is that you can get several answers on the same question. The quoted participant thought that each of the five respondents gave two answers each, making it ten responses in total, when it actually was a few respondents answering several times in different ways, and a few respondents just answering once. This is an interesting finding and it shows why specifically chatbot survey results need design attention.

Limitations

The tasks and questions given to the participants were too instrumental and focused on the interface and not so much on the application of the tool, which meant that the answers from the participants were often shallow. This also meant that it was not possible to analyze the results on a deeper level, instead the analysis was more focused on the

graphical elements of the interface and not on the participants’ understanding of the underlying data. The questions were also a bit too leading and should have focused more on the participants’ understanding of the data. Because of the COVID-19 pandemic the study suddenly had to be re-planned from an in-person study to a remote study, which made the format of the study different and it made it harder to for example ask follow-up questions since it was a bit harder to communicate everything online via voice only.

Another limitation was that since all of the tests were done remotely, it meant that the participants used their own computers. This could mean that things such as the screen’s resolutions or colors could differ and impact how the participants experienced the system and the study.

Future research directions

Future research should look into how to make the participants better conceptualize the interface, and how to make them understand the underlying data. There are indications from this study that Bloom’s Taxonomy can be used in a systematic way as inspiration to try to setup a study which better focuses on these points.

Future research could use participants from different user groups, such as age, origin, or knowledge to see if the results would differ. One could also look beyond understanding and look at things such as perception or efficiency. Another thing that could be interesting is to research how the chatbot interaction itself should be designed.

CONCLUSION

This study aimed to explore what to think about when designing an interface in order to present chatbot results for a user. With using Hubert as a specific case, a user study was undertaken with a total of 17 users participating, in four iterations. These iterations contained user tests and interviews to collect data. The data from the final iteration was then analyzed using the method thematic analysis. The results failed to help answering the research question, but did give some suggestions on interface visualization, mainly the importance of using visualizations that are carefully selected to improve the understanding of the data presented.

(11)

REFERENCES

[1] Anna & Hubert Labs AB. Hubert.ai - How it works. Retrieved May 26, 2020 from https://www.hubert.ai [2] Petter Bae Brandtzaeg and Asbjørn Følstad. 2017.

Why People Use Chatbots. In Internet Science (Lecture Notes in Computer Science), Springer International Publishing, Cham, 377–392. DOI:https://doi.org/10.1007/978-3-319-70284-1_30 [3] Virginia Braun and Victoria Clarke. 2006. Using

thematic analysis in psychology. Qual. Res. Psychol. 3,

2 (January 2006), 77–101.

DOI:https://doi.org/10.1191/1478088706qp063oa [4] Minjee Chung, Eunju Ko, Heerim Joung, and Sang Jin

Kim. 2018. Chatbot e-service and customer satisfaction regarding luxury brands. J. Bus. Res.

(November 2018).

DOI:https://doi.org/10.1016/j.jbusres.2018.10.004 [5] Paul R. Daugherty and H. James Wilson. 2018. Human

+ Machine: Reimagining Work in the Age of AI. Harvard Business Press.

[6] Asbjørn Følstad, Theo Araujo, Symeon Papadopoulos, Effie Lai-Chong Law, Ole-Christoffer Granmo, Ewa Luger, and Petter Bae Brandtzaeg (Eds.). 2020. Chatbot Research and Design: Third International Workshop, CONVERSATIONS 2019, Amsterdam, The Netherlands, November 19–20, 2019, Revised Selected Papers. Springer International Publishing, Cham. DOI:https://doi.org/10.1007/978-3-030-39540-7 [7] Asbjørn Følstad, Cecilie Bertinussen Nordheim, and

Cato Alexander Bjørkli. 2018. What Makes Users Trust a Chatbot for Customer Service? An Exploratory Interview Study. In Internet Science (Lecture Notes in Computer Science), Springer International Publishing, Cham, 194–208. DOI:https://doi.org/10.1007/978-3-030-01437-7_16

[8] Dedre Gentner and Albert L. Stevens. 2014. Mental Models. Psychology Press.

[9] Margot J. van der Goot and Tyler Pilgrim. 2020. Exploring Age Differences in Motivations for and Acceptance of Chatbot Communication in a Customer Service Context. In Chatbot Research and Design (Lecture Notes in Computer Science), Springer International Publishing, Cham, 173–186. DOI:https://doi.org/10.1007/978-3-030-39540-7_12 [10] Bob Heller, Mike Proctor, Dean Mah, Lisa Jewell, and

Bill Cheung. 2005. Freudbot: An Investigation of Chatbot Technology in Distance Education. Association for the Advancement of Computing in Education (AACE), 3913–3918. Retrieved May 27,

2020 from

https://www.learntechlib.org/primary/p/20691/

[11] Soomin Kim, Joonhwan Lee, and Gahgene Gweon. 2019. Comparing Data from Chatbot and Web Surveys: Effects of Platform and Conversational Style on Survey Response Quality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), Association for

Computing Machinery, Glasgow, Scotland Uk, 1–12. DOI:https://doi.org/10.1145/3290605.3300316

[12] David R. Krathwohl. 2002. A Revision of Bloom’s Taxonomy: An Overview. Theory Pract. 41, 4

(November 2002), 212–218.

DOI:https://doi.org/10.1207/s15430421tip4104_2 [13] Xueming Luo, Siliang Tong, Zheng Fang, and Zhe Qu.

2019. Frontiers: Machines vs. Humans: The Impact of Artificial Intelligence Chatbot Disclosure on Customer Purchases. Mark. Sci. (September 2019), mksc.2019.1192.

DOI:https://doi.org/10.1287/mksc.2019.1192

[14] Bradford Mott, James Lester, and Karl Branting. 2004. Conversational Agents. In The Practical Handbook of Internet Computing, Munindar Singh (ed.). Chapman

and Hall/CRC.

DOI:https://doi.org/10.1201/9780203507223.ch10 [15] Bonnie M. Muir. 1987. Trust between humans and

machines, and the design of decision aids. Int. J. Man-Mach. Stud. 27, 5–6 (November 1987), 527–539. DOI:https://doi.org/10.1016/S0020-7373(87)80013-5 [16] Lea Müller, Jens Mattke, Christian Maier, and Tim

Weitzel. 2020. Conversational Agents in Healthcare: Using QCA to Explain Patients’ Resistance to Chatbots for Medication. In Chatbot Research and Design (Lecture Notes in Computer Science), Springer International Publishing, Cham, 3–18. DOI:https://doi.org/10.1007/978-3-030-39540-7_1 [17] Tom Nadarzynski, Oliver Miles, Aimee Cowie, and

Damien Ridge. 2019. Acceptability of artificial intelligence (AI)-led chatbot services in healthcare: A mixed-methods study. Digit. Health 5, (January 2019), 1–12. DOI:https://doi.org/10.1177/2055207619871808 [18] Mai-Hanh Nguyen. Why the world’s largest tech companies are building machine learning AI bots capable of humanlike communication. Business Insider. Retrieved June 15, 2020 from

https://www.businessinsider.com/why-google- microsoft-ibm-tech-companies-investing-chatbots-2017-11

[19] Donald A. Norman and Stephen W. Draper (Eds.). 1986. User centered system design: new perspectives on human-computer interaction. Erlbaum, Hillsdale, N.J.

[20] Mohammad Nuruzzaman and Omar Hussain. 2018. A Survey on Chatbot Implementation in Customer Service Industry through Deep Neural Networks. 54– 61. DOI:https://doi.org/10.1109/ICEBE.2018.00019 [21] Kyo-Joong Oh, Dongkun Lee, Byungsoo Ko, and

Ho-Jin Choi. 2017. A Chatbot for Psychiatric Counseling in Mental Healthcare Service Based on Emotional Dialogue Analysis and Sentence Generation. In 2017 18th IEEE International Conference on Mobile Data

Management (MDM), 371–375.

DOI:https://doi.org/10.1109/MDM.2017.64

(12)

(13)

(14)

(15)

(16)

(17)

(18)

APPENDIXG

–

Test Scenario Intro

Hubert is a conversational chatbot that can be used to collect data about a participant's experience or opinions in a certain subject, such as customer experience or human resources. In contrast to traditional surveys, Hubert asks the participant questions with follow-up questions, depending on how participants reply. So, similarly to traditional surveys, it's easy to send out but also allows for deeper insight in the replies.

For example, companies can use Hubert to evaluate their products. They send a link to their customers, who talk to Hubert in a chat screen (see image below [Figure A1]). After this, Hubert automatically analyzes the answers and presents them to the company on a web page.

Figure A1. Hubert chat screen.

(19)

Figure A2. Tagging a reply. Scenario

(20)

APPENDIXH

–

Questions about test Scenario

(21)

APPENDIXI

–

Tasks and follow-up questions

1. Can you explain what you see on this page? What kind of information do you see? (If needed: How do you interpret the data?)

2. What can you tell from the Overall experience?

3. What is your understanding about the different headlines?

4. What can you tell about the Recommendations? (If needed: What do you see here?) How do you interpret this data? 5. Click the left chart under Overall experience. Can you explain what you see on this page? How do you interpret this

data?

6. What happens when you click one of the blue bars? 7. What is your understanding of the menu to the left?

8. Go to Sentiment. How would you sort the data so that you show the negative answers and the oldest answer? How easy was that?

9. Go back to the first page by clicking Hubert in the top-left corner. Click the right chart under Overall experience. Can you explain what you see on this page? What is your understanding of this data?

(22)

APPENDIXJ

–

Semi-structured interview

1. What is your understanding of this data that you have seen? 2. How easy is it to interpret the information?

3. What are the most important aspects for you in Hubert? 4. What different types of information is in the data?

(23)

APPENDIXK

–

Transcriptions of tasks and interviews

Participant 1

Can you explain what you see here?

There has been a survey of some sorts and probably using the Hubert chatbot with different kinds of experience from people who responded to the survey and they can leave a recommendation or a comment, and you can apparently see different data about the people who responded.

So what is your understanding of this data? Can you tell me what types of data you see?

Well I see pie charts with, I don’t know, different kinds of opinions, positive and negative opinions and what kind of people were reacting to everything like what kind of job they had or I’m not sure.

What about the questions there, do you think they belong to something specific?

Oh yes, on the left side you have the overall experience on average based on this recruiting process and on the right side you have improvements, opinions of improvements.

What is your understanding about the different headlines?

So you mean the headline of overall experience for instance? Well it’s a short and clear explanation of what’s beneath it.

So you think it’s relevant for the questions that come after?

Yes.

If you scroll down a little, you see the Recommendation section. How do you interpret the data there?

Well most people aren’t likely to recommend this position.

Was that easy to understand?

Yea it’s 4.8/10, not great.

If you click the left pie chart. Can you explain what you see on this page?

A more detailed summary of the responses given so you can see how many people responded in what manner so there were two people who responded with a 4 and a 5 out of 1+0, and one person who responded with a 10.

Is this page easy to understand?

Hmm, yes relatively so.

How could it be more clear?

Umm, maybe have the chart have more detailed lines because I’d imagine if there were more responses it might get.. Not very easy to read, if there’s like 10 people who said 4 and 9 people who said 9 you might maybe think oh they both had 10 people giving the same answer so there’s not really any lines indicating a number only the top, the 2 in this case. And in that way I had to think more about what the.. How many responses the 10 might have.

What happens when you click one of the blue bars?

Ah you get the reactions of the people who responded to the question.

What is your understanding of the menu to the left?

Well it’s basically the the types of answers given, you can select all answers or maybe a specific type of answer you were interested in.

What happens when you click them?

Hmm. So you can get to see a pie chart of the different kinds of answers. So you get a somewhat detailed pie chart about the different types of answers given.

How would you sort the data so that you show the negative answers and the oldest answer?

With the drop-down menu or either selecting a type of answer right here.

How easy was that?

Very easy.

So if you go back and then click the right pie chart. What do you see on this page?

(24)

How easy is it to understand this bar chart?

Very easy, it’s clear and on every bar it has total percentages of the bar and it also has the 5%, 10% etcetera.

If you would like to see all the words of the bar chart, how would you do that?

I would click the expand button.

How easy was that?

Very easy, it’s also really clear.

Now I have a few questions.

What is your impression of the data that you have seen so far?

What do you mean?

Was everything clear, was it easy to understand what Hubert tried to tell you?

Yeah, most of it was very clear, there are some improvements possible I think but very little.

Those improvements, is it anything else other than what you already said?

Everything else seems relatively clear, but one other thing that might also help if you were to use a lot of pie charts you should probably stick with a certain type, like on the left hand side you have the positive which is the biggest and the starting part, but on the right side you have the biggest which is other, which makes sense, but it’s also on the other side instead of the first thing which I believe is convention for a pie chart but I am no expert.

Still considering that you are a HR manager, what would you say are the most important aspects of Hubert?

Being able to improve on your own process, I think that’s the most important thing because you get different data and different responses so you know on average what’s most important to improve on according to your responses.

The last question you have already touched on, but I have to ask: Do you have any ideas on how to make you understand Hubert better?

(25)

Participant 2

Can you explain what you see on this page?

This is the results dashboard with the results. We have the overall experience which I think is like a summary of the reaction of the candidate during the recruitment process as it says. And also, maybe some feedback from the candidate as well. So it’s like a summary dashboard with this graphic and the results of interacting with Hubert, and I can see some of the tagging I guess.

What kind of information do you see?

So I can see this graphics and numbers as well, and also text like questions and comments, and it’s definitely very visual because the graphics are helping a lot to understand. SO it’s a combination of text and graphics and colors as well.

What can you tell from the Overall experience section? Anything specific that you think about?

From the overall experience I think it’s interesting to see that the majority of the people who applied say it’s very positive and I think it will come with neutral as well so neutral and positive are good. But the negative is low as well and I was thinking like, what’s the meaning of the number like 5.6/10? I guess it’s the experience, oh so they’re rating the average experience. Also in the second part, from the first question my opinion is it’s kind of like a very normal recruiting process so I think there are a lot of space for improvement, and when they say like ok the most say “other” means like there are some thing that Hubert didn’t tag or maybe Hubert didn’t consider so it means that maybe it’s important to add options. So this second question doesn’t help much to me as a hiring manager I can’t see actually what “other” means. So maybe “other” means maybe not as good because they write the same thing in other but since it’s not tagged it will show under the tag “other”

What can you tell me about the different headlines?

Yeah the main headline is Evaluation results and then we have subsections, Overall experience, Comments, Recommendations…

Is that clear?

In comments yeah I don’t know if Comments are the most common thing, was it was everyone was saying or was it the most… If everyone is saying for example I lost trust in the company… If those comments are the most common comments, I don’t know what is in Comments… In the case of Recommendation, I think the number, I need a scale to compare, is 10 very likely or 1 very likely? So kind of something that I can compare the number with on a scale

Scroll up and click the left piechart. What can you tell me about this page? What do you see?

I see another chart, more information about the question, detailed question report. I have all answers, I can see different sections, ok. I have all the answers here

How do you interpret this data, for example the bar chart?

So I have a number of responses, ok so was confused, I have 10 messages and 5 responses so I thought here it would be 10 you know? When I looked at the chart I was expecting to see 10 answers because there are 10 answers down here. Here I can see for example how they rate experience. How was your experience, so I can see like the most of the people they say it’s kind of like in the middle, and I think I need to see what means 1 and what means 2 so I don’t know which one is in the middle, what is this one? I strongly disagree or?

Here I have some summary. So yeah I can see this is how they rate the experience with the recruitement process. But still, here, ok number of responses, and there could be some follow-up question that you can see in this part? Here there are only 5 answers specifically to that question but here like in answers there might be some follow-up question right?

The menu to the left, what is your understanding of that?

From sentiment I will say this is like the tagging mostly. Emotions, inside the tagging what is the feeling of the participant, and the topics I would say maybe… I don’t know what topics is, what do you mean by topic? Is it the topic of the

conversation with them or the topic of the interview or hiring? These I know because I read about them but topics I don’t know.

How would you sort the data here so that you see the negative answers and the oldest answers?

Here, I check at answers and I see all positive neutral negative. Maybe make it red.

(26)

So according to the question this chart is showing mostly like what aspects could be improved in the hiring process… I guess I can see more answers… So yeah It gives a more detailed report about maybe the percentage of people from my

respondents. 5… Recruiters… This kind of confuse me a bit because 12% of 5, I don’t know what means 12%, 6%, 6%, 3%… Because when I read 12% from 5 respondents it will be like 0. something so I don’t know what it means this

percentage. If it’s like the other one I can see 1 or 2, like 2 people say this. But I can see what’s the highest topic, “recruiters” is one of the things to improve, I don’t know the term of maybe “recruiters” maybe you have to go deeper in the topic like why “recruiters”. And the lowest one would be maybe this 3, description, communication. So it’s pretty clear in terms but sometimes it’s quite confusing to me because of the number 5.

Go back. The data you have seen so far, what is your understanding of it? How do you interpret it?

Based on the case scenario I think the data is more like a result or conclusion on how the people… Trying to collect the people’s emotions or perceptions about the hiring process. So I think the data is helping me a lot to see that and also it gives you some kind of like reaction also from the participants that we can usually.. It’s very difficult to see when you rate, when you rate something you say ok I give you 5, 2, I give you 7, but some people will say ok I will give you 5 it’s positive, I will give you 7 but it’s negative. It could be that. So it’s good to see the sentiment and emotions. And I think all the information is, the graphics help a lot and yes maybe when I see recommendations or something maybe it’s something to compare.

How easy was it to interpret the information?

It was very easy, based on the reading of the information sheet, and also the instructions, so kind of gives you an idea of what to expect so it was pretty clear but I will see reactions and sentiment and a lot of tagging based on the responses collected. I think it helps a lot with the pie chart and the numbers and if I want to click I can go and see more detailed information. The graphics helped a lot, I would say it’s very friendly. To me it’s important to put a little more obvious information or trivial information that you might think is not necessary, but for someone who’s not maybe very familiar with this kind of interface it would help them a lot.

What are the most important aspects of Hubert?

Since I’m the HR manager I want to give a good impression to my applicants, I want the interaction in the hiring process to be very smooth and friendly and also that they can feel that if they weren’t selected this time they can keep applying, so for me definitely in the section of overall experience I would like to see their experience, and also what can we improve, I think there’s always room for improvement for everything so it’s important to see so just here to read more about what the “other” thing so maybe a list or something.

Do you have any ideas of how to make you understand Hubert better?