Document Clustering Interface

(1)

Linköping Universitet

729G30

Bachelor Thesis

Document Clustering Interface

Author:

Samuel Johnson

Supervisor:

Arne Jönsson

December 17, 2014

LIU-IDA/KOGVET-G14/029SE

(2)

Abstract

This project created a rst step prototype interface for a document clustering search engine. The goal is to facilitate the needs of people with reading diculties as well as being a useful tool for general users when trying to nd relevant but easy to read documents. The hypothesis is that minimizing the amount of text and focus on graphical representation will make the service easier to use for all users. The interface was developed using previously established persona and evaluated by general users (i.e. not users with reading disabilities) in order to see if the interface was easy to use and to understand without tooltips and tutorials. The results showed that even though the participants understood the interface and found it intuitive, there was still some information they thought were missing, such as an explanation for the reading indexes and how they determined readability.

(3)

1 Introduction

In today's information based society it may sometimes be dicult to nd the rel-evant information you are looking for. Hours upon hours could be spent in order to go through accessible documents and articles on the internet and you often encounter very similar information/documents. With this in mind, it would be helpful with a service which gather the result of your search queries and sort them in to clusters of similar documents. Clusters would then provide a simple way of determining which documents are worth reading and which documents to skip. The information clusters often show relevance towards the search query, however this might not be the only relevant information for a user with reading diculties. Individuals with reading disabilities might have a problem to scan through all the documents and all the text presented in a cluster and therefore want to know the readability of a document and not just the relevance towards the query. This is the idea in which this project is based upon. By creating a document clus-tering service which not only present the result as relevance towards the search query but also indicates the readability of the documents with the help of dierent readability indexes. This paper will present the graphical user interface developed to help users make sense of the data, regardless of disabilities. This project is a continuation of the earlier work Weblättläst which ranks WebPages according to readability.

1.1 Hypothesis

The basis of this thesis is the that users with reading disabilities might benet from an interface with focus on graphical representation of clustered documents instead of the list of documents normally presented when a user search the internet for information. Also that users might not only want to know the relevance of each document and/or cluster but also the readability of the results from the search, ltering which documents might be an easier read. By providing as much relevant information as possible and presenting it in a visual, easy-to-navigate and comprehensive way through the user interface, this thesis will attempt to create a basic prototype to be used in further development of the service.

(6)

1.2 Purpose

The purpose of this thesis is to design a prototype interface mock-up based on the knowledge surrounding graphical design and two previously constructed personas. The interface prototype should be easy to use regardless of reading disability and will be a good basis for future development of the service. This interface will present data clusters based on documents found and their relevance towards the search query as well as presenting the readability of the document. In order to do this one have to explore the options regarding graphical representations of data clusters in order to facilitate the needs of users with reading disabilities while maintaining as much information as possible in order to make the service relevant for other users.

2 Background

Data clustering have existed for decades and is helpful in organizing large amount of information. Data clusters might be anything from statistical data, categor-ically organized species records, pictures, documents, et cetera. Data clustering entails the process of clustering of any kind of data and not necessarily document clustering specically.

2.1 Data clustering

Modern society provides users with enormous amounts of information for anyone with a internet connection. In order to organize and categorize data used in, for example research, clustering this data is a viable solution to minimize the need to process all the data. The goal of data clustering is to nd natural occurring groups by looking at patterns, objects, etc [3]. Jain denition of the function of data clustering is as follows; given a quantity of objects, nd a number of groups based on measurements of similarity, which means; the similarity between objects within the same group are high but the similarity are low between objects which originates from dierent groups. How the data clustering should be done (i.e. what criteria should be used) is a subjective assumption based on knowledge and comprehension of the domain in which the data occur and therefore the important aspects of the data itself. Data clusters are used for three main purposes [3]:

(7)

Underlying structure:

to gain insight into data, generate hypotheses, detect anomalies, and identify salient features.

Natural classication:

to identify the degree of similarity among forms or organisms (phylogenetic relationship).

Compression:

as a method for organizing the data and summarizing it through cluster prototypes.

2.1.1 Graphical representation of data clusters

There are many ways to create graphical representations of data clusters depending on type of data and what the intended purpose is. An eective way of presenting data clusters is by creating graphical diagrams where the dierent clusters are categorized according to relevance towards the search query, number of individual objects within a cluster (quantity of similar search results), et cetera. Most of the data cluster engines available online are using a user interface where the clusters are presented in the form of a tree, with directives and sub-directives [2]. This system of organization is similar to that of operating systems of computers, with folders and subfolders. But this kind of representation of data clusters has limitations [2]:

• Tree-categories does not show any link between two clusters; two clusters might contain data which have strong semantic connection which, depending on settings for the data clustering, might not end up in the same cluster. This would mean that the semantic connection might be underestimated or even lost if not presented in another fashion.

• It is harder to nd similarity between clusters with a tree of categories, meaning two clusters might belong to a more general category. It is also

(8)

harder to identify non-relevant categories with a tree representation of data clusters since all categories are presented on the same level.

• Another problem occurs with large amounts of data in each category or with a large number of cluster categories. If several categories (nodes) where to be expanded at the same time then it might not visually t the monitor screen, the same problem would apply to sub-directives. This would lead to the graphical representation being hard to navigate and limit the user as to how many nodes can be expanded at the same time.

A topology-driven approach was dened by Giacomo et al [2] in order to create a more versatile method for designing the interface for a web clustering engine than a pure tree structure. The topology-driven approach includes three distinct phases [2]:

• 1. In the rst phase, a base graph is computed that summarizes the semantic relationships of the snippets returned by the classical search engines; called the snippet graph.

• 2. A clustering algorithm is applied on the snippet graph in order to com-pute a hierarchy of semantic categories, each category grouping a certain subset of snippets. Relationships among categories that are not in a an-cestor/descendant relationship is automatically induced by the edges of the snippet graph; the resulting clustered graph is called the graph of categories. • 3. The graph of categories is visualized to the user along with a set of

functionalities to browse and analyze it.

As you can see phase 3 is of most interest for this paper, the visualization of the clustered data. While there are many ways of creating a visual representation of data clusters there are a few common goals that should be considered when doing so [2].

• The user should not be overwhelmed by information and it should be kept as simple as possible. This means that the content of the dierent clusters should not be displayed at the same time but rather let the user self deter-mine which clusters should be explored. It also means that simplifying the

(9)

relationship among clusters can make it easier for the user by emphasizing those aspects considered most relevant (in this case readability and relevancy towards the search query).

• The visual representations should be aesthetically pleasing and facilitate the users mental map of the graph during exploration.

2.2 Readability

Since the program will use dierent readability indexes in order to rate each doc-ument it might be important to understand these readability measurements. Cur-rently the program use three dierent measurements; LiX, OVIX and Nominal Ratio. Note that these readability indexes do not consider grammar or semantic content.

2.2.1 LIX

LIX gives indication towards the readability of the text by looking at the length of sentences as well as the frequent use of longer words. However LIX does not take into consideration how common the long and short words are, meaning words like "mountain" might be considered a long word and therefore increase the nal score of text. Words that are shorter but uncommon does till improve the readability of the text; a word like "woe" would increase the readability of the text even if it might be a word that is harder to understand than "mountain". LIX gives a good indication of the readability in terms of long/short words/sentences. The formula for LIX follows;

LIX = n(words)

n(sentences)+ (

n(words > 6 chars)

n(words) × 100) (1)

2.2.2 OVIX

Instead of looking at the length of words and sentences, OVIX provides a readabil-ity value with the help of the range of words used, in essence the greater variation of words used, the lower readability. Basically by dividing the number of unique words with all words contained in the text you can get an idea of how advanced a

(10)

text might be. Formula follows;

OV IX = log(n(words in total))

log(2 − _{log(n(words in total))}log(n(unique words))) (2) 2.2.3 Nominal Ratio

By looking at quantity of dierent word classes especially nominal's, NR provides a measurement of information density. A lower value means a lesser density of information which in essence should be an easier text to read. Formula as follows;

N R = n(nouns) + n(prepositions) + n(participles)

n(pronouns) + n(adverbs) + n(verbs) (3)

3 Existing data cluster engines

There are a few data cluster engines available on the internet already, but none that seems to cater to the needs of those with reading disabilities. This paper will take a

closer look at the document cluster engine Carrot2 (http://search.carrot2.org/stable/search).

3.1 Carrot2

Carrot2 is an existing OSS (Open Source Software) data cluster engine which pro-vide users with topologically sorted clusters based on documents found on the web. According to Osinski and Weiss(2005) [7] the primary goal of Carrot2 is to enable rapid research experiments with novel text/web mining techniques. Carrot2 is equipped with several implementations of search results clustering algorithms. This version of Carrot2s interface was later improved two years later by Osinski and Weiss(2007) [8] and as they explain: "One important factor determining the success of the latter [Whether OSS can eectively replace proprietary end-user applications] is the quality of their user interfaces".

Non-experts, meaning common users, want to complete their tasks quickly with least amount of eort spent on learning and operating the software. While a low quality user interface might not discourage experts from using the software, other users might feel reluctant to spend time learning/operating such a software. There-fore an attempt was made to create a more user friendly interface for Carrot2 by

(11)

introducing usability practices from the perspective of an active developer. The inherent aim of Carrot2, since it is OSS, is to provide a framework for development of search results clustering engines.

The development of Carrot2 started in 2002 and focused on its primary goal to en-able rapid experiments with novel web mining techniques [8]. However after some time into the development the creators of Carrot2 decided to focus more on facil-itating the needs of end-users which would require an easier-to-use interface than what was already implemented. In order to redesigning the UI (user interface) four major stages was set [8]: Evaluation of the original design, prototyping of the new design, usability testing, implementation and evaluation of the redesigned version. Evaluation of the original design - When emphasis of the software shifted from attracting text mining researchers towards end-users, they discovered that some elements were obsolete. For instance the original design forced the user to choose which clustering algorithm to use; "GoogleAPI, English stemmer, Tf Terms weigh-ing, AHC, Dynamic Tree" or "YahooAPI, LINGO, Dynamic Tree", which to a layman would mean little to nothing.

Another problem was encountered for non-expert users when choosing the search result sorting order. In the original interface search results were presented with thematic folders (clusters) to the left and content of folders to the right (the indi-vidual documents). The sorting order chosen referred to the content of folders and not the folders themselves and according to Osinski and Weiss(2007) [8] it was not always a certainty for users which sorting order was selected or what it actually meant. In addition, end-users would face technical jargon which would be hard to understand for any layman.

Changes made for the Carrot2 interface includes a few overhauls such as setting search results sources as tabs instead of using boxes. The new interface also hid the choices for clustering algorithms since the majority of end-users would not have enough technical knowledge required to choose an appropriate clustering al-gorithm, this option is now attainable from a 'advanced options" menu. In order to test the newly redesigned interface for Carrot2 [8] they performed informal us-ability test on a handful of people (exact number not provided).

(12)

Since this project focus on the graphical representation of clusters let's take a look at the three type of data presentation which Carrot2 provides:

3.1.1 "Folders"

The "Folders" option presents the search results by a familiar design using folders as symbols for clusters. The clusters or "Folders" are presented on the left side and when a cluster is clicked the containing documents is listed on the right side. The right side panel with clusters is sorted according to number of documents within each cluster meaning the more results belonging to a cluster the higher the cluster will be on the list. The thematic topic is the label for the cluster and the number is the amount of results belonging to that cluster.

Figure 1: The "Folders" option of graphical visualization for clusters. 3.1.2 "Circles"

The same setup is used for the option "Circles" with presentation of clusters on the left side and the results on the right. The circle is fairly similar to the "Folders" option in that it presents the clusters according to the amount of results within. Instead of listing the amount of documents within a cluster as a number however, "Circles" only provide a hint for how large an individual cluster is by the amount of space given on the circle. The space given to each cluster is dynamical to the

(13)

other clusters since the circle is a nite space. Clicking on a cluster in the circle will generate a list to the right with the contained documents.

Figure 2: The "Circles" option of graphical visualization for clusters. 3.1.3 "FoamTree"

The third option for cluster visualization is "FoamTree". As before with "Circles" the amount of space taken by each cluster represents the amount of results within and as with "Circles" the "FoamTree" does not provide a numerical clue as how many documents belong to a certain cluster.

(14)

Figure 3: The "FoamTree" option of graphical visualization for clusters. These forms of graphical visualization does not provide the user with much more except search result order (by selecting "All Topics" in the "Folders" tab) and which cluster contains the most results. This can be explained by the simple fact that it is not a part of Carrot2s goal to provide anything else than a visualization of the data clusters. In contrast to Carrot2, this project have the additional goal of providing the user with relevant documents which are easy to read, or at the very least give the user an indication which documents should be an easier read. Having these criteria as one of the main goals means that these three forms of graphical representation might not be adequate since implementing more measurements as variables might not be feasible.

4 Reading Disabilities

There are many factors which could contribute to reading diculties outside an actual disability. For instance the user might have the target language as a sec-ond language, or the user might be able to read eectively but have a narrow

(15)

vocabulary, even temporary factors like sleepiness, stress or a rowdy environment might contribute towards reading diculties. However, proper disabilities might require more than nding a calm place to read or spend time learning more di-cult/uncommon words. Two common reasons for reading disabilities are dyslexia and ADHD (Attention Decit Hyperactivity Disorder) and they both dier quite a lot in the way these disorders aect reading, it is also important to note that neither of these disorders is due to low intelligence or a limited vocabulary. Even though this cluster engine will merely provide hyperlinks to the original homepage and not alter the original texts in any way, it is important to understand how these disorders aect users.

4.1 Dyslexia

Dyslexia is a neurobiological learning disability which presents great diculty in reading and writing. People with dyslexia have trouble to accurately/uently rec-ognize words in texts and often have trouble spelling words or decoding longer sentences Shaywitz(2005) [6]. Dyslexia is hereditary and the greatest risk factor is the family medical history and according to Shaywitz(2005) [6] ranging from 23 The reading diculties caused by dyslexia is regarding the decoding of words, this means that a person suering from this disorder may not have any problem with comprehension of spoken language (i.e. have a good vocabulary), however when trying to read a written sentence they might face problems of actually being able to uently read each word. Spelling words are also a common problem for people with this disorder.

In order to facilitate reading for people with dyslexia there are a few guidelines to follow (Centrum för lättläst - appendix). Writing in shorter sentences makes decoding easier since the reader does not need to decode to much words, also using simple words/more common words would allow the reader to more easily identify the word and therefore make decoding more uent. It might also help to write each sentence on new lines so that the reader can clearly see where the sentence begins/ends and it might feel nice to use a good variation of words instead of re-peating the same words over and over, however repetitive use of words makes the

(16)

text easier to decode and would be benecial towards a reader with dyslexia.

4.2 Attention Decit (Hyperactivity) Disorder(ADHD)

ADHD (or ADD - Attention Decit Disorder) is a neurological disorder which often develops during the early years of childhood but might not be diagnosed until adulthood. There are a few personality traits which are commonly linked to ADHD, for instance hyperactivity, restlessness, mood swings and short attention span are often portrayed as the main characteristics of the disorder [1]. But the symptoms of ADHD or ADD can greatly vary between individuals, an example is the dierence between ADHD and ADD for instance. A person with ADHD might get noticed quite easily since they appear restless and unable to focus their attention while a person with ADD might be harder to identify since its common those people will be on the opposite of the hyperactivity scale, often being con-sidered lazy or lost in thought. The list of characterizing traits associated with ADHD or ADD is a mile long and often debated what they are, however the main symptoms is present to at least some degree in everyone aicted by this disorder; restlessness, skewed perception of time, inability to maintain attention for longer periods of time and in a large occurrences reading diculties.

ADHD is not in essence a specic reading disability disorder, however accord-ing to Yoshimasu et al.(2010) [4] young adults by the age of 19 has a signicantly higher rate of reading disability incidents than those without ADHD. The rate of reading disabilities for young adults with ADHD is 51

The main hurdle when it comes to reading for a person with this disorder is massive walls of text, it is hard to keep track where you are in the text and without breaks such as new paragraphs it is easy to get side-tracked or lost in thought. There is no real issue to actual decoding the texts like for people with dyslexia, rather it is more about maintaining interest and attention to the meaning of the words. Long or dicult (i.e. uncommon) words might not be a problem for a person with ADHD, but long sentences might make the reader unintentionally lose interest and therefore make it harder to comprehend the meaning behind the sentence.

(17)

4.3 Cognitive impairments and information accessibility

There are many ways in which an interface may hinder the user, for instance un-common terminology or unintuitive menus, but these hindrances would apply to all users and not just those with certain impairments. Since this project focus on reading diculties this is the main concern when speaking of cognitive impair-ments.

A good question to ask before creating an interface is how much information is necessary?. If you would imagine a commonly used program such as Microsoft Word, how well would you be able to navigate if there were no tabs or submenus, but instead all commands was visible and available at all times? So categorizing commands into subgroups and limit the amount of options/commands/menus at any given moment might be vital in order to help the user navigate the program. Creating an interface which make sense might not be as easy as one would expect, for instance most people know how to shut down a computer with a Windows operating system, but the process is not actually very intuitive; you have to access the "Start" menu in order to choose "Shut down" Gregor(2006) [5]. Even though the process is rather counter-intuitive, most people using computers regards the process of shutting down Windows as natural.

Consider the navigation in Windows - it is consistent and when you have learned it you have a basic understanding of other Windows-based systems. When it comes to websites on the other hand, there are no default structure for the navigation which is used for every site, sure there is a certain common navigation for certain categories, for instance blogs usually have one type of layout while newspapers or social networking websites uses other layouts. For every website or program there is a balance between the functionality of the program; making sure all relevant commands and features are present, and the ability for the user to actually nd and execute the proper command/feature or to nd the relevant information they are looking for.

There are three central guidelines to follow when designing an interface for peo-ple with cognitive impairments (not specic to reading diculties) according to

(18)

Gregor(2006) [5]:

• Identify the main feature of the software, e.g., to process text, to send emails, and focus on designing for that.

• Remove extraneous and misleading links to alternative features or alternative means of achieving the same end at least at the initial level of the interface. • Make interaction as direct as possible; memory and sequencing are dicult for most people with cognitive diculties and the more direct the interaction is, the fewer cognitive resources need to be devoted to it.

Creating an interface does not seem to be about implementing new features or options but rather to make it as simple as possible with as few as possible choices (in the initial phases of the interface). Focusing the information and features presented on the main purpose of the program/website and to keep the information as detailed and short as possible.

5 Creating a prototype

To create a viable interface prototype for a document clustering engine and to facilitate the needs of those users who might have diculties with reading, I took inspiration from existing data cluster engines and used previously created personas to evaluate dierent types of graphical representations.

5.1 Personas

In order to create a viable service for users with reading disabilities two personas were used in order to understand what needs the service should attend to. Personas were created in the previous project EasyReader and consist of Frida - primary persona with dyslexia and Bao Li - secondary persona with Swedish as a second language. Even though these personas were not specically designed for my project there are a few aspects worthy of consideration. I will list a few noteworthy characteristics of these two personas.

(19)

5.1.1 Frida - Dyslexia

Frida is based on several students with reading/writing diculties. She has the need to be able to nd articles/literature which are easy to read, she also has trouble identifying dierences between search results and feels a lot of time is spent reading summaries in order to see if the article is easy to understand or relevant. She also considers the presentation of the text, preferring a spacious text with larger font-size. She prefers text which does not contain dicult/long words and would rather choose another article which may be less relevant to her interests. The following is a short summary of the aspects which is interesting with regards of clustering and graphical representation:

• Finding it hard to search for relevant articles and literature since it takes longer to read.

• Feels a need for a simple step-by-step text structure in order to fully under-stand.

• Consider short summaries useful since she might not have to read the whole article to know if it is relevant.

• Needs a fast and ecient way of nding relevant information (articles/literature) • Want to understand the essence of a text more easily.

How document clustering and a graphical representation can facilitate the needs of Frida: Grouping together similar documents in clusters means that Frida does not have to read through several similar search results in order to nd the one she is looking for. It will also mean that instead of just looking at (as in an example of persona scenario) the top three search results, she can view top 25 or maybe even more results, increasing the chance of nding what she is looking for. By using a graphical representation she will easily be able to identify which documents suit her needs. With focus on relevancy and readability she can spot the cluster that show the most promise and then concentrate her energy on those documents/search results. If she quickly wants to know which document is the easier read she can choose to sort the documents (in the list on the side) according to a readability index of her choosing. Providing a sentence or two as summary of

(20)

a document might be a good idea so that she doesnt have to lter through each document just to see what it is about.

5.1.2 Bao Li - Swedish as a second language

Bao Li is based on several exchange students currently studying at Linköping Universitet, Sweden. Li have limited ability to speak/understand Swedish, but enough skills to understand everyday conversations as long as the speaker does not speak to fast or have a dicult accent. He also have a rudimentary understanding for written Swedish and can read simple text without much diculty. Even though his lectures is in English, Li wishes to learn to better speak and read Swedish. He often use Swedish instead of English in conversations with other Swedish-speaking acquaintances. Li have completed a few Swedish courses and in order to learn more he often reads technical articles in Swedish in his spare time, technical articles often use the same type of wording and terms in most languages.

• Li reads technical articles in Swedish in order to learn the language.

• Sometimes have trouble with text being too dicult and then gets bored, even if the subject interest him.

• Wants to learn Swedish for personal satisfaction rather than of necessity. • Wish to both be able to nd the relevant information he seeks as well as

im-prove his Swedish and to be able to handle dicult texts in Swedish without getting bored trying.

How document clustering and a graphical representation can facilitate the needs of Bao Li: Since Li wants to learn the language it might be a handy tool in order to nd interesting articles with an appropriate level of diculty, also as his understanding increases he can try a harder text which tackle the same area. Increasing the diculty by reading texts further down the readability scale might provide a useful tool for learning. Since he also wishes to nd articles which interests him the clustering of documents might provide an easier way of nding similar articles on the same subject.

(21)

5.2 Interface prototype

In order to create an interface which facilitate the needs of users with reading diculties but still maintain a general approach for other users, I created a list of criteria based on the needs outlined in the two personas and ideas from existing cluster engines. The list is as follows

• As much information as possible should be represented graphically in order to make it easier for the user to identify which documents s/he consider relevant without having to read through all the documents. This will make it easier for Frida to nd relevant information without forcing her to read too much.

• Keep the amount of text at any given time to a minimum and in short sentences. This would help both users with dyslexia as well as users with ADHD.

• The interface should be kept as simple as possible in order to not overwhelm the user with too much information [2], content of the dierent clusters should not be displayed at the same time but allow the user to choose which cluster to explore.

• The visual presentation should be maintained when exploring individual clus-ters instead of simply listing the documents in text. This help the user see the relations between documents in a single cluster just as they can see the relations between clusters. [2]

• Focus on the main feature of the software and remove any unnecessary/misleading links for alternative solutions. Make the interaction as direct as possible in order to save cognitive resources for the important parts. [5]

• Presenting the data visually might not be enough for some users and therefore a list of the documents should also be present with information about which cluster it belongs to, a short summary (snippets) and maybe even the values attributed to that document (readability/relevance). This list should also be able to sort the documents according to relevance or readability so that the user might be able to see if documents from dierent clusters might

(22)

have some similarities. This would give users like Bao Li an overview for all documents and he can more easily nd the easiest readable document without searching every cluster.

The prototype I created presents the clustered documents with a graphical hexagon-grid. It will use hexagons since it will not create wasted space (i.e. unused space between the tiles) and since hexagon allows for six neighbouring tiles instead of, for example, four neighbours if square tiles were to be used. The variables which decides where the cluster will be positioned on the grid will be relevance, which will be static (i.e. non-changeable) and the user-specied readability index. Each cluster is given a colour to allow users to easily distinguish between dierent clus-ters and the number of hexagons within the cluster will represent the number of documents belonging to that cluster. The design involves two slightly dierent lev-els; Cluster level and document level. Below is a early-stage prototype of cluster level:

Figure 4: Early interface prototype.

The cluster level allows the user to identify the existing clusters and can eas-ily see the (non-semantic) relation between them according to relevance towards the search query as well as the preferred readability index. On the right of the graphical representation is a list with all documents found in the search.

(23)

This list will rank the documents according to the chosen criteria (readability indexes/relevance) regardless of what cluster the document belongs to. Either the text will be highlighted with the colour of the parent cluster or a small box next to the text will display the cluster each document belongs to. The reason for the list is because two documents in the same cluster might have great dierences in readability/relevance, so in order to compensate for the limitations of graphical representation the list will provide the user with means to sort the documents, regardless of cluster, according to their own preferences.

The document level of the interface will be accessed by click on a cluster. Then a smaller grid (ie. larger hexagons)will show the individual documents within that cluster and their position on the grid according to relevance/readability index. This is to provide the user with a continuous graphical representation of the doc-uments instead of (as some existing data clustering engines does) breaking the graphical representation it favour of a text based list. It will also quickly show the dierences/similarities of the documents regarding relevance/readability. The same type of list will exist to the right of the graphical representation but without colour coding and will only show the documents belonging to the current cluster. 5.2.1 Functional Interface Prototype

The interface current has three dierent frames the rst frame the user encounter is the simple query box where the user also can adjust number of documents viewed. The interface is currently limited to 100 results in order to t the static hexagon grid, there can however be an unlimited number of clusters. Because of this there is no set colour for each cluster, instead colour is assigned randomly.

(24)

Figure 5: Frame 1 - Query search box.

The second frame is the cluster presentation with both a graphical presentation using a graph with readability on the x axis and relevancy on the y axis. This frame also includes a document list which can be used to sort every document regardless of which cluster they belong to.

(25)

Two readability indexes are currently available for the graphical presentation, LIX and Ovix. The document list can sort documents according to LIX, Ovix, rele-vancy and cluster. Each cluster get a randomly assigned colour, this is to help distinguish one cluster from another. The number of hexagons within a cluster indicates the number of documents which belongs to that cluster.

Hover the mouse over a cluster and the user will be shown a snippet from a document within the cluster, each hexagon shows a single snippet. Hovering over a document in the list will highlight the cluster it belongs to within the graph. Clicking a cluster brings the user to document level which show a graph of the individual documents within that cluster.

Figure 7: Frame 3 - Document level of graph.

The third frame used is the document level of the graph which allows the user to view the documents in a cluster as a graph and the document list will only sort the documents inside that cluster, the user access this frame by clicking a cluster or by clicking the cluster colour of a document in the document list. Clicking a document, either in the document list or on the document level of the graph provides a snippet of the document, a hyperlink to the webpage and the score of LIX/OVIX and relevancy.

(26)

6 Evaluation

The focus of the evaluation is to gauge how easy the interface is to use and if users nd it useful. Note that only one participant had a diagnosed reading disability, the reason is that the interface is supposed to be useful to everyone not just people with reading disabilities. Also the evaluation focus is to see if the interface is comprehensible for all users not just people with reading disabilities.

6.1 Method

In order to see if the interface is easy to use and comprehend a small study was performed on six people between the ages of 21-62, with a mean of 39.6 and 18.2 as the standard deviation. The participants was rst introduced to the program with a statement explaining what a document cluster engine is and what purpose it might have. They were not told that the interface was designed with reading disabilities in mind as to not compromise the study with biased ideas. Information given before doing tasks:

"This is a document cluster engine. It use search results and create clusters with similar documents, that is, it compare each document and the document that have the highest similarity will become part of the same cluster. Then the readability of each document is evaluated which gives an indication of how easy a document is to read. The result is several collections of documents (clusters) and readability scores on each individual document. The purpose is to provide a service that al-lows the user to easily scan through a large number of documents in order to nd information that is both relevant to the user search as well as nding documents that are easy to read, without the need to read through each individual document." The participant then received basic tasks to perform using the interface while the evaluator sat behind and observed how they were doing, marking down num-ber of mistakes with short commentary sentences describing the mistake. These task were as follows:

• Choose 30 results and press search • Change to 40 results and press search

(27)

• Change graph to show ovix readability • Change back to LIX

• Find the cluster with the highest relevancy • Find the cluster with the highest readability

• In the document list, make sure it is sorted after readability LIX • Click the cluster with the most results

• Find the document with the highest readability and relevance in on the document level

• Go back to cluster level

• Find the cluster that contain the most relevant document using document list (change sorting to relevancy)

• Click that cluster and nd the document with the highest readability • Change the sorting in document list to OVIX

After the test was nished the participant answered a couple of questions regarding their user experience and a few question to see if the participant understood the interface. The participants also answered questions about their experience level with computers and whether they had a reading disability or not, as stated above only one had a reading disability, in the form of ADHD. The questions was asked by the observer and answers where also written down by the observer, this is to minimize rushed answers that might occur if the participants were handed a questionnaire to answer and also in order to better understand the participants experience. The questions asked were:

• Was the interaction intuitive?

• Did the document list help in nding specic documents, and if so in what way did it help?

(28)

• Were there any element that was hard to comprehend?

• Did you nd any diculty to navigate the graphical representation? • Was there any diculty navigating with the document list?

• Using both the graph and the document list, did you encounter any prob-lems?

• Did you feel there were options missing?

• What do you think this service could be used for/ what would you use it for?

• Do you know what a readability index is? (like LIX/OVIX) • Is it important to know what LIX/OVIX is?

• Do you think it is important with tooltips/help in order to understand the website better?

• Is there any information that you felt were missing, for instance explanations or directions?

• What would you like changed?

• Number of mistakes (i.e. "How many mistakes did you make?"):

6.2 Results

In order to analyze the results two categories were created, technical mistakes which looked at which task were performed when mistakes were made and user experience which is based on answers from the participants regarding how the perceived the usability.

6.2.1 Technical mistakes

In total quite few mistakes were made by the participants, with the highest being 4 and the lowest 0 with the average of 1.8 and with a standard deviation of 1.47 mistakes. The most common mistake was nding the cluster with the highest

(29)

readability due to the reversal of the x axis with ve participants, however this was corrected quickly by the participants except for one participant who (by his own account) were confused by the reversal and it took him a few seconds longer to adapt to the non-standard presentation. This was also the only participant to actually click the wrong cluster, the other participant corrected themselves before clicking, only verbally identifying the wrong cluster. Two of the participants mention the reversal when they made their mistakes but the other three did not notice the reversal and they corrected their mistakes more quickly.

The second most common mistake occurred when the participants were navigating from the document level back to cluster level of the graph with two people not using the intended navigation but instead used the browser function, a third participant were confused how to proceed but instead of using the browser function he took a longer time to identify and use the intended function. The participants who navigated with the browser both said that it is what they are used to doing rather than not nding the navigation within the interface.

6.2.2 User experience

The interview part of the evaluation was conducted in order to see if the par-ticipants were aware of their mistakes as well as identifying problems with the interface that might not readily show with task-based testing. Three common themes emerged from the interviews; the reversal of the X-axis on the graph, nav-igation and what LIX/Ovix is. What follow is a summary of answers from all participants of the most relevant questions.

Reversal of the x-axis

Most participants did notice the reversal of the x-axis but only two of them found it disrupting or confusing. The two participants who found it confusing regarded themselves as experienced computer users and used to looking at graphs and one gave that as an explanation to why he found it confusing. Of the three participants who made a mistake with reading the reversed x-axis but did not comment on the reversal at the time of the task, in the interview two of them admitted that they noticed it afterwards but was not bothered by it. What follows is a summary of answers from all participants of the most relevant questions regarding the reversed

(30)

x-axis:

• Some participants had a lot of questions and opinions regarding the x/y axis and how the x axis was reversed to what you would expect. This opinion was expressed more strongly with people regarding themselves as experienced computer users.

• When asked if there were any elements which were dicult to comprehend two participants mentioned that it was hard nding readability on the graph. • When asked if the interaction was intuitive all participants answered yes however one participant added that he found it dicult to understand what a cluster was, how clustering worked and why the readability axis was reversed. The readability indexes

Most participants were unaware of what LIX/OVIX are and did not know which readability index they should use. They all thought it was important to have an explanation of the readability indexes in order to be able to properly use them in searches. The one participant who did know about readability indexes still ex-pressed a wish for an explanation since he felt unsure exactly how they worked. Another commented that she did not know what LiX and Ovix was or how they worked out which document is easier to read. In order for the participants to be able to choose a correct readability index for their own purpose they need to know what criteria is used by the readability indexes to score the text. The participants also showed curiosity of how readability indexes (as well as clustering) worked. Navigation

The navigation seemed to be of little problem for the participants in general. The technical mistakes were mostly caused by the habit of the participant rather than a lack of understanding the navigation. However one commented that two of the clusters were the same colour (since colour is assigned randomly this tend to hap-pen) but that when you hover the mouse over a cluster all the hexagons belonging to that cluster is highlighted which was a good thing and made the choice of colour for the clusters less important.

Another user asked during testing if the colour was an indicator for how "good" or "bad" a cluster is, as an example if a red cluster meant that it isn't easy to

(31)

read or if the green cluster meant it were the best cluster. The participants found the interface easy to navigate and understand but some of them still expressed a wish for tooltips and maybe even a tutorial. Most comments about the navigation were stated during the testing but bellow is a summary of the relevant questions from the interview:

• When asked if the document list helped in nding specic documents all participants were in agreement that it would be really useful if you prioritize for example readability over relevance and also whenever you hover with the mouse in the document list the parent cluster lights up.

• All participants understood the dierence between the cluster level and the document level without any problems.

• When asked if there was any diculties present when navigating the graph most participants said that it was easy to navigate except one user who had problems understanding that hexagons with the same colour belonged to the same cluster, but he shortly after stated that it "made sense".

6.3 Discussion

There is room for improvement of the current interface; however it will be useful as a template for future development. There are a few changes that could be made to improve the usability of the interface as a result of the evaluation but might not be important for general use.

Changing the x-axis so that it is displayed the way some users are accustomed to (- to the left and + to the right) would make fewer users confused by the results in the graph.

Explanations should be available, maybe as a quick tutorial or as tooltips.

Information about the dierent reading indexes would help users decide which index they should use, not much information is required just a simple explanation as how the reading index works and what the score indicates.

(32)

Static colour choices for the clusters could help users dierentiate between clusters but might be a problem since it will limit the maximum number of clusters allowed. There are also a few additions that would benet the interface, such as displaying tag words for individual documents or post processing of documents (changing font/ font size/ text structure/et cetera) for easier reading. Also to consider is the addition of an alternative way to view the grid; instead of two layers (cluster and document level) only use colour coding in order to dier between clusters and place each individual documents according to its true value. Might however require a larger grid and might be harder for a user to locate entire clusters: this could be added as an option.

7 Summary

The task of developing a interface designed to facilitate the needs of people with reading disabilities is rather complicated. There are several variations of type of reading disabilities with dierent underlying causes and problems. So in order to create a user friendly interface you need to understand what the problems are. For instance people with a lesser understanding for the language might have dif-culty with unusual words and people with ADHD might have problem reading large chunks of texts and long words. In order to address most of these issues, a graphical interface seemed most suitable.

Two previously created personas as well as an understanding for how to design towards cognitive impaired users were used in order to identify which aspects should be in focus for the design. The choice of layout was based on aesthetics and function to allow as much information as possible to be viewed with as little demand on cognitive resources as possible.

Since the program use two criteria in order to rank the documents (relevance and readability) a dierent approach to design was need than for example Carrot2 which only list relevance. This aected the choice of using a grid graph and in order to still provide an easy overview over individual documents, a list to the right side of the graph was implemented. When evaluated the participants found

(33)

the interface easy to understand and navigate, with a few exceptions. However some explanations/tooltips/tutorials might be required in order for the user to be able to use the service to its full potential.

The functional interface prototype is not meant as interface for the nal product, rather it is a mock-up interface based on the idea to make as much information graphically represented as possible. It was created in order to be able to test the design and it will be a useful guide in the creation of the nal interface. There are many factors that have to be taken into account when designing a user in-terface, especially if the design is also meant for users with impairments. This interface is therefore only the rst stage for the nal product. Future work should take a closer look at which readability indexes to use, additional help for users with reading disabilities (text format changes, use of colours, text simplication, et cetera) as well as creating a more wholesome website (perhaps ability to save search results, creating a list of reading material from several dierent searches, et cetera). This interface only sought to limit the amount of information, or rather cognitive resource draining elements, that is not necessary for the main feature. Now it might be time to consider addition of those extra functions that might be needed.

(34)

References

[1] Fred K. Berger. http://www.nlm.nih.gov/medlineplus/ency/article/001551.htm. 2014.

[2] Luca Grilli Emilio Di Giacomo, Walter Didimo and Giuseppe Liotta. Graph visualization techniques for web clustering engines. 2007.

[3] Anil K. Jain. Data clustering: 50 years beyond k-means. 2009.

[4] Robert C. Colligan Jill M. Killian Robert G. Voigt Amy L. Weaver Kouichi Yoshimasu, William J. Barbaresi and Slavica K. Katusic. Gender, attention-decit/hyperactivity disorder, and reading disability in a population-based birth cohort. 2010.

[5] Anna Dickinson Peter Gregor. Cognitive diculties and access to information systems: an interaction design perspective. 2006.

[6] Bennett A. Shaywitz Sally E. Shaywitz. Dyslexia (specic reading disability). 2005.

[7] Dawid Weiss Stanisªaw Osinski. Carrot2: Design of a exible and ecient web information retrieval framework. 2005.

[8] Dawid Weiss Stanisªaw Osinski. Introducing usability practices to oss: The insiders' experience. 2007.

(35)

Document Clustering Interface

Linköping Universitet

729G30

Bachelor Thesis

Document Clustering Interface

Author:

Samuel Johnson

Supervisor:

Arne Jönsson

December 17, 2014

Contents

1 Introduction

1.1 Hypothesis

1.2 Purpose

2 Background

2.1 Data clustering

2.2 Readability

3 Existing data cluster engines

3.1 Carrot2

4 Reading Disabilities

4.1 Dyslexia

4.2 Attention Decit (Hyperactivity) Disorder(ADHD)

4.3 Cognitive impairments and information accessibility

5 Creating a prototype

5.1 Personas

5.2 Interface prototype

6 Evaluation

6.1 Method

6.2 Results

6.3 Discussion

7 Summary

References

Linköping University Electronic Press

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –från

publiceringsdatum under förutsättning att inga extraordinära omständigheter

uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten vid

en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

be-skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form

eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller

konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

för-lagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet – or its possible

replacement –from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for

anyone to read, to download, or to print out single copies for his/hers own use

and to use it unchanged for non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional upon the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its www home page:

http://www.ep.liu.se/.

4.2 Attention Decit (Hyperactivity) Disorder(ADHD)