Visualization of live search

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Visualization of live search

by

Olof Nilsson

LIU-IDA/LITH-EX-A--13/002--SE

2013-12-09

(2)

Linköping University

Department of Computer and Information Science

Final Thesis

Visualization of live search

by

Olof Nilsson

LIU-IDA/LITH-EX-A--13/002--SE

2013-12-09

Supervisor: Johan Åberg

Examiner: Magnus Bång

(3)

2.1.1 Representing documents . . . 8 2.2 Information visualization . . . 9 2.3 Document classification . . . 11 2.3.1 Measuring performance . . . 12 2.4 Technologies . . . 13 3 Method 17 3.1 Overview . . . 17 3.2 Visualization tool . . . 17 3.3 Document classification . . . 18 3.3.1 Selection of algorithm . . . 19 4 Results 21 4.1 Architecture . . . 21 4.2 Web crawling . . . 22

4.3 Data processing and indexing . . . 23

4.4 Document classification . . . 24

4.4.1 Machine learning classifier . . . 24

4.4.2 Word matching classifier . . . 25

4.4.3 Classification weighting . . . 26

4.4.4 Classifier evaluation . . . 26

4.5 Live search . . . 28

4.6 Visualization tool . . . 29

4.6.1 Tool construction . . . 29

(4)

CONTENTS CONTENTS 5 Discussion 33 5.1 Document classification . . . 33 5.2 Visualization tool . . . 35 6 Conclusion 37 6.1 Problem . . . 37 6.2 Future work . . . 38 A Visualization tool screenshots 41

(5)

List of Tables

2.3.1 Multi-label measures . . . 12 2.3.2 Multi-label measures . . . 13 2.3.3 Measures notation . . . 13 3.3.1 Classes . . . 18 4.3.1 Document fields . . . 23

4.4.2 Classifier performance results: Feature threshold . . . 27

4.4.3 Classifier performance results: Labelling threshold . . . 28

4.4.4 Classifier performance results: Word Matching classifier . . . 29

4.4.5 Classifier performance results: Weighted classification . . . . 29

(6)

(7)

List of Figures

4.1.1 Architecture . . . 21

4.4.2 Classifier performance results: Feature threshold . . . 27

4.4.3 Classifier performance results: Labelling threshold . . . 28

4.6.4 Visualization tool overview . . . 30

A.1 Case A screenshots . . . 42

A.2 Case B screenshots . . . 43

(8)

(9)

Abstract

The classical search engine result page is used for many interactions with search results. While these are e↵ective at communicating relevance, they do not present the context well. By giving the user an overview in the form of a spatialized display, in a domain that has a physical analog that the user is familiar with, context should become pre-attentive and obvious to the user. A prototype has been built that takes public medical information articles and assigns these to parts of the human body. The articles are indexed and made searchable. A visualization presents the coverage of a query on the human body and allows the user to interact with it to explore the results. Through usage cases the function and utility of the approach is shown.

(10)

(11)

Chapter 1 Introduction

This thesis was done in close cooperation with Findwise AB, at their Stock-holm office. The company supplies enterprise search solutions primarily to large organizations and corporations.

1.1 Problem

The classical search engine result page (serp) is used for almost all inter-actions with search results, especially on the web. While these pages can give the user a good idea of what the most relevant results are, they lack context for how the results relate to the collection as a whole. Solutions such as facets with query previews can give the user an idea of the distribution of results, but do not generally show what coverage the query has on the entire search domain. According to the Visual Information-Seeking Mantra, e↵ective visualization tools rely on overviews:

Overview first, zoom and filter, then details-on-demand.

(Shneiderman, 1996)

This mantra also describes a common way of searching; get an overview of the results, filter to get the subset you are interested in and finally check which items that are the most interesting. This process is iterative. With serps, users are forced to iterate on the detail items. Search engines have tried to facilitate this by using live search, but this still requires scanning of each list of results. By introducing an overview and moving the real-time response to this level, the mantra and search process should fit together well. Spatialized displays are e↵ective as overviews, and can exploit pre-atten-tive properties to quickly give the user an idea of what is being presented. However, such displays run the risk of becoming too abstract if they lack a physical analog. With a limited and specific domain this can be avoided. A well-understood spatialization could also o↵er the possibility of spatializing the second step of the mantra, the filtering.

(12)

1.2. SOURCES CHAPTER 1. INTRODUCTION

The aim of this thesis is to build a system that considers these prob-lems and tries to solve them. To do this, the following questions must be answered:

• How can search results be given positions in a spatialization of a search domain, with a reasonable precision?

• How can facets be represented in the spatialization and allow the user to accurately select them?

A reasonable precision is a result that is comparable with results ac-quired in other studies, on similar size datasets and in similar classification tasks. The precision should measure how well the outcome matches the expectation, entirely or in part.

Accurate selection is when selection of an object yields the result that the user expected.

1.2 Sources

The majority of the articles have been found using Swedish university li-brary system Libris, Google Scholar and by looking through bibliographies of previously discovered articles and books. Websites by prominent search and visualization research labs have also been studied to find recommended sources of information. Article credibility has been decided mainly by rele-vance and citation count, or if cited by another credible source.

1.3 Glossary

These are definitions of concepts and words as used in this thesis.

Search Querying a web-based information retrieval system for textual in-formation in a collection of text documents.

Collection A database, website, filesystem or any other collection of doc-uments that can be used as data for an information retrieval system. Facet A filter to be applied before searching. The search will then only be done within the set of documents that match the facet. (Morville and Callender, 2010)

Domain, search domain Search in a limited set of information, bounded by e.g. vocabulary or topics.

Query completion Automatic completion of a search query while it is being written. Uses query expansion to find possible queries and sug-gests these to the user in a way that the user can complete the query without typing it all out. (Morville and Callender, 2010)

(13)

1.4. STRUCTURE CHAPTER 1. INTRODUCTION

Live search Presenting search results dynamically as queries are made or facets are selected.

Visualization The concept, and research area, of using some medium to convey data in a way that allows interpretation and understanding by a human. (Spence, 2007)

Visualization tool An interactive system that uses principles of visualiza-tion to convey something to a user. (Spence, 2007)

Spatialized display An image or a visualization tool that uses the map metaphor to show information in the same way as, for example, popu-lation density can be shown on a geographical map. (Fabrikant et al., 2006)

Classification Automatically putting a document in a class or category. Class, label A document belongs to a class if it has a label representing

that class assigned to it.

1.4 Structure

I begin with presenting the theoretical background for this thesis, continue with the methods used, the results that I got using these methods and a discussion of those results. Finally, conclusions are made along with some ideas of how to proceed.

(14)

(15)

Chapter 2 Theory

This chapter describes the theoretical foundations of thesis, specifically in the areas of information retrieval, information visualization and document classification.

2.1 Search and information retrieval

Jansen and Rieh (2010) described information retrieval (IR) as a field fo-cusing on systems for extraction of information from content collections, in contrast to information searching which concerns itself with human inter-action with such systems. This thesis covers both of these fields, but takes more basis in information visualization when it comes to interaction. In Jansen and Rieh’s nested framework of human information behavior and information systems, this thesis limits itself to the innermost areas, those of information searching behavior and information retrieval systems. Here the IR systems enable the searching behavior, where humans search or browse the systems.

The way search results are displayed has changed little since the arrival of the web and the web search engine. While important interaction func-tions have been added, the dominant method is still the search engine result page (serp), where a list of textual items is sorted vertically by relevance. (Hearst, 2009)

There are some usage patterns that are especially common in search. Morville and Callender (2010) identified the following:

Quit Quit without finding anything.

Narrow Start with broad results and narrow down.

Expand Start with a narrow query and expand the amount of results. Pearl Growing Take a document and search with that as a basis.

(16)

2.1. SEARCH AND INFORMATION RETRIEVALCHAPTER 2. THEORY

Facets, or faceted navigation, is a technique used to allow users to narrow their search by selecting ranges or specific values of some fields (Manning et al., 2008). This can be further enhanced by adding query previews that show how many results the user can expect with a certain facet applied. Graphical representations of facets are also sometimes used, which make selecting facets more natural for the user. Sidi and Marques (2011) described a graphical interface used to navigate medical information, aimed at medical professionals, where the user started by selecting the body part they were interested in as a facet. That interface lacked any type of query preview, however.

Expanding search results is often implemented as an advanced search, or through navigation aids (Morville and Callender, 2010). One reason for why these patterns reappear in many search systems is that the query input by the user is not the same as the actual information need (Manning et al., 2008).

There is also the risk of the user getting stuck in anti-patterns (Morville and Callender, 2010). Users can get stuck pogosticking between the results and individual items when the items aren’t the ones the user actually wanted. Another risk is users thrashing on the query, rewriting it over and over with-out receiving the right results. Hearst (2009) writes that with serps, users must often learn that they can’t expect the first query to yield immediately usable results; instead being ready to scan the result list to find what they were actually looking for or to reformulate their query to get more specific results.

2.1.1 Representing documents

Representing importance of a term in a document is commonly done using tf-idf weighting. The weighting is defined as

tf idft,d= tft,d⇥ idft

where t is a term and d a document in a collection D of length n. The term frequency (tft,d) is the number of times t occurs in d. The inverse document

frequency (idft) is the inverse of dft, the number of documents in D that t

occurs in. This is defined as

idft= log

n dft

With the importance of terms represented as weights, a document can be said to be a vector of these weights. So a documents is a vector of weights or features for each term in the collection, also known as the feature vector. This is called the vector space model and is a fundamental part of many information retrieval techniques.(Manning et al., 2008)

(17)

2.2. INFORMATION VISUALIZATION CHAPTER 2. THEORY

a common solution is the cosine similarity, sim(d1, d2) =

d1· d2

|d1||d2|

The Solr search platform also gives higher scores for matches in short document fields than in long, as well as for matching multiple terms of a query. (SolrRelevancyFaq, 2012)

2.2 Information visualization

The design ideals behind information visualization can be summed up in the Visual Information-Seeking Mantra:

Overview first, zoom and filter, then details-on-demand.

(Shneiderman, 1996)

This principle has been successful in many cases when applied to search interfaces. (Hearst, 2009)

The type of data used for the visualization greatly influences what can be done with it. Data that is categorical, i.e. discretely divided into di↵erent categories such as intervals or file types, requires di↵erent solutions than quantitative data (such as numbers). Especially nominal data has proven difficult to visualize. (Hearst, 2009)

There have been a number of systems that show the content of docu-ments as some form of glyph or graph. Some document visualizations have relied on word counts and other non-semantic statistics to generate graph-ical representations. Direct application of the vector space model are also common, with the visualization positioning documents according to some similarity metric. This has not proven successful for usability, however:

Adjacency on maps like these is meant to indicate semantic similar-ity along an abstract dimension, but this dimension does not have a spatial analog that is easily understood.

(Hearst, 2009, p.241)

Bonnel et al. (2005) discusses finding meaning metaphors for search re-sult visualization, and the design choices that have to be made to find them. They stress the importance of a clear metaphor, something that is a realiza-tion of an associarealiza-tion between graphical parameters and document informa-tion. One approach presented is a three-level hierarchy where the interaction starts on topics, moves to a similarity view and finally to document details; a solution that is analogous to the visual information-seeking mantra.

Skupin and Fabrikant (2003) presented the idea of using spatialization, a common method in the area of geographic visualization, for information visualization. Two main approaches were outlined:

(18)

2.2. INFORMATION VISUALIZATION CHAPTER 2. THEORY

Computational approach: Using cartographic computational methods, or methods from geographic information science, directly on non-geographic data.

Cognitive approach: Building visualizations that function with the hu-man perception of geographic space, and of visual representations of such.

A visualization tool that uses spatialization is also called spatialized dis-play. Spatialized displays of non-geographic data have been shown to be in many ways as intuitive as geographical maps (Fabrikant et al., 2006). User understanding of these types of spatial metaphors even appear to be un-connected to the user’s background and expertise with spatial data (Skupin and Fabrikant, 2003).

One of the strongest e↵ects in spatialization and geographic visualization is the distance-similarity metaphor, also expressed in the so-called first law of geography:

Everything is related to everything else, but closer things are more related than distant things

(Skupin and Fabrikant, 2003)

Overcoming this e↵ect can be done by applying transformations to the objects in the spatialization. The transformations can alter things such as scale, color, connections and boundaries. (Skupin and Fabrikant, 2003)

There are some aspects of human awareness that are extra important for visualization:

Inattentional blindness We only notice what we are paying attention to. Change blindness We are bad at seeing changes.

Cognitive map We keep an internal model of data. Pre-attention We notice some things unconsciously.

Pre-attention concerns pre-attentive properties, the things that humans perceive before consciously thinking about them. Commonly, detecting them takes around 200-250 milliseconds, so we can easily see if objects di↵er in these properties with just a glance. They are properties such as color, shape, size, texture, direction, density and intensity. Detecting these di↵er-ences is impaired if there is more than one property that di↵ers, however. (Healey and Enns, 2011; Spence, 2007)

This knowledge of human perception has been used in visualization ex-tensively. Varying color based on data is a common approach, as well as texture and motion. (Healey and Enns, 2011)

Work has been done on creating visualizations based on automatic clas-sification of documents, such as by Kules et al. (2006). In tests, it was found that users explored deeper when having the overview than without, which could combat some of the problems with search interfaces mentioned earlier.

(19)

2.3. DOCUMENT CLASSIFICATION CHAPTER 2. THEORY

2.3 Document classification

A classification problem is concerned with automatically assigning labels to documents, or, put di↵erently, deciding the class of a document. Tsoumakas and Katakis (2007) divide classification problems into single-label and multi-label problems. In a single-multi-label problem, a document is assigned a single label l 2 C, where C is the set of classes. Depending on the size of C, the problem is binary (|C| = 2) or multi-class (|C| > 2) In multi-label, a document is assigned a set of labels L✓ C.

Kules et al. (2006) describes three dimensions of document classification: lean vs. rich classes, online vs. o✏ine classification and full- vs. fast-feature techniques. Lean and rich classes refers to the depth of the classes; simple classification by document type or language is lean, while extensive tax-onomies or ontologies are rich. Online and o✏ine on the other hand consid-ers when the classification is done; online is run at query time at (for search) works with the returned result set, while o✏ine is run prior to any querying and can leverage the full database. Finally, full- and fast-feature refers to the information required by the classification; a full-feature technique needs all information that is available about a document, while fast-feature can look solely at metadata.

Approaches to multi-label classification have fallen into two main types: problem transformation and algorithm adaption. In the first case, the multi-label problem is transformed into one or more single-multi-label problems. In the second case, a single-label algorithm is modified to handle multiple labels. (Tsoumakas and Katakis, 2007)

With documents represented in the vector space model, classification becomes a matter of separating areas within this space. The methods can be divided into linear and non-linear. A linear classification method separates the n-dimensional vector space with an n-dimensional hyperplane. In the single-label binary case, this means everything on one side is a member of the class and everything on the other is not. A non-linear classificication method is one that does not separate space solely via hyperplane; for example, it can be a combination of a hyperplane and a hypersphere, or som kind of surface. Deciding class membership can therefore become more complex than for linear methods.(Manning et al., 2008)

The nature of the these two types of classification methods means they are good at di↵erent things. Linear methods will clearly separate the space even if it is somewhat noisy. Some documents will be mis-classified, but the majority will be correct. A non-linear methods will adapt more easily to hard-to-separate data, but at the same time will react strongly to noise. Manning et al. (2008) describes the sensitivity to noise as variance, while adapting to the majority is bias. Considering these two is called the bias-variance tradeo↵ and is essential to selecting an appropiate method.

Search result or document clustering is a related technique. Documents are assigned to clusters depending on their properties. Contrary to

(20)

classifi-2.3. DOCUMENT CLASSIFICATION CHAPTER 2. THEORY

cation though, clusters are generally automatically generated. Kules et al. (2006) argues that while automated clustering can give new insight to how search results are structured, categories or classifications that are stable over time help the user more. Manning et al. (2008) agreed with this position, pointing at tests that show good category hierarchies are better both for user experience and search result quality. It has also been shown that users prefer navigating a clearly defined document space (Skupin and Fabrikant, 2003).

2.3.1 Measuring performance

Performance of classification algorithms can be measured in a number of ways. In their overview of multi-label classification, Tsoumakas and Katakis (2007) used four performance measures, see table 2.3.2. The usage of the terms precision and recall di↵er a bit from the common ones, since they do not measure the precision of each assignment but of the set of assignments as a whole.

Table 2.3.1: Multi-label measures

Measure Formula Evaluation focus

Accuracy 1 n n X i=1 |Yi\ Zi| |Yi[ Zi|

The average exact matches per document Precision 1 n n X i=1 |Yi\ Zi| |Zi|

The average partial matches per document

Recall 1 n n X i=1 |Yi\ Zi| |Yi|

The average per-class classification with partial matches Hamming Loss 1 nl n X i=1

|Yi Zi| The average per-example

per-class total error

In these formulas, n is the number of documents, l the number of classes, Y is the set of original labels and Z is the set of assigned labels.

Accuracy, precision and recall have a range of [0.0, 1.0]. Hamming Loss, on the other hand, depends on the absolute number of labels assigned and originally present. Ideally it is zero, but in the worst case it will be equal to the label density of the data set. Label density and the closely related label cardinality are defined as:

Label cardinality The average number of labels.

Label density The cardinality divided by the number of classes.

(Tsoumakas and Katakis, 2007) Sokolova and Lapalme (2009) used similar measures, but di↵erentiated them from non-multi-label classification measures, see table 2.3.2.1

1_{Source contains a misprint for Exact Match Ratio, comparing L}d

(21)

2.4. TECHNOLOGIES CHAPTER 2. THEORY

Table 2.3.2: Multi-label measures

Measure Formula Evaluation focus

Exact Match Ratio 1

n

X

i=1

I(Lc

i = Ldi) The average per-document

exact classification Labelling Fscore 1 n n X i=1 2Pl_j=1Lc i[j]Ldi[j] Pl j=1Lci[j] + Ldi[j]

The average per-document classification with partial matches Retrieval Fscore 1 l l X j=1 2Pn_i=1Lc i[j]Ldi[j] Pn i=1Lci[j] + Ldi[j]

The average per-class classification with partial matches Hamming Loss 1 nl n X i=1 l X j=1

I(Lc_i[j]6= Ldi[j]) The average per-example

per-class total error

In these formulas, n is the number of documents, l the number of classes, Ld _{the list of original labels and L}c _{the list of assigned labels. L}c

i[j] is 1 if

document i has label j, 0 otherwise. The same ranges as for the previous formulas apply here.

In this thesis, the formulas described by Sokolova and Lapalme are used to measure performance. They are more clearly defined for the multi-label case, and their sum-based definition is easily described in program code. In addition to these measures, label density (of the testing data) will be included to make the Hamming Loss value more comparable between test cases where the density di↵ers. A normalized (divided by the label density) Hamming Loss is also included in graphs. The notation used can be seen below.

Table 2.3.3: Measures notation

Shorthand Measure

EMR Exact Match Ratio

LF Labelling Fscore

RF Retrieval Fscore

HL Hamming Loss

LD Label Density

N-HL Normalized Hamming Loss

2.4 Technologies

The system in this thesis uses a number of existing applications, frameworks and libraries. A short description of each one follows.

of Lc i[j].

(22)

Nutch The Apache project Nutch is an open-source web crawler. It col-lects the web pages it encounters by following links from a list of seed urls. Solr Solr is an enterprise search platform. It is an open-source project at Apache. Documents in Solr are indexed according to a schema that describes which fields that they should contain. When queries are sent to Solr, the relevance of documents is calculated (as described in 2.1) and a result list sorted on this is returned.

Hydra Hydra is a document pipeline. It is a proprietary platform cur-rently being developed at Findwise. Hydra is based around a core, which handles a database containg the documents, and a number of stages. Each stage is run as a separate process and communicates with the core us-ing rest2_{. The stages fetch documents and perform operations on them,}

from modifying fields or extracting keywords to outputting them over http. Stages can also input new documents or filter out ones already there.

Since the stages are all their own processes, document fetching is asyn-chronous. A strict pipeline can be enforced by setting rules or query pa-rameters for the stages, such as only running stage B if stage Q has already fetched and returned (”touched”) that document.

Jellyfish Jellyfish is a proprietary front-end framework for search applica-tions, developed at Findwise. It serves as an abstraction layer between the frontend and a back-end search server, allowing for federated search (send-ing the same queries to multiple search engines and combin(send-ing the results) as well as switching out the back-end without a↵ecting the front-end. It provides a web service which a user interface can communicate with over rest, receiving answers in json3_.

Web technologies The main web technologies used are html, css, svg and Javascript. While these technologies are standardized, the implementa-tion varies in di↵erent web browsers. Ensuring proper funcimplementa-tion in all major browsers is time-consuming, so in this thesis development has been made toward a single browser: Google Chrome.

Key to building an dynamic interactive web application using these tech-nologies is Document Object Model (dom) manipulation. When loading a web page, a web browser constructs a dom tree that stores the structure of the page. The dom tree is then used for interaction and rendering of the page - changes to the tree cause the rendered page to change. One way to modify the dom tree is using Javascript, a programming language that most web browsers can interpret.

Two major Javascript libraries are used in this thesis; jQuery and Data-Driven Documents (d3.js). They both o↵er dom manipulation but are

2_{Representational State Transfer} 3_{Javascript Object Notation}

(23)

otherwise intended for di↵erent usage. jQuery helps in building complex interactions and simplifies common tasks. d3.js is aimed entirely at con-structing visualization tools from data.

Both libraries makes selections of nodes in the dom tree using css se-lectors4_{. This allows a selection to contain one or more nodes, making all}

operations on the selection apply to all nodes.

With d3.js, nodes can have data associated with them. Assigning data to a selection maps the data items to each of the nodes in the selection, and creates new nodes if needed. Nodes can have data-dependent functions assigned for any attribute, allowing the color, size, position and so on to change as data changes.

d3.js itself does not draw graphics; instead, it relies on the browser’s capability to render Scalable Vector Graphics (svg), or simply html+css.

(24)

(25)

Chapter 3 Method

In this chapter, the method used to adress the problem is presented.

3.1 Overview

A visualization tool has been constructed that uses live search to show the coverage of a search query over the search domain as the user types. The user can interact with the tool to select an area of the search domain to apply faceting on the results.

The chosen domain to try this system on is the human body, in par-ticular the public medical care information sites V˚ardguiden1 _{and 1177}2_.

Data was acquired from the websites using a web crawler. The data was indexed by Solr3and automatically classified by a document classifier. The search interface and visualization was implemented as a web application communicating with Findwise’s own search services framework.

The precision of the positions in the spatialization is represented by the precision of the document classifier. The measure used is Labelling F-score which takes multi-label, multi-class classification into account.

The accuracy of the facet selection is motivated by a number of use cases that also highlight the design principles used in the visualization tool.

3.2 Visualization tool

The visualization tool has been constructed based on the theoretical basis described earlier. Problematic anti-patterns have been alleviated using the design patterns suggested by Morville and Callender (2010), and methods from information visualization.

1_{http://www.vardguiden.se} 2_{http://www.1177.se}

(26)

3.3. DOCUMENT CLASSIFICATION CHAPTER 3. METHOD

As suggested by several authors (Kules et al., 2006; Hearst, 2009; Bonnel et al., 2005), a visualization tool benefits from having a physical analog, i.e. that the graphical representation is similar to a physical domain that the user is already familiar with. For the search domain used in this thesis, a possible spatialization is the human body, with documents or information about them displayed on di↵erent parts of the body.

Evaluation is based on use cases that describe how a user might interact with the visualization tool. The discussion considers how these use cases relate to the patterns and methods the tool is based on.

3.3 Document classification

To properly position the articles in the spatialization, the classes need to have positions or at least areas that they conform to. For this purpose a sub-set of the international standard for anatomical nomenclature, Terminologia Anatomica, is used. Since the classes should be easy to understand and rec-ognizable, only major body parts and organs are included. (Wikipedia, 2011b,c,a) The number of classes were reworked during development to im-prove classification performance. See table 3.3.1 for the classes used and the labels that documents are assigned.

Table 3.3.1: Classes

Class Label (Swedish)

Head huvud Brain hjärna Neck hals Chest bröst Lungs lungor Heart hjärta Back rygg Spine ryggrad Shoulder axel Arm arm Hand hand Abdomen buk Stomach mage Sex organ könsorgan Pelvis höft

Leg ben

Foot fot

Two documents for each class are selected by searching for the classes in the full dataset. The classes are then labelled by manually reading the contents of the documents and assigning fitting labels. This training data is then split into training and testing parts when running the classifier.

(27)

3.3. DOCUMENT CLASSIFICATION CHAPTER 3. METHOD

3.3.1 Selection of algorithm

The classification was to be implemented in either of the two proprietary frameworks (Hydra and Jellyfish) provided by the company, to ensure in-dependence from the crawler and indexing systems. The choice of platform for the classification directly influences what limits and possibilities must be taken into consideration:

Hydra O✏ine, with access to full HTML data. Other data processing is done in Hydra, which keeps the architectural design cleaner. Adds a potentially costly processing stage, making updating the database slower. A stage in hydra can only access one document at a time, so an algorithm should preferably work incrementally to allow training to be done using Hydra.

Jellyfish Online, access to fields in Solr documents. Classification can be combined with the visualization tool. Classification is done at request, which could slow down responses to the visualization tool. Can request all documents for training purposes.

Additionally, the characteristics of the dataset influences the choice of algorithm:

• The dataset is very small for a classification problem, consisting of around 900 documents.

• The number of training examples for each class is limited to just a few. Since implementing a document classification method is a substantial amount of work. In the interest of saving time, an open-source classification library is used. One option is the Stanford Natural Language Processing library. This has several algorithms to choose from, both linear and non-linear. It’s a feature-based classifier that tracks how features relate to each class. For a linear method, training the classifier assigns weights to each such feature. (Manning and Klein, 2003)

Manning et al. (2008) recommends using a classifier with a high bias (thus low variance) for small datasets. A classifier with high variance could be easily influenced by single outlying documents, whose frequency in the dataset would be quite high given the small number of documents. This suggests that one of the linear methods in Stanford NLP is a good fit for this thesis.

(28)

(29)

Chapter 4 Results

This chapter is a presentation of the resulting prototype and the information received from evaluation of its di↵erent parts.

4.1 Architecture

The basic idea behind the system is to take documents through four steps; crawling, processing, indexing and presenting. The architecture can be seen in figure 4.1.1, and the four components below match these four steps: Nutch Crawls the sites, collecting articles.

Hydra Receives the crawl data, extracts text and classifies articles. Solr Builds an index with the articles and responds to search queries. Jellyfish Handles interaction.

Figure 4.1.1: Architecture Nutch SolrInputStage NodeExtractorStage Hydra Core StanfordClassifierStage WordMatchingClassifierStage ClassifierWeightingStage SolrOutputStage Solr Jellyfish crawl data index Document with

raw HTML documentComplete Queries andresponses docs

(30)

4.2. WEB CRAWLING CHAPTER 4. RESULTS

Nutch, Hydra and Solr all work on documents, which are data structures with a number of fields. Their particular formats are not interchangeable, however, so documents in one component are converted before being sent to the next. Nutch and Hydra are only needed to collect data and process it; once that is done Solr and Jellyfish run online as the system that the user communicates with.

Besides using the frameworks and systems described in 2.4, this thesis also involved constructing a number of plugins and other software compo-nents. These are described below, and more detailed explanations can be found in the coming sections.

Nutch - RawContentParserFilter: Stores the raw html of crawled arti-cles.

Nutch - RawContentIndexingFilter: Adds the stored html to the doc-ument sent out from Nutch.

Hydra - NodeExtractorStage: Extracts text from nodes in the html of a document.

Hydra - StanfordClassifierStage: Can be trained to assign labels to documents.

Hydra - WordMatchingClassifierStage: Assigns labels to documents via dictionary lookup.

Hydra - ClassifierWeightingStage: Assigns final labels to documents by weighing previously assigned labels together.

Jellyfish - vis.js: Handles the visualization tool on the web interface. Jellyfish - livesearch.js: Makes search requests while the user types a

query or interacts with the visualization tool.

During all testing and evaluation, this is run on Ubuntu 10.04 and Sun Java 6 1.6.0, with Solr and Jellyfish running as web applications inside a Tomcat 6 instance. The client web browser connects from the same network or the same computer.

4.2 Web crawling

Nutch is set up to do three passes over the data; crawl, merge and index. During crawl, all links that match certain filters are followed and the con-tents stored. In merge, any documents that shouldn’t be in the final dataset (such as article listings or category summaries) are removed. During index, documents are output as Solr documents following a Solr schema and sent to Hydra for processing.

(31)

4.3. DATA PROCESSING AND INDEXING CHAPTER 4. RESULTS

Nutch automatically parses the websites it crawls during the crawl pass in order to find links. This parsed content is put in a field in the fi-nal document, while the raw HTML is thrown away. In order to allow Hydra to process arbitrary parts of the articles, this raw HTML has to be stored in the data output from Nutch. The RawContentParserFilter and RawContentIndexingFilter handles this by storing the raw HTML as metadata during crawl, then putting it in a field in the output Solr document during indexing.

The crawls are done once for each data source; one crawl for V˚ardguiden and one crawl for 1177.

4.3 Data processing and indexing

Hydra receives data from Nutch by exposing the same interface for adding documents as Solr. This way, no special plugin needed to be built for Nutch to output the documents. The articles are received as documents with raw HTML content stored in the raw content field. The contents are processed by several instances of the NodeExtractorStage, each populating a di↵erent field; title, description, keywords and article text. The stage parses the raw HTML and extracts text using an XPath expression. See algorithm 4.3.1 for the implementation.

After all content has been extracted, the documents are classified (see section 4.4). When data processing is done, the articles are output to Solr for indexing. The fields populated by the data processing stages and stored in the Solr index can be seen in table 4.3.1.

Table 4.3.1: Document fields

Field Content

url URL to article, also used as ID article text Plain text article text

title Article title

keywords Comma-delimited list of keywords labels Whitespace-delimited list of labels description Short description of article

(32)

4.4. DOCUMENT CLASSIFICATION CHAPTER 4. RESULTS

Algorithm 4.3.1: ExtractNode(document, source, target) comment:Extracts text from source field in document using

path and puts or appends result to target field. local path, append

content GetField(document, source) html TrimCdataTags(content) if IsHtml(html) then 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : f ragment ParseHtml(html)

nodes EvaluateXpath(path, fragment) extractedT ext ””

for each node_{2 nodes} do

⇢

nodeT ext_{TextFromNode(node)} Append(extractedT ext, nodeT ext) if append

then ⇢

targetT ext_{GetField(document, target)} Append(targetT ext, extractedT ext) SetField(document, target, extractedT ext)

4.4 Document classification

The document classification is done in steps by three separate Hydra stages; the machine learning classifier, the word matching classifier and finally clas-sifier weighting.

4.4.1 Machine learning classifier

The machine learning document classifier is implemented using the Stanford Natural Language Processing library classifier (or just Stanford Classifier). The implementation is a stage in Hydra (StanfordClassifierStage) that is fed a set of training documents on startup. The classifier can be instantiated with a few di↵erent learning methods and structures. The one chosen here is LinearClassifier, with the learning method set to the default Quasi-Newton optimization. The default setting is used in the interest of easier development of the stage.

The Stanford Classifier is a multi-class classifier that can output a prob-ability distribution over the classes for each document, but it cannot han-dle multi-labelled input during training. In order to hanhan-dle multi-labelled training examples, a problem transformation method (PT5 as described by Tsoumakas and Katakis (2007)) is used. Each document is decomposed into one new document per label. For example, a five-label document is repre-sented by five single-labelled documents with identical text content. These documents are then used to train the classifier. The training is done o✏ine,

(33)

and the trained classifier is stored as a serialized Java object alongside the stage.

When running as a stage connected to Hydra, the stored classifier is read from file on startup. The classifier then takes an incoming document, reads the article text field, strips out non-word characters, lowercases the text and splits it into a list of words. The word list is then used to create a datum that represents the document. This datum stores each unique word as a feature that keeps track of the number of occurences in that document. The implementation seen in algorithm 4.4.1 has some options for con-figuration. The feature threshold can be set in order to exclude features that appear less than x times in the training dataset. The most important setting is the labelling threshold, which decides what probability is needed to assign a label to a document.

Algorithm 4.4.1: StanfordClassify(document, source, target) comment:

Assigns labels scoring above threshold by sending source field in document to the trained classif ier and putting these labels in target field.

local classif ier, threshold labels_””

text_{GetField(document, source)} Clean(text)

ToLowercase(text)

datum_{CreateDatum(text)} scores_{Score(classifier, datum)} for each (class, score)_{2 scores}

do ⇢

if score > threshold

then AppendLabel(labels, class) SetField(document, target, labels)

4.4.2 Word matching classifier

A secondary classifier (WordMatchingClassifierStage) based on word match-ing was built to improve performance. This classifier uses a dictionary of synonyms for each class. A document is represented as a set of words and a label is assigned if any of the synonyms are present in the set. No stemming or other linguistic techniques are employed, only a simple lowercasing of the words. See algorithm 4.4.2 for the implementation.

(34)

Algorithm 4.4.2: WordMatch(document, source, target) comment:Checks source in document for matches to words in

dictionary and appends labels to target local dictionary

text GetField(document, source) words MakeWordSet(text) labels GetField(document, target) for each (class, synonyms)2 dictionary

do ⇢

if WordSetContainsSynonyms(words, synonyms) then AppendLabel(labels, class)

SetField(document, target, labels)

4.4.3 Classification weighting

The labels assigned by the previous classifiers are finally examined by the ClassifierWeightingStage. If the classifiers agree on a label the label is included in the final document; otherwise, it’s ignored. Agreement is defined to be when both of the classifiers have assigned the label to a document. See algorithm 4.4.3 for the implementation.

Algorithm 4.4.3: Weighting(document, sources, target) comment:

Retrieves all labels from sources in document and counts them assigning any that occur more than 50% to target

occurences_{InitOccurences()} numSources_{Count(sources)} for each src_{2 sources}

do 8 < :

lbls_{GetField(document, src)} for each label_{2 lbls}

do AddOccurence(occurences, label) labels_””

for each (label, count)_{2 occurences} do

⇢

if count/numSources > 0.5 then Append(labels, label) SetField(document, target, label)

4.4.4 Classifier evaluation

The multi-label measures detailed in section 2.3.1 were used to evaluate the performance of the classifier.

(35)

In order to find a good feature threshold a number of di↵erent settings were tried. Ten training-testing cycles were done for each setting and the measures averaged. The training dataset was divided into 90% used for training and 10% used for testing. The results of the tests, seen in ta-ble 4.4.2 and figure 4.4.2, show that performance is better at zero feature thresholding.

Table 4.4.2: Classifier performance results: Feature threshold

Measure Feature Threshold Average

0 1 3 5 7 EMR 0.20 0.15 0.10 0.12 0.15 0.14 LF 0.51 0.48 0.40 0.44 0.43 0.45 RF 0.37 0.35 0.25 0.34 0.29 0.32 HL 0.08 0.08 0.06 0.07 0.08 0.07 LD 0.18 0.18 0.16 0.16 0.16 0.18

Figure 4.4.2: Classifier performance results: Feature threshold

0 1 3 5 7 0% 10% 20% 30% 40% 50% 60% EMR LF RF N-HL

With the feature thresholding decided, tests were run to determine a la-belling threshold. Each threshold value was tried ten times and the measures averaged. As seen in table 4.4.3, the labelling f-score peaks around a thresh-old of 0.1 0.05, while retrieval f-score keeps increasing as the threshold is lowered.

To get a high labelling f-score without too much information loss, a threshold of 0.125 (or 12.5% confidence) seems a good pick.

The word matching classifier was tested on the same training data as the previous one. Only one run is presented, since only one dictionary file was

(36)

4.5. LIVE SEARCH CHAPTER 4. RESULTS

Table 4.4.3: Classifier performance results: Labelling threshold

Measure Labelling Threshold

0.5 0.4 0.3 0.2 0.15 0.125 0.1 0.075 0.05 0.025 EMR 0.18 0.11 0.18 0.17 0.09 0.12 0.13 0.10 0.15 0.07 LF 0.30 0.30 0.40 0.43 0.38 0.44 0.45 0.45 0.45 0.44 RF 0.21 0.19 0.23 0.27 0.27 0.28 0.26 0.28 0.32 0.34 HL 0.04 0.03 0.05 0.06 0.05 0.06 0.07 0.08 0.08 0.10 LD 0.18 0.16 0.18 0.17 0.16 0.16 0.18 0.16 0.16 0.18

Figure 4.4.3: Classifier performance results: Labelling threshold

0.5 0.4 0.3 0.2 _0.15 0.12 5 0.1 0.07 5 0.05 _0.025 0% 10% 20% 30% 40% 50% 60% EMR LF RF N-HL

used. All runs with the same data-dictionary combination will yield this result.

Finally, the result of weighing together the assignments made by the two classifiers was tested. This test was done by compiling a list of the documents and their final label, then comparing these with the training data with a script. Table 4.4.5 shows the improvement.

One thing to note is that the number of documents with assigned labels has decreased after the weighting. The original dataset contains ca 900 documents, but the labelled documents are only ca 300.

4.5 Live search

The live search implementation (livesearch.js) is done in Javascript using the jQuery library.

(37)

4.6. VISUALIZATION TOOL CHAPTER 4. RESULTS

Table 4.4.4: Classifier performance results: Word Matching classifier

Measure Result EMR 0.06 LF 0.44 RF 0.48 HL 0.06 LD 0.10 N-HL 0.65

Table 4.4.5: Classifier performance results: Weighted classification

Measure Result EMR 0.35 LF 0.62 RF 0.68 HL 0.05 LD 0.10 N-HL 0.49

includes a web-based generic graphical user interface (gui) with a search box, serp and an auto-completion jQuery widget (see figure 4.6.4). Auto-complete suggestions and search query results are retrieved from two dif-ferent web services running in Jellyfish: the autocomplete service and the search service.

Live search is enabled with the help of the search service. Whenever the user enters a query or interacts with the interface (e.g. selects an autocom-pletion suggestion or uses the visualization tool), an asynchronous search request is made. The request is transformed by the Jellyfish search service into a format fit for Solr, which responds with search results. Again, Jelly-fish transforms the answer into a generic form that livesearch.js receives and sends to the result handler (see algorithm 4.6.1).

The script makes some attempts at minimizing zero-result pages by searching for the suggestions that Solr makes; if the response has no re-sults, a new request is made with the suggestion as the query.

4.6 Visualization tool

4.6.1 Tool construction

The visualization tool is built with the Javascript library Data-Driven Doc-uments (d3.js).

The tool consists of two parts: the body-vis and the result list. When a search response is received from the live search, one of these parts is updated. If the user has selected any facets or selected any pagination page other than the first, the result list is updated. Otherwise, the result list is hidden and

(38)

Figure 4.6.4: Visualization tool overview

Autocomplete

Body-vis Result list

the body-vis is updated.

Algorithm 4.6.1: HandleResult(result) comment:

Takes a search result and adds this to the visual-ization, or updates the SERP if browsing or having f acetsSelected.

local f acetsSelected, browsing if f acetsSelected OR browsing then UpdateSERP(result) else 8 > > < > > : ResetLayers() SetupListeners() data_{GetResultItems(result)} AddData(data)

In the body-vis, an image of the human body is divided into interactive body parts that allow selection of facets. When the body-vis is updated, each body part is assigned its matching facet item from the response. The color of the body part depends linearly on the relative facet count, ranging from rgb(65, 0, 255) for the facets with the lowest count to rgb(255, 0, 0) for the facets with the highest count. In essence, the more red a body part is the more hits there are, and vice versa the more blue the area is. If the facet count is zero, or if the facet is not included in the response, the

(39)

corresponding body part is hidden to show the non-interactive background image. When body parts are selected, a search request is sent via the live search object, with the corresponding facets selected. Unselected body parts are rendered in grayscale.

Selection of body parts is done either by clicking them or by making a box selection that encloses them. Disjointed areas of selection can be made by selecting multiple times. Clicking a selected body part deselects it and clicking on the background deselects all body parts. Interaction with the body-vis doesn’t change the focus, so the user can quickly inspect the results and go back to writing the query.

The graphics are handled by svg rendered directly in the browser, with graphical changes performed by modifying the css properties of each body part.

4.6.2 Evaluation of the tool

The following three use cases highlight the function of the interface and the usage patterns it enables. For screenshots, see Appendix A.

Case A

Albert has felt stings of pain while swallowing the last few days. Wondering what he might have, he decides to search on that medical information site he saw yesterday. He starts typing his query, but doesn’t have to finish as the query completion quickly shows what he intended.

He notices there are a lot of hits in the lungs, and he did have a worry it would be the flu. He selects the lungs and the results show instantly. Scanning the list, the first page doesn’t really have anything that sounds familiar.

Recalling the throat was red on the overview, Albert selects that as well and the result list is updated. The general articles about throat pain do not interest Albert very much, he wants to know about diseases. The article about tonsilitis catches his eye, he recognises the name and recalls that his sister had that a few years ago. Intrigued, he clicks through.

Case B

After the third week of feeling down and having di↵use pains in her joints, Brenna searches for some information on her symptoms. She immediately gets an overview of the results.

Unsure about where she wants to start out, she picks the body part with the most hits, the feet. Nothing interests her, so she selects the legs as well. She hasn’t felt any particular problems with her knee, so she ignores the artros articles and increases the search with the hip. Nothing more relevant is added. She deselects.

(40)

Noticing there are hits on the arm and shoulder, she makes a new selection. The general neck and shoulder pain article might be interesting! She clicks through.

Case C

Cecilia has an itching rash on one of her hands that she’s getting worried about. Figuring there could be something she can do about it herself, she does a search on rashes.

The overview shows nothing on the hands, so Cecilia tries with rashes on hands specifically - which gives only hits in the chest. Stumped, she reconsiders her search term. Thinking the itching might be more relevant, she tries with that instead.

The areas on the overview don’t seem like what she wanted, so she tries with eczema, thinking a more medical term could be good. This shows hits on the hand. Cecilia selects it and clicks the first article.

(41)

Chapter 5 Discussion

Here, the results are discussed in relation to the method and the theoretical background.

5.1 Document classification

With the aspects of classification described in section 2.3 in mind, the classes used in the final classification are rich, as they describe the content of the documents themselves. As for the classification method, it’s o✏ine and full-feature, as it operates on the article text in the data processing stage.

The behavior seen in table 4.4.2 is expected; retrieval performance should increase when more labels are assigned, just as exact matches should de-crease with a lower threshold. The fact that both retrieval and Hamming loss increases but labelling f-score remains constant for the low threshold indicates that more labels are assigned, but many of them are wrong. The results suggest that using the feature thresholding at all is a bad idea. How-ever, this might be premature; for a larger dataset, limiting the number of features can be one approach to making machine learning a viable approach. The results from the labelling threshold tests (table 4.4.3) show that on average half of the labels on a document are correct. This means most docu-ments have at least one correct label. Conversely, only 20-30% of the original labels are assigned to the documents. Due to this, even fewer documents (10-18%) are perfectly labelled. The test does have a lower information loss than the other classifiers though, which suggests a majority of the original labels are preserved by the classifier.

Tsoumakas and Katakis (2007) tested other problem transformation methods for multi-label problems, besides PT5 used in this thesis. They found on a similarly limited dataset (662 documents, 30% for training, label density 0.05) that simpler linear classifiers (such as Naive-Bayes) performed worse than linear classifiers (Hamming Loss of 0.057 compared to non-linear methods ranging 0.004 0.046).

(42)

5.1. DOCUMENT CLASSIFICATION CHAPTER 5. DISCUSSION

Linear classifiers have also been tested by Slonim and Tishby (2001), where a single-label linear classifier (Naive-Bayes) was trained with a com-paratively larger dataset (25 documents per class). There, performance was measured in the percentage of test cases that were correctly classified, and found to range 38-69% for the benchmark dataset 20 Newsgroups. This per-formance metric is comparable to the exact match ratio, which reached 35% only after the weighted classification.

The performance of the word matching classifier seen in table 4.4.2 shows that it is better at assigning many labels than the machine learning classifier, but is otherwise the same or worse. The high Hamming Loss along with the low Exact Match Ratio suggests that the word matcher is liberal with assigning labels, reinforcing the notion that it assigns more labels than the machine learning classifier.

A lack of word stemming and lexical analysis in the word matching clas-sifier also a↵ects the performance negatively.

Table 5.1.1: Classifier performance results: Comparison

Measure Machine Learning Word Matching Weighted

EMR 0.12 0.06 0.35 LF 0.44 0.44 0.62 RF 0.28 0.48 0.68 HL 0.06 0.06 0.05 LD 0.16 0.10 0.10 N-HL 0.39 0.65 0.49

Table 5.1.1 shows a comparison1 _{of the three stages of classification.}

The weighted classification shows improvement across the board compared to the two classifiers; only Hamming Loss for the machine learning classi-fier is better. This indicates that most of the assigned labels are correct, but there are a number of documents that don’t get labelled. This is even more clear when looking at the number of documents that have labels as-signed to them; the machine learning classifier assigns labels to practically all documents while the weighted classification contains only a third of the original documents. This is natural when considering how the weighting is done, since only the labels that both of the classifiers agree on are included. Again, the Hamming Loss of the weighted classification shows clearly that around half of the original labels are missed.

The variation in label density seen between the machine learning classi-fier and the other tests can be traced to the data; since only a portion of the documents can be used for testing (the rest being training data), the density increases.

The results suggest that a machine learning classifier’s performance can be increased by combining its results with a simpler dictionary-based clas-sifier. Doing so could lead to a loss of information, however.

(43)

5.2. VISUALIZATION TOOL CHAPTER 5. DISCUSSION

5.2 Visualization tool

All cases exhibit the visual information-seeking mantra, starting with an overview and then filtering to get individual items. The third step, details-on-demand, can be seen as clicking on the link and thus receiving the rest of the details. A possible improvement would be to display a description when the user hovers over an item, making details-on-demand a part of the dynamic interface to a greater degree.

Case A (see section 4.6.2) shows the pre-attentive properties of the vi-sualization. The user can easily di↵erentiate the body parts by color and boundaries. The borders of zero-hit body parts are also hidden, making the ones with color changes stand out. The physical analog of the human body also gives the user an instant interest in what many hits in one body part could mean, inviting exploration. The same thing can be seen in case B (see section 4.6.2), where the coloring of the body parts gives the user a hint of where to look first.

One risk shown in this case is that the user could get stuck thrashing on the overview, never realising that the query needs to be reformulated to get the right result. The use of auto-complete and suggestions should mitigate that risk, however.

In case C, the move of iteration from results to overview is shown in e↵ect. Cecilia can quickly see that the results of her query won’t give the articles she was looking for. Even without leaving the search box, she can change her query around and see results. One problem due to the small dataset comes up in this case; a zero-result page. The live search mitigates some of the e↵ect of reaching zero-results pages though, allowing Cecilia to quickly change her query.

As can be seen in the cases, the visualization tool spatializes the search domain and gives the user an idea of the coverage of a query. Relations between articles and the article content is not shown graphically, although the labels are shown in text under each item in the result list.

(44)

(45)

Chapter 6 Conclusion

This chapter presents the conclusions that can be drawn from the discussion and the results, and whether the problem has been properly adressed.

6.1 Problem

Regarding the first question, the approach used in this thesis is to spatialize facets instead of individual documents. An initial idea was to spatialize a continuous dimension (with a physical analog) of the search domain. How-ever, no such domain was easily identified. Additionally, the articles in the data were not simple to place in a definite position, since many covered broad medical conditions that concerned di↵erent parts of the body. While the end classification did not assign multiple labels in most cases, the intent was to get such classification. An interesting solution would have been to give documents a distribution of probabilities for the classes. The document could then be spatialized as an area.

As described in the discussion (see section 5.1), the main problem for the solution used in this thesis is the size of the dataset. Simpler methods have performed better on larger datasets, and linear classifiers in general have problems when there are few examples to train on. There are of course improvements that can be made, for example by expert-labelled training data. For the size of the dataset, it would also be feasible to manually tag all articles and disregard automated classification entirely. The improvement gained from combining trained and rule-based classifiers is also noteworthy; while it did mean fewer documents were classified, the labels that were assigned were more relevant. For the user, that trade-o↵ could be worth it. With a search domain that has facets which can serve as spatialization, the approach used in this thesis can be an e↵ective way of getting the ad-vantages from spatialized displays into search systems.

Moving on to the second question; with the facets directly represented in the spatialization, selection becomes a direct interaction with this data.

(46)

6.2. FUTURE WORK CHAPTER 6. CONCLUSION

The approach makes selection intuitive and fast. The direct selection of facets is limited by the available screen space, however. Introducing more body parts might make the selection more difficult. One improvement could be to have the data hierarchially classified or structured, so that selecting a larger body part also selects those within that part.

6.2 Future work

The hope of this author is that the merit of spatialized displays for search and retrieval systems is further explored. The use of graphical facets can be a good start for introducing more such displays. An interesting continuation of this work would be to incorporate ideas from it in a customer project or an existing interface, and see it gives any measurable e↵ect on usability. It would also be interesting to see how this sort of solution could be packaged and presented for a customer. Another continuation could be to expand the technical approach with a larger dataset, more sophisticated classification techniques and a more advanced visualization. Continuing on the combi-nation of trained and rule-based classifiers and measuring the e↵ect more closely could also be an approach.

(47)

Bibliography

Bonnel, N., Cotarmana´ch, A. and Morin, A. (2005). Meaning metaphor for visualizing search results, Proceedings Ninth International Conference on Information Visualisation pp. 467–472.

URL: http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=1509117

Fabrikant, S. I., Montello, D. R. and Mark, D. M. (2006). The distance-similarity metaphor in region-display spatializations, IEEE Comput. Graph. Appl. 26: 34–44.

URL: http://dx.doi.org/10.1109/MCG.2006.90

Healey, C. G. and Enns, J. T. (2011). Attention and visual memory in visualization and computer graphics, IEEE Transactions on Visualization and Computer Graphics . PrePrint.

URL: http://doi.ieeecomputersociety.org/10.1109/TVCG.2011.127 Hearst, M. A. (2009). Search User Interfaces, Cambridge University Press. Jansen, B. J. and Rieh, S. Y. (2010). The seventeen theoretical constructs of information searching and information retrieval, Journal of the American Society for Information Science and Technology 61(8): 1517–1534. URL: http://dx.doi.org/10.1002/asi.21358

Kules, B., Kustanowitz, J. and Shneiderman, B. (2006). Categorizing web search results into meaningful and stable categories using fast-feature techniques, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pp. 210–219.

Manning, C. D., Raghavan, P. and Sh¨utze, H. (2008). Introduction to In-formation Retrieval, Cambridge University Press.

URL: http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Manning, C. and Klein, D. (2003). Optimization, maxent models, and

con-ditional estimation without magic, Tutorial at HLT-NAACL 2003 and ACL 2003. Retrieved from URL 2012-01-17.

(48)

BIBLIOGRAPHY BIBLIOGRAPHY

Morville, P. and Callender, J. (2010). Search Patterns: Design for Discovery, 1st edn, O’Reilly Media, Inc.

Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for information visualizations, Proceedings of the 1996 IEEE Symposium on Visual Languages, IEEE Computer Society, Washington, DC, USA, pp. 336–.

URL: http://dl.acm.org/citation.cfm?id=832277.834354

Sidi, R. and Marques, D. (2011). The impact of visualization in search and discovery, Slides and recording of a talk at Microsoft Research, Redmond, WA, United States. Retrieved from URL 2011-09-12.

URL: http://research.microsoft.com/apps/video/dl.aspx?id=145179 Skupin, A. and Fabrikant, S. I. (2003). Spatialization methods: a

car-tographic research agenda for non-geographic information visualization, Cartography and Geographic Information Science 30: 95–119.

Slonim, N. and Tishby, N. (2001). The power of word clusters for text classi-fication, In 23rd European Colloquium on Information Retrieval Research. Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance

measures for classification tasks, Inf. Process. Manage. 45: 427–437. URL: http://dl.acm.org/citation.cfm?id=1542545.1542682

SolrRelevancyFaq (2012). Accessed 2012-01-30.

URL: http://wiki.apache.org/solr/SolrRelevancyFAQ

Spence, R. (2007). Information Visualization - Design for Interaction, 2 edn, Pearson Education.

Tsoumakas, G. and Katakis, I. (2007). Multi-label classification: An overview, Int J Data Warehousing and Mining 2007: 1–13.

Wikipedia (2011a). Human anatomy. Accessed 2011-10-21. URL: http://en.wikipedia.org/w/index.php?

title=Human anatomy&oldid=456015369#Regional groups Wikipedia (2011b). Lista ¨over kroppsdelar. Accessed 2011-10-21.

URL: http://sv.wikipedia.org/w/index.php?

title=Lista %C3%B6ver kroppsdelar&oldid=13920848

Wikipedia (2011c). Outline of human anatomy. Accessed 2011-10-21. URL: http://en.wikipedia.org/w/index.php?

(49)

Appendix A

Visualization tool

screenshots

(50)

APPENDIX A. VISUALIZATION TOOL SCREENSHOTS F igu re A. 1: C as e A sc re en sh ot s

(51)

APPENDIX A. VISUALIZATION TOOL SCREENSHOTS F igu re A. 2: C as e B sc re en sh ot s

(52)

APPENDIX A. VISUALIZATION TOOL SCREENSHOTS F igu re A. 3: C as e C sc re en sh ot s

(53)

På svenska

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Visualization of live search

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Visualization of live search

by

Olof Nilsson

LIU-IDA/LITH-EX-A--13/002--SE

2013-12-09

Final Thesis

Visualization of live search

by

Olof Nilsson

LIU-IDA/LITH-EX-A--13/002--SE

2013-12-09

Supervisor: Johan Åberg

Examiner: Magnus Bång

Contents

List of Tables

List of Figures

Abstract

Chapter 1

Introduction

1.1

Problem

1.2

Sources

1.3

Glossary

1.4

Structure

Chapter 2

Theory

2.1

Search and information retrieval

2.1.1

Representing documents

2.2

Information visualization

2.3

Document classification

2.3.1

Measuring performance

2.4

Technologies

Chapter 3

Method

3.1

Overview

3.2

Visualization tool

3.3

Document classification

3.3.1

Selection of algorithm

Chapter 4

Results

4.1

Architecture

4.2

Web crawling

4.3

Data processing and indexing

4.4

Document classification

4.4.1

Machine learning classifier

4.4.2

Word matching classifier

4.4.3

Classification weighting

4.4.4

Classifier evaluation

4.5

Live search

4.6

Visualization tool

4.6.1

Tool construction

4.6.2