Changing a user’s search experience byincorporating preferences of metadata

(1)

Changing a user’s search experience

by incorporating preferences of

metadata

MIRAN ALI

(2)

Abstract

Implicit feedback is usually data that comes from users’ clicks, search queries and text highlights. It exists in abun-dance, but it is riddled with much noise and requires ad-vanced algorithms to properly make good use of it. Several findings suggest that factors such as click-through data and reading time could be used to create user behaviour models in order to predict the users’ information need.

This Master’s thesis aims to use click-through data and search queries together with heuristics to create a model that prioritises metadata-fields of the documents in order to predict the information need of a user. Simply put, implicit feedback will be used to improve the precision of a search engine. The Master’s thesis was carried out at Findwise AB - a search engine consultancy firm.

Documents from the benchmark dataset INEX were in-dexed into a search engine. Two different heuristics were proposed that increment the priority of different metadata-fields based on the users’ search queries and clicks. It was assumed that the heuristics would be able to change the listing order of the search results. Evaluations were car-ried out for the two heuristics and the unmodified search engine was used as the baseline for the experiment. The evaluations were based on simulating a user that searches queries and clicks on documents. The queries and docu-ments, with manually tagged relevance, used in the evalua-tion came from a data set given by INEX. It was expected that listing order would change in a way that was favourable for the user; the top-ranking results would be documents that truly were in the interest of the user.

The evaluations revealed that the behaviour of the heuris-tics and the baseline have erratic behaviours and metrics never converged to any specific mean-relevance. A statisti-cal test revealed that there is no difference in accuracy be-tween the heuristics and the baseline. These results mean that the proposed heuristics do not improve the precision of the search engine and several factors, such as the indexing of too redundant metadata, could have been responsible for this outcome.

(3)

¨

Andra en anv¨andares s¨okupplevelse genom att inkorporera metadatapreferenser

Implicit feedback är oftast data som kommer fr˚an användarnas klick, sökfr˚agor och textmarkeringar. Denna data finns i överflöd, men har för mycket brus och kräver avancerade algoritmer för att man ska kunna dra nytta av den. Fle-ra rön föresl˚ar att faktorer som klickdata och läsningstid kan användas för att skapa beteendemodeller för att förutse användarens informationsbehov.

Detta examensarbete ämnar att använda klickdata och sökfr˚agor tillsammans med heuristiker för att skapa en mo-dell som prioriterar metadata-fält i dokument s˚aatt användar-ens informationsbehov kan förutses. Allts˚aska implicit feed-back användas för att förbättra en sökmotors precision. Ex-amensarbetet utfördes hos Findwise AB - en konsultfirma som specialiserar sig p˚a söklösningar.

Dokument fr˚an utvärderingsdatamängden INEX index-erades i en sökmotor. Tv˚a olika heuristiker skapades för att ändra prioriteten av metadata-fälten utifr˚an användarnas sök- och klickdata. Det antogs att heuristikerna skulle kun-na förändra ordningen av sökresultaten. Evalueringar utfördes för b˚ada heuristiker och den omodifierade sökmotorn användes som m˚attstock för experimentet. Evalueringarna gick ut p˚a att simulera en användare som söker p˚a fr˚agor och klickar p˚a dokument. Dessa fr˚agor och dokument, med manuellt taggad relevansdata, kom fr˚an en datamängd som tillhan-dahölls av INEX.

Evalueringarna visade att beteendet av heuristikerna och m˚attstocket är slumpmässiga och oberäkneliga. Ingen av heuristikerna konvergerar mot n˚agon specifik medelrele-vans. Ett statistiskt test visar att det inte är n˚agon signifi-kant skillnad p˚auppmätt träffsäkerhet mellan heuristikerna och m˚attstocket. Dessa resultat innebär att heuristikerna inte förbättrar sökmotorns precision. Detta utfall kan bero p˚a flera faktorer som t.ex. indexering av överflödig meta-data.

(4)

Acknowledgements

This Master’s thesis would not have been a possibility without the help of numerous academics, experts within the field of information retrieval, and loved ones.

I was preoccupied fighting potential dengue risks and extreme humidity in Sin-gapore when I received joyful news from Simon Stenstr¨om about conducting my Master’s thesis at Findwise AB. They gave me a chance to work with a field that was close to my heart and do so in a very modern and professional work environ-ment. Evidently, I made the wise choice of accepting the offer.

Working at the Findwise office turned out to be an experiencce that gave me first-hand contact with developers that strived to be as helpful as possible for all of the thesis workers. This behaviour was highly prevalent in my supervisor at Findwise - Meidi T˜onisson. Not only was it possible for me to send my inquiries to her at almost anytime, Meidi is a former thesis worker at Findwise and she was able to help me with the organisation of the project as well as give me a skill set to manage some of the technical components of development. Martin Nycander and Simon Stenstr¨om were responsible for holding an internal lecture at Findwise that was very helpful, since it covered how Findwise’s own products are used and how they could be utilised in the project. I would also like to thank the rest of the employees for giving me several laughs at the lunch table and also for the times they reached out and gave me help with annoying bugs in software.

I feel extremely privileged to have had Hedvig Kjellstr¨om as my academic su-pervisor. She had a constant presence during the course of the Master’s thesis through e-mail contact and insisted that I should meet her on a regular basis in order to receive feedback. Integral parts of this report would not have been possible to conceive if not for Hedvig’s expertise on the subject.

Finally, I would like to thank my wife Sivan. Truly the light of my life and my kindred spirit, she has always been like a pillar that supports me during tough times and she always has time to listen to me rambling on about my Master’s thesis. Thank you all for your everlasting support and patience!

(7)

Introduction

The field of information retrieval has gone from a primarily academic principle to being the most common way for people to access information on a daily basis [1]. The most common application of IR is the multitude of search engines that can be found online. While IR started off as means for information professionals to find relevant data - scientific innovations, superb engineering and a massive price drop in computer hardware has lead to search engines that are capable of searching through several millions pages within subsequent response times for hundreds of millions of searches every day [1].

While the major search engines are doing a fantastic job at searching through millions of indexed pages, there is always room for improvement. Joachims and Radlinski [2] touch upon this subject and mention that Implicit Feedback (IF) is important if we are to advance beyond the one-ranking-fits-all approach. An application of IF means that a user’s behaviour is used to adapt the search engine to fit that particular user. One way of going beyond the one-ranking-fits-all approach is by incorporating user behaviour models into the search engine [3, 4, 5]. The authors in the aforementioned citations assume that users’ informational needs will change over time and, thus, create user behaviour models that are specific for certain sessions; the user wants to know more about the programming language Java today by typing the query ”Java”, but will ask for information about Java Island tomorrow with the same query. Search engines are to quickly adapt to this kind of behaviour and the aforementioned models help in abstracting this.

This report will delve into subjects like ranked retrieval, implicit feedback and user behaviour models. All of these are deeply intertwined, as the reader of this report will notice. Different ways to permutate the ranking of documents, given by a search engine, will be proposed and implemented. Evaluation of heuristics will be performed and the data will go through statistical methods to deduce if the proposed heuristics improved the user’s search experience.

A lot has been done in the world of IR and the main part of this report focuses on customizing the ranking of a state-of-the-art search engine through different heuristics. The suggested way of performing this customization will stand as the

(8)

CHAPTER 1. INTRODUCTION

academic contribution of this report.

1.1 Problem Statement

The purposes of this Master’s thesis are to investigate, implement and evaluate a search engine that uses click-through data combined with corresponding search queries to prioritise certain fields1 _{in the search results. This is done in order to}

change the ranking of search results and make sought documents appear further up in the list.

The author’s hypothesis for this project is that

Hypothesis 1: the proposed solutions will be able to predict the users’ sought documents significantly better than the unmodified search engine.

The research questions that will guide the author throughout this Master’s thesis are:

1. Does the evaluations of the search engine heuristics indicate that the author’s hypothesis might be accurate?

2. What are the desirable/undesirable effects of the solution?

3. Is it reasonable to assume that the proposed solution actually improve the users’ search experience?

1.2 Contributions

This project’s contribution consists of a new way to combine different sources of implicit feedback, namely search queries and click data, into adapting a search engine to a user’s searching patterns.

1.3 The Project Provider

Findwise AB is a consultancy agency that focuses on, among other things, enterprise search [6]. The office is located in the center of Stockholm. Findwise AB have, as the role of project providers, contributed with an abundance of technical components and immense guidance. They have encouraged the thesis workers to use practices such as Scrum, in order to guarantee regular deliveries in a short period of time. It is possible that the findings in this report might be used in an actual Findwise project if they were to create custom search solutions.

1_{the words fields, metadata and metadata-fields will be used interchangeably throughout this}

(9)

1.4 Report Outline

The rest of the document is organized as follows. Chapter 2 covers the previous work made in the field of ranked retrieval and implicit feedback. Chapter 3 covers the methodology and the approach taken to solve the problems at hand and how the experiments were set up for training and testing the heuristics. Chapter 4 illustrates and discusses the results and, finally, Chapter 5 contains the concluding remarks of the Master’s thesis.

(10)

Chapter 2

Background

The following sections cover the different techniques that were used for the project and also what previous researchers have accomplished in the field of information retrieval.

2.1 Ranked Retrieval

Early Information Retrieval (IR) systems relied on boolean retrieval. This meant that users would satisfy their information need by constructing complex queries that consisted of boolean ANDs, ORs and NOTs. The drawbacks of these systems were that it was hard for a user to construct queries when their information need was extremely specific and there was no notion of document ranking, according to Singhal [7]. Singhal continues by saying that there are certain power users that still use boolean retrieval (probably librarians), but most users expect IR systems to perform ranked retrieval.

The basis of IR systems with ranked retrieval is to sort the list of found docu-ments on how useful they are in regards to the given query. Docudocu-ments are usually assigned numeric scores and sorted on these scores in descending order.

Different models exist to shape an IR system that is capable of ranked retrieval. Some of these models are the vector space model and different probabilistic models. But the one that is relevant for this Master’s thesis is the vector space model, since the search engine used in the experiments utilised this kind of model.

2.1.1 Vector Space Model

The vector space model is, according to Manning et. al [1], fundamental to a host of IR operations (e.g. cosine similarity). There are several aspects that comprise the vector space model such as calculation of similarities, dot products and vectors.

(11)

Vector ˛

V(d) is the vector that represents document d, where there is one component for

each term in the document. The components are comprised of a term and its corresponding weighting score. This score is usually dependant on factors such as the term’s number of occurrences in the document and the number of documents that contain this term. A very popular weighting-score is the tf-idf score (term frequency - inverse document frequency). The dash between tf and idf should not be mistaken for a subtraction sign; the tf-idf score is calculated by multiplying the factors tf and idf. While tft,drepresents the term frequency of a term t in document

d, idft represents the inverse document frequency for a term t or:

idft= log

N

dft (2.1)

The reason for using inverse document frequency in the weighting score is be-cause words are not of equal significance. The few documents that contain the term

t1 are to be given a boost when a user searches for the query t1. Similarly, the

humongous amount of documents that contain the term t2 should not be given a

boost to their total score if a user searches for t2.

Using (2.1), the tf-idf formula can be written as:

tf_{≠ idf}t,d= tft,dú log

N

dft (2.2)

Weigel et. al shed light on two issues when tf-idf scoring is applied to structured data: (1) which units of the data are to be treated as coherent pieces of information, and (2) the structural similarity problem, which concerns the issue of quantifying the distance of a document to a given query [8]. Both of these issues are somewhat tangible to this thesis, due to the fact that the data used for the experiments consist of structural data.

A set of documents can be seen as a set of vectors in a vector space, where there is one axis for each term. It is important to note that this representation makes ordering of terms completely insignificant. This means that the sentence ’Sivan is

brighter than Miran’ is the equivalent of ’Miran is brighter than Sivan’ once both

sentences are converted to vectors; semantics are not preserved.

Similarity

A standard way of calculating the similarity between two documents is to use the cosine similarity of the vector representations of the two documents:

sim(d1, d2) = ˛

V(d1) · ˛V (d2)

|˛V (d1)||˛V (d2)|. (2.3)

The numerator represents the dot product of the vectors that represent doc-uments d1 and d2 and the denominator is the product of the vectors’ euclidean

(12)

CHAPTER 2. BACKGROUND

The cosine similarity is a central part of this Master’s thesis, as it is used to calculate the similarity between the query and different parts of the documents’ structure. Queries can also be represented as a vector in the same manner as the document vectors. This means that (2.3) can be used to calculate the cosine similarity between the vector that represents document d and query q:

sim(q, d) = V˛(q) · ˛V (d)

|˛V (q)||˛V (d)|. (2.4) Equation 2.4 was used by the author when a measurement of the similarity was to be extracted between a query and one of the metadata-fields in the document.

2.2 Solr

The Solr search engine played a very central role in the project. It was, together with its accompanying database, used for indexing data and searching through this data in subsequent response times. Below is a description of the most important aspect of the Solr search engine - the query fields (qf) paramater.

The qf parameter enabled the author to search for a query among a certain set of metadata-fields. In addition to this, the qf parameter also allowed the author to weight the fields differently. This means that the parameter can be used to make a query match in one field more significant than a query match in another field; fields’ importance can be differently weighted and this was exactly the kind of behaviour that was pursued during the course of the project.

The qf-string has the following structure:

qf = fieldvalue₁ 1+ fieldvalue₂ 2 + ... + fieldvaluen

n (2.5)

The + in Equation 2.5 should not be confused with the addition operator. This is merely a way for Solr to append additional data to a parameter.

The official term for what happens in Equation 2.5 is referred to as boosting. As stated previously, this means that more emphasis is put on boosted fields of the documents [9]. In rough terms, the implicit feedback that is acquired from a user is meant to be used to update the values that will be used in the qf-parameter; user behaviour will dictate how fields’ importance are weighted.

2.3 Implicit Feedback

The majority of this section contains descriptions of different implementations of implicit feedback (IF). Apart from being used in different Recommender Systems (RS) and video retrieval systems, IF is almost exclusively used for document re-trieval systems and the mentioned applications will mostly touch upon this field of study.

(13)

2.3.1 Evaluating Differences

While there are many proceedings and reports discussing IF, most of them have spent much time evaluating the applications that use IF. This means that the ben-efits of IF are not given and authors take great caution when hypothesising positive correlations between document usability and the use of IF. Some articles in the bib-liography contain a section dedicated to determining any correlation between the use of IF and search engine improvement or other related metrics.

This is highly present in a report written by Kelly and Belkin [10]. They hypoth-esized that scrolling frequency, reading time and document interaction were great sources of IF and would help users find documents they thought were relevant. Sta-tistical methods showed that there was no staSta-tistically significant difference between relevant and non-relevant documents when using any of the three aforementioned sources.

2.3.2 Drawbacks of Implicit Feedback

Joachims and Radlinski [2] claimed that IF is used in most search engines and IF is important if search engines are to advance beyond the one-ranking-fits-all approach, but IF is also noisy and biased, making simple learning strategies doomed to fail.

The same authors mentioned that IF has a problem with trust-bias - users rely heavily on the ranking given by the search engine. Joachims et al. [11] formed several strategies that incorporate IF and conclude that even the best strategies were less consistent than explicit feedback results, but they mentioned that the virtually infinite amount of implicit feedback combined with good machine learning algorithms should suffice in closing the performance gap between the two types of feedbacks.

IF cannot be negative and is interpreted on a relative scale [12]. The abundance of IF has the downside that it is highly noisy [13].

2.3.3 Different applications

As mentioned in the beginning of the section, IF can also be used in Recommender Systems (RS). The use of RS is not uncommon for services like Netflix and Amazon [14]. Jawaheer et al. [12] used both explicit and implicit feedback to create an RS in a music context. They mentioned that users are usually reluctant to give explicit feedback because of the cognitive effort it requires. In this context, IF is considered to be the number of times a person has listened to a certain song and they devised a method that combines implicit and explicit feedback to calculate ratios of interest and these ratios decide what artist should be recommended to the user.

It is important to note that unlike explicit feedback, IF only gives a relative scale in this case [12]. While a user explicitly down- or upvotes a song to express that he/she dislikes or likes the song, IF can only be interpreted in relative terms. A user listening to a song 10 times does not give much information, but if the same

(14)

CHAPTER 2. BACKGROUND

user listened to another song 100 times, one could conclude that the latter song is preferred over the former one. This is an example of IF.

Hopfgartner and Jose [15] evaluated implicit feedback models for their video retrieval engine. Through defining different user behaviour models, they knew what kind of IF-data to expect from users if they were to deploy the search engine and include it in user studies, but the user behaviour models were created to form a simulator that was to access data according to the different models.

Another popular application of IF is in the form Collaborative Filtering (CF), which is letting a majority of users’ preferences dictate what the common user likes and dislikes. The drawback of CF lies in something called the cold start problem [13]; in the beginning, there is relatively little data about each user, which results in an inability to draw inferences to recommend items to users [16].

Zigoris and Zhang [17] used bayesian models together with explicit and implicit feedback to solve the aforementioned cold start problem. The authors claim that the field of IR is moving into a more complex environment with user-centered or personalized adaptive information retrieval. In unanimity with the other authors, Zigoris and Zhang wrote that the user does not like to provide explicit data, which is why IF receives much attention; user data can be gathered without users explicitly telling the search engine about their informational need.

Kim et al. [18] mentioned that explicit feedback is not practical, since users are unwilling to give such feedback. Just as Kelly and Belkin, Kim et al. believed that reading time will not help the authors in finding what documents users find relevant. Even though this is Kim’s et al. initial beliefs, they actually find a significant difference in reading time between non-relevant and relevant documents [18]. The author of this Master’s thesis believes that the difference in results is because of the fact that the test subjects in Kim’s report have been given a document set that they are familiar with. This was not the case with the test subjects in Kelly and Belkin’s report [10] and these authors concluded that the lack of statistically significant difference was mainly because the test subjects had a hard time finding articles that were relevant because they were not experts in the field covered by the document set.

Although the most common disadvantage of IF is the fact that it is so noisy, it comes with some great advantages as well. Yuanhua et al. [19] discussed the state of search engines and write that they are not good for personalized searches. IF is used to remedy this problem and they argued that the advantage of IF is that users constantly provide search engines with IF.

There are many factors of user behaviour that can be used as a source of IF. One example is eye movement. Buscher et al. [20] used eye tracking data to analyse how users read documents, what parts were read and how these parts were read. The data was then used for personalization of future search results.

Another factor that can be used as IF is text selection. White and Buscher [21] analysed the text that users highlight and compare this highlighted text to the search query. The calculated similarities were then to be used for the same purpose as stated in the previous article - information retrieval personalization.

(15)

2.3.4 User Behaviour Models

Agichtein et al. [3] incorporated user behaviour data into the ranking process and showed that the accuracy of a modern search engine can be significantly improved. Implicit feedback was modeled into several features like click-through features, browsing features, time on page and query-text features. Their tests showed that their ranking algorithms significantly improved their accuracies when incorporating their proposed implicit feedback model.

Shen et al. [4] identified a major deficiency in existing retrieval systems - they lack user modelling and are not adaptive to individual users. One example that was mentioned was that different users might have the same query (e.g. ”Java”) to search for different information (e.g. the Java island in Indonesia or the programming language). They also mentioned that the information need of a user might change over time. The same user that queried ”Java” might have been referring to the island the first time around, but was later referring to the programming language.

Most common IF models take advantage of the query to create a user behaviour model, but since queries are usually short, the model becomes very impoverished [4]. The purpose of Shen’s et. al paper was to create and update a user model based on implicit feedback and use the model to improve the accuracy of ad hoc retrieval, which is short term information needs. The implicit user models created by the authors remind a lot of the strategies that Joachims et al. [11] composed. Among many things, Shen’s model [4] accounted for viewed search results that were ignored by the user and re-ranked the search results that were yet to be presented to the user. Evaluation of search results for their user behaviour model was done by calculating precision and recall for documents at certain listings. Both precision and recall were better than Google’s search engine for the top documents in the case of 32 query topics.

Liu et al. [22] created a user behaviour model based on the users dwell time on a webpage. By extracting multiple data points of dwell time coupled with how the user reacted, it was possible for the authors to create a bell-curve distributions that could predict the most probable action of the user.

One of the oldest articles in the literature study is Oard and Kim’s [5] article on using implicit feedback in recommender systems. Even in this article, a user behaviour model was proposed to abstract the informational need of the user and how it would implicitly help the system.

(16)

Chapter 3

Methodology

This chapter covers the different methods that were used in order to solve the problem at hand. Subjects such as database configurations, heuristics and tests are brought up in this chapter.

One of the main challenges of this Master’s thesis was to model and implement the heuristics that were used for ranking document results. When this was accom-plished, the author conducted the proposed evaluation to train and test the ranking algorithm with the designed heuristics.

3.1 Heuristics

Two different heuristics were implemented and below are the full descriptions of both heuristics.

3.1.1 Field value boost

This heuristic is the fundamental value-boost heuristic that is the cornerstone of the Master’s thesis. It works by keeping track of decimal values that work as boost-values for the aforementioned Solr qf-parameter. All fields have a decimal value of 1.0 in the beginning, but this changes depending on what kind of metadata the user prefers. If the user were to click on links where the title of these links are highly similar to the user’s query, the title-field is associated with a higher decimal value and this leads to future search results where the title of the top results are highly similar to the query. Below is the formula for calculating the new value for a field, when it has been clicked:

best f ield:= choose most similar field(query, chosen document.fields) (3.1)

best f ield.value:= best f ield.value

(17)

Note that best field.value on the right-hand side of the second equation refers to the current decimal value of best field. This current value is overwritten after the second equation is executed.

Furthermore, line 3.1 can be written in more detail, since this line is meant to calculate what part of the document is the most similar when compared to the user’s search query:

Data: chosen document, user query Result: best field

best field := NIL; best score := NIL;

for each current field in chosen document do

score := cosine similarity(current field, user query);

if score >best score then

best score := score; best field := current field;

end end

return best field;

Algorithm 1: Choosing the most similar field

3.1.2 Field value boost with dampening effect

This heuristic is fundamentally the same as the previous one, but it has another feature to it. This feature is that the value that is to be added to the old field value has a dampening constant. This dampening constant introduces a behaviour where several updates on the same field are not as aggressive as in the previous heuristic. Below is the formula for applying the dampening effect to the value of the best field1_.

best f ield.value:= best f ield.value

sum of score on all f ields+ best field.val údampening

Where dampening can be written as:

dampening = 1/(field.number of times updated)

The expression above implies that increments of a field’s value will diminish when the field has been updated many times. Without the dampening, incrementing the value of a field that has not been of much interest, until now, is not that significant. This is because the formulas take the value of a field and divide it with the sum of all fields’ values. This fraction becomes small when the total sum (the denominator) is large and the current value of the selected field (the numerator) is small because

(18)

CHAPTER 3. METHODOLOGY

of stated reasons. The dampening tries to remedy the problem of aggressively increasing values and the following chapter shows if the dampening introduces any significantly different behaviours.

3.2 Gathering Data

Just as pointed out by the literature study, it is difficult to simulate users’ infor-mation need [10]. Therefore, it is beneficial for the quality of data if it is possible to procure data where genuine information need is guaranteed. This kind of data was given by the INEX forum in the form of a qrel file accompanied with a file that contains numerous topics and queries.

Figure 3.1. A snippet from the qrel file

The structure of the qrel file is easy to follow and is illustrated in Figure 3.1. The first column represents the IDs of a topic. The topics are extracted from a topic file and an example topic is illustrated in Figure 3.3. The only piece of information in the topic file that is relevant to this Master’s thesis is the query found inside each topic, which is found with the help of the topic id. These queries are actual queries written by actual users of the LibraryThing service2 _{(LT). The ’Q0’ string}

was disregarded for the experiments.

The next relevant column of data represents the books’ IDs in the LT system. To be able to make use of these book IDs, one had to first convert them to Amazon ISBNs, which was possible since the author was given an LT-Amazon translation table. A snippet of the translation table is illustrated in Figure 3.2.

The last piece of information is listed after the book id. This column represents the relevance scores for the books. The book with the highest score for a certain topic is considered to be the most relevant book in regards to the query that the topic represents.

(19)

Figure 3.2. A snippet from the translation table between ISBNs and LT ids,

re-spectively. Note that one LT id is usually mapped to several ISBNs.

Figure 3.3. A snippet from the topic file

The qrel file was used as a list of queries. These queries were used as input to the search engine and also as a way to find out what books were to be inserted in the database. The qrel file also played an integral part during the evaluation.

Another aspect of data gathering is the structure of the books. The set of books given by the INEX forum consisted of 2.8 million books with metadata from Amazon and LibraryThing. The metadata from Amazon was formal and contains information about title, author, publisher and so forth. The metadata from Li-braryThing was user-contributed and usually contained information about awards, book characters and places [23].

(20)

Figure 3.4. A snippet from a file that is formatted according to the Solr schema

Figure 3.4 depicts a snippet of a book that has been formatted in a way that made it compatible for upload to the search engine’s database. Certain numeric and trivial fields were omitted from all of the books. Figure 3.4 is just a simple example of how the fundamental structure of a formatted document looked like.

3.3 Evaluation

The evaluation was essential for this Master’s thesis. The evaluation was used to find out if the heuristics significantly affect the behaviour of the search engine. The following paragraphs should contain enough information for the reader to be able to reproduce the stages of the evaluation.

Just below this paragraph, the reader can observe the pseudo-code that describes the procedures of the evaluation. As can be seen, not only was the evaluation used to find the benefits of the heuristics, but it was also used to actually train the heuristic by simulating the behaviour of a user.

(21)

for n := 1 to 100 do

shuffle set of queries;

for each q in queries do

input q in search engine;

couple search results with their relevance scores from qrel file; calculate mean-relevance@10;

”click” on search result with highest maximum of relevance; update field values with regards to q and ”clicked” link;

end

extract mean-relevance for latest query;

end

Algorithm 2: Evaluation algorithm

A thorough description of every part in Algorithm 2 is given below, along with some omitted parts such as plotting the results.

Shuffling the query set The reason to why the query set shuffled was because

the author needed to exclude the risk of results being dependent on the queries’ order in the set.

Using relevance score from qrel file Once a query was used as input to the

search engine, search results were acquired. The qrel file played a very important role in this stage; the documents that were considered relevant to the given search query were given relevance scores. These relevance scores were then coupled with the search results. For instance, if one searched for the query ”bible” and also had access to a qrel file where it explicitly said that the book with ISBN ”1585161519” has a relevance score of 10 in relation to this query, this score would be transferred from the qrel file and coupled with the corresponding search result3_{. An illustration,}

that is unrelated to aforementioned example, of how scores were used can be seen below. Note that scores in Figure 3.5 are not actually visible for the end-user, but they are visible in this picture for the purpose of clarification.

(22)

Figure 3.5. Picture showing the search GUI in action. Also illustrates the example

of how scores are used and what link the user clicked on.

User simulation One of the crucial parts of the evaluation was that it was also

responsible for simulating the behaviour of a user in order to choose a document among the search results. The choice of a document was considered a click and was then used, together with the query, to let the heuristic update the field values. The behaviour of the simulated user was based on the following hypothesis:

Hypothesis 2: the user will click on the document that has the highest maxi-mum of relevance score among the top 10 search results.

If one were to apply this hypothesis to Figure 3.5, it would mean that the user would click on the second search result. The reason for this choice is because the chosen document is among the top 10 search results, the score of 4 is the maximum of the available relevance scores, and the score of 4 has the highest position among all scores of 4.

Measure of performance In order to get a perception of how well the heuristic

was performing for each query, a measure of performance needed to be proposed. The one that the author proposed is called Mean-Relevance@10 (MR@10). The idea behind this measurement is mostly explained by its name:

M R@10 =

q10

i=1documenti.relevance

10 . (3.3)

In other words, MR@10 is simply the mean relevance for the first 10 search results. The relevance that is used in Equation 3.3 is referring to the relevance

(23)

that was extracted from the qrel file. If a relevance score is not available for a certain search result, its relevance score is simply set to 0.

Plotting the results For the reader to make sense of the evaluation results, the

data was visualized into a number of charts.

For every shuffle of the query set, MR@10 was calculated for every query that was handed to the search engine. This allows the author to plot the MR@10-score for every available query, enabling the reader to see how the MR@10-scores varied after every field value update for all of the heuristics.

Aforementioned plotting procedure does not lead to the final result, however. The interesting data was created due to the shuffles of the query set. At the end of running all of the queries, the latest calculated MR@10-score was stored as a unique value for that specific query order. This means that after 100 shuffles were performed, the author would have acquired 100 unique MR@10-scores. These scores were plotted as bell curve plots and they were constructed for all of the heuristics. Using these bell curve plots and their accompanying data, measurements such as standard deviation was calculated. Finally, the author performed a paired t-test to find out if the heuristic lead to a behaviour that was significantly different from the regular baseline search engine.

3.4 Technical Configurations

The following text goes through the technical components in the system and reveals what they were used for and how they were configured.

3.4.1 Search Engine

The search engine used for this Master’s thesis is called Apache Solr4_{. Solr is one of}

the products that resulted from the Apache Lucene project and offers features like full-text search, hit highlighting and so forth. Full-text search is the most relevant feature for the Master’s thesis.

The differences in configuration between the default Solr product and the one that was used for the Master’s thesis are not many. The major, and important, differences lie in the schema file. The major changes are what fields that are to be considered. The documents that were added to the search engine’s database needed to mostly comply with the listed fields in the schema. A snippet of the schema’s containing fields can be found below.

(24)

Figure 3.6. A snippet of Solr’s schema configuration.

Figure 3.6 shows a few of the fields that can exist in the documents that were submitted to the search engine’s database. A closer inspection of the figure reveals that the field ”isbn” was the only field that was actually required. This means that a document was disregarded if its isbn field did not exist.

The value of the ”indexed” variable told the user if the field was supposed to be searchable. Most fields had this variable set to ”true”. Certain numeric values like ”weight” were not searchable, since it seemed highly unlikely that a user would search for books by solely entering their weight. Although some fields are not searchable, they were most likely ”stored” and the corresponding variable was set to ”true”. This means that the value of the field was stored and could be presented if the document was found with the help of a field that was indexed. The ”multiValued” field was very useful when a field could contain several values. An example of this was the field that contains the author(s). If one has several authors to a book, it was convenient if the field could support storage of several values.

3.4.2 Jellyfish Service

Jellyfish is a Findwise product that acts as the service layer between the data and front-end layer. Because of Jellyfish, the business logic is abstracted, which leads to a case where the front-end design does not need to care for the business logic and can instead focus on rendering and handling user interactions5_.

The Jellyfish project was imported to the author’s workspace and was config-ured from there. A template came with the project and minimal configuration was required. The substantial changes were made in an xml-file that was related to how the Solr instance would work. Jellyfish encapsulates Solr and was responsible for creating the queries that take boost values into consideration.

(25)

Figure 3.7. A snippet of the JellyFish configuration that changes the Solr query.

The bean in Figure 3.7 that is called ”qf” appended a qf-string to the Solr query. Its input data came from a file called data.properties, but the reader should be informed that this data was overwritten by a Java method that loads the proper qf-values from the author’s own database. The unnecessary input of aforementioned file was due to technical issues. In the end, JellyFish appended this line to the URL and passed it to Solr.

3.4.3 Searching interface

The reader was given a brief glimpse of the search interface in Figure 3.5. The purpose of the Jellyfish component was to take care of the system’s business logic in order to make the front-end as light weight as possible. The search interface was extremely light weight and users would interact with, at most, three components: the search field, the view where search results are presented, and the buttons that are used to navigate to the next page. Once the ”Hitta” button was clicked, the query that was given by the end-user was submitted to the Jellyfish component, which added the parameters specified in the Solr configuration files i.e. qf.

3.4.4 Database

Field values The database for the Master’s thesis was a simple H2 Database that

was used as a key-value store for the values used in the qf-parameter. Every time a heuristic updated a field value, it uploaded the value to the appropriate table. How values were stored is illustrated in Figure 3.8

(26)

Figure 3.8. Illustration showing some of the fields and how they are stored in the

database.

Click- and Search-logging Another reason for using a database was to log the

clicks and searches made by a user. The database also offered a table that revealed which one of the clicks was related to a certain query. This data was used to find out what a user clicked on in response to his/her query and this tuple of data was used as input data to one of the heuristics.

(27)

Results

The results shown in this chapter are different illustrations of the data acquired through the evaluation. The first section will illustrate plots that show how the mean-relevance improved after every query and also the plots that illustrate the normal probability distribution of mean-relevances for the heuristics. The final section is dedicated to an analysis of the results. The analysis will be in the form of a paired two-tailed t-test with a high significance level and is used to see if there are any significant differences between the baseline search algorithm and the heuristics created by the author.

4.1 Plots

Below is a group of plots that show how the mean-relevance differed after every query search for all of the heuristics and the baseline algorithm.

Note that the set of queries was shuffled 100 times for each heuristic. Therefore, each of the figures shown in Figure 4.1 - 4.3 were randomly chosen from three different sets of 100 different plots. They are essentially snapshots of large test runs that were run to create the distributions seen in Figure 4.4 - 4.6.

These distributions attempted to illustrate the normal probability distribution for each heuristic. The construction of these plots was previously explained and readers are referred to Section 3.3 to find out how the construction of the plots was performed.

(28)

CHAPTER 4. RESULTS

Figure 4.1. Plot showing how the mean-relevance differed after every searched query

(29)

(30)

CHAPTER 4. RESULTS

for the Baseline algorithm. Field values are not boosted, which is the equivalent of a regular Solr search being executed.

(31)

Figure 4.4. The normal probability distribution for the mean-relevances when using

(32)

CHAPTER 4. RESULTS

(33)

(34)

CHAPTER 4. RESULTS

Figure 4.7. The normal probability distribution for the mean-relevances of all the

previously illustrated plots in the same figure.

By looking at Figure 4.7, the reader should be able to observe that the differences between the heuristics and the baseline algorithm are not large. Observing the non-correlation with the naked eye is not enough. The non-non-correlation will be further investigated and discussed in Section 4.2.

4.2 Analysis

Observing Figures 4.1 - 4.3 one cannot see any differences in the evolution of the mean-relevance. The behaviour seems erratic for all of the heuristics. Although this data is omitted, the author can mention that most search queries usually returned well over 90 % of the stored document set as search results. These large sets of search results might have lead to essential documents being pushed down beyond the top-10 ranking. If the highly relevant documents were placed somewhere beyond the first page several times, it means that the actual heuristics failed to make the mean-relevance converge to a high value. This observation is further touched upon in the concluding remarks.

(35)

In order to see how values are distributed, the probability distribution curves are shown in Figures 4.4 - 4.7. The most probable value of the mean-relevance is 0.6 for all of the heuristics and the similarity of their distribution is perfectly seen in Figure 4.7. The similarity can also be observed by comparing the means and standard deviations for all of the distribution curves, since these values are also similar. To ensure that there is no significant difference between the distributions of mean-relevances, two-tailed paired t-tests were performed. Below are the different cases that were tested:

Test case 1 Significance between the value boost heuristic without dampening

and the baseline algorithm.

Test case 2 Significance between the value boost heuristic with dampening and

the baseline algorithm.

When performing a paired two-tailed t-test, one must specify a null hypothesis and an alternative hypothesis before performing the actual test. The null hypoth-esis states that there is no difference in the means of the sample sets, while the alternative hypothesis states the contrary. The null and alternative hypotheses are mutually exclusive.

The following subsections shows the reader the null and alternative hypotheses, and the outcomes of the tests. The chosen significance level for both test cases was 0.05 (5 %).

4.2.1 Test case 1

Null hypothesis There is no difference between the sample means between

sam-ples from the value boost heuristic’s mean-relevances and the corresponding samsam-ples from the baseline algorithm.

Alternative hypothesis There is a difference between the sample means due to

a non-random cause.

Results The two-tailed paired t-test showed that there was no significant

differ-ence between the sample means. The alternative hypothesis must be rejected.

4.2.2 Test case 2

Null hypothesis There is no difference between the sample means between

sam-ples from the value boost dampening heuristic’s mean-relevances and the corre-sponding samples from the baseline algorithm.

Alternative hypothesis There is a difference between the sample means due to

(36)

CHAPTER 4. RESULTS

Results The two-tailed paired t-test showed that there was no significant

differ-ence between the sample means. The alternative hypothesis must be rejected. This test confirms, with high certainty, that there were no significant differences in precision between the heuristics and the baseline algorithm. One of the reasons could have been the fact that the heuristics can be considered as somewhat simple algorithms with no actual scientific support for how boost-values are incremented. Just as Joachims and Radlinksi [2] mentioned, simple heuristics are doomed to fail.

(37)

Conclusions

This is the final chapter of the Master’s thesis and aims to answer the research questions that were formulated in the beginning of this report. Once the research questions have been answered, future improvements on this Master’s thesis will be brought up.

5.1 Research questions

Does the evaluations of the search engine heuristics indicate that the au-thor’s hypothesis might be accurate? Hypothesis 1 stated that the proposed

solutions would be significantly better at presenting the users’ sought documents than the baseline search engine1_{. Statistical tests were made that used data acquired}

through an evaluation process. These tests showed, with a significance level of 5 %, that there was no difference between the mean-relevances given by the proposed solutions and the baseline search engine. Any differences that might have been observed were simply too insignificant and probably the cause of random factors.

What are the desirable/undesirable effects of the solution? Looking at

Figure 4.7, the reader could see that it was most likely that the mean-relevance for an executed search would have a value of 0.6. This applied for all of the proposed solutions and the baseline search engine. A value of 0.6 equates to a scenario where some of the documents on the first page are not relevant to the search query. In other words, any value that is well below 1.0 is considered to be extremely poor.

This means that one of the undesirable effects of the proposed solutions was that they do not change the behaviour of favouring documents that are non-relevant to the search query.

Another undesirable effect was the amount of returned documents. A search query containing the word ”books” would, in this case, return more than 90 % of

1_{The hypothesis actually mentioned ”the unmodified search engine”, but this is the same as}

(38)

CHAPTER 5. CONCLUSIONS

the indexed documents and it probably pushed back the relevant documents away from the top-10 list.

Looking at the plots of mean-relevance in Figures 4.1 - 4.3, another undesirable effect that can be observed was that, for all cases, there was no convergence of mean-relevance. The evolution of mean-relevance looks highly random and the author believes that the reason for this lies in the fact that the proposed solutions were too simple. Although actual users were not used for the evaluation, the behaviour of the simulated user still gave rise to the infamous trust bias, since the simulated user only clicked a document that was placed somewhere on the first page.

After the execution of all queries, only around half of the metadata fields were ac-tually updated. Fields like ”isbn”, ”publisher”, ”dewey” and so forth were never up-dated and retained their original value after the evaluations. This introduces doubt over the necessity of having to index all of the metadata-fields into the database.

Is it reasonable to assume that the proposed solutions actually improve the users’ search experience? The evaluation in this Master’s thesis has shown,

with great accuracy, that there were no significant differences in accuracy between the proposed solutions and the baseline search engine. Therefore, the proposed solu-tions do not improve the users’ search experience and there is no point in replacing the baseline search engine in favour of the proposed solutions.

5.2 Future improvements

In this final section, all of the future improvements for this Master’s thesis are listed and explained.

Calculating similarity For every pair of executed search query and click,

simi-larities between the metadata-fields of the clicked document and the search query were calculated. The metric used to calculate similarity was the common cosine similarity. One way of improving the Master’s thesis could be to introduce different ways of calculating similarities between, essentially, two strings.

Another aspect of similarity, that can be changed, is to not just look at what metadata-fields are similar to the query, but also look at how much of a similarity was found. The current setup only allowed one metadata-field to have its boost-value updated, the most similar one, but it is unfair to completely disregard the field that was the second most similar field or even the third most similar field. Therefore, the code used to update the boost-values needs to take into account that several fields might have affected the user’s decision of choosing a document.

Another aspect that can be changed is the increment rate of the fields. If there is a scenario where only one field was found to be similar to the search query, one could introduce a function that reduces or amplifies the increment value based on how much of a similarity could be found between the metadata-field and the search query.

(39)

User-centric evaluations The 380 queries that the author acquired from INEX

and used for the evaluations came from 380 different LibraryThing users. The evaluation was carried out by simulating the behaviour of a single user, but since the 380 queries came from several users, the evaluation was actually trying to attempt the behaviour of a typical user in the given domain - LibraryThing. In other words, the evaluation attempted to create a search behaviour model for the entire domain of LibraryThing. This did not work well and the proposed solutions did not differ in behaviour compared to the baseline search engine.

Another way of evaluating the heuristics could be to make the evaluations user-centric; the evaluation should try to create individual user behaviour models in order to see if the proposed solutions are good at creating a model of information need for a single user.

Not indexing all of the metadata In the light of looking at the values of the

metadata-fields, some of the fields are never updated during the evaluation and their indexing into the search engine’s database might have been redundant. One could research this aspect by only indexing the metadata that is likely to be searched by end-users. This might reduce the amount of documents returned and make it more likely that the relevant documents will be closer to the top-10 list of search results.

Taking other factors of implicit feedback into account The heuristics that

were evaluated in this Master’s thesis only took the search query and the clicked search result into consideration. Since it has been mentioned that simple heuristics are doomed to fail [2], it would be interesting to evaluate heuristics that take several factors of IF into account. One example is the usage of the reading time of a document. If a user stays on the page of a document, it could mean that the document is of interest for the user.

Different authors in the literature review put this theory to the test and came to different conclusions [10, 18]. The findings that claim that reading time is not a good metric for relevance of a document claim that the conclusion was affected by the fact that test subjects were not experts in the domain where they searched for documents. This is not the case for [18], where users were experts of the search domain and had no issues in finding documents that were relevant to the queries they were to search for. This lead to reading time being used as a metric that would significantly increase the performance of the search engine.

In the light of this finding, future researchers should attempt to extend the proposed solutions of this Master’s thesis to incorporate the factor of reading time.

Different hypothesis for user behaviour Hypothesis 2 concerned the

be-haviour of a user when he/she is to click on a search result. Not enough data was gathered in order to deduce if Hypothesis 2 actually increased the quality of the evaluation results. An improvement to this study could involve tests where different user behaviours are used for the evaluation.

(40)

References

[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. An

Introduction to Information Retrieval. Cambridge University Press,

Cam-bridge, England, 2009. URL http://nlp.stanford.edu/IR-book/pdf/ irbookonlinereading.pdf [Online; retrieved January 23rd 2014].

[2] Thorsten Joachims and Filip Radlinski. Search engines that learn from implicit feedback. IEEE Computer, 40(8):34–40, August 2007. URL http://luci.ics.uci.edu/websiteContent/weAreLuci/biographies/ faculty/djp3/LocalCopy/04292009.pdf [Online; retrieved February 3rd 2014].

[3] Eugene Agichtein, Eric Brill, and Susan Dumais. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual

ACM International Conference on Research and Development in Information Retrieval (SIGIR ’06), 2006. URL http://www.msr-waypoint.com/en-us/

um/people/sdumais/SIGIR2006-fp345-Ranking-agichtein.pdf [Online; re-trieved January 29th 2014].

[4] Xuehua Shen, Bin Tan, and ChengXiang Zhai. Implicit user modeling for personalized search. In Proceedings of the 14th ACM International Conference

on Information and Knowledge Management, CIKM ’05, pages 824–831, New

York, NY, USA, 2005. ACM. URL http://doi.acm.org/10.1145/1099554. 1099747 [Online; retrieved February 3rd 2014].

[5] Douglas W. Oard and Jinmook Kim. Implicit feedback for recommender systems. In AAAI Technical Report WS-98-08, 1998. URL http:// www.aaai.org/Papers/Workshops/1998/WS-98-08/WS98-08-021.pdf [On-line; retrieved January 29th 2014].

[6] Findwise AB. The website of findwise ab, May 2014. URL http://www. findwise.com [Online; retrieved May 1st 2014].

[7] Amit Singhal. Modern information retrieval: A brief overview. IEEE Data Eng.

Bull., 24(4):35–43, 2001. URL http://singhal.info/ieee2001.pdf [Online;

(41)

[8] Felix Weigel, Klaus U. Schulz, and Holger Meuss. Ranked retrieval of struc-tured documents with the s-term vector space model. In Norbert Fuhr, Mounia Lalmas, Saadia Malik, and Zolt´an Szl´avik, editors, Advances in XML

Informa-tion Retrieval, volume 3493 of Lecture Notes in Computer Science, pages 238–

252. Springer Berlin Heidelberg, 2005. URL http://dx.doi.org/10.1007/ 11424550_19 [Online; retrieved July 15th 2014].

[9] Apache. Queryfields function in solr, May 2014. URL http://wiki.apache. org/solr/ExtendedDisMax#bf_.28Boost_Function.2C_additive.29 [On-line; retrieved May 5th 2014].

[10] Diane Kelly and Nicholas J. Belkin. Reading time, scrolling and interaction: Exploring implicit sources of user preference for relevance feedback. In

Proceed-ings of the 24th Annual International ACM Conference on Research and Devel-opment in Information Retrieval (SIGIR ’01), 2001. URL http://comminfo.

rutgers.edu/etc/mongrel/kelly-belkin-SIGIR2001.pdf [Online; retrieved January 23rd 2014].

[11] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. Accurately interpreting clickthrough data as implicit feedback. In

Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005. URL http://www.cs.cornell.edu/people/tj/

publications/joachims_etal_05a.pdf [Online; retrieved January 23rd]. [12] Gawesh Jawaheer, Martin Szomszor, and Patty Kostkova. Comparison of

im-plicit and exim-plicit feedback from an online music recommendation service. In

Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec ’10, pages 47–51, New York, NY,

USA, 2010. ACM. URL http://doi.acm.org/10.1145/1869446.1869453 [Online; retrieved February 3rd 2014].

[13] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for im-plicit feedback datasets. In Proceedings of the 2008 Eighth IEEE

Interna-tional Conference on Data Mining, ICDM ’08, pages 263–272, Washington,

DC, USA, 2008. IEEE Computer Society. URL http://dx.doi.org/10.1109/ ICDM.2008.22 [Online; retrieved February 3rd 2014].

[14] Steffen Rendle and Christoph Freudenthaler. Improving pairwise learning for item recommendation from implicit feedback. In Proceedings of the 7th ACM

International Conference on Web Search and Data Mining, WSDM ’14, pages

273–282, New York, NY, USA, 2014. ACM. URL http://doi.acm.org/10. 1145/2556195.2556248 [Online; retrieved July 15th 2014].

[15] Frank Hopfgartner and Joemon Jose. Evaluating the implicit feedback models for adaptive video retrieval. In Proceedings of the International Workshop on

(42)

REFERENCES

[16] Shaghayegh Sahebi and William Cohen. Community-based recommendations: a solution to the cold start problem. In Workshop on Recommender Systems

and the Social Web, RSWEB, 2011. URL http://d-scholarship.pitt.edu/

13328/ [Online; retrieved February 3rd 2014].

[17] Philip Zigoris and Yi Zhang. Bayesian adaptive user profiling with explicit & implicit feedback. In Proceedings of the 15th ACM International Conference

on Information and Knowledge Management, CIKM ’06, pages 397–404, New

[18] Jinmook Kim, Douglas W. Oard, and Kathleen Romanik. Using Implicit

Feed-back for User Modeling in Internet and Intranet Searching. College of Library

and Information Services, University of Maryland, College Park, 2000. URL http://books.google.se/books?id=kgdFGwAACAAJ [Online; retrieved Febru-ary 3rd 2014].

[19] Yuanhua Lv, Le Sun, Junlin Zhang, Jian-Yun Nie, Wan Chen, and Wei Zhang. An iterative implicit feedback approach to personalized search. In

Proceed-ings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics,

ACL-44, pages 585–592, Stroudsburg, PA, USA, 2006. Association for Com-putational Linguistics. URL http://dx.doi.org/10.3115/1220175.1220249 [Online; retrieved February 3rd 2014].

[20] Georg Buscher, Andreas Dengel, Ralf Biedert, and Ludger V. Elst. Attentive documents: Eye tracking as implicit feedback for information retrieval and beyond. ACM Trans. Interact. Intell. Syst., 1(2):9:1–9:30, January 2012. URL http://doi.acm.org/10.1145/2070719.2070722 [Online; retrieved July 15th 2014].

[21] Ryen W. White and Georg Buscher. Text selections as implicit relevance feed-back. In Proceedings of the 35th International ACM SIGIR Conference on

Research and Development in Information Retrieval, SIGIR ’12, pages 1151–

1152, New York, NY, USA, 2012. ACM. URL http://doi.acm.org/10.1145/ 2348283.2348514 [Online; retrieved July 15th 2014].

[22] Chao Liu, Ryen W. White, and Susan Dumais. Understanding web browsing behaviors through weibull analysis of dwell time. In Proceedings of the 33rd

International ACM SIGIR Conference on Research and Development in In-formation Retrieval, SIGIR ’10, pages 379–386, New York, NY, USA, 2010.

ACM. URL http://doi.acm.org/10.1145/1835449.1835513 [Online; re-trieved July 15th 2014].

(43)

[23] INEX Forum. Information about the data set, May 2014. URL https://inex. mmci.uni-saarland.de/tracks/books/ [Online; retrieved May 5th 2014].

(44)

Changing a user’s search experience byincorporating preferences of metadata