Learning to Rank Using Implicit Feedback and Active Exploration

(1)

Learning to Rank Using Implicit Feedback and

Active Exploration

Applying the Glicko Rating System to a People-Search Context

MEIDI TÕNISSON

Master’s Thesis at CSC Supervisor: Josephine Sullivan

Examiner: Stefan Carlsson Project Provider: Martin Nycander

(2)

(3)

Abstract

Learning to rank using implicit feedback is a relatively new supervised machine learning method, in which training data is used in order to automatically calculate search engine document ranking parameters. This thesis describes the implementation and evaluation of a rank learning algorithm with active learning. The relevance judgments are collected from users in the form of click-through data, and the ranking parameters are learned incrementally over time.

The online learner is implemented as a faculty search engine at the Royal Institute of Technology, and is evaluated both via human usage and user simulation experiments. The Kendall Tau Ranking Correlation Coefficient is used in order to com-pare the learned ranks with the corresponding true ranks of the experiments.

(4)

Rankinlärning med hjälp av implicit feedback

och aktiv inlärning

Rankinlärning med hjälp av implicit feed-back är en relativt ny metod inom övervakad maskininlärning, där träningsdata används för att automatiskt beräkna rankparametrar för do-kument i en sökmotorkorpus. Denna uppsats beskriver imple-mentation och utvärdering av en rankinlärningsalgoritm som använder sig av aktiv inlärning. Relevansomdömen samlas in från sökmotorns användare i form av klickdata, och rankpara-metrarna beräknas inkrementellt.

Algoritmerna implementeras i form av en sökmotor för fa-kultetsmedlemmar vid Kungliga Tekniska Högskolan och ut-värderas både med hjälp av mänskliga användare och experi-ment med användarsimulationer. Kendall Tau Ranking Correlation-koefficienten används för att jämföra de inlärda dokumentran-kerna med de korrekta dokumentrandokumentran-kerna.

(5)

Introduction

Among the many tasks a search engine has to perform, ranked retrieval is no doubt one of the most important. Ranked retrieval involves applying some kind of ordering to the documents that the search engine retrieves before presenting them to the user, after the user has posed his or her initial query. The search engine must thus try to deduce the relative importance of the documents according to the believed information need[1] of the user.

There are many ways to estimate the information need of a user, all of which result in various types of advantages and disadvantages. One such method is to calculate a relevance measure dependent on the frequency of the query terms in the document -called the tf-idf value (term frequency-inverse document frequency)[2] - and to assume that unique terms should be perceived as more relevant than terms that often appear in document sets. This is adequate for many purposes, but requires search engine devel-opers to believe in the underlying assumption that user-formulated queries correctly define the information needs of the users. This may often not be the case, as is evidenced by the amount of time users frequently spend on formulating and reformulating queries in order to get more relevant results from the search engine[3].

It has become abundantly clear that common machine learning techniques might be able to bring a new perspective to the calculation of users’ information needs. By tracking what documents users actually choose to view after a set of documents have been retrieved, we can use this as a measure of quality of the chosen document - after all, we can assume that the user is an expert in terms of their own information need. This method of dynamically updating document relevance values is aptly called relevance feedback[4].

1.1 The implications of clicking

(10)

straight forward[4]. This method of relevance feedback retrieval is relatively costly, however, and given the abundance of click logging software available to search engine developers today, one would prefer to make use of this data in some way.

The question of whether clickthrough data can be viewed as implicit relevance feedback has been thoroughly explored within the scientific community of Information Retrieval. An eye tracking user study performed by a team of researchers at Cornell and Stanford University has shown that user clicks can indeed be trusted to the same extent as classical explicit relevance feedback - as long as one interprets a user click as a relative pairwise preference judgment[5]. Users commonly scan the document ranking from top to bottom, meaning that a click on the n:th document in the ranking expresses a relative pairwise preference for document n when compared to each of the preceding documents in turn. With this in mind, we can fit a model to our learning problem and deduce apt algorithms for calculating document relevance values based on user-provided clickthrough feedback (see section 2.1).

1.2 Active exploration

An important issue that emerges from use of relevance feedback is a form of bias on the part of the user called trust bias. Trust bias means that users do not fully recognize their own expert status in the field of their own information needs, which leads them to over-trust the rankings produced by the search engine. This can lead to a feedback loop where the user merely affirms the initial ordering of the documents by the search engine by clicking the top ranked results that the search engine has just retrieved, with less regard to the actual relevance of the documents[6].

One way to handle this predicament is to make use of a machine learning tactic called active exploration[7]. In the general case, active exploration entails not limiting oneself to what one currently perceives to be the optimal solution to a problem, but to also explore previously unused solutions. This should theoretically lead to the learner eventually finding a solution that is more closely related to the global optimal solution than what we might have ended up with if we were to be more narrow-minded.

(11)

1.3. PROBLEM STATEMENT

1.3 Problem statement

The aim of this thesis is to investigate, implement and evaluate a specific methodology for learning document rankings by using clickthrough data from search engine users. The work will be heavily focused on implementing previously proposed ideas for learning and active exploration, with the goal of researching how well the proposed methods perform in realistic settings. In order to formalize the goals, the following research questions will constitute the basis for the research conducted during the work of this thesis (with the proposed solution denoting the implementation outlined in chapter 3):

1. In what search contexts does the proposed solution appear to be suitable?

2. Is it reasonable to believe that the proposed solution would improve the experience of a user searching for information?

3. What are the desirable/undesirable consequences of the proposed solution? Drawing from these research questions, the following hypotheses have been formu-lated, and they will be tested within the scope of this thesis:

1. The proposed solution learns document ranks that are closer to the underlying true ranks than the initial ranks proposed by the search engine.

2. The proposed solution provides a higher rate of usability than that of a standard, non-learning search engine.

The answer to Research question 1 (in what search contexts the proposed solution appears to be suitable) will also be explored.

1.4 The project provider

(12)

(13)

Chapter 2

Theoretical background

2.1 Model

In order to know what algorithms are suitable for learning the ranking parameters of a search context, we must choose a model that approximates the live scenario. In an article written by Information Retrieval researchers Filip Radlinski and Thorsten Joachims in 2007, the following Bradley-Terry model is proposed as a good fit for the rank learning problem[7]. The remainder of this section is devoted to giving an overview of the technical details of the findings in the aforementioned article.

We let M∗ = (µ∗₁, . . . , µ∗|_C|) ∈ M, R+ be the true relevance values of the documents in

our corpus C. These values are obviously unknown to us before learning has started, but what is known is that we want to maintain the following Bayesian posterior of the relevance values given the training data D:

P(M|D)= P(D|M)P(M) P(D)

Radlinski and Joachims formalize the learning problem as an optimization problem, and present a loss function that counts the number of misordered pairs of documents in the learned rank. Using this loss function, and operating under the assumption that P(M|D) is a multivariate distribution where M and D are uncorrelated, they show that the mode of P(M|D) is often the ranking that minimizes the expected loss.[7]

The clickthrough data D supplied by the users of the search engine is, as previously stated, best viewed as comparative pairwise relevance judgments. With this in mind, we can use a modified Bradley-Terry model to model the likelihood of the document relevance values:

P(di dj)=

rel(di)

rel(di)+ rel(dj)

where rel(di) is the relevance of diand the operator denotes a pairwise preference

(14)

Given this model, Radlinski and Joachims propose the application of a known set of algorithms from a rating system called the Glicko rating system[10] in order to maintain the Bayesian posterior P(M|D) [7].

2.2 Algorithms responsible for learning

2.2.1 Glicko learner

The Glicko learner is a set of learning algorithms based on the Glicko rating system∗, which in turn is a system commonly used for rating chess players. If we view the learning problem as a continuous chess tournament in which each player represents a document, we can see our aforementioned comparative pairwise relevance judgments as chess games within the tournament. This means that a document being preferred over another document in a given rank is analogous to a chess player winning a game against another chess player - in both cases, the rank parameters of the players/documents need to be updated in order to correctly reflect their respective values.

Obviously, winning against a highly rated chess player should increase the skill estimate of the winning chess player more than if he had won a game against an amateur, so we need to take the current rating of the opposing player (document) into account when calculating the new ratings. This is taken into account within the Glicko rating system, as updated rating values are dependent on the rating values of the opposing players.

Within the Glicko rating system, each document is assigned both a relevance measure and a standard deviation measure – the former denoting the global relevance of the document, and the latter denoting the amount of uncertainty that is associated with this relevance value. The uncertainty will typically decline over time if many inter-document comparisons are made, but inter-documents for which few relevance judgments have been made will see their uncertainty slowly rise. It is an online learner[11] - meaning that the rankings are learned incrementally over time.

The update functions to the relevance and standard deviation estimates are given in table 2.1. Each click carried out by a user will correspond to a series of inter-document games, depending on how far down in the ranking the user has elected to click.

2.2.2 Sigma presenter

The Sigma presenter, previously proposed by Johan Tidén[8], is responsible for the active exploration of new document rankings. Its principle is very simple - choosing the two documents in the produced rank with the highest uncertainty and placing them in rank three and four, respectively (see figure 2.1). With this strategy, users will be influenced to consider documents with high relevance uncertainty.

∗

An educational explanation of the Glicko rating system can be found at http://www.glicko.net/

(15)

2.2. ALGORITHMS RESPONSIBLE FOR LEARNING

Glicko update rule

µ+ i = µi+ q 1 σ2_i+ 1 δ2 Pm j=1g(σ2j)(sj−E(s|µi, µj, σ2j)) σ+ i = r 1 1 σ2_i+δ21 Where:

s1, . . . , smare the outcomes of each game (0 for loss, 1 for win)

µ+

i is the updated relevance value of document i

σ+

i is the updated standard deviation of document i

µiis the relevance value of document i before updates are carried out

σiis the standard deviation of document i before updates are carried out

µjis the relevance value of opponent document j

σj is the standard deviation of opponent document j

q= ln 10₄₀₀ g(σ2₎₌ √ 1 1+3q2_σ2_/π2 E(s|µi, µj, σj)= 1 1+10−_g(_{σj)(µi−µj)/400} δ2 ₌ q2Pm j=1(g(σj))2E(s|µi, µj, σj)(1 − E(s|µi, µj, σj)) −1

Table 2.1.The update rules for the Glicko learner, as proposed by Mark Glickman. [10]

(16)

2.3 Search context

When it comes to defining and describing a typical search engine usage scenario, it is apparent that not all usage scenarios were created equal. Depending on the information need∗of the users as well as on the type of information contained in the search engine index, the behavior of the users as well as the theoretical ideal behavior of the search engine varies. In order to emphasize and encapsulate all of these differing parameters and circumstances, we introduce the concept of a search context.

Search context A context for information retrieval, characterized by assumptions about the type of information contained in the search engine index as well as the behavior of the users.

2.4 Simulations

In this thesis project, user simulations based on behavioral models will be implemented and executed in a series of different experiments, in order to ensure a solid foundation of data for carrying out evaluations. These simulations will be based on carefully chosen assumptions as well as previous conclusions from researchers within the information retrieval community, as outlined in the sections below.

In order to properly simulate the click behavior of a user, one must make assump-tions of what documents users are likely to click (given a perceived set of document relevance values) as well as of the nature of the given relevance values. We call the first class of assumptions a click strategy, and the second a relevance model. Combining different valuations of these two assumptions gives us several scenarios to act out and evaluate. This experimental setup should ideally give insight into the validity of all three hypotheses outlined in section 1.3.

2.4.1 Click strategies of the simulated user

Cascade model

As described in section 1.1, users commonly scan a list of documents from top to bottom when assessing the search results of a posed query, and tend not to take into account documents that have been ranked lower than the document they choose to click. The Cascade model, presented by Microsoft researchers in 2008 [12], accounts for the impact of the ranking upon the click behavior of the users, and thus fits our learning scenario reasonably well.

Given a document relevance value rd, the probability cdiof clicking document d at

rank i can be expressed as cdi = P(click = true|Document = d, Rank = i). Under the

assumptions of the Cascade model, this can be calculated as:

∗

(17)

2.4. SIMULATIONS cdi = rd i−1 Y j=1 (1 − rdocinrank:j) Perfect user

Though it has been shown that a presented rank does have a profound effect on user click behavior, one cannot discount the possibility that some users might be more thorough in their assessment of the documents retrieved by the search engine. For the purposes of being exhaustive, experiments involving a perfect user behavioral model have been carried out. This means that the simulated user only clicks the document with the strictly highest relevance value in the presented rank. Using the same notation as in the Cascade model:

cdi = 1 ⇔ rd= max ∀_i ri

In order to further explore click strategies where the user clicks depend both on doc-ument relevance and docdoc-ument rank, two customized strategies have been constructed as detailed below.

Top ranked among top relevant

When using this strategy, users identify the set of five most relevant documents, and then click the document in the set that has the highest rank. This strategy corresponds loosely to a scenario where users evaluate the document rank rather exhaustively until they come across a document above a certain threshold of acceptable relevance.

topRelevant= the set of top five relevant documents in the rank cdi = 1 ⇔ i = min

∀_{i∈topRelevant}i

Top relevant decreasing with rank

This strategy further emphasizes the effect of rank on user behavior, but without being as governed by rank as the Cascade model. The consequences of this model is that users find documents lower in the rank to be less and less appealing, but there is still quite a high probability of the user venturing deep down in the ranking if necessary, since the assumption is that the user only finds the top five relevant documents acceptable enough to click.

topRelevant= the set of top five relevant documents in the rank scoredi = rd/1.01i

cdi = 1 ⇔ scoredi = max

(18)

Modelling deviations from the click strategy

It is an established concept[13] that users sometimes stray from the models that define their beliefs about document relevance. In order to make sure that our simulations adequately approximate realistic situations, the click strategies outlined above have been supplemented with a degree of randomness which has the effect of generating quasi-random clicks with probability . Three different ideas have been taken into account when generating the quasi-random clicks:

• The user clicks a random document drawn from the set of top five most relevant documents.

• The user clicks a random document drawn from the set of top five highest ranked documents.

• The user clicks a random document drawn from the entire document rank.

2.4.2 Relevance model of the documents

Distributions

The distributions from which we generate our underlying relevance values are described in the following paragraphs.

Uniform In a uniform relevance model, the underlying document relevance values for ten of the documents in the result set are randomly drawn in the interval [0, 1] with uniform probability.

Normal distribution In these experiments, relevance values are randomly drawn from the normal distribution N(1500, 1472₎∗

.

Exponential The exponential relevance model corresponds to a search context where the document relevancies differ immensely, with one document being the most relevant by far.

Global/individual models Depending on the search context, document relevance

values may or may not vary as a consequence of user characteristics. In effect, this means that we cannot always assume that all users act on the basis of the same relevance model. Conversely, it seems excessively drastic to assume that there is such high discrepancy among the individual relevance models so as to warrant the assumption that they are independent. To account for both of these possibilities, three assumptions about the nature of the relevance models have been explored:

∗

(19)

2.4. SIMULATIONS

Click strategy Relevance model Relevance model scope Random click strategy Initial rank assumption Cascade model Normal distri-bution

Global Random click

in top five rel-evant

All documents have some rel-evance

Perfect user Uniform dis-tribution

Individual Random click

in top five ranked Only ten of the documents have non-zero relevance val-ues Top ranked among top relevant Exponential distribution

Noisy global Random click in entire rank

Top relevant, decreasing with rank

Table 2.2.The different parameters and their respective values.

• Global relevance models, for which all users in a scenario act based on the same relevance model.

• Individual relevance models, meaning that each user acts according to their own (independent) relevance model.

• Noisy relevance models, which are based on global relevance models, but where every user has individual random noise added to the relevance values.

Assumptions about initial rank

(20)

2.5 Evaluation of rank correctness

Evaluations will be performed on the results of an online learner (which has learned its rankings from collected live data) as well as on the results of simulated user experiments. Given that the underlying relevance model is not available to us in the real world scenario, however, we must use other means of evaluating these results.

2.5.1 Evaluating an online learner when the true rank is unknown

In order to evaluate the properties of the learner, a statistical analysis of the entirety of the collected click data will be performed in order to produce a ground truth ranking - in other words, underlying rankings that maximize the likelihood of producing the click behavior we have observed. We will therefore use the entirety of the collected user click data in order to estimate the most likely relevance values for each of the documents. The values of this relevance estimate can than be compared to the values the online learner acquired incrementally during the course of the experiment.

This kind of estimate is called a Maximum Likelihood (ML) estimate, and is defined as the parameter estimation that maximizes the value of the likelihood function. In a Bradley-Terry model, which is the model providing the basis for our pairwise com-parison judgments (see section 2.1), the log-likelihood function can be specified as follows[14]∗: l(µ) = m X i=1 m X j=1 [wi jlnµi−wi jln(µi+ µj)]

where wi jdenotes the number of times document i has been deemed more relevant

than document j by a user. Given this log-likelihood function, and a few assumptions about our data, it has been shown [14] that the following iterative algorithm can be employed in order to calculate the parameter values of our Bradley-Terry model:

µk+1 i = Wi         X j,i Ni j µk i + µkj         −1

where Wi denotes the total number of times document i has been deemed more

relevant than another document, and Ni j denotes the total number of comparative

judgments between document i and document j.

In order to be able to calculate the ML estimate with the algorithm given above, it has been shown[14] that we must first assume that ”[i]n every possible partition of the individuals into two nonempty subsets, some individual in the second set beats some individual in the first set at least once” - otherwise we cannot find any parameter config-uration that maximizes the log-likelihood. This means that if we were to represent our multi-document intercomparisons as a directed graph, where an edge from document i

∗

(21)

2.5. EVALUATION OF RANK CORRECTNESS

Figure 2.2. Graph where strongly connected components are marked. Digital

im-age. Wikipedia, the Free Encyclopedia. N.p., 16 Feb. 2006. Web. 1 Apr. 2013.

http://en.wikipedia.org/wiki/File:Scc.png.

to document j denotes a recorded user preference for i over j, there must be some path from each node in the graph to every other node in the graph. In other words - the graph has to be a strongly connected component (SCC) (see figure 2.2) if we are to be able to calculate any Maximum Likelihood estimates.

This assumption poses a problem in the realm of user-collected data, as we are not able to force users to specify pairwise preferences for certain documents over others. The consequence of this is that in order to calculate the parameter estimates, we must first split the intercomparison graph into several strongly connected components, and then run the iterative parameter estimation algorithm on each of the SCCs separately. Consequently, we will have produced a number of rankings on several disjoint subsets of the intercomparison graph, and we will have to compare these estimated rankings to their corresponding subsets of the relevance parameters that were produced from the online learning phase.

These subset rankings are then compared to the actual incrementally learned and presented ranks using the Kendall Tau Ranking Correlation Coefficient (a metric for rank comparisons that describes how many element swaps are necessary to transform one ranking into another)[15, 16], in order to both evaluate the learning mechanism and the possible usability drawbacks of using active presentation.

2.5.2 Evaluating user simulation experiments where the true rank is known

(22)

2.5.3 Evaluating the validity of the hypotheses

In order to evaluate the validity of Research question 1∗, we will have to make use of the data from the structured simulation experiments, as the live experiment does not account for more than one search context and as such does not provide any empirical basis from which to make comparisons.

The Kendall Tau Ranking Correlation Coefficient provides an excellent metric for gauging the validity of Hypothesis 1†, both in the live experiment and in the simulated experiments.

Hypothesis 2‡ can also be evaluated both within the realm of the live experiment and the simulated experiments, by studying the search logs of the search engine and ascertaining whether it seems plausible that users of the search engine are satisfied with the results of their queries. If user satisfaction is high, we should be able to see several trends in the search engine data:

• Users frequently click documents in the document rank returned to them after posing a query. §

• The clicks of the search engine users are commonly distributed among the top ranked documents in the document rank.¶

These assumptions will be evaluated by studying the search and click logs of the search engines at the end of the experiments.

∗

Research question 1: In what search contexts does the proposed solution appear to be suitable?

†

Hypothesis 1: The proposed solution learns document ranks that are closer to the underlying true ranks than the initial ranks proposed by the search engine.

‡

Hypothesis 2: The proposed solution provides a higher rate of usability than that of a standard, non-learning search engine.

§

This corresponds to situations where user satisfaction with the returned document rank is high enough so that a suitable document can easily be found.

¶

(23)

Part II

(24)

(25)

Chapter 3

Implementation

3.1 Frameworks for static search

For ease of implementation, existing software and frameworks are used as a basis for the implementation of the algorithms of this thesis.

Solr The open source enterprise search engine platform Solr∗is used for indexing and storing the document collection.

The Solr search engine has been configured to take into account any possible phonetic variations in the user queries, which has the effect that the search engine produces a larger set of results that often do not closely match the query posed by the user. This rather unfavorable search situation leads to there being much room for improvement, which is desirable when assessing the capabilities of a machine learning algorithm.

The phonetic search functionality is achieved by adding a phonetic analyzer [17] (both at index time and query time) to the indexed text fields in the Solr search engine, encoding both the query and the documents with the DoubleMetaphone algorithm[18]. The DoubleMetaphone algorithm is the second generation of the original phonetic en-coding algorithm Metaphone, with the added advantage of being more suited towards international (and not just English) phonemes.

Jellyfish Jellyfish is a layer of abstraction, developed by Findwise, residing between the user interface and the underlying search engine. This layer of abstraction, running as a webapp on the server of choice, greatly simplifies and streamlines all communication between the GUI and the search engine. It provides several standardized communica-tion protocols, and is an efficient way of manipulating both user queries and result sets retrieved from the underlying search engine. The bulk of the project-specific software is implemented in Java as part of either the query or result pipelines in the Jellyfish abstraction layer.

∗

(26)

Logging servlet Also developed by Findwise, the logging servlet provides a layer of communication between the client and the logging database. Parameters are URL encoded in the browser whenever a user submits a query or clicks a document link, at which point the request is sent to the servlet, which then parses the parameters before sending them to the database.

3.2 Software architecture for online learning implementation

The user-search engine communication is carried out via a basic web interface. This interface is an adaptation of a standard Jellyfish search template, including a JSP GUI and a click/query logging JavaScript service. The GUI is responsible for sending the user query on to the underlying Jellyfish webservice (which in turn sends it to Solr, where matching documents are extracted) as well as using JavaScript to log user clicks and queries via the logging servlet.

(27)

3.3. DATA COLLECTION

3.2.1 Glicko learner

The set of documents retrieved in response to the user query is processed through the Glicko learner - implemented as a part of the Jellyfish webservice. The Glicko learner then uses the document set and relevant ranking parameters retrieved from a database in order to produce a learned rank, which is then sent as an ordered list of documents to the Sigma presenter (also implemented as a part of the Jellyfish webservice). In order to achieve this, the Glicko learner is implemented as a Java component in both the query and result pipeline of the Jellyfish webservice, using database stored ranking parameters along with new user clicks in order to update and output the learned rank.

3.2.2 Sigma presenter

The Sigma presenter, in turn, produces a rank that is to be presented to the user, based on information passed on by the Glicko learner. As such, the Sigma presenter implementation also resides in the Jellyfish webservice, in the form of a Java component immediately following the Glicko learner. The Sigma presenter is thus supplied with a learned rank from the Glicko learner, along with uncertainty parameters denoting the currently estimated variance of the document relevance values, in order to produce its presented ranks.

When the user has been presented with the rank and made their choice as to which document abstract seems to be the most fitting to their information need, their click (along with the preceding query) is registered in a click logging database. The click logs are used to update the ranking parameters in the ranking database - and so the cycle of learning continues.

3.3 Data collection

The data set forming the basis for the self-learning search engine consists of a collection of user profiles of faculty at the Royal Institute of Technology (KTH). These profiles are available for browsing at http://www.kth.se/directory/, and are supplied with a good number of meta-information tags, making the data easily retrievable. For the purposes of this thesis, a node.js script using the scraper module [19] was constructed, outputting structured XML documents (so as to be able to easily index the documents in Solr).

On the client side, logging of user clicks and queries is implemented via JavaScript in the Jellyfish GUI. Whenever a search event∗ or a click event † occurs, the relevant parameters (such as link destination, query string and presented document ranking) are stored in a logging database for future retrieval, by communicating with the logging servlet described in section 3.1.

∗

The event when a user submits a query to the search engine.

†

(28)

3.4 User base

Users are directed to the search engine through directed social media promotion, which should have the consequence of most users being current or former students at the Royal Institute of Technology. This directed approach is taken to ensure that users have some prior knowledge of the document corpus, which in turn should lead to less noisy clicks.

3.5 Simulations

To carry out the simulation experiments detailed in section 2.4, user behavior models have been implemented in Java. The SimulatedUser class is responsible for approxi-mating the behavior of a given type of user (with a given click strategy and relevance model), whereas the UserScenarioSimulator class is responsible for creating a set of users and continuously prompting the created users to carry out appropriate search-and-click actions.

The run() method in the SimulatedUser class does one of two things: searches for a random query in a predefined set of queries, or clicks a document in a rank that was retrieved from an earlier search. With probability = 0.15∗, the user clicks a random document drawn from a subset of the document rank, depending on the previously defined random click strategy (see section 2.4.1). The splitting up of the run() method into two separate atomic methods (search() and click()) is justified by the added level of realism that this interleaving of actions brings to the experiments.

∗

This value of is inspired by the damping factor of Google’s PageRank, where d = 0.15 denotes the

(29)

Part III

(30)

(31)

Chapter 4

Live experiment results

The live experiment was set up using the frameworks described in 3.1. The interface through which the users interacted with the search engine was a simple web page featuring a short explanation of the project and a search field. The results from the user queries were presented in list form, with thumbnail pictures of each retrieved faculty member (if available). See figure 4.1 for an example screenshot.

Figure 4.1.Screenshot from the working implementation.

(32)

the live experiment, with the column clicked showing what document the user elected to click after having posed their query (NULL if no document was clicked). The rank column contains the ordered list of IDs of the documents that were shown to the user, and the query column contains the user query.

Figure 4.2.Sample screenshot of the live collected queries.

As such, the Kendall Tau RCC values have not been possible to calculate for the live experiment, which also means that the validity of Hypothesis 1∗has not been evaluated for this experiment.

When it comes to Hypothesis 2†, a study of the search engine logs reveals that the majority of the user queries did not have corresponding user clicks. As previously stated, this might be due to the lack of information need on part of the users, but it might also indicate a general dissatisfaction with the search results. For the user queries with corresponding user clicks, the indices of the clicked documents have been plotted in figure 4.3. Each time step on the x-axis of the graph corresponds to one user click.

∗

†

(33)

(34)

(35)

Chapter 5

Simulation results

5.1 Simulation overview

As explained in section 2.4, user simulations have been implemented based on an array of different assumptions about user behaviour and document relevance values. The combinations of these experiment parameters give us 216 different experiments to carry out.

Each of these parameter combinations have been simulated using three different possible queries that the simulated users submit searches for∗, yielding 648 distinct experiments in total. The small number of possible queries might not correspond completely to a realistic search context, but is necessary in order to ensure that the simulated click data is not spread too thinly across numerous queries. For an overview of the different parameters and their respective values, see table 2.2.

5.2 Kendall Tau Ranking Correlation Coefficient

The Kendall Tau ranking correlation coefficient †

between the underlying true ranks and the learned ranks has been plotted for all experiments‡.

In order to ascertain whether the learned ranks outperform the ranks initially pro-vided by the Solr search engine, and to evaluate the validity of Hypothesis 1§, the Kendall Tau RCC values for the difference between the initial ranks and the underlying true ranks have also been calculated.

The distribution of these values are illustrated as box plots in the following figures.

∗

"lars", "mikael" and "anders"

†

With −1 denoting maximum dissimilarity, 1 denoting maximum similarity and 0 denoting no corre-lation between ranks.

‡

216 experiments per relevance distribution (648 in total) were carried out. See Appendix (chapter 6.6) for individual Kendall Tau RCC values.

§

(36)

(37)

5.2. KENDALL TAU RANKING CORRELATION COEFFICIENT

(38)

(39)

5.2. KENDALL TAU RANKING CORRELATION COEFFICIENT

(40)

When the data points are separated by underlying document relevance model as in figure 5.1, it is apparent that all three relevance models seem to produce similar results, with the majority of the experiments yielding positive (albeit near-zero) Kendall Tau RCC values. The Kendall Tau RCC values of the baseline ranks, as shown in figure 5.2, have their RCC values centered around 0, which tells us that they are generally unrelated to the underlying true ranks.

When separating the data points by user click strategy, a similar picture emerges (see figure 5.3), with no strategy being exceptionally better or worse than the other. Again, the Kendall Tau RCC values of the initial ranks are largely homogenous and centered around 0 (see figure 5.4).

There are more dimensions to the data producing the Kendall Tau RCC values, that merit some further exploration.

When we separate the data points by the assumption about the spread of the rel-evance distributions as outlined in section 2.4.2, a clearer picture emerges. When the relevance distribution is restricted to only encompass ten of the documents in the doc-ument set, the Kendall Tau RCC values have a higher tendency to reach values near 1 (see figure 5.5). Simultaneously, this assumption leads to a larger spread in the Kendall Tau RCC values, with a much higher number of values being located far below zero. In general, though, the median of the values lies clearly above the median of the values produced by the models with a larger document spread.

Compared with the Kendall Tau RCC values of the initial ranks organized by initial rank assumption (see figure 5.6), we see that the spread is yet again larger for the experiments with a relevance distribution restricted to ten documents - although the median values yet again lie very close to zero for all experiments.

When further studying the results of the experiments run under the assumption that the relevance distribution only encompasses ten of the documents in the rank, we find a slight difference in the average Kendall Tau RCC values depending on the scope of the relevance model. When the relevance model is global - meaning that all users share the same underlying relevance beliefs - the mean value of the Kendall Tau RCC is slightly higher than that of a completely individual or noisy relevance model setting (see figure 5.7). Again, as shown in figure 5.8, the initial rank Kendall Tau RCC values are centered around 0, denoting very little correlation with the true ranks.

5.3 Clicked indices over time

(41)

5.3. CLICKED INDICES OVER TIME

(42)

(43)

(44)

(45)

never have to stray far down in the rank in order to find documents that they perceive as relevant. By studying the clicked indices over time, we can get a sense of the validity of Research question 1∗.

5.3.1 Click strategy

Figures 5.9, 5.10, 5.11 and 5.12 demonstrate the change in clicked indices over time for all experiments organized according to click strategy. The mean index clicked in each time step has been plotted in black, while the individual experiments have been plotted in grey, superimposed over each other. Each time step corresponds to one simulated click.

5.3.2 Relevance model

Figures 5.13, 5.14 and 5.15 demonstrate the change in clicked indices over time for all experiments organized according to document relevance model. The mean index clicked in each time step has been plotted in black, while the individual experiments have been plotted in grey, superimposed over each other.

∗

(46)

(47)

(48)

(49)

(50)

(51)

(52)

(53)

Chapter 6

Conclusions

6.1 Rank correctness

Although the live experiment did not yield any useful results from which to draw con-clusions about rank correctness, the simulated experiments demonstrated the strengths and weaknesses of the Glicko learner and Sigma presenter to a certain extent. As shown in figure 5.5, there is a notable difference in Kendall Tau RCC median values between the experiments with a broader and a narrower relevance model spread. The median value from the experiments with a narrower relevance model scope is clearly closer to 1 than that of the experiments with relevance model spreading across all documents, leading to the tentative conclusion that the Glicko learner might be more suited to user search contexts in which fewer documents have acceptable levels of relevance values.

Furthermore, as shown in figure 5.7, the scope of the relevance model appears to have an influence on the correctness of the learned rank, with a global relevance model scope producing the best results. It appears as though the proposed solution fares best in a search situation where users are likely to share a common view of the relevance values of the documents.

6.2 Properties of the learner

As previously discussed in section 5.3, the clicked indices over time can be seen as a measure of how much the Glicko learner is learning at a given point in time. By studying these values visually, it appears as though the learning rate of the Glicko learner generally decreases over time for all user models, with the Cascade user click model being a notable exception. The reason for this might be that users acting according to the Cascade model generally are very conservative when considering a retrieved document rank, and very unlikely to click documents that appear slightly lower in the rank - no matter their actual relevance value.

(54)

clicked indices over time have been plotted for a particular experiment in which users click the most relevant document at all time with probability 1 − = 0.85, and a random document drawn from the top five relevant documents with probability = 0.15. The relevance model only distributes non-zero relevance values for ten of the documents in the rank - the ten initially highest ranked documents. The bottom line is that the simulated users in this experiment always click on documents that they perceive as highly relevant.

Figure 6.1. Clicked indices over time for the experiment with a exponentially dis-tributed document relevance model, a relevance model distribution spanning ten of the documents in the document set, a ”perfect user” click strategy and random clicks being drawn from the set of top relevant documents. The relevance model is global and not noisy.

(55)

6.2. PROPERTIES OF THE LEARNER

6.2.1 Favoring unranked documents

Figure 6.2 demonstrates a scenario in which the Glicko learner has registered one user click on document number two in the rank. All unranked documents start off with an initial relevance value ofµ = 1500. After having played their first game against another unranked opponent document, they either end up with a new relevance valueµ∗> 1500 orµ∗_{< 1500, depending on whether they win or lose the game.}

Figure 6.2. The rank transformation after one click has been registered in the initial rank – the bolded document being the clicked document.

(56)

user experience in general. A potential remedy to this problem is to make sure to never rank documents that have no implicit relevance judgments above documents that do – although this might also have the unwanted side effect of pushing unrated documents too far down in the document rank.

6.3 Sources of error

As described in section 4, the results of the live experiment have not yielded any insights as to the properties of the Glicko learner. This is a result most likely ascribed to the circumstances of the experiment. Seeing as how the live experiment was not supervised or instructed in any way, users had no real information need to satisfy through usage of the learning search engine, which led to a low rate of clicks and a low rate of repeat queries.

Furthermore, the user simulation experiments were based on simplified models of user behavior, which by definition cannot completely accurately represent the actions of a human user. The assumptions these models are based on do have a scientific basis, but may still not produce results that are guaranteed to be realistic. The relevance models were also simplified, which may have led to click behavior unrepresentative of that of a human user.

6.4 Final thoughts

In conclusion, it is apparent that the solution proposed and implemented within the scope of this essay does appear to be better suited towards certain types of search contexts (providing us with a tentative answer to Research question 1∗) – namely contexts for which the user base shares a common view of the relevance values of the documents. Furthermore, the Kendall Tau RCC values of the learned ranks – although far from ideal – have been shown to be decidedly better than those of the initial ranks. Based on these findings, we can not reject Hypothesis 1†.

As it stands, it would be foolish to say that the proposed solution would categorically improve the user experience of a user searching for information, as document ranks change in a way that does not seem completely intuitive to a user (see section 6.2.1 for an example of this type of behavior). The learner seems to suffer from a range of undesirable consequences, and it is unclear if the benefits outweigh the drawbacks at this point. With this in mind, Hypothesis 2‡has to be rejected.

Given that the solution does manage to learn the underlying ranks to a certain extent, it is safe to say that more research is needed§

in order to establish the optimal learner

∗

Research question 1: In what search contexts does the proposed solution appear to be suitable?

†

‡

Hypothesis 2: The proposed solution provides a higher rate of usability than that of a standard, non-learning search engine.

§

(57)

6.5. FUTURE WORK

configuration.

6.5 Future work

6.5.1 Suggestions for future research

The following list contains suggestions for improvements on the research conducted within the scope of this thesis, both regarding implementation and evaluation.

• Use a cool-down period for the Sigma presenter, making it less and less likely to present highly uncertain documents as learning continues. This should have a positive effect on usability.

• Make sure that unranked documents are not ranked above ranked documents after a certain period of time, so as to counteract the effect outlined in section 6.2.1. • Conduct directed experiments with several groups of human users so as to sim-ulate an actual information need, in order to avoid the problems described in section 4.

(58)

(59)

Part IV

(60)

(61)

6.6. APPENDIX 1: KENDALL TAU RANKING CORRELATION COEFFICIENT VALUES

6.6 Appendix 1: Kendall Tau Ranking Correlation Coefficient

Values

(62)

6.7 Appendix 2: Live experiment user logs

The following log shows all search and click events processed by the implemented solution during the scope of the live experiment. There are three columns shown in the following data: clicked index (”-” if no document was clicked), user query, and the phonetic encoding of the user query. The logs are shown in order from first to last. - Marcus Dicander MRKS

- Viggo Kann FKKN 0 Viggo FK

0 mats bejhem MTSP 1 Hedvig HTFK

(63)

6.7. APPENDIX 2: LIVE EXPERIMENT USER LOGS - Säkerhetschef SKRT - Väktare FKTR - Lena LN 2 dillian TLN - Henrik eriksson HNRK - lennart LNRT 24 roy R 3 gärdenäs KRTN 3 romero RMR - roth R0 - falk FLK 2 maria MR - meidi MT - internationali ANTR

6 internationali patrik ANTR 0 göran manneberg KRNM 0 Hedvig HTFK - fuuu F - henrik HNRK - henrik eriksson HNRK - Henrik ericsson HNRK 0 viggo kann FKKN 13 ale- ALKS 0 minock MNK - michael minock MKLM 0 olof heden ALFT - ale- ALKS

- andre ANTR

(64)

(65)

(66)

(67)

(68)

(69)

Bibliography

[1] Robert S. Taylor. The process of asking questions. American Documentation, pages 391–396, March 1962. ISSN 1075-2838. URL http://zaphod.mindlab.umd.edu/ docSeminar/pdfs/16863553.pdf.

[2] Wikipedia. tf-idf — Wikipedia, the free encyclopedia, 2012. URL http://en. wikipedia.org/wiki/Tf-idf. [Online; retrieved October 31st].

[3] Bernard J. Jansen, Amanda Spink, Chris Blakely, and Sherry Koshman. Defining a session on web search engines: Research articles. J. Am. Soc. Inf. Sci. Technol., 58(6):862–871, April 2007. ISSN 1532-2882. doi: 10.1002/asi.v58:6. URL http: //dx.doi.org/10.1002/asi.v58:6.

[4] Wikipedia. Relevance feedback — Wikipedia, the free encyclopedia, 2013. URL http://en.wikipedia.org/wiki/Relevance_feedback. [Online; retrieved Octo-ber 31st].

[5] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst., 25(2), April 2007. ISSN 1046-8188. doi: 10.1145/1229179.1229181. URL http://cs303.stanford.edu/papers/ joachims_etal_07a.pdf.

[6] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’05, pages 154–161, New York, NY, USA, 2005. ACM. ISBN 1-59593-034-5. doi: 10.1145/1076034.1076063. URL http://www.cs.cornell. edu/people/tj/publications/joachims_etal_05a.pdf.

(70)

[8] Johan Tiden. Active exploration to improve learning rate from click through data. Master’s thesis, Royal Institute of Technology, 2012. URL http://www.nada.kth.se/utbildning/grukth/exjobb/rapportlistor/2012/ rapporter12/tiden_johan_12044.pdf.

[9] Findwise AB. Findability by findwise. URL http://www.findwise.com/about/ findability-findwise. [Online; retrieved October 31st].

[10] Mark E. Glickman. Parameter estimation in large dynamic paired comparison ex-periments, 1999. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi= 10.1.1.151.1706&rep=rep1&type=pdf.

[11] Wikipedia. Online machine learning — Wikipedia, the free encyclopedia, 2013. URL http://en.wikipedia.org/wiki/Online_machine_learning. [Online; re-trieved October 31st].

[12] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison of click position-bias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, pages 87–94, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-927-2. doi: 10.1145/1341531.1341545. URL http://doi.acm.org/10.1145/1341531.1341545.

[13] Weizhu Chen, Dong Wang, Yuchen Zhang, Zheng Chen, Adish Singla, and Qiang Yang. A noise-aware click model for web search. In Proceedings of the fifth ACM international conference on Web search and data mining, WSDM ’12, pages 313–322, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-0747-5. doi: 10.1145/2124295. 2124335. URL http://doi.acm.org/10.1145/2124295.2124335.

[14] R. Hunter. Mm algorithms for generalized bradley-terry models. The Annals of Statistics, 32:2004, 2004.

[15] Wikipedia. Kendall tau rank correlation coefficient — Wikipedia, the free encyclopedia, 2013. URL http://en.wikipedia.org/wiki/Kendall_tau_rank_ correlation_coefficient. [Online; retrieved October 31st].

[16] M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):pp. 81–93, 1938. ISSN 00063444. URL http://www.jstor.org/stable/2332226.

[17] Apache Solr. Solr wiki. URL http://wiki.apache.org/solr/. [Online; retrieved October 31st].

[18] Lawrence Philips. The double metaphone search algorithm. C/C++ Users J., 18(6): 38–43, June 2000. ISSN 1075-2838. URL http://dl.acm.org/citation.cfm?id= 349124.349132.

(71)

BIBLIOGRAPHY

Learning to Rank Using Implicit Feedback and Active Exploration