• No results found

Latent variable neural click models for web search

N/A
N/A
Protected

Academic year: 2021

Share "Latent variable neural click models for web search"

Copied!
88
0
0

Loading.... (view fulltext now)

Full text

(1)

Latent variable neural click models for web search

HENRIK SVEBRANT

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)
(3)

models for web search

HENRIK SVEBRANT

Master in Computer Science Date: July 6, 2018

Supervisor: Jeanette Hällgren Kotaleski Examiner: Elena Troubitsyna

Swedish title: Neurala klickmodeller med latenta variabler för webbsöksystem

School of Computer Science and Communication

(4)

Abstract

User click modeling in web search is most commonly done through probabilistic graphical models. Due to the successful use of machine learning techniques in other fields of research, it is interesting to eval- uate how machine learning can be applied to click modeling. In this thesis, modeling is done using recurrent neural networks trained on a distributed representation of the state of the art user browsing model (UBM). It is further evaluated how extending this representation with a set of latent variables that are easily derivable from click logs, can affect the model’s prediction performance.

Results show that a model using the original representation does not perform very well. However, the inclusion of simple variables can drastically increase the performance regarding the click prediction task. For which it manages to outperform the two chosen baseline models, which themselves are well performing already. It also leads to increased performance for the relevance prediction task, although the results are not as significant. It can be argued that the relevance pre- diction task is not a fair comparison to the baseline functions, due to them needing more significant amounts of data to learn the respective probabilities. However, it is favorable that the neural models manage to perform quite well using smaller amounts of data.

It would be interesting to see how well such models would perform when trained on far greater data quantities than what was used in this project. Also tailoring the model for the use of LSTM, which suppos- edly could increase performance even more. Evaluating other repre- sentations than the one used would also be of interest, as this repre- sentation did not perform remarkably on its own.

(5)

Sammanfattning

Klickmodellering av användare i söksystem görs vanligtvis med hjälp av probabilistiska modeller. På grund av maskininlärningens fram- gångar inom andra områden är det intressant att undersöka hur des- sa tekniker kan appliceras för klickmodellering. Detta examensarbete undersöker klickmodellering med hjälp av recurrent neural networks tränade på en distribuerad representation av en populär och välpre- sterande klickmodell benämnd user browsing model (UBM). Det un- dersöks vidare hur utökandet av denna representation med statistiska variabler som enkelt kan utvinnas från klickloggar, kan påverka denna modells prestanda.

Resultaten visar att grundrepresentationen inte presterar särskilt bra.

Däremot har användningen av simpla variabler visats medföra dras- tiska prestandaökningar när det kommer till att förutspå en använda- res klick. I detta syfte lyckas modellerna prestera bättre än de två ba- selinemodeller som valts, vilka redan är välpresterande för syftet. De har även lyckats förbättra modellernas förmåga att förutspå relevans, fastän skillnaderna inte är lika drastiska. Relevans utgör inte en lika jämn jämförelse gentemot baselinemodellerna, då dessa kräver myc- ket större datamängder för att nå verklig prestanda. Det är däremot fördelaktigt att de neurala modellerna når relativt god prestanda för datamängden som använts.

Det vore intressant att undersöka hur dessa modeller skulle preste- ra när de tränas på mycket större datamängder än vad som använts i detta projekt. Även att skräddarsy modellerna för LSTM, vilket borde kunna öka prestandan ytterligare. Att evaluera andra representationer än den som användes i detta projekt är också av intresse, då den an- vända representationen inte presterade märkvärdigt i sin grundform.

(6)

The following notation is used and referred to throughout the content of this thesis.

SERP Search Engine Results Page PGM Probalistic Graphical Model

CTR Click Through Rate

UBM User Browsing Model

PSCM Partially Sequential Click Model DR Distributed Representation

NDCG Normalized Discounted Cumulative Gain

E a user examines an object on a SERP

A a users is attracted by the object’s representation

C an object is clicked

S a user’s information need is satisfied

q query representation

u or d document representation r rank position within an SERP

ANN Artificial Neural Network RNN Recurrent Neural Network

LSTM Long Short-Term Memory

v

(7)

1 Introduction 2

1.1 Definitions . . . 3

1.1.1 Web search engine . . . 3

1.1.2 Search engine results page . . . 3

1.1.3 Click log . . . 4

1.1.4 Click models . . . 4

1.1.5 Distributed representation . . . 4

1.2 Problem definition . . . 5

1.3 Delimitations . . . 5

1.4 Ethics and sustainability . . . 5

1.5 Thesis outline . . . 6

2 Background 7 2.1 Click models . . . 7

2.1.1 Random click model . . . 8

2.1.2 Position based model . . . 8

2.1.3 Cascade model . . . 9

2.1.4 User browsing model . . . 10

2.1.5 Dynamic Bayesian network model . . . 10

2.1.6 Partially sequential click model . . . 11

2.2 Artificial neural networks . . . 12

2.3 Feedforward neural networks . . . 13

2.4 Recurrent neural networks . . . 13

2.4.1 Long short-term memory . . . 15

2.5 Deep learning in practice . . . 15

2.5.1 The learning process . . . 15

2.5.2 Optimization . . . 16

2.5.3 Regularization . . . 18

2.6 Click model evaluation . . . 19

vi

(8)

2.6.1 Perplexity . . . 20

2.6.2 Normalized discounted cumulative gain . . . 20

3 Related Work 22 3.1 Neural models . . . 22

3.2 Latent variable model . . . 23

3.3 This project . . . 24

4 Methodology 25 4.1 Dataset . . . 25

4.2 Data processing . . . 26

4.3 Baseline models . . . 27

4.4 Disributationally represented user browsing model . . . 28

4.5 Neural network configurations . . . 29

4.5.1 Recurrent neural network . . . 29

4.5.2 Long short-term memory . . . 29

4.6 Learning on distributed models . . . 30

4.6.1 Optimization of learning . . . 30

4.6.2 Hyper-parameter selection . . . 31

4.7 Latent variables . . . 31

4.8 Distributed representations with latent variables . . . 33

4.9 Evaluation methodology . . . 33

4.9.1 Click prediction . . . 33

4.9.2 Relevance prediction . . . 34

4.10 Experimental setup . . . 34

4.10.1 Research questions . . . 34

4.10.2 Evaluation methodology . . . 35

5 Results 36 5.1 Baseline click models . . . 36

5.2 Neural click models . . . 37

5.3 Click prediction results . . . 38

5.3.1 Original representation . . . 38

5.3.2 Extended representations . . . 39

5.4 Relevance prediction results . . . 47

5.4.1 Experiment results . . . 47

5.4.2 Comparison of single-variable models . . . 49

5.4.3 Comparison of two-variable models . . . 49

5.5 Model comparisons . . . 52

5.6 Long short-term memory performance . . . 57

(9)

6 Discussion 62 6.1 Summary of findings . . . 62 6.2 Future work . . . 65

7 Conclusion 66

Bibliography 67

A Prediction accuracy results 70

B Perplexity results 71

B.1 Single-variable models . . . 71 B.2 Two-variable models . . . 72

C NDCG results 75

(10)

2.1 An example of a feedforward neural network with three input nodes, one hidden layer with four nodes and two output nodes. . . 13 2.2 An example of a recurrent neural network with three

input nodes, one hidden layer with two nodes and two output nodes. The output layer is now connected to the hidden layer using feedback connections. . . 14 2.3 Early stopping. Training is interrupted when validation

error starts to increase. . . 19 4.1 RNN model configuration. [3] . . . 30 5.1 Average perplexities comparison between the original

representation and single variable variations. . . 40 5.2 Perplexity comparison between the original representa-

tion and single variable variations, for ranks 1-5. . . 41 5.3 Perplexity comparison between the original representa-

tion and single variable variations, for ranks 6-10. . . 42 5.4 Average perplexities comparison between the original

representation and our two-variable variations. . . 44 5.5 Perplexity comparison between the original representa-

tion and our two-variable variations, for ranks 1-5. . . 45 5.6 Perplexity comparison between the original representa-

tion and our two-variable variations, for ranks 6-10. . . . 46 5.7 NDCG comparison between the original representation

and our one variable variations. . . 50 5.8 NDCG comparison between the original representation

and our two-variable variations. . . 51

ix

(11)

5.9 Average perplexities comparison between the original representation and our best performing variations. . . 54 5.10 Average perplexities comparison between the baseline,

original distributed and the two best variable models. . . 55 5.11 NDCG comparison between the original representation

and our best variable variations. . . 56 5.12 Average perplexities comparison between the original

representation and our best performing variations now tested with LSTM. Notation has been shortened, i = im- pressions, d = dwell. . . 58 5.13 NDCG comparison between the original representation

and our best performing variations now tested with LSTM. Notation has been shortened, i = impressions, d

= dwell. . . 59 5.14 Average perplexities comparison between the original

representation and three-variable variant as RNN and LSTM configurations. . . 60 5.15 NDCG comparison between the original representation

and the three-variable variant using RNN and LSTM configurations. . . 61

(12)

4.1 Statistical Query-Document features. Generated for a

document d on SERPs of a query q. . . 32

4.2 Statistical Query features. Generated for a query q. . . 32

4.3 Statistical Document features. Generated for the indi- vidual document d, irrespective of queries. . . 32

5.1 Baseline UBM click prediction results. . . 37

5.2 Baseline PSCM click prediction results. . . 37

5.3 Baseline models relevance prediction results. . . 37

5.4 Model parameters used for our experiments. . . 38

5.5 Click prediction results for the original distributed model. 38 5.6 One variable model experiments. The variables are de- rived from query-document pairs. . . 39

5.7 Two-variable model experiments. The variables are de- rived from query-document pairs. . . 43

5.8 Click prediction results using query-document impres- sions, click-through rate and average dwelling time. . . . 47

5.9 NDCG scores for the original representation and our variable variations. Results marked in bold text high- lights the best performing variations found in a variable set. . . 48

5.10 Perplexity gains of DRdwelland DRimpr,CT R,dwellover our baseline and base representation models. . . 52

5.11 NDCG paired t-test results calculated per query ses- sion. Measuring significance levels between DRdwell and DRimpressions,CT R,dwellagainst DR. . . 53

5.12 NDCG paired t-test results calculated per query session. Measuring significance levels between DRlstm2impr,CT R,dwell against DR and DRlstm. . . 61

xi

(13)

5.13 Perplexity gains of DRlstm2impr,CT R,dwell over our baselines and original representation variations. . . 61 A.1 Neural models prediction accuracy on the test data. . . . 70 B.1 Click prediction results using query-document impres-

sions. . . 71 B.2 Click prediction results using query-document clicks. . . 71 B.3 Click prediction results using query-document click-

through rate. . . 72 B.4 Click prediction results using query-document average

dwell time. . . 72 B.5 Click prediction results using query-document average

position. . . 72 B.6 Click prediction results using query-document impres-

sions and clicks. . . 72 B.7 Click prediction results using query-document impres-

sions and click-through rate. . . 73 B.8 Click prediction results using query-document impres-

sions and average dwelling time. . . 73 B.9 Click prediction results using query-document clicks

and click-through rate. . . 73 B.10 Click prediction results using query-document clicks

and average dwelling time. . . 73 B.11 Click prediction results using query-document click-

through rate and average dwelling time. . . 74 C.1 NDCG scores for all experiments. Measured for ranks

up to 1, 3, 5 and 10. . . 75

(14)

Introduction

It can be argued that as the amount of raw data being stored increases, so does the difficulties in finding the relevant data that suits your inter- ests and needs. Efficiently finding the correct data in a large quantity is a big problem, especially in the information age we live in today.

It has many times been predicted that the total amount of data being stored throughout society will grow exponentially toward 2020 and beyond, such that the size of the digital universe doubles in size every two years at least. [19]

The problem researched and evaluated in this thesis is done with re- gards to information retrieval, in the field of web search systems.

When it comes to research and development in this field, many ex- periments involve users. Such experiments can differ largely in size, in the ranges of small laboratory studies to web search systems with millions of real users. However, advances could not be made without some understanding of user behavior. As with many areas of science, this information can be modeled.

A user model concerning web search allows us to simulate the user be- havior on a search engine result page (SERP) using assumptions made on user behavioral traits. Such models are commonly called click mod- els, as the main observed user interaction with search systems concerns the user’s clicking behavior.

One motivation for click models is that they help in cases where there is a lack of real users to include in the experiments for various rea-

2

(15)

sons. Another motive is made regarding privacy and commercial con- straints, as user interaction data is commonly restricted. In these cases, the use of simulated users is highly valuable.

Besides user simulations, click models can be used to improve docu- ment ranking, evaluation metrics and to better understand users by inspecting the click models respective parameters. [4]

Click models are commonly based upon the concept of probabilistic graphical modeling (PGM). In the last couple of years, there has been an increased effort made by utilizing the idea of machine learning to develop the concept even further. This thesis will attempt to combine common click models with the concept of deep learning, using a dis- tributed representation of an existing click model. It will be further evaluated how such a representation can be extended by latent vari- ables derived from the click logs in use, and if such extensions can improve the model’s prediction performance.

1.1 Definitions

This section introduces a few central definitions of concepts covered in this report.

1.1.1 Web search engine

A web search engine is a software system designed to search for in- formation on the internet. Most commonly, the search results are pre- sented in a vertical list referred to as a Search engine results page. One of the most commonly known examples of a web search engine is Google.

1.1.2 Search engine results page

A search engine results page is the page displayed by a web search engine in response to a query given by the user. The actual type of the results depends on the web search engine system itself. It commonly

(16)

contains links to relevant websites, documents or pictures. A search engine result page is referred to as a SERP throughout this thesis.

1.1.3 Click log

A click log is a dataset that contains information regarding users of a search engine’s click behavior. It commonly contains field values such as a session id, query terms and the corresponding documents returned upon a search. Most importantly, it contains information on which results were clicked and when. Other examples are information more specific to the user, such as their region.

1.1.4 Click models

A click model simulates user behavior on a SERP, based on different assumptions made on the user’s behavior. The search engine presents the user with a SERP, which contains a set of objects which may be directly related to the queried subject, or less so. The user examines the set of results and possibly clicks on one or more from this set. The search is abandoned either by issuing a new query or by ending the interaction with the SERP. The ongoing events between the issuing of a query and abandonment are called a session. A detailed explanation of some common click models is presented in section 2.1.

1.1.5 Distributed representation

A distributed representation (DR) is the concept of expressing a par- ticular object or concept as a set of values represented in a vector of length N. A practical value of a distributed representation is its ability to capture similarity between different concepts and concept combina- tions.

(17)

1.2 Problem definition

This thesis investigates the following question:

How can recurrent neural networks (RNNs) and latent variables de- rived from the data source be utilized together to model click behavior in a web search system effectively?

This question is examined by expressing a state of the art PGM-based click model with a distributed representation that can be used to train a recurrent neural network. This representation will be extended with a set of latent variables that can easily be derived from click logs. The objective is to construct a model that performs well regarding both click prediction and relevance prediction when compared to the se- lected baseline PGM-based versions.

These steps can be expressed as a set of hypotheses that are to be eval- uated in the course of this project:

• A deeply learned distributionally represented click model performs bet- ter than their respective PGM versions.

• Latent variables derived from click logs can be used to improve the per- formance of a deeply learned click model.

1.3 Delimitations

This thesis will not be able to capture the contents of the documents being clicked on. This restriction is due to the data used in this project has been fully anonymized, and no such information is retrievable as a result.

1.4 Ethics and sustainability

User click modeling in web search does not necessarily have to raise ethical concerns. It is somewhat more likely that the concept is moti- vated due to ethical reasons, as user browsing data is commonly re- stricted and not publicized. User simulation through click models are

(18)

not personalized however and does in its basic form not include per- sonal data. The actual queries sent to the search engine can also be anonymized rather easily.

The application of machine learning for click modeling does not result in additional ethical restrictions in this case. However, machine learn- ing methods should be continually applied with care as it is a sensitive subject that can affect societies significantly.

The concept does not have any significant effect concerning sustain- ability. As click models are mainly used in the context of research and development, it does not affect the societies sustainable future much.

Their use can, however, be argued to allow for experiments to use smaller user groups, resulting in less traveling required in such cases.

1.5 Thesis outline

The report begins by describing the concept of click modeling and common metrics used to evaluate their performance. In addition to this, it provides some background to the concept of artificial neural networks. A set of related works are presented, introducing what other attempts have been made in the area during recent years. Fol- lowing these works, the methodology used to answer the chosen re- search question is described. After having described the methodology, the experiments corresponding results are presented. The report ends with a discussion of the results and concluding remarks.

(19)

Background

In this chapter, the underlying knowledge required to understand the problem is presented. The chapter begins by describing the concept of click modeling. Following these descriptions, formal definitions of a set of standard and state of the art click models are presented. The chapter continues to explain the idea of neural networks and the click model evaluation metrics that are used throughout this project.

2.1 Click models

As previously described in section 1.1.4, a click model describes the be- havior of a user while browsing results on a search engine results page.

When a user issues a query to the search engine, it will respond with a SERP containing information based on the user’s described needs. The user then examines the list of resulting documents and may choose to click on one or more of the results. The user may also choose to abandon the search session entirely. The process between query and abandonment is called a session.

Click models treat the search behavior of its users as a sequence of observable and hidden events. They are described by binary values X, where the value X = 1 means that the event has occurred, and X = 0 means that it has not. The main events that are considered by most click models are the following:

7

(20)

E : a user examines an object on a SERP.

A : a user is attracted by the object’s representation in the SERP.

C : an object is clicked.

S : a user’s information need has been satisfied, and the query session can conclude.

The models define dependencies between these events to estimate probabilities of their corresponding random variables. Some proba- bilities are treated as parameters and depend on features of a SERP and a user’s query. [4]

The following sections introduce a selection of common and state of the art PGM-based click models.

2.1.1 Random click model

The random click model (RCM) is the simplest click model, having only one parameter. It is defined as follows:

P (Cu = 1) = ρ (2.1)

This formula means that every document u has the same probability of being clicked and this probability is a model parameter ρ. As the model only has one parameter, it can simply be estimated using Max- imum Likelihood Estimation (MLE).

Although this model is very simplistic, its performance is often used as a baseline when comparing to other models. It can also be assumed to be safe from overfitting since it only has one parameter. [4]

2.1.2 Position based model

Many models include a so-called examination hypothesis, formally de- scribed as:

Cu = 1 ⇔ Eu = 1 ∩ Au = 1 (2.2) This hypothesis means that a user clicks a document u if and only if, the user both examined the document and was attracted by it. The random variables Euand Au are usually considered independent.

(21)

The position-based model (PBM) incorporates the assumption that the probability of a user examining a document u, given a query q de- pends heavily on its rank or position on a SERP. This probability typi- cally decreases with rank, i.e., result positions further down in the list.

The model incorporates this into a set of parameters. The examination probability at rank r is represented by γr and αr represents the attrac- tion probability. This model can allow for sessions where more than one click event has occurred. [4]

P (Cu = 1) = P (Eu = 1) ∗ P (Au = 1) (2.3)

P (Au = 1) = αuq (2.4)

P (Eu = 1) = γru (2.5)

2.1.3 Cascade model

The cascade model (CM) works on the assumption that the user scans documents listed in the SERP sequentially, from top to bottom until they find a relevant document. This works under the foundational assumption that the top ranked document u1 is always examined, whereas document ur where r ≥ 2 are examined if, and only if the previous document ur−1 was examined and not clicked. This assump- tion combined with the examination assumptions from equations 2.3 and 2.4 obtains the cascade model:

Cr = 1 ⇔ Er = 1 ∩ Ar= 1 (2.6)

P (Ar = 1) = αurq (2.7)

P (E1 = 1) = 1 (2.8)

P (Er= 1 | Er−1 = 0) = 0 (2.9) P (Er = 1 | Cr−1 = 1) = 0 (2.10) P (Er= 1 | Er−1 = 1, Cr−1 = 0) = 1 (2.11) This model can only describe sessions with one click and cannot ex- plain non-linear examination patterns. [4]

(22)

2.1.4 User browsing model

The user browsing model (UBM) is an extension of PBM, described in section 2.1.2. It includes some elements from the CM described in sec- tion 2.1.3. The model is based on the idea that the examination prob- ability should consider previous clicks, albeit remain mainly position- based. It depends not only on the rank of a document, r but also on the rank of the previously clicked document r’. This can be formalized as:

P (Er= 1 | C1 = c1, ..., Cr−1 = cr−1) = γrr0 (2.12) where r’ is the rank of the previously clicked document, or zero if none has been clicked. In other words:

r0 = max{k ∈ {0, ..., r − 1} : ck= 1} (2.13) where c0is set to 1 for convenience. An alternative formulation to 2.12 is as follows:

P (Er = 1 | C<r) = P (Er = 1 | Cr0 = 1, Cr0+1 = 0, ..., Cr−1 = 0) = γrr0 (2.14) [4], [5]

2.1.5 Dynamic Bayesian network model

The dynamic Bayesian network model (DBN) is an extension to CM, described in section 2.1.3. It works on the assumption that the user’s perseverance after a click is only dependent on the actual relevance σuq, also called the satisfaction probability, instead of the perceived relevance αuq. The model can be described as follows:

Cr ⇔ Er= 1 ∩ Ar = 1 (2.15) P (Ar = 1) = αurq (2.16)

P (E1 = 1) = 1 (2.17)

P (Er = 1 | Er−1 = 0) = 0 (2.18) P (Sr = 1 | Cr = 1) = σurq (2.19) P (Er = 1 | Sr−1 = 1) = 0 (2.20)

(23)

P (Er = 1 | Er−1 = 1, Sr−1 = 1) = γ (2.21) where γ is the continuation probability for a user that either clicked a document and was not satisfied by it, or did not click any document at all. [4]

2.1.6 Partially sequential click model

The partially sequential click model (PSCM) was introduced by Wang et al. [20] as an attempt to model less sequential user behavior, and has resulted in great performance. Using the timestamp of the click log in use, the click sequence is organized as C = {C1, C2, ..., Ct, ..., CT} where t is the relative temporal order of a click and Ctrecords the result position of the t-th click, where 1 ≤ Ct≤ M, where M represents the number of documents considered in the SERP, commonly set to 10.

The model is based on two assumptions, the first being the First-order click hypothesis, also used in the UBM model described in section 2.1.4 and the DBN model, from section 2.1.5. Through this assumption, the model assumes that the click event at time t+1 is only determined by the click event at time t. This allows the model to divide a click se- quence into sub-sequences, or adjacent click pairs, {[C0, C1], ..., [Ct−1, Ct], ..., [CT, CT +1]}. Where C0 represents the beginning of the search process and CT +1 represents the end.

Additionally, according to the Locally unidirectional examination assump- tion, given an observation of adjacent clicks at a point in time, users tend to examine results without any directional changes. Meaning that they follow the path from m to n, where m < n, without deviation. The examination and click sequence between Ct−1 and Ctcan be noted as {Em, ..., Ej, ..., En} and {Cm, ..., Cj, ..., Cn}, respectively. Note that in the adjacent click sequence, only Cm and Cncan have a value of 1, and the other positions on the path have value 0. The model can be described as follows:

P (Ct| Ct−1, ..., C1) = P (Ct| Ct−1) (2.22) P (Ct = n | Ct−1 = m) = P (Cm = 1, ..., Ci = 0, ..., Cn = 1) (2.23) P (Ei = 1 | Ct−1 = m, Ct = n) =

( γimn,if m ≤ i ≤ n or n ≤ i ≤ m 0, else

(2.24)

(24)

Ci ⇔ Ei = 1 ∩ Ri = 1 (2.25)

P (Ri = 1) = αuq (2.26)

Where equation 2.22 denotes the first-order click hypothesis and equa- tion 2.23 encodes the unidirectional examination assumption by restrict- ing the examination process to one-way from m to n. The examination probability of Eiis defined as 2.24, since the examination behavior be- tween adjacent clicks may not follow the cascade assumption. The probability of examination depends on the position of the clicks, sim- ilar to UBM but not restricted to strictly sequential behavior. PSCM also follows the examination hypothesis just like most click models do, described in equation 2.25. In this notation, variable αuq corresponds to the relevance of the document u and query q. The variable γimn

represents the examination transition probability in either upwards or downwards directions.

2.2 Artificial neural networks

Artificial neural networks (ANNs) are a family of computational meth- ods attempting to model the information processing capabilities found in the biological brain. The concept has been around since 1943 and has since been developed to be increasingly sophisticated [16]. An ar- tificial neural network requires much computational power to achieve good performance. Due to this computational power not being avail- able until the last decade, the concept has not gained any real traction until now, made capable by the rapid increase in computational capa- bilities through cheaper and more efficient hardware in recent years.

Modeling methods based on ANN have since proven successful in many complex tasks, such as speech recognition, translation and im- age classification.

The goal of an ANN can be simplified to approximation of some given function using a set of input value model parameters θ. A common example is of a simple classifier where y = f*(x) maps an input x to a category y. An ANN in this context defines a mapping y = f(x; θ) and learns the values of the parameters θ that results in the best function approximation. [6]

(25)

2.3 Feedforward neural networks

A feedforward neural network is one of the simplest types of networks in the ANN family. The model typically consists of one input layer, one output layer and some number of intermediate hidden layers. Fig- ure 2.1 serves as an architectural example. The model is called feed- forward as information flows through the model in only one direction.

From the input layer, through the hidden computational layers and ending in the output layer, which produces the actual prediction. [6]

Figure 2.1: An example of a feedforward neural network with three input nodes, one hidden layer with four nodes and two output nodes.

As the information flows through the network, the input values are multiplied by a set of weight values at each step. During training, the predictions are compared against the expected output values, which are used to calculate the training error by using an appropriate error function. The training error is used to tune the weights in an attempt to increase the networks prediction accuracy. When the output is com- pared with expected values, the process is referred to as Supervised Learning. [16]

2.4 Recurrent neural networks

The family of recurrent neural networks (RNNs) is an extension to the traditional feedforward networks in the sense that they now include feedback connections, no longer limiting the information to flow in one

(26)

predetermined direction. Figure 2.2 serves as an architectural exam- ple. By including feedback connections, the outputs of the model are fed back into itself. This process is called backpropagation and is com- monly used when training models with more than one hidden layer.

Such models are often referred to as deep neural networks.

Figure 2.2:An example of a recurrent neural network with three input nodes, one hidden layer with two nodes and two output nodes. The output layer is now connected to the hidden layer using feedback connections.

During backpropagation, the error is fed back into the network and used to adjust the weights of the network, attempting to reduce the error and increase prediction performance. This is a crucial step when training deep networks. Backpropagation is made possible by the idea of parameter sharing between different parts of the model. This con- cept enables the possibility to extend and apply the model to exam- ples of different forms and generalize across them. Where a traditional fully connected feedforward network would require separate parame- ters for each input feature, a recurrent neural network shares the same weights across several time steps. [6]

After multiple samples have been used to learn a good set of values, the network should be able to produce accurate predictions for new samples. Overfitting may occur, however, meaning that the network has been trained too specifically to the training data and is not general enough to predict the new one. [16]

One of the most crucial parameters of the model is the learning rate.

It decides how quickly weights should be altered. A small learning rate leads to slower learning, while a too large value may prevent the network from converging if the optimal weight values are accidentally skipped.

(27)

2.4.1 Long short-term memory

Long short-term memory, commonly referred to as LSTM is an exten- sion to the common concept of RNN models.

Not only does the model allow for self-loops where the gradient can flow for long durations. By the introduction of gates, the weight of the loops can be conditioned on the context, rather than a fixed feature of the model. A hidden unit controls the gates themselves. They allow the time scale of when to integrate new knowledge to be set dynami- cally. The use of LSTM has proven highly successful in many applica- tions and can be used to mitigate problems that may occur in regular RNNs, such as exploding or vanishing gradients. [6]

2.5 Deep learning in practice

This section covers some crucial parts to consider when training a deep neural network model.

2.5.1 The learning process

One important aspect when dealing with models consisting of com- plex neural networks is how to select the model’s respective parameter values. The strategy of deep learning is to learn them using iterative, gradient-based optimizers that drive the error function down to very low values. To effectively apply gradient-based learning, a sufficient loss function must be chosen, as well as how to represent the output of the model. Depending on the context, the loss function may also be called the cost or error function.

Activation functions

The choice of non-linear transformation or activation functions has a significant effect on the training and task performance when dealing with neural networks. Some popular and widely used activations are the sigmoid, hyperbolic tangent (tanh) and rectified linear unit (ReLU)

(28)

functions. ReLU is currently one of the most successful and widely used activations [15], defined as:

f (x) = max(x, 0) (2.27)

The sigmoid activation function is effective for transforming the final output to a probability in the range of [0, 1]. For this reason, sigmoid is a common choice for the output layer of systems that are to output a probability distribution. It may, however, result in optimization diffi- culties if used as the top hidden layer in deep neural networks [1]. The sigmoid activation function is defined as:

f (x) = 1

1 + e−x (2.28)

Weight initialization

Weights have to be initialized with care to break the symmetry be- tween hidden units of the same layer. There exists a set of common weight initialization strategies. However, initialization using a zero- mean Gaussian with a small standard deviation of around 0.1 often performs well enough. Biases and weights in connection to the output layer are generally safe to be initialized to zero. [1]

2.5.2 Optimization

The optimization problem is in the context of neural networks, the problem of finding the best model parameters θ, that significantly re- duces the loss function, also called cost or error function. The loss function is the measure of how well a model performs for the given training sample and the expected output. It may also depend on vari- ables such as weights or biases.

Stochastic gradient descent

Stochastic gradient descent (SGD) is the most commonly used opti- mization algorithm. The algorithm tries to find minima or maxima by iteration. It takes one important parameter, the learning rate, which is

(29)

usually decreased gradually over time until some particular point in time, after which it is left constant. The reasoning for this is the source of noise introduces by the SGD gradient estimator that does not vanish even when we arrive at a minimum. [6]

Adaptive learning

Adaptive learning is the process of gradually altering the learning rate during training. There exists a handful of algorithms used for this purpose. An example of such an algorithm is the Adadelta optimizer, introduced by Zeiler [21]. This optimizer requires minimal additional computational overhead when compared to the conventional SGD al- gorithm. While it offers a set of additional hyperparameters to be set, it has been found that their selection does not significantly alter the final results.

Difficulties in optimization

Plateaus, saddle points and flat regions

The problem of neural network optimization can be reduced to the problem of finding a local minimum. However, it is not necessarily restricted to local minima discovery. It is recommended to plot the norm of the gradient over time. The reason for this is other shapes that become more prevalent in higher dimensions, such as saddle points and other critical points. [6]

Vanishing and exploding gradients

Vanishing and exploding gradients are other types of problems that occur in neural networks. The vanishing gradient problem occurs when the gradient becomes so small that the weights are unable to be updated, in some cases even causing the network to end training completely.

Error gradients can accumulate during learning and can result in the exploding gradient problem. Exploding gradients mean that large gra- dient values are used to update the weights, causing an unstable net- work. [6]

The use of LSTM can reduce these potential problems in the case of recurrent neural networks. [3]

(30)

2.5.3 Regularization

A central problem in deep learning is how to construct an algorithm that does not only perform well on the training data, but also on new inputs. Regularization techniques are strategies in machine learning that are designed to reduce test set error, possibly at the expense of increased training error.

Some approaches put extra constraints on the model itself, such as adding restrictions on the parameter values. Others add additional terms in the objective function corresponding to a soft constraint on the parameter values.

With regards to deep learning applications, most regularization strate- gies are based on regularizing estimators. Essentially, they work by increasing bias for a reduced variance. A good performing regularizer is one that makes a profitable trade, by greatly reducing variance while not increasing the bias too much. [6]

Below are a set of popular regularization techniques.

Clipping gradients

Practitioners have used the idea of clipping the gradient for many years. It is a regularization method that works to mitigate the explod- ing and vanishing gradients problem. It involves clipping the norm of the gradient when it has become too large, just before the following parameter update. Which is motivated by the assumption that when gradients explode, so does the curvature and higher order derivatives as well [12]. This idea can also be applied element-wise, however clip- ping the norm is the more popular approach. [6]

Dropout

Standard backpropagation in RNNs can cause the model to overfit to the training data and not generalize well. An idea to mitigate this is the use of dropout. It works by the principle of making a particular hid- den unit unreliable, meaning that hidden units may be ignored during

(31)

a training phase. Selections are random, based on chosen probabili- ties. One drawback to using dropout is that it can increase training time quite heavily. It is said that a neural network utilizing dropout generally takes 2 to 3 times longer to train. This is caused by the noise that the parameter updating process introduces. [18]

Early stopping

Early stopping is likely one of the most commonly used forms of reg- ularization in deep learning. It is often thought of as a very effective hyperparameter selection algorithm, especially concerning the num- ber of training epochs. It requires almost no change in the founda- tional training process, the objective function or the set of parameters values allowed. It works by interrupting the training procedure once the model’s performance on the validation set has become worse, vi- sualized in Figure 2.3. The validation set is a set of examples that are never used for learning, but a representative for future test examples.

[6] [13]

Figure 2.3: Early stopping. Training is interrupted when validation error starts to increase.

2.6 Click model evaluation

This section presents the metrics used for evaluating the model’s click and relevance prediction performance. Both metrics are widely adopted for this particular area of research.

(32)

2.6.1 Perplexity

To assess our model’s click prediction performance, the click perplex- ity metric is used, as introduced by Dupret and Piwowarski [5]. This metric measure how "surprised" the model is upon observing a docu- ment. The higher the value, the worse is the model, with an optimal value of 1. The perplexity of the random click model described in sec- tion 2.1.1 is 2, meaning that a realistic model should have a value in the range of [1, 2]. The metric is calculated as follows:

pr(M) = 2−1|S|Ps∈S(c(s)r log2qr(s)+(1−c(s)r ) log2(1−q(s)r )) (2.29) Where qr(s)is the probability of a user clicking the document at rank r in the session s as predicted by the model M. In other words:

q(s)r = PM(Cr | q, u) (2.30) It is possible to calculate the perplexity averaged across ranks to get an overall measure of model quality, done by the following formula:

p(M) = 1 n

n

X

r=1

pr(M) (2.31)

When comparing perplexity scores of two models A and B, the perplex- ity gain of A over B can be calculated as follows:

gain(A, B) = pB− pA

pB− 1 (2.32)

The perplexity is typically higher for top documents, and decreases to- ward the bottom of a SERP, this is because top documents typically get more clicks. Since it is more difficult to predict a click than predicting its absence, this leads to top documents being demanding for a click model to get right. [4]

2.6.2 Normalized discounted cumulative gain

Normalized discounted cumulative gain (NDCG) is a measure of rank- ing quality, standardized in information retrieval and introduced by Järvelin and Kekäläinen [8]. In this thesis, it is used to assess the rele- vance predictions given by our click models.

(33)

The following formula describes the normalized discounted cumula- tive gain at rank position p, followed by the definitions of its respective numerator and denominator.

N DCGp = DCGp

IDCGp (2.33)

DCGp =

p

X

i=1

2reli− 1

log2(i + 1) (2.34)

IDCGp =

|REL|

X

i=1

2reli − 1

log2(i + 1) (2.35)

Variable reli represents the graded relevance of the result at position i and REL represents the list of relevant documents, ordered by their relevance, in the corpus up to position p. The NDCG values for all queries can be averaged in order to obtain a measure of the average performance of an algorithm. Note that for a perfect ranking algorithm the DCGp will be the same as the ideal, resulting in a NDCG of 1.

Results are normally values on the interval [0, 1].

(34)

Related Work

This chapter provides an overview of some of the related work that has previously been done in the field of click modeling for web search.

Most of the work referred to utilizes deep learning techniques for either sponsored search or regular web search processes, while one brings up another interesting approach that has affected the work done in this project. The chapter ends by describing what specific in- formation has been used in the course of this project.

3.1 Neural models

A variety of previous work has been done when it comes to applying neural networks to click modeling. Zhang et al. [22] (2013) proposed a click model based on RNN for sponsored search that directly models the user’s current and historical sequential behaviors into a click pre- diction process. They construct features based on ad impressions for both the training and testing process.

Liu et al. [9] (2015) proposed a model based on convolutional neural networks for sponsored search. Their model utilizes two sets of con- volutional and flexible pooling layers, ending with a fully connected output layer. They evaluate significant features of ad impressions and the subsequent history of the impression history to improve the click prediction accuracy.

22

(35)

Liu et al. [10] (2017) proposed a CNN-based click model for web search, which incorporates document content information and context of the SERP. The proposed network utilizes a single wide convolu- tional layer, followed by non-linearity and max pooling. After which the query and result-feature vectors previously generated are used to compute content and context similarities. The features are processed by a set of joint and hidden layers, finally used to make relevance and click probability predictions. Their evaluation section includes the UBM and PSCM click models and also the RNN model of Zhang et al. [22]. This RNN model performed worse than the baseline version of PSCM, whereas the CNN version of PSCM was shown to perform very well.

Borisov et al. [3] (2016) proposed a RNN-based click model for web search which utilizes a distributed representation. This report is the main inspiration for the work done in this thesis. They do not base their representation on the previously known PGM-based click mod- els. However, they provide examples of how the UBM and DBN mod- els can be translated for this purpose. Instead, their distributed rep- resentations are implemented using three sets of representations for a query q, document d and user interaction i. Each set considers vary- ing amounts of query sessions, either query-document pairs, all query sessions generated by the given query q, or all query sessions given query q whose SERP contains the document d. They found that their distributed models, specifically the one using long short-term memory (LSTM) performs better than the baseline models used.

3.2 Latent variable model

The work of Hu, N Liu, and Chen [7] is the second biggest inspiration for this thesis. Their approach is not based on neural networks the same way as the work mentioned above. Instead, they introduce the interesting techniques of feature generation and feature augmentation through latent variables and combine them with existing click mod- els. This report is inspirational mainly for the collection of potential statistical features that can be generated from the click logs.

(36)

3.3 This project

This project can be viewed as a combination of the distributed repre- sentation concept found in [3] with a set of statistical features found in [7] that can be generated from click logs. Specifically, the distributed representation of UBM, exemplified by [3], is used individually and in combination with various sets of statistical features. The rather simple representation of the UBM model is the primary motivation for its use.

It is interesting to evaluate how such a simple distributed representa- tion can be applied for the use of click modeling, and if its performance is improved when incorporating statistical features.

(37)

Methodology

This chapter describes the methodology used to answer the selected research questions. The chapter starts off describing the data used and the selected baseline models. The concept of distributed repre- sentations and their implementations shortly follows. These represen- tations are later extended by a set of latent variables that can easily be derived from the click logs. The chapter ends by describing the evaluation process used in an attempt to answer the selected set of hypotheses.

4.1 Dataset

The click log data used in this thesis comes from the Yandex Relevance Prediction challenge held in 2011 [14]. The data contains 30,717,251 unique queries, 117,093,258 unique documents sampled from logs of the Russian search engine Yandex. Beside these click logs, it also contains human generated binary relevance labels for 41,275 query- document pairs, containing 4991 unique queries. These relevance la- bels are used to evaluate the ranking performance of the click models used in this project.

Two different types of lines describe the query sessions in the data. The first line type initializes a query session, holding the following values:

SessionID TimePassed TypeOfAction QueryID RegionID ListOfURLs

25

(38)

Where SessionID is the unique number identifying the described query session, TimePassed is the time initializing the session, always set to 0. TypeOfAction is Q for query, QueryID is the identifying id for the issued query, followed by a list of 10 document ids, held in ListOfURLs.

The list is ordered from left to right, as they were shown to the user from top to bottom in the SERP. RegionID is a unique identifier of the country from where the user is querying from. The number of queries sent within a session varies.

The second line type describes a click action made within a query ses- sion and is denoted by the values:

SessionID TimePassed TypeOfAction URLID

Where TimePassed represents the time since the query session was ini- tialized until the click was recorded. The resolution of the time mea- surement is undisclosed. TypeOfAction is C, representing a click event.

URLID represents the id of the document that was clicked. The num- ber of click events for a query may vary between 0 and 10 clicks.

4.2 Data processing

The original data was read and translated into query sessions that could be described in a single line. It was then randomized and split into separate partition files, where each file represents approximately 0.2% of the total number of query sessions. Meaning that each file holds over 340.000 different query sessions, where the following fields describe a session:

SessionID, QueryID, Result-1, Click-event-1, .. Result-10, Click-event-10 In this structure, a click event is represented by the time since the start of the query session that the corresponding search result was clicked, or 0 if no click was made.

When partitions are referred to throughout this report, it is done with these partitions in mind. However, due to the time restrictions of this project, the data quantity used in this project had to be restricted con- siderably. The conducted experiments were done using 10 partitions, representing approximately 2% of the total number of query session

(39)

available in the dataset. This quantity still holds over 3.4 million dif- ferent query sessions. However, these restrictions may impact the final results found.

The performance comparison of baseline models used may not be en- tirely fair due to limitations of iterative variable estimation, requiring greater amounts to learn the required parameter values effectively. Re- garding neural models, more significant data quantities could allow for more general training and thus may decrease the potential of over- fitting. The use of more significant amounts of data would, therefore, be preferable.

4.3 Baseline models

The baseline models implemented and used to benchmark our neural models were UBM and PSCM. These were selected due to their perfor- mance in previous work. The user browsing model was a clear choice, given that it is the model which the distributed representation used in this project is based on. The model’s respective probabilistic pa- rameters were learned using the expectation-maximization algorithm, which is a common iterative method used in order to retrieve maxi- mum likelihood estimates for unobserved variables. It was configured with a maximum of 50 iterations or ending once the root mean square error (RMSE) was found to be less than 1 × 10−8. These were picked due to a higher precision not showing any significant increases in early attempts, instead only increased the total time spent.

The root mean square error is defined as:

RM SE =

sPn

i=1(yi− ˆyi)

n (4.1)

Where yiis the predicted value and ˆyiis the value observed.

(40)

4.4 Disributationally represented user browsing model

The UBM baseline was translated into a distributed representation as Borisov et al. [3] described it. The representation is hereafter referred to as vector state.

The vector state, denoted by sr, is represented by a tuple of four integer values (q, d, r, r0). In this notation, q denotes the issued query id, d is the id of the document currently being examined and r is the rank of d in the SERP. Finally, r’ denotes the rank of the previously clicked document. Rank in this context refers to its placement on the SERP, where a rank of 1 means that it is located at the top of the results page.

The distributed representation of a UBM modeled query session can be formalized as:

I(q) = (QueryID(q), 0, 0, 0) (4.2) U (sr, ir, dr+1) = (sr[0], docID(dr+1), sr[2] + 1, h(sr, ir)) (4.3)

h(sr, ir) =

( sr[2], if ir = 1

sr[3], else (4.4)

F (sr+1) = γsr+1[2],sr+1[2]−sr+1[3] ∗ αsr+1[0],sr+1[1] (4.5) Where the mapping I(q) in notation 4.2 initializes the vector state by setting the first component to the queried query id and the rest to a value of zero. The mapping U (s, i, d) is used to update state sr to state sr+1 by setting the second component to the id of the next examined document. The rank of the currently examined document is incre- mented by one, as the user has moved down the SERP. The fourth component is set to the third component of sr if the previous docu- ment drwas clicked. The variable iris a boolean value denoting a user interaction, specifically a click event in this project.

The function F (sr+1)computes the probability that a user clicks on the currently examined document. This prediction is the product of the ex- amination probability γr,r−r0 and the attractiveness probability αq,d. In the context of deep learning models, these parameters are represented by the parameters to be learned.

(41)

4.5 Neural network configurations

This section presents how the recurrent neural network and long short-term memory models were implemented using Tensorflow [11].

4.5.1 Recurrent neural network

The base RNN model is constructed as depicted in Figure 4.1, based on the implementation of Borisov et al. [3]. It uses a fully connected layer in order to initialize the vector state s0. From this state it utilizes recurrent connections to propagate information from the current state srto the next, sr+1. The states are formalized as follows:

s0 = f1(Wqs∗ q + b1) (4.6) sr+1 = f2(Wss∗ sr+ Wis∗ ir+ Wds∗ dr+1+ b2) (4.7) The functions f1 and f2 here refers to non-linear transformations, e.g.

the tanh, sigmoid or ReLU functions.

The click probability cr+1 is computed using a fully connected layer with one output unit for each search result in the SERP. The sigmoid activation function is used in the output layer to ensure that the output falls in the range [0, 1].

cr+1 = σ(Wsc∗ sr+1 + b3) (4.8) The matrices Wqs, Wss, Wis, Wds, Wscand bias variables b1, b2and b3are the parameters of functions I, U and F that are to be learned during training.

4.5.2 Long short-term memory

The LSTM configuration was implemented the same as in Figure 4.1, the only difference being the states sr consisting of an LSTM cell in- stead of regular RNN, with hopes that the gated learning process would increase performance.

(42)

Figure 4.1:RNN model configuration. [3]

4.6 Learning on distributed models

The problem of click prediction can be viewed as a binary classifica- tion problem since each result in the SERP is predicted a binary value representing whether the result has been clicked. As the output of our models is calculated using the sigmoid activation function, this is as simple as rounding the output to the nearest integer. Prediction accu- racy is calculated using the rounded predictions and their respective truth labels.

Both the RNN and LSTM configurations are trained by maximizing the likelihood of observed click events.

4.6.1 Optimization of learning

To effectively train the model and optimize it for our problem, a loga- rithmic loss function was used. Specifically, the sigmoid cross entropy function [17] available in Tensorflow was selected. This was done in combination with the Adadelta optimization algorithm with default values  = 10e−6 and ρ = 0.95 in order to adjust the learning rates. The gradient clipping technique [12] was used in an attempt to mitigate the exploding gradient problem, with the threshold set to 1. These settings were used due to their use in the work of Borisov et al. [3].

(43)

Dropout regularization was used in both ingoing and outgoing direc- tions to the hidden layer. Another form of regularization used is early stopping. It was configured to compare the best loss value with the current one, combined with some steps used to determine whether to stop training or not.

4.6.2 Hyper-parameter selection

The main parameters explored in this project were learning rate, cell size, the number of hidden layers, dropout probability, activation func- tions for both the query layer, Equation 4.6 and the results layer, Equa- tion 4.7. The number of epochs for early stopping and the maximum number of training epochs were also explored. The random search algorithm, presented by Bergstra and Bengio. [2], was used to find a good starting set of parameter values, after which a more precise ex- ploration was done using the regular grid search approach. Due to large amounts of data and time restrictions of this project, it was essen- tial to find a set of parameter values making a good tradeoff between accuracy and active training time.

4.7 Latent variables

The second part of this thesis covers the combinations of latent vari- ables that can be derived from the dataset. It is explored how these variables can be used in combination with the distributed representa- tions to increase prediction performance.

Generally, there are three different groups of latent variables that can be derived from a click log. These are Query-Document features, which can be generated by summing the mappings between queries and their resulting documents. There are also Query features, which merely re- lies on summing the queries of the data source, and Document features which instead relies on summing the resulting documents.

Some examples of variables that can be generated in each class can be seen in the tables below. The work of Hu, N Liu, and Chen [7] inspired these variables.

(44)

Table 4.1:Statistical Query-Document features. Generated for a document d on SERPs of a query q.

Feature Description

# Clicks Number of times document d is clicked given q

# Impressions Number of times document d is impressed given q CTR # Clicks / # Impressions for d given q

First CTR CTR on d when it is the first clicked document Last CTR CTR on d when it is the last clicked document Only CTR CTR on d when it is the only clicked document AvgDwellTime Average dwell time on d after being clicked given q AvgPosition Average ranking position of d given q

Table 4.2:Statistical Query features. Generated for a query q.

Feature Description

# Clicks Number of clicks on all SERPs

# Shows Number of times q is searched

AvgClickPosition Average position of all clicks made in the SERPs CT Rk CTR for the rank position k of the query q,

(1 ≤ k ≤ 10)

AvgClickNum Average number of clicks on each SERP

Table 4.3: Statistical Document features. Generated for the individual docu- ment d, irrespective of queries.

Feature Description

# Clicks Number of clicks on d

# Impressions Number of impressions of d

CTR # Clicks / # Impressions

LastCTR CTR on u when it is the last clicked document

Where CTR is the abbreviation for click-through rate, and an impres- sion means that the document has been presented in a SERP.

(45)

4.8 Distributed representations with latent variables

Once an acceptable selection of hyper-parameters had been identified for the base model, it was extended by incorporating a set of statistical variables into the vector state. In the context of this project, the num- ber of variables tested had to be limited due to time restrictions. It was decided to focus on the query-document features from Table 4.1 due to them holding the most contextual information, mapping relations between queries and document pairs. The simple sets of the individ- ual document and query features are static, regardless of context and thereby providing insufficient information.

The idea was to extend the calculations of equation 4.7 with combina- tions of latent variables, such that the next states would now be calcu- lated as follows:

sr+1 = f2(Wss∗ sr+ Wis∗ ir+ Wds∗ dr+1+ Wvis∗ vri+ b2) (4.9) Where vri represents some variable i from the set of variables in Ta- ble 4.1 and r denotes the (queryID, docID) pair at some rank in the SERP. Wvisis simply the corresponding weight matrix. This additional information was thought to make the neural networks more capable of identifying effective patterns, in hopes of increasing its predicting power.

4.9 Evaluation methodology

To effectively evaluate our neural models, the click prediction and rel- evance prediction tasks were considered. The neural models are all trained by maximizing the likelihood of observed click events. All models were trained on 10 partitions and evaluated on 2 partitions.

4.9.1 Click prediction

Click prediction is the task of predicting a user’s clicks given a SERP.

The metric used for this is described in section 2.6.1. For our neural

(46)

models, the click probability is simply the output of our models. Mak- ing it easy to incorporate in our evaluation process. The click proba- bility of document d, given a query q is simply formalized as the fol- lowing, where r denotes the ranking position on the SERP:

P (Cr = 1 | q, dr) (4.10)

4.9.2 Relevance prediction

Relevance prediction is the task of predicting the relevance of a doc- ument given a query. The metric used for this is described in section 2.6.2. The relevance of a document d to a query q is estimated using the click probability of d, when it appears on the first position of the SERP. This can be denoted as:

R(q, d) = P (C1 = 1 | q, d1) (4.11)

4.10 Experimental setup

This section describes the experimental process used in this thesis.

It begins by reintroducing the research question, also found in sec- tion 1.2, followed by describing the experiments conducted in an at- tempt to answer them.

4.10.1 Research questions

The research questions evaluated throughout this thesis is:

How can recurrent neural networks (RNNs) and latent variables de- rived from the data source be utilized together to model click behavior in a web search system effectively?

This question is evaluated using the following set of hypotheses.

H1 A deeply learned distributionally represented click model per- forms better than their respective PGM versions.

H2 Latent variables derived from click logs can be used to improve the performance of a deeply learned click model.

References

Related documents

The range of an activation function usually prescribes the domain of state variables (the state space of the neural network). In the use of neural networks for optimization,

In Part III, we will see that “higher mental functions” tend to be modeled more in connectionist terms constrained (if at all) by psychological or psycholinguistic data (cf. the Part

Figure 4 shows data from the United States, a country with which most readers will be familiar; Figure 5 depicts Germany, a case with a generally large number of raters and

Perhaps most prominently, there are often slight jumps in the data when the contemporary codings end (given data reduction, scores from contemporary coders can continue for

We succeeded in reproducing the 100% score on the TOEFL test using three different ways of redistribution the weight; the Caron P transform, the PC removal scheme, and with a

An LSTM recurrent neural network is ideal for the case of harbour porpoise click train classification since the click trains contain clicks that change over time.. The time

This thesis contributes to three research topics: it shows for the first time how to connect a CBLS solver to a technology-independent modelling language (Paper I and Paper II); it

Figure 18.2 shows that the complete mitral annulus (trigone region plus contractile annulus) increasingly flattens throughout diastole, reaching its flattest configuration at the