Latent variable neural click models for web search

(1)

Latent variable neural click models for web search

HENRIK SVEBRANT

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

models for web search

HENRIK SVEBRANT

Master in Computer Science Date: July 6, 2018

Supervisor: Jeanette Hällgren Kotaleski Examiner: Elena Troubitsyna

Swedish title: Neurala klickmodeller med latenta variabler för webbsöksystem

School of Computer Science and Communication

(4)

Abstract

User click modeling in web search is most commonly done through probabilistic graphical models. Due to the successful use of machine learning techniques in other fields of research, it is interesting to evaluate how machine learning can be applied to click modeling. In this thesis, modeling is done using recurrent neural networks trained on a distributed representation of the state of the art user browsing model (UBM). It is further evaluated how extending this representation with a set of latent variables that are easily derivable from click logs, can affect the model’s prediction performance.

Results show that a model using the original representation does not perform very well. However, the inclusion of simple variables can drastically increase the performance regarding the click prediction task. For which it manages to outperform the two chosen baseline models, which themselves are well performing already. It also leads to increased performance for the relevance prediction task, although the results are not as significant. It can be argued that the relevance prediction task is not a fair comparison to the baseline functions, due to them needing more significant amounts of data to learn the respective probabilities. However, it is favorable that the neural models manage to perform quite well using smaller amounts of data.

It would be interesting to see how well such models would perform when trained on far greater data quantities than what was used in this project. Also tailoring the model for the use of LSTM, which suppos- edly could increase performance even more. Evaluating other representations than the one used would also be of interest, as this representation did not perform remarkably on its own.

(5)

Sammanfattning

Klickmodellering av användare i söksystem görs vanligtvis med hjälp av probabilistiska modeller. På grund av maskininlärningens fram- gångar inom andra områden är det intressant att undersöka hur dessa tekniker kan appliceras för klickmodellering. Detta examensarbete undersöker klickmodellering med hjälp av recurrent neural networks tränade på en distribuerad representation av en populär och välpre- sterande klickmodell benämnd user browsing model (UBM). Det un- dersöks vidare hur utökandet av denna representation med statistiska variabler som enkelt kan utvinnas från klickloggar, kan påverka denna modells prestanda.

Resultaten visar att grundrepresentationen inte presterar särskilt bra.

Däremot har användningen av simpla variabler visats medföra drastiska prestandaökningar när det kommer till att förutspå en använda- res klick. I detta syfte lyckas modellerna prestera bättre än de två ba- selinemodeller som valts, vilka redan är välpresterande för syftet. De har även lyckats förbättra modellernas förmåga att förutspå relevans, fastän skillnaderna inte är lika drastiska. Relevans utgör inte en lika jämn jämförelse gentemot baselinemodellerna, då dessa kräver mycket större datamängder för att nå verklig prestanda. Det är däremot fördelaktigt att de neurala modellerna når relativt god prestanda för datamängden som använts.

Det vore intressant att undersöka hur dessa modeller skulle prestera när de tränas på mycket större datamängder än vad som använts i detta projekt. Även att skräddarsy modellerna för LSTM, vilket borde kunna öka prestandan ytterligare. Att evaluera andra representationer än den som användes i detta projekt är också av intresse, då den an- vända representationen inte presterade märkvärdigt i sin grundform.

(6)

The following notation is used and referred to throughout the content of this thesis.

SERP Search Engine Results Page PGM Probalistic Graphical Model

CTR Click Through Rate

UBM User Browsing Model

PSCM Partially Sequential Click Model DR Distributed Representation

NDCG Normalized Discounted Cumulative Gain

E a user examines an object on a SERP

A a users is attracted by the object’s representation

C an object is clicked

S a user’s information need is satisfied

q query representation

u or d document representation r rank position within an SERP

ANN Artificial Neural Network RNN Recurrent Neural Network

LSTM Long Short-Term Memory

v

(7)

1 Introduction 2

1.1 Definitions . . . 3

1.1.1 Web search engine . . . 3

1.1.2 Search engine results page . . . 3

1.1.3 Click log . . . 4

1.1.4 Click models . . . 4

1.1.5 Distributed representation . . . 4

1.2 Problem definition . . . 5

1.3 Delimitations . . . 5

1.4 Ethics and sustainability . . . 5

1.5 Thesis outline . . . 6

2 Background 7 2.1 Click models . . . 7

2.1.1 Random click model . . . 8

2.1.2 Position based model . . . 8

2.1.3 Cascade model . . . 9

2.1.4 User browsing model . . . 10

2.1.5 Dynamic Bayesian network model . . . 10

2.1.6 Partially sequential click model . . . 11

2.2 Artificial neural networks . . . 12

2.3 Feedforward neural networks . . . 13

2.4 Recurrent neural networks . . . 13

2.4.1 Long short-term memory . . . 15

2.5 Deep learning in practice . . . 15

2.5.1 The learning process . . . 15

2.5.2 Optimization . . . 16

2.5.3 Regularization . . . 18

2.6 Click model evaluation . . . 19

vi

(8)

2.6.1 Perplexity . . . 20

2.6.2 Normalized discounted cumulative gain . . . 20

3 Related Work 22 3.1 Neural models . . . 22

3.2 Latent variable model . . . 23

3.3 This project . . . 24

4 Methodology 25 4.1 Dataset . . . 25

4.2 Data processing . . . 26

4.3 Baseline models . . . 27

4.4 Disributationally represented user browsing model . . . 28

4.5 Neural network configurations . . . 29

4.5.1 Recurrent neural network . . . 29

4.5.2 Long short-term memory . . . 29

4.6 Learning on distributed models . . . 30

4.6.1 Optimization of learning . . . 30

4.6.2 Hyper-parameter selection . . . 31

4.7 Latent variables . . . 31

4.8 Distributed representations with latent variables . . . 33

4.9 Evaluation methodology . . . 33

4.9.1 Click prediction . . . 33

4.9.2 Relevance prediction . . . 34

4.10 Experimental setup . . . 34

4.10.1 Research questions . . . 34

4.10.2 Evaluation methodology . . . 35

5 Results 36 5.1 Baseline click models . . . 36

5.2 Neural click models . . . 37

5.3 Click prediction results . . . 38

5.3.1 Original representation . . . 38

5.3.2 Extended representations . . . 39

5.4 Relevance prediction results . . . 47

5.4.1 Experiment results . . . 47

5.4.2 Comparison of single-variable models . . . 49

5.4.3 Comparison of two-variable models . . . 49

5.5 Model comparisons . . . 52

5.6 Long short-term memory performance . . . 57

(9)

6 Discussion 62 6.1 Summary of findings . . . 62 6.2 Future work . . . 65

7 Conclusion 66

Bibliography 67

A Prediction accuracy results 70

B Perplexity results 71

B.1 Single-variable models . . . 71 B.2 Two-variable models . . . 72

C NDCG results 75

(10)

2.1 An example of a feedforward neural network with three input nodes, one hidden layer with four nodes and two output nodes. . . 13 2.2 An example of a recurrent neural network with three

input nodes, one hidden layer with two nodes and two output nodes. The output layer is now connected to the hidden layer using feedback connections. . . 14 2.3 Early stopping. Training is interrupted when validation

error starts to increase. . . 19 4.1 RNN model configuration. [3] . . . 30 5.1 Average perplexities comparison between the original

representation and single variable variations. . . 40 5.2 Perplexity comparison between the original representa-

tion and single variable variations, for ranks 1-5. . . 41 5.3 Perplexity comparison between the original representa-

tion and single variable variations, for ranks 6-10. . . 42 5.4 Average perplexities comparison between the original

representation and our two-variable variations. . . 44 5.5 Perplexity comparison between the original representa-

tion and our two-variable variations, for ranks 1-5. . . 45 5.6 Perplexity comparison between the original representa-

tion and our two-variable variations, for ranks 6-10. . . . 46 5.7 NDCG comparison between the original representation

and our one variable variations. . . 50 5.8 NDCG comparison between the original representation

and our two-variable variations. . . 51

ix

(11)

5.9 Average perplexities comparison between the original representation and our best performing variations. . . 54 5.10 Average perplexities comparison between the baseline,

original distributed and the two best variable models. . . 55 5.11 NDCG comparison between the original representation

and our best variable variations. . . 56 5.12 Average perplexities comparison between the original

representation and our best performing variations now tested with LSTM. Notation has been shortened, i = impressions, d = dwell. . . 58 5.13 NDCG comparison between the original representation

and our best performing variations now tested with LSTM. Notation has been shortened, i = impressions, d

= dwell. . . 59 5.14 Average perplexities comparison between the original

representation and three-variable variant as RNN and LSTM configurations. . . 60 5.15 NDCG comparison between the original representation

and the three-variable variant using RNN and LSTM configurations. . . 61

(12)

4.1 Statistical Query-Document features. Generated for a

document d on SERPs of a query q. . . 32

4.2 Statistical Query features. Generated for a query q. . . 32

4.3 Statistical Document features. Generated for the individual document d, irrespective of queries. . . 32

5.1 Baseline UBM click prediction results. . . 37

5.2 Baseline PSCM click prediction results. . . 37

5.3 Baseline models relevance prediction results. . . 37

5.4 Model parameters used for our experiments. . . 38

5.5 Click prediction results for the original distributed model. 38 5.6 One variable model experiments. The variables are derived from query-document pairs. . . 39

5.7 Two-variable model experiments. The variables are derived from query-document pairs. . . 43

5.8 Click prediction results using query-document impressions, click-through rate and average dwelling time. . . . 47

5.9 NDCG scores for the original representation and our variable variations. Results marked in bold text high- lights the best performing variations found in a variable set. . . 48

5.10 Perplexity gains of DRdwelland DRimpr,CT R,dwellover our baseline and base representation models. . . 52

5.11 NDCG paired t-test results calculated per query session. Measuring significance levels between DRdwell and DRimpressions,CT R,dwellagainst DR. . . 53

5.12 NDCG paired t-test results calculated per query session. Measuring significance levels between DR^lstm2impr,CT R,dwell against DR and DR^lstm. . . 61

xi

(13)

5.13 Perplexity gains of DR^lstm2impr,CT R,dwell over our baselines and original representation variations. . . 61 A.1 Neural models prediction accuracy on the test data. . . . 70 B.1 Click prediction results using query-document impres-

sions. . . 71 B.2 Click prediction results using query-document clicks. . . 71 B.3 Click prediction results using query-document click-

through rate. . . 72 B.4 Click prediction results using query-document average

dwell time. . . 72 B.5 Click prediction results using query-document average

position. . . 72 B.6 Click prediction results using query-document impres-

sions and clicks. . . 72 B.7 Click prediction results using query-document impres-

sions and click-through rate. . . 73 B.8 Click prediction results using query-document impres-

sions and average dwelling time. . . 73 B.9 Click prediction results using query-document clicks

and click-through rate. . . 73 B.10 Click prediction results using query-document clicks

and average dwelling time. . . 73 B.11 Click prediction results using query-document click-

through rate and average dwelling time. . . 74 C.1 NDCG scores for all experiments. Measured for ranks

up to 1, 3, 5 and 10. . . 75

(14)

Introduction

It can be argued that as the amount of raw data being stored increases, so does the difficulties in finding the relevant data that suits your inter- ests and needs. Efficiently finding the correct data in a large quantity is a big problem, especially in the information age we live in today.

It has many times been predicted that the total amount of data being stored throughout society will grow exponentially toward 2020 and beyond, such that the size of the digital universe doubles in size every two years at least. [19]

The problem researched and evaluated in this thesis is done with regards to information retrieval, in the field of web search systems.

When it comes to research and development in this field, many experiments involve users. Such experiments can differ largely in size, in the ranges of small laboratory studies to web search systems with millions of real users. However, advances could not be made without some understanding of user behavior. As with many areas of science, this information can be modeled.

A user model concerning web search allows us to simulate the user behavior on a search engine result page (SERP) using assumptions made on user behavioral traits. Such models are commonly called click models, as the main observed user interaction with search systems concerns the user’s clicking behavior.

One motivation for click models is that they help in cases where there is a lack of real users to include in the experiments for various rea-

2

(15)

sons. Another motive is made regarding privacy and commercial constraints, as user interaction data is commonly restricted. In these cases, the use of simulated users is highly valuable.

Besides user simulations, click models can be used to improve document ranking, evaluation metrics and to better understand users by inspecting the click models respective parameters. [4]

Click models are commonly based upon the concept of probabilistic graphical modeling (PGM). In the last couple of years, there has been an increased effort made by utilizing the idea of machine learning to develop the concept even further. This thesis will attempt to combine common click models with the concept of deep learning, using a distributed representation of an existing click model. It will be further evaluated how such a representation can be extended by latent variables derived from the click logs in use, and if such extensions can improve the model’s prediction performance.

1.1 Definitions

This section introduces a few central definitions of concepts covered in this report.

1.1.1 Web search engine

A web search engine is a software system designed to search for information on the internet. Most commonly, the search results are presented in a vertical list referred to as a Search engine results page. One of the most commonly known examples of a web search engine is Google.

1.1.2 Search engine results page

A search engine results page is the page displayed by a web search engine in response to a query given by the user. The actual type of the results depends on the web search engine system itself. It commonly

(16)

contains links to relevant websites, documents or pictures. A search engine result page is referred to as a SERP throughout this thesis.

1.1.3 Click log

A click log is a dataset that contains information regarding users of a search engine’s click behavior. It commonly contains field values such as a session id, query terms and the corresponding documents returned upon a search. Most importantly, it contains information on which results were clicked and when. Other examples are information more specific to the user, such as their region.

1.1.4 Click models

A click model simulates user behavior on a SERP, based on different assumptions made on the user’s behavior. The search engine presents the user with a SERP, which contains a set of objects which may be directly related to the queried subject, or less so. The user examines the set of results and possibly clicks on one or more from this set. The search is abandoned either by issuing a new query or by ending the interaction with the SERP. The ongoing events between the issuing of a query and abandonment are called a session. A detailed explanation of some common click models is presented in section 2.1.

1.1.5 Distributed representation

A distributed representation (DR) is the concept of expressing a particular object or concept as a set of values represented in a vector of length N. A practical value of a distributed representation is its ability to capture similarity between different concepts and concept combinations.

(17)

1.2 Problem definition

This thesis investigates the following question:

How can recurrent neural networks (RNNs) and latent variables derived from the data source be utilized together to model click behavior in a web search system effectively?

This question is examined by expressing a state of the art PGM-based click model with a distributed representation that can be used to train a recurrent neural network. This representation will be extended with a set of latent variables that can easily be derived from click logs. The objective is to construct a model that performs well regarding both click prediction and relevance prediction when compared to the selected baseline PGM-based versions.

These steps can be expressed as a set of hypotheses that are to be evaluated in the course of this project:

• A deeply learned distributionally represented click model performs better than their respective PGM versions.

• Latent variables derived from click logs can be used to improve the performance of a deeply learned click model.

1.3 Delimitations

This thesis will not be able to capture the contents of the documents being clicked on. This restriction is due to the data used in this project has been fully anonymized, and no such information is retrievable as a result.

1.4 Ethics and sustainability

User click modeling in web search does not necessarily have to raise ethical concerns. It is somewhat more likely that the concept is motivated due to ethical reasons, as user browsing data is commonly restricted and not publicized. User simulation through click models are

(18)

not personalized however and does in its basic form not include per- sonal data. The actual queries sent to the search engine can also be anonymized rather easily.

The application of machine learning for click modeling does not result in additional ethical restrictions in this case. However, machine learning methods should be continually applied with care as it is a sensitive subject that can affect societies significantly.

The concept does not have any significant effect concerning sustainability. As click models are mainly used in the context of research and development, it does not affect the societies sustainable future much.

Their use can, however, be argued to allow for experiments to use smaller user groups, resulting in less traveling required in such cases.

1.5 Thesis outline

The report begins by describing the concept of click modeling and common metrics used to evaluate their performance. In addition to this, it provides some background to the concept of artificial neural networks. A set of related works are presented, introducing what other attempts have been made in the area during recent years. Fol- lowing these works, the methodology used to answer the chosen research question is described. After having described the methodology, the experiments corresponding results are presented. The report ends with a discussion of the results and concluding remarks.

(19)

Background

In this chapter, the underlying knowledge required to understand the problem is presented. The chapter begins by describing the concept of click modeling. Following these descriptions, formal definitions of a set of standard and state of the art click models are presented. The chapter continues to explain the idea of neural networks and the click model evaluation metrics that are used throughout this project.

2.1 Click models

As previously described in section 1.1.4, a click model describes the behavior of a user while browsing results on a search engine results page.

When a user issues a query to the search engine, it will respond with a SERP containing information based on the user’s described needs. The user then examines the list of resulting documents and may choose to click on one or more of the results. The user may also choose to abandon the search session entirely. The process between query and abandonment is called a session.

Click models treat the search behavior of its users as a sequence of observable and hidden events. They are described by binary values X, where the value X = 1 means that the event has occurred, and X = 0 means that it has not. The main events that are considered by most click models are the following:

7

(20)

E : a user examines an object on a SERP.

A : a user is attracted by the object’s representation in the SERP.

C : an object is clicked.

S : a user’s information need has been satisfied, and the query session can conclude.

The models define dependencies between these events to estimate probabilities of their corresponding random variables. Some probabilities are treated as parameters and depend on features of a SERP and a user’s query. [4]

The following sections introduce a selection of common and state of the art PGM-based click models.

2.1.1 Random click model

The random click model (RCM) is the simplest click model, having only one parameter. It is defined as follows:

P (C_u = 1) = ρ (2.1)

This formula means that every document u has the same probability of being clicked and this probability is a model parameter ρ. As the model only has one parameter, it can simply be estimated using Max- imum Likelihood Estimation (MLE).

Although this model is very simplistic, its performance is often used as a baseline when comparing to other models. It can also be assumed to be safe from overfitting since it only has one parameter. [4]

2.1.2 Position based model

Many models include a so-called examination hypothesis, formally described as:

C_u = 1 ⇔ E_u = 1 ∩ A_u = 1 (2.2) This hypothesis means that a user clicks a document u if and only if, the user both examined the document and was attracted by it. The random variables Euand Au are usually considered independent.

(21)

The position-based model (PBM) incorporates the assumption that the probability of a user examining a document u, given a query q depends heavily on its rank or position on a SERP. This probability typically decreases with rank, i.e., result positions further down in the list.

The model incorporates this into a set of parameters. The examination probability at rank r is represented by γr and αr represents the attrac- tion probability. This model can allow for sessions where more than one click event has occurred. [4]

P (C_u = 1) = P (E_u = 1) ∗ P (A_u = 1) (2.3)

P (A_u = 1) = α_uq (2.4)

P (E_u = 1) = γ_r_u (2.5)

2.1.3 Cascade model

The cascade model (CM) works on the assumption that the user scans documents listed in the SERP sequentially, from top to bottom until they find a relevant document. This works under the foundational assumption that the top ranked document u1 is always examined, whereas document ur where r ≥ 2 are examined if, and only if the previous document ur−1 was examined and not clicked. This assumption combined with the examination assumptions from equations 2.3 and 2.4 obtains the cascade model:

C_r = 1 ⇔ E_r = 1 ∩ A_r= 1 (2.6)

P (A_r = 1) = α_u_r_q (2.7)

P (E₁ = 1) = 1 (2.8)

P (E_r= 1 | E_r−1 = 0) = 0 (2.9) P (E_r = 1 | C_r−1 = 1) = 0 (2.10) P (E_r= 1 | E_r−1 = 1, C_r−1 = 0) = 1 (2.11) This model can only describe sessions with one click and cannot explain non-linear examination patterns. [4]

(22)

2.1.4 User browsing model

The user browsing model (UBM) is an extension of PBM, described in section 2.1.2. It includes some elements from the CM described in section 2.1.3. The model is based on the idea that the examination probability should consider previous clicks, albeit remain mainly position- based. It depends not only on the rank of a document, r but also on the rank of the previously clicked document r’. This can be formalized as:

P (E_r= 1 | C₁ = c₁, ..., C_r−1 = c_r−1) = γ_rr⁰ (2.12) where r’ is the rank of the previously clicked document, or zero if none has been clicked. In other words:

r⁰ = max{k ∈ {0, ..., r − 1} : c_k= 1} (2.13) where c0is set to 1 for convenience. An alternative formulation to 2.12 is as follows:

P (E_r = 1 | C_<r) = P (E_r = 1 | C_r⁰ = 1, C_r⁰₊₁ = 0, ..., C_r−1 = 0) = γ_rr⁰ (2.14) [4], [5]

2.1.5 Dynamic Bayesian network model

The dynamic Bayesian network model (DBN) is an extension to CM, described in section 2.1.3. It works on the assumption that the user’s perseverance after a click is only dependent on the actual relevance σ_uq, also called the satisfaction probability, instead of the perceived relevance αuq. The model can be described as follows:

Cr ⇔ Er= 1 ∩ Ar = 1 (2.15) P (A_r = 1) = α_u_r_q (2.16)

P (E₁ = 1) = 1 (2.17)

P (E_r = 1 | E_r−1 = 0) = 0 (2.18) P (S_r = 1 | C_r = 1) = σ_u_r_q (2.19) P (E_r = 1 | S_r−1 = 1) = 0 (2.20)

(23)

P (E_r = 1 | E_r−1 = 1, S_r−1 = 1) = γ (2.21) where γ is the continuation probability for a user that either clicked a document and was not satisfied by it, or did not click any document at all. [4]

2.1.6 Partially sequential click model

The partially sequential click model (PSCM) was introduced by Wang et al. [20] as an attempt to model less sequential user behavior, and has resulted in great performance. Using the timestamp of the click log in use, the click sequence is organized as C = {C1, C2, ..., Ct, ..., C_T} where t is the relative temporal order of a click and Ctrecords the result position of the t-th click, where 1 ≤ Ct≤ M, where M represents the number of documents considered in the SERP, commonly set to 10.

The model is based on two assumptions, the first being the First-order click hypothesis, also used in the UBM model described in section 2.1.4 and the DBN model, from section 2.1.5. Through this assumption, the model assumes that the click event at time t+1 is only determined by the click event at time t. This allows the model to divide a click sequence into sub-sequences, or adjacent click pairs, {[C0, C1], ..., [Ct−1, C_t], ..., [CT, CT +1]}. Where C0 represents the beginning of the search process and CT +1 represents the end.

Additionally, according to the Locally unidirectional examination assumption, given an observation of adjacent clicks at a point in time, users tend to examine results without any directional changes. Meaning that they follow the path from m to n, where m < n, without deviation. The examination and click sequence between Ct−1 and Ctcan be noted as {E_m, ..., Ej, ..., En} and {Cm, ..., Cj, ..., Cn}, respectively. Note that in the adjacent click sequence, only Cm and Cncan have a value of 1, and the other positions on the path have value 0. The model can be described as follows:

P (C_t| C_t−1, ..., C₁) = P (C_t| C_t−1) (2.22) P (C_t = n | C_t−1 = m) = P (C_m = 1, ..., C_i = 0, ..., C_n = 1) (2.23) P (E_i = 1 | C_t−1 = m, C_t = n) =

( γ_imn,if m ≤ i ≤ n or n ≤ i ≤ m 0, else

(2.24)

(24)

C_i ⇔ E_i = 1 ∩ R_i = 1 (2.25)

P (R_i = 1) = α_uq (2.26)

Where equation 2.22 denotes the first-order click hypothesis and equation 2.23 encodes the unidirectional examination assumption by restrict- ing the examination process to one-way from m to n. The examination probability of Eiis defined as 2.24, since the examination behavior between adjacent clicks may not follow the cascade assumption. The probability of examination depends on the position of the clicks, sim- ilar to UBM but not restricted to strictly sequential behavior. PSCM also follows the examination hypothesis just like most click models do, described in equation 2.25. In this notation, variable αuq corresponds to the relevance of the document u and query q. The variable γimn

represents the examination transition probability in either upwards or downwards directions.

2.2 Artificial neural networks

Artificial neural networks (ANNs) are a family of computational methods attempting to model the information processing capabilities found in the biological brain. The concept has been around since 1943 and has since been developed to be increasingly sophisticated [16]. An artificial neural network requires much computational power to achieve good performance. Due to this computational power not being available until the last decade, the concept has not gained any real traction until now, made capable by the rapid increase in computational capabilities through cheaper and more efficient hardware in recent years.

Modeling methods based on ANN have since proven successful in many complex tasks, such as speech recognition, translation and im- age classification.

The goal of an ANN can be simplified to approximation of some given function using a set of input value model parameters θ. A common example is of a simple classifier where y = f*(x) maps an input x to a category y. An ANN in this context defines a mapping y = f(x; θ) and learns the values of the parameters θ that results in the best function approximation. [6]

(25)

2.3 Feedforward neural networks

A feedforward neural network is one of the simplest types of networks in the ANN family. The model typically consists of one input layer, one output layer and some number of intermediate hidden layers. Fig- ure 2.1 serves as an architectural example. The model is called feedforward as information flows through the model in only one direction.

From the input layer, through the hidden computational layers and ending in the output layer, which produces the actual prediction. [6]

Figure 2.1: An example of a feedforward neural network with three input nodes, one hidden layer with four nodes and two output nodes.

As the information flows through the network, the input values are multiplied by a set of weight values at each step. During training, the predictions are compared against the expected output values, which are used to calculate the training error by using an appropriate error function. The training error is used to tune the weights in an attempt to increase the networks prediction accuracy. When the output is compared with expected values, the process is referred to as Supervised Learning. [16]

2.4 Recurrent neural networks

The family of recurrent neural networks (RNNs) is an extension to the traditional feedforward networks in the sense that they now include feedback connections, no longer limiting the information to flow in one

(26)

predetermined direction. Figure 2.2 serves as an architectural example. By including feedback connections, the outputs of the model are fed back into itself. This process is called backpropagation and is commonly used when training models with more than one hidden layer.

Such models are often referred to as deep neural networks.

Figure 2.2:An example of a recurrent neural network with three input nodes, one hidden layer with two nodes and two output nodes. The output layer is now connected to the hidden layer using feedback connections.

During backpropagation, the error is fed back into the network and used to adjust the weights of the network, attempting to reduce the error and increase prediction performance. This is a crucial step when training deep networks. Backpropagation is made possible by the idea of parameter sharing between different parts of the model. This concept enables the possibility to extend and apply the model to examples of different forms and generalize across them. Where a traditional fully connected feedforward network would require separate parameters for each input feature, a recurrent neural network shares the same weights across several time steps. [6]

After multiple samples have been used to learn a good set of values, the network should be able to produce accurate predictions for new samples. Overfitting may occur, however, meaning that the network has been trained too specifically to the training data and is not general enough to predict the new one. [16]

One of the most crucial parameters of the model is the learning rate.

It decides how quickly weights should be altered. A small learning rate leads to slower learning, while a too large value may prevent the network from converging if the optimal weight values are accidentally skipped.

(27)

2.4.1 Long short-term memory

Long short-term memory, commonly referred to as LSTM is an extension to the common concept of RNN models.

Not only does the model allow for self-loops where the gradient can flow for long durations. By the introduction of gates, the weight of the loops can be conditioned on the context, rather than a fixed feature of the model. A hidden unit controls the gates themselves. They allow the time scale of when to integrate new knowledge to be set dynami- cally. The use of LSTM has proven highly successful in many applications and can be used to mitigate problems that may occur in regular RNNs, such as exploding or vanishing gradients. [6]

2.5 Deep learning in practice

This section covers some crucial parts to consider when training a deep neural network model.

2.5.1 The learning process

One important aspect when dealing with models consisting of complex neural networks is how to select the model’s respective parameter values. The strategy of deep learning is to learn them using iterative, gradient-based optimizers that drive the error function down to very low values. To effectively apply gradient-based learning, a sufficient loss function must be chosen, as well as how to represent the output of the model. Depending on the context, the loss function may also be called the cost or error function.

Activation functions

The choice of non-linear transformation or activation functions has a significant effect on the training and task performance when dealing with neural networks. Some popular and widely used activations are the sigmoid, hyperbolic tangent (tanh) and rectified linear unit (ReLU)

(28)

functions. ReLU is currently one of the most successful and widely used activations [15], defined as:

f (x) = max(x, 0) (2.27)

The sigmoid activation function is effective for transforming the final output to a probability in the range of [0, 1]. For this reason, sigmoid is a common choice for the output layer of systems that are to output a probability distribution. It may, however, result in optimization difficulties if used as the top hidden layer in deep neural networks [1]. The sigmoid activation function is defined as:

f (x) = 1

1 + e^−x (2.28)

Weight initialization

Weights have to be initialized with care to break the symmetry between hidden units of the same layer. There exists a set of common weight initialization strategies. However, initialization using a zero- mean Gaussian with a small standard deviation of around 0.1 often performs well enough. Biases and weights in connection to the output layer are generally safe to be initialized to zero. [1]

2.5.2 Optimization

The optimization problem is in the context of neural networks, the problem of finding the best model parameters θ, that significantly re- duces the loss function, also called cost or error function. The loss function is the measure of how well a model performs for the given training sample and the expected output. It may also depend on variables such as weights or biases.

Stochastic gradient descent

Stochastic gradient descent (SGD) is the most commonly used optimization algorithm. The algorithm tries to find minima or maxima by iteration. It takes one important parameter, the learning rate, which is

(29)

usually decreased gradually over time until some particular point in time, after which it is left constant. The reasoning for this is the source of noise introduces by the SGD gradient estimator that does not vanish even when we arrive at a minimum. [6]

Adaptive learning

Adaptive learning is the process of gradually altering the learning rate during training. There exists a handful of algorithms used for this purpose. An example of such an algorithm is the Adadelta optimizer, introduced by Zeiler [21]. This optimizer requires minimal additional computational overhead when compared to the conventional SGD algorithm. While it offers a set of additional hyperparameters to be set, it has been found that their selection does not significantly alter the final results.

Difficulties in optimization

Plateaus, saddle points and flat regions

The problem of neural network optimization can be reduced to the problem of finding a local minimum. However, it is not necessarily restricted to local minima discovery. It is recommended to plot the norm of the gradient over time. The reason for this is other shapes that become more prevalent in higher dimensions, such as saddle points and other critical points. [6]

Vanishing and exploding gradients

Vanishing and exploding gradients are other types of problems that occur in neural networks. The vanishing gradient problem occurs when the gradient becomes so small that the weights are unable to be updated, in some cases even causing the network to end training completely.

Error gradients can accumulate during learning and can result in the exploding gradient problem. Exploding gradients mean that large gradient values are used to update the weights, causing an unstable network. [6]

The use of LSTM can reduce these potential problems in the case of recurrent neural networks. [3]

(30)

2.5.3 Regularization

A central problem in deep learning is how to construct an algorithm that does not only perform well on the training data, but also on new inputs. Regularization techniques are strategies in machine learning that are designed to reduce test set error, possibly at the expense of increased training error.

Some approaches put extra constraints on the model itself, such as adding restrictions on the parameter values. Others add additional terms in the objective function corresponding to a soft constraint on the parameter values.

With regards to deep learning applications, most regularization strategies are based on regularizing estimators. Essentially, they work by increasing bias for a reduced variance. A good performing regularizer is one that makes a profitable trade, by greatly reducing variance while not increasing the bias too much. [6]

Below are a set of popular regularization techniques.

Clipping gradients

Practitioners have used the idea of clipping the gradient for many years. It is a regularization method that works to mitigate the exploding and vanishing gradients problem. It involves clipping the norm of the gradient when it has become too large, just before the following parameter update. Which is motivated by the assumption that when gradients explode, so does the curvature and higher order derivatives as well [12]. This idea can also be applied element-wise, however clipping the norm is the more popular approach. [6]

Dropout

Standard backpropagation in RNNs can cause the model to overfit to the training data and not generalize well. An idea to mitigate this is the use of dropout. It works by the principle of making a particular hidden unit unreliable, meaning that hidden units may be ignored during

(31)

a training phase. Selections are random, based on chosen probabilities. One drawback to using dropout is that it can increase training time quite heavily. It is said that a neural network utilizing dropout generally takes 2 to 3 times longer to train. This is caused by the noise that the parameter updating process introduces. [18]

Early stopping

Early stopping is likely one of the most commonly used forms of regularization in deep learning. It is often thought of as a very effective hyperparameter selection algorithm, especially concerning the number of training epochs. It requires almost no change in the foundational training process, the objective function or the set of parameters values allowed. It works by interrupting the training procedure once the model’s performance on the validation set has become worse, vi- sualized in Figure 2.3. The validation set is a set of examples that are never used for learning, but a representative for future test examples.

[6] [13]

Figure 2.3: Early stopping. Training is interrupted when validation error starts to increase.

2.6 Click model evaluation

This section presents the metrics used for evaluating the model’s click and relevance prediction performance. Both metrics are widely adopted for this particular area of research.

(32)

2.6.1 Perplexity

To assess our model’s click prediction performance, the click perplexity metric is used, as introduced by Dupret and Piwowarski [5]. This metric measure how "surprised" the model is upon observing a document. The higher the value, the worse is the model, with an optimal value of 1. The perplexity of the random click model described in section 2.1.1 is 2, meaning that a realistic model should have a value in the range of [1, 2]. The metric is calculated as follows:

p_r(M) = 2⁻¹^|S|^P^s∈S^(c^(s)^r ^log²^q^r^(s)^+(1−c^(s)^r ^{) log}²^(1−q^(s)^r ⁾⁾ (2.29) Where q_r^(s)is the probability of a user clicking the document at rank r in the session s as predicted by the model M. In other words:

q^(s)_r = P_M(C_r | q, u) (2.30) It is possible to calculate the perplexity averaged across ranks to get an overall measure of model quality, done by the following formula:

p(M) = 1 n

n

X

r=1

p_r(M) (2.31)

When comparing perplexity scores of two models A and B, the perplexity gain of A over B can be calculated as follows:

gain(A, B) = p^B− p^A

p^B− 1 (2.32)

The perplexity is typically higher for top documents, and decreases toward the bottom of a SERP, this is because top documents typically get more clicks. Since it is more difficult to predict a click than predicting its absence, this leads to top documents being demanding for a click model to get right. [4]

2.6.2 Normalized discounted cumulative gain

Normalized discounted cumulative gain (NDCG) is a measure of ranking quality, standardized in information retrieval and introduced by Järvelin and Kekäläinen [8]. In this thesis, it is used to assess the relevance predictions given by our click models.

(33)

The following formula describes the normalized discounted cumulative gain at rank position p, followed by the definitions of its respective numerator and denominator.

N DCGp = DCG_p

IDCG_p (2.33)

DCG_p =

p

X

i=1

2^relⁱ− 1

log₂(i + 1) (2.34)

IDCG_p =

|REL|

X

i=1

2^relⁱ − 1

log₂(i + 1) (2.35)

Variable reli represents the graded relevance of the result at position i and REL represents the list of relevant documents, ordered by their relevance, in the corpus up to position p. The NDCG values for all queries can be averaged in order to obtain a measure of the average performance of an algorithm. Note that for a perfect ranking algorithm the DCGp will be the same as the ideal, resulting in a NDCG of 1.

Results are normally values on the interval [0, 1].

(34)

Related Work

This chapter provides an overview of some of the related work that has previously been done in the field of click modeling for web search.

Most of the work referred to utilizes deep learning techniques for either sponsored search or regular web search processes, while one brings up another interesting approach that has affected the work done in this project. The chapter ends by describing what specific information has been used in the course of this project.

3.1 Neural models

A variety of previous work has been done when it comes to applying neural networks to click modeling. Zhang et al. [22] (2013) proposed a click model based on RNN for sponsored search that directly models the user’s current and historical sequential behaviors into a click prediction process. They construct features based on ad impressions for both the training and testing process.

Liu et al. [9] (2015) proposed a model based on convolutional neural networks for sponsored search. Their model utilizes two sets of convolutional and flexible pooling layers, ending with a fully connected output layer. They evaluate significant features of ad impressions and the subsequent history of the impression history to improve the click prediction accuracy.

22

(35)

Liu et al. [10] (2017) proposed a CNN-based click model for web search, which incorporates document content information and context of the SERP. The proposed network utilizes a single wide convolutional layer, followed by non-linearity and max pooling. After which the query and result-feature vectors previously generated are used to compute content and context similarities. The features are processed by a set of joint and hidden layers, finally used to make relevance and click probability predictions. Their evaluation section includes the UBM and PSCM click models and also the RNN model of Zhang et al. [22]. This RNN model performed worse than the baseline version of PSCM, whereas the CNN version of PSCM was shown to perform very well.

Borisov et al. [3] (2016) proposed a RNN-based click model for web search which utilizes a distributed representation. This report is the main inspiration for the work done in this thesis. They do not base their representation on the previously known PGM-based click models. However, they provide examples of how the UBM and DBN models can be translated for this purpose. Instead, their distributed representations are implemented using three sets of representations for a query q, document d and user interaction i. Each set considers vary- ing amounts of query sessions, either query-document pairs, all query sessions generated by the given query q, or all query sessions given query q whose SERP contains the document d. They found that their distributed models, specifically the one using long short-term memory (LSTM) performs better than the baseline models used.

3.2 Latent variable model

The work of Hu, N Liu, and Chen [7] is the second biggest inspiration for this thesis. Their approach is not based on neural networks the same way as the work mentioned above. Instead, they introduce the interesting techniques of feature generation and feature augmentation through latent variables and combine them with existing click models. This report is inspirational mainly for the collection of potential statistical features that can be generated from the click logs.

(36)

3.3 This project

This project can be viewed as a combination of the distributed representation concept found in [3] with a set of statistical features found in [7] that can be generated from click logs. Specifically, the distributed representation of UBM, exemplified by [3], is used individually and in combination with various sets of statistical features. The rather simple representation of the UBM model is the primary motivation for its use.

It is interesting to evaluate how such a simple distributed representation can be applied for the use of click modeling, and if its performance is improved when incorporating statistical features.

(37)

Methodology

This chapter describes the methodology used to answer the selected research questions. The chapter starts off describing the data used and the selected baseline models. The concept of distributed representations and their implementations shortly follows. These representations are later extended by a set of latent variables that can easily be derived from the click logs. The chapter ends by describing the evaluation process used in an attempt to answer the selected set of hypotheses.

4.1 Dataset

The click log data used in this thesis comes from the Yandex Relevance Prediction challenge held in 2011 [14]. The data contains 30,717,251 unique queries, 117,093,258 unique documents sampled from logs of the Russian search engine Yandex. Beside these click logs, it also contains human generated binary relevance labels for 41,275 query- document pairs, containing 4991 unique queries. These relevance labels are used to evaluate the ranking performance of the click models used in this project.

Two different types of lines describe the query sessions in the data. The first line type initializes a query session, holding the following values:

SessionID TimePassed TypeOfAction QueryID RegionID ListOfURLs

25

(38)

Where SessionID is the unique number identifying the described query session, TimePassed is the time initializing the session, always set to 0. TypeOfAction is Q for query, QueryID is the identifying id for the issued query, followed by a list of 10 document ids, held in ListOfURLs.

The list is ordered from left to right, as they were shown to the user from top to bottom in the SERP. RegionID is a unique identifier of the country from where the user is querying from. The number of queries sent within a session varies.

The second line type describes a click action made within a query session and is denoted by the values:

SessionID TimePassed TypeOfAction URLID

Where TimePassed represents the time since the query session was initialized until the click was recorded. The resolution of the time mea- surement is undisclosed. TypeOfAction is C, representing a click event.

URLID represents the id of the document that was clicked. The number of click events for a query may vary between 0 and 10 clicks.

4.2 Data processing

The original data was read and translated into query sessions that could be described in a single line. It was then randomized and split into separate partition files, where each file represents approximately 0.2% of the total number of query sessions. Meaning that each file holds over 340.000 different query sessions, where the following fields describe a session:

SessionID, QueryID, Result-1, Click-event-1, .. Result-10, Click-event-10 In this structure, a click event is represented by the time since the start of the query session that the corresponding search result was clicked, or 0 if no click was made.

When partitions are referred to throughout this report, it is done with these partitions in mind. However, due to the time restrictions of this project, the data quantity used in this project had to be restricted con- siderably. The conducted experiments were done using 10 partitions, representing approximately 2% of the total number of query session

(39)

available in the dataset. This quantity still holds over 3.4 million different query sessions. However, these restrictions may impact the final results found.

The performance comparison of baseline models used may not be entirely fair due to limitations of iterative variable estimation, requiring greater amounts to learn the required parameter values effectively. Re- garding neural models, more significant data quantities could allow for more general training and thus may decrease the potential of overfitting. The use of more significant amounts of data would, therefore, be preferable.

4.3 Baseline models

The baseline models implemented and used to benchmark our neural models were UBM and PSCM. These were selected due to their performance in previous work. The user browsing model was a clear choice, given that it is the model which the distributed representation used in this project is based on. The model’s respective probabilistic parameters were learned using the expectation-maximization algorithm, which is a common iterative method used in order to retrieve maximum likelihood estimates for unobserved variables. It was configured with a maximum of 50 iterations or ending once the root mean square error (RMSE) was found to be less than 1 × 10⁻⁸. These were picked due to a higher precision not showing any significant increases in early attempts, instead only increased the total time spent.

The root mean square error is defined as:

RM SE =

sPn

i=1(yi− ˆyi)

n (4.1)

Where yiis the predicted value and ˆy_iis the value observed.

(40)

4.4 Disributationally represented user browsing model

The UBM baseline was translated into a distributed representation as Borisov et al. [3] described it. The representation is hereafter referred to as vector state.

The vector state, denoted by sr, is represented by a tuple of four integer values (q, d, r, r⁰). In this notation, q denotes the issued query id, d is the id of the document currently being examined and r is the rank of d in the SERP. Finally, r’ denotes the rank of the previously clicked document. Rank in this context refers to its placement on the SERP, where a rank of 1 means that it is located at the top of the results page.

The distributed representation of a UBM modeled query session can be formalized as:

I(q) = (QueryID(q), 0, 0, 0) (4.2) U (sr, ir, dr+1) = (sr[0], docID(dr+1), sr[2] + 1, h(sr, ir)) (4.3)

h(s_r, i_r) =

( s_r[2], if ir = 1

s_r[3], else (4.4)

F (s_r+1) = γ_s_r+1_[2],s_r+1_[2]−s_r+1_[3] ∗ α_s_r+1_[0],s_r+1_[1] (4.5) Where the mapping I(q) in notation 4.2 initializes the vector state by setting the first component to the queried query id and the rest to a value of zero. The mapping U (s, i, d) is used to update state sr to state s_r+1 by setting the second component to the id of the next examined document. The rank of the currently examined document is incre- mented by one, as the user has moved down the SERP. The fourth component is set to the third component of sr if the previous document drwas clicked. The variable iris a boolean value denoting a user interaction, specifically a click event in this project.

The function F (sr+1)computes the probability that a user clicks on the currently examined document. This prediction is the product of the examination probability γr,r−r⁰ and the attractiveness probability αq,d. In the context of deep learning models, these parameters are represented by the parameters to be learned.

(41)

4.5 Neural network configurations

This section presents how the recurrent neural network and long short-term memory models were implemented using Tensorflow [11].

4.5.1 Recurrent neural network

The base RNN model is constructed as depicted in Figure 4.1, based on the implementation of Borisov et al. [3]. It uses a fully connected layer in order to initialize the vector state s0. From this state it utilizes recurrent connections to propagate information from the current state srto the next, sr+1. The states are formalized as follows:

s₀ = f₁(W_qs∗ q + b₁) (4.6) s_r+1 = f₂(W_ss∗ s_r+ W_is∗ i_r+ W_ds∗ d_r+1+ b₂) (4.7) The functions f1 and f2 here refers to non-linear transformations, e.g.

the tanh, sigmoid or ReLU functions.

The click probability cr+1 is computed using a fully connected layer with one output unit for each search result in the SERP. The sigmoid activation function is used in the output layer to ensure that the output falls in the range [0, 1].

c_r+1 = σ(W_sc∗ s_r+1 + b₃) (4.8) The matrices Wqs, Wss, Wis, Wds, Wscand bias variables b1, b2and b3are the parameters of functions I, U and F that are to be learned during training.

4.5.2 Long short-term memory

The LSTM configuration was implemented the same as in Figure 4.1, the only difference being the states sr consisting of an LSTM cell instead of regular RNN, with hopes that the gated learning process would increase performance.

(42)

Figure 4.1:RNN model configuration. [3]

4.6 Learning on distributed models

The problem of click prediction can be viewed as a binary classification problem since each result in the SERP is predicted a binary value representing whether the result has been clicked. As the output of our models is calculated using the sigmoid activation function, this is as simple as rounding the output to the nearest integer. Prediction accuracy is calculated using the rounded predictions and their respective truth labels.

Both the RNN and LSTM configurations are trained by maximizing the likelihood of observed click events.

4.6.1 Optimization of learning

To effectively train the model and optimize it for our problem, a loga- rithmic loss function was used. Specifically, the sigmoid cross entropy function [17] available in Tensorflow was selected. This was done in combination with the Adadelta optimization algorithm with default values = 10e⁻⁶ and ρ = 0.95 in order to adjust the learning rates. The gradient clipping technique [12] was used in an attempt to mitigate the exploding gradient problem, with the threshold set to 1. These settings were used due to their use in the work of Borisov et al. [3].

(43)

Dropout regularization was used in both ingoing and outgoing directions to the hidden layer. Another form of regularization used is early stopping. It was configured to compare the best loss value with the current one, combined with some steps used to determine whether to stop training or not.

4.6.2 Hyper-parameter selection

The main parameters explored in this project were learning rate, cell size, the number of hidden layers, dropout probability, activation functions for both the query layer, Equation 4.6 and the results layer, Equa- tion 4.7. The number of epochs for early stopping and the maximum number of training epochs were also explored. The random search algorithm, presented by Bergstra and Bengio. [2], was used to find a good starting set of parameter values, after which a more precise ex- ploration was done using the regular grid search approach. Due to large amounts of data and time restrictions of this project, it was essen- tial to find a set of parameter values making a good tradeoff between accuracy and active training time.

4.7 Latent variables

The second part of this thesis covers the combinations of latent variables that can be derived from the dataset. It is explored how these variables can be used in combination with the distributed representations to increase prediction performance.

Generally, there are three different groups of latent variables that can be derived from a click log. These are Query-Document features, which can be generated by summing the mappings between queries and their resulting documents. There are also Query features, which merely relies on summing the queries of the data source, and Document features which instead relies on summing the resulting documents.

Some examples of variables that can be generated in each class can be seen in the tables below. The work of Hu, N Liu, and Chen [7] inspired these variables.

(44)

Table 4.1:Statistical Query-Document features. Generated for a document d on SERPs of a query q.

Feature Description

# Clicks Number of times document d is clicked given q

# Impressions Number of times document d is impressed given q CTR # Clicks / # Impressions for d given q

First CTR CTR on d when it is the first clicked document Last CTR CTR on d when it is the last clicked document Only CTR CTR on d when it is the only clicked document AvgDwellTime Average dwell time on d after being clicked given q AvgPosition Average ranking position of d given q

Table 4.2:Statistical Query features. Generated for a query q.

Feature Description

# Clicks Number of clicks on all SERPs

# Shows Number of times q is searched

AvgClickPosition Average position of all clicks made in the SERPs CT R_k CTR for the rank position k of the query q,

(1 ≤ k ≤ 10)

AvgClickNum Average number of clicks on each SERP

Table 4.3: Statistical Document features. Generated for the individual document d, irrespective of queries.

Feature Description

# Clicks Number of clicks on d

# Impressions Number of impressions of d

CTR # Clicks / # Impressions

LastCTR CTR on u when it is the last clicked document

Where CTR is the abbreviation for click-through rate, and an impression means that the document has been presented in a SERP.

(45)

4.8 Distributed representations with latent variables

Once an acceptable selection of hyper-parameters had been identified for the base model, it was extended by incorporating a set of statistical variables into the vector state. In the context of this project, the number of variables tested had to be limited due to time restrictions. It was decided to focus on the query-document features from Table 4.1 due to them holding the most contextual information, mapping relations between queries and document pairs. The simple sets of the individual document and query features are static, regardless of context and thereby providing insufficient information.

The idea was to extend the calculations of equation 4.7 with combinations of latent variables, such that the next states would now be calculated as follows:

s_r+1 = f₂(W_ss∗ s_r+ W_is∗ i_r+ W_ds∗ d_r+1+ W_v_i_s∗ v_r_i+ b₂) (4.9) Where vri represents some variable i from the set of variables in Ta- ble 4.1 and r denotes the (queryID, docID) pair at some rank in the SERP. Wvisis simply the corresponding weight matrix. This additional information was thought to make the neural networks more capable of identifying effective patterns, in hopes of increasing its predicting power.

4.9 Evaluation methodology

To effectively evaluate our neural models, the click prediction and relevance prediction tasks were considered. The neural models are all trained by maximizing the likelihood of observed click events. All models were trained on 10 partitions and evaluated on 2 partitions.

4.9.1 Click prediction

Click prediction is the task of predicting a user’s clicks given a SERP.

The metric used for this is described in section 2.6.1. For our neural

(46)

models, the click probability is simply the output of our models. Mak- ing it easy to incorporate in our evaluation process. The click probability of document d, given a query q is simply formalized as the following, where r denotes the ranking position on the SERP:

P (C_r = 1 | q, d_r) (4.10)

4.9.2 Relevance prediction

Relevance prediction is the task of predicting the relevance of a document given a query. The metric used for this is described in section 2.6.2. The relevance of a document d to a query q is estimated using the click probability of d, when it appears on the first position of the SERP. This can be denoted as:

R(q, d) = P (C1 = 1 | q, d1) (4.11)

4.10 Experimental setup

This section describes the experimental process used in this thesis.

It begins by reintroducing the research question, also found in section 1.2, followed by describing the experiments conducted in an attempt to answer them.

4.10.1 Research questions

The research questions evaluated throughout this thesis is:

How can recurrent neural networks (RNNs) and latent variables derived from the data source be utilized together to model click behavior in a web search system effectively?

This question is evaluated using the following set of hypotheses.

H1 A deeply learned distributionally represented click model performs better than their respective PGM versions.

H2 Latent variables derived from click logs can be used to improve the performance of a deeply learned click model.