• No results found

To learn and evaluate a system for recommending business intentions based on customer behaviour

N/A
N/A
Protected

Academic year: 2022

Share "To learn and evaluate a system for recommending business intentions based on customer behaviour"

Copied!
77
0
0

Loading.... (view fulltext now)

Full text

(1)

U.U.D.M. Project Report 2018:26

Examensarbete i matematik, 30 hp Handledare: Kristiaan Pelckmans Ämnesgranskare: Raazesh Sainudiin Examinator: Denis Gaidashev Juni 2018

Department of Mathematics Uppsala University

To learn and evaluate a system for

recommending business intentions based on customer behaviour

Niklas Fastlund

(2)
(3)

Typeset in LATEX

SE

©Niklas Fastlund, 2018

(4)

Abstract

The linear dynamic recommender algorithm DR-TRON together with a selective sam- pling scheme has been evaluated as means to recommend business intentions of users phoning in after using a business homepage. Furthermore an artifical analysis of the selective sampling method is also presented. The sampler seeks to improve the rate at which DR-TRON learns by selecting users based on empirical margins between recom- mended items to recieve feedback from.

The artifical analysis of the selective sampling method was done by training two models on a generated user sample of size n in particular scheme. The first model uses the sam- pling method to query k users out of n, then the second model trains on the first k users on the same sample. After the training new test users are generated and the procentual accuracy in correctly recommending the most relevant item for these new users are noted.

This is iterated r times to see how the procentual mistakes distribute.

The artifical analysis of the selective sampler shows promise on the different types of data generated. Albeit small the selective sampling algorithm were around 2-3 percent better at recommending the most relevant item.

Because of the insufficiently small dataset which was recieved for the case study regarding business intentions, a typical evaluation could not be done. An attempt was still made and the case-oriented analysis was done by first enlarging this dataset via a sampling with uniform probability on the original dataset. Then a typical evaluation was done on this enlarged dataset by splitting it randomly into a training set and a test set and a comparison between DR-TRON and a naive solution was made. The insufficiently small dataset given, allowed only for the conclusion that the DR-TRON peformed significantly better than the naive solution of simply guessing the business intention.

(5)

Contents

1 Introduction 3

1.1 Background and Motivation . . . 3

1.2 Recommender Systems . . . 4

1.3 Outline of Thesis . . . 4

2 Theory 6 2.1 Preliminaries . . . 6

2.1.1 Trace . . . 6

2.1.2 Matrix Norm . . . 7

2.1.3 Conditional Expectation . . . 8

2.2 DR-TRON . . . 8

2.3 First Theoretical Result . . . 15

2.4 Active Learning . . . 16

2.5 Second Theoretical Result . . . 21

3 Artifical Evaluation 24 3.1 IID Gaussian Encoding . . . 26

3.1.1 Analysis . . . 26

3.1.2 Results . . . 29

3.1.3 Discussion . . . 30

3.2 Correlated Gaussian and Discrete Encoding . . . 30

3.2.1 Analysis . . . 31

3.2.2 Results . . . 34

3.2.3 Discussion . . . 35

3.3 Selecting Parameter b . . . 35

4 FreeSpee Case Study 43 4.1 Encoding . . . 43

4.1.1 First Experimental Results . . . 44

4.2 Evaluation . . . 46

4.2.1 Description of Dataset . . . 46

4.2.2 Results of Experiment . . . 50

4.3 Discussion of Case Study . . . 50

5 General Discussion 51

6 Acknowledgements 52

(6)

Appendices 55

A Analysis Gaussian IID 56

B Analysis Correlated Gaussian 61

C Analysis Discrete Encoding 67

(7)

Chapter 1 Introduction

1.1 Background and Motivation

When a consumer contacts a business today, this business may only have one address of contact but many different departments which the customer may end up in. An example of this could be a consumer contacting a bank to discuss housing mortgage loan. The customer obviously would like to get in contact where such loans are handled and not any other department, say credit card issues. The reason for this set up could for example be to avoid overwhelming the customer with different contact information and making the experience of the customer as smooth as possible. However one still wishes to direct users properly and this is usually solved by cumbersome methods such as in the case of phone calls, asking the user for keywords or asking the consumer to traverse a phone-menu.

FreeSpee is a company that helps businesses track their online customers as they convert from online to offline channels, which the above example could be a illustration of. To achieve this FreeSpee provides a platform for their clients where they present statistics and analytics for business phone contact points as well as additional services to prevent lead loss and improve the customer experience surrounding phone calls. FreeSpee is al- ways focused on the improvement of customer experience surrounding phone calls and making the phone call experience as seamless as possible. Therefore there lies an interest in a system which starts from contextual information about customers and based on this the system produces a ranked list of possible business intentions of the customer. This list could then be used in general to perhaps recommend the customer contact addresses which is closer to the goals of the customer. If there is only one department to end up in it could be used to prioritize incoming calls or to prepare the answering party for the intentions of the caller. The first usage of the list would translate in the above exam- ple to one where the system suggests the business intention of ‘housing mortgage loan’

over anything else and the bank, if they have seperate phone numbers, could recommend the number to the department dealing with loans instead of only presenting the number leading to a phone-menu. The interest is to see if the methods of Machine Learning can be used to make this process more transparent.

To meet this interest the intention of the thesis is to, with the collaboration of Kristiaan Pelckmans and FreeSpee, prototype a recommender system. The system will rank the possible business intentions based on customer behaviour and evaluate the peformance

(8)

of this system.

1.2 Recommender Systems

Recommendation systems are computer software that recommends items to users in dif- ferent contexts. The act of recommending is motivated by wanting to optimize some predefined objectives. For instance, the objective could be to maximize user engagement in a product and increase the total number of subscriptions. Imagine for example a product which offers a stream of music to the users where a recommender system could then suggest different songs or playlists to keep the user more engaged.

To rank items over each other a scoring function is usually learned based on the avail- able input. Said inputs are typically information about the users, items, contexts and feedback. Looking at the above example with streaming music the feedback information could be that the user listens to the songs and playlists. Item information could be the different properties of the songs, for instance: rhythm, genre, and year. Information about the users could be previously liked songs. Usually it starts with recommending items based on the available information and then when feedback is recieved it is possible to learn from this and update the recommender system. Below is a mindmap of the flow of information and later on in the thesis this will be put into a more formal mathematical format [1].

User information

Item information Contextual information

Recommender system Feedback

Ranking Figure 1.1: Flow of information for a typical recommender system.

1.3 Outline of Thesis

In the next chapter preliminary theory will be reviewed and also a specific recommender system will be presented in a more mathematical format along with some theoretical results. In the same chapter a sampling scheme will also be described. This sampling scheme handles how to pick users to ask for feedback to potentially learn faster rather than just learning on arbitrarly picked users.

FreeSpee had information about the caller but not the crucial piece of information about where the caller ended up after traversing the phone-menu. So initially FreeSpee had to

(9)

find a suitable business for which FreeSpee offers their services to and motivate them to collaborate by handing over the information about which department the caller ended up in. This would be done in exchange for a prototype recommender system that could be used to present a recommended list of contacs through a widget on their homepage and possibly letting the users avoid the phone-menu entirely.

FreeSpee had difficulties acquiring this collaboration throughout the thesis work and to solve this setback and not let the whole project be delayed it was decided to devote chapter 3 to investigate the potential gain of using the sampling scheme mentioned in chapter 2. For these investigations, the data was generated by ourselves. Chapter 3 is therefore a more general analysis of the filter and not necessarily with the same encoding and format as for the FreeSpee data.

After FreeSpee failed to find a collaboration a very late compromise was made which led to recommending business intentions of users phoning FreeSpee after using their own homepage. FreeSpee began to label the callers manually and the format could atleast be settled upon and similar analysis that was made in chapter 3 was repeated. This is presented in chapter 4. It turned out that for this compromise, FreeSpee could not pro- duce enough rows of data to be useful for producing an actual prototype recommender system. Instead an artificial dataset was made based on the received data and a short analysis was made to try to make use of this input. This is also presented in Chapter 4.

Chapter 5 will be devoted to conclusions and a short outlook.

(10)

Chapter 2 Theory

2.1 Preliminaries

This section we will go through and remind the reader about the trace operator, matrix norms and other mathematical objects. The purpose of this is to help the reader under- stand the material presented later in the thesis. First let us go through some notations:

ˆ Boldface x is used to emphasize that it is a vector.

ˆ Let A be any event. 1A is used as an indicator function which is one if A occured and zero otherwise.

ˆ A ∈ Rn×m denotes a n × m matrix with real numbers as entries.

ˆ Let X be a random variable, F be a sigma field over the set Ω and P be a probability measure. Then we use the standard notation of a probability space (Ω, F , P). For a more formal explanation see [2].

ˆ We adopt the shorthand notation X ∈ F to say that X is measurable with respect to the sigma algebra F .

ˆ Let X ∈ (Ω, F, P) and G ⊂ F be a sub-sigma field. Then E(X) is the expected value defined as the Lebesgue integral of X over Ω and E(X|G) is the conditional expectation of X on the sigma field G. For more see [2].

ˆ Let X be a random variable. Then X ∈ L1 or X ∈ L1(Ω, F , P) means that E(|X|) < ∞.

ˆ Let Z1, Z2, ...Zt be a sequence of random variables. Then σ(Z1, ..., Zt) refers to the sigma algebra generated by these random variables.

2.1.1 Trace

The trace of an n × n matrix A is defined as the sum of the diagonal elements. That is tr(A) =

n

X

i=1

aii = a1,1+ ... + ann. (2.1)

(11)

In the proofs and in other parts we will use some basic properties of the trace operator.

Firstly that it is a linear mapping. That is, let A and B be square matrices and c a scalar. Then we have that

tr(cA + B) = c tr(A) + tr(B). (2.2)

Proof is straightforward and is omitted. We also have that for suitable matrices A, B, C, D tr(ABCD) = tr(DABC) = tr(CDAB) = tr(BCDA). (2.3) This means that the trace is invariant under cyclic permutations. Proof is omitted. Since the transpose of a matrix does not change the elements on the diagonal it is clear that

tr(A) = tr(AT). (2.4)

2.1.2 Matrix Norm

There are several different matrix norms but here a matrix norm called the Frobenius norm will be used. First let us define a matrix norm.

Definition. Let K denote a field of real or complex numbers. A function ||·|| : Km×n→ R is a matrix norm on m × n matrices if it satisfies

(i) Positivity ||A|| ≥ 0 and ||A|| = 0 iff A = 0.

(ii) Homogeneity ||cA|| = |c|||A||.

(iii) Triangle inequality ||A + B|| ≤ ||A|| + ||B||.

Some norms, but not all satisfy

||AB|| ≤ ||A||||B||. (2.5)

This property is usually refered to as being submultiplicative. The Frobenius norm is defined as

||A||F = ||A||2 =

sX

i,j

a2i,j =p

tr(ATA) (2.6)

and satisfies being submultiplicative. It is further true that hA, Bi = tr(ATB) =X

i,j

ai,jbi,j (2.7)

is a inner product between Rm×nmatrices. Let us recall the definition of an inner product space

Definition. Let K denote a field of real or complex numbers. An inner product space is a vector space V over the field K together with an inner product

h·, ·i : V × V → K

that satisfies these three properties for all vectors x, y, z and c ∈ K

(12)

(i) Conjugate symmetry hx, yi = hy, xi.

(ii) Linearity in the first argument hx + y, zi = hx, zi + hy, zi and hcx, yi = chx, yi.

(iii) Positive-definiteness hx, xi ≥ 0 and equal to 0 iff x = 0.

So we have that (2.7) is a inner product on the space V = Rm×n matrices. We are reminding ourselves about this since we are later using this expression

hA + B, A + Bi = hA, Ai + hA, Bi + hB, Ai + hB, Bi (2.8) which follows from using (i) and (ii) from the definition. Since we are using real numbers we have that hA, Bi = hB, Ai and the above becomes

hA + B, A + Bi = hA, Ai + hB, Bi + 2hA, Bi. (2.9)

2.1.3 Conditional Expectation

We will not state the definition of conditional expectation with respect to a sigma algebra but only state a few properties we need of it. For a complete picture please refer to [2].

Suppose X ∈ L1(Ω, β, P) and let G ⊂ β be a sub σ-field. These are the properties we need

(i) Product rule. Let X,Y be random variables satisfying X, Y X ∈ L1. If Y ∈ G, then

E(XY |G) a.s.= Y E(X|G) (2.10)

(ii) Tower property. If G1 ⊂ G2 ⊂ G, then for X ∈ L1

E (E(X|G2)|G1) = E(X|G1) (2.11) E (E(X|G1)|G2) = E(X|G1). (2.12) First property can be thought of as taking out what is known. Second property is also refered to as smallest sigma algebra wins or smoothing depending on the literature. For proof see [2].

2.2 DR-TRON

As mentioned earlier there is a motivation to learn some kind of scoring function to be able to recommend one item over another, i.e. recommend the item with the higher score.

Since the area of machine learning is vast there are many different ways one could model this scoring function. This thesis will focus on a specific algorithm called DR-TRON which was presented to me by my supervisor Kristiaan Pelckmans. DR-TRON extends a well known algorithm called the Perceptron but everything is fully explained in this thesis so no background knowledge is necessary from the reader.

Formally, consider a set of m objects a1, ..., am which one wants to rank in relevance to a context c. Each object ai is characterised by a vector xai ∈ Rna. The context is also

(13)

characterised by a vector xc∈ Rnc. We then seek to learn a mapping from (a, c) to the relevance of a in c. This scoring or relevance is predicted as

ˆ

r(a, c) = xTcWxa (2.13)

where the matrix W ∈ Rnc×na is the part of the model that needs to be learned. We will throughout the thesis use object and item interchangeably to mean the same thing.

Let us illustrate by writing out the relevance of ai to c. Before the illustration a short explanation of the notation is due. When writing out the elements of an item ai we move the subscript i to be a superscript instead, i.e. ai have the elements ai1, ai2, ..., ain

a. This is because throughout the thesis we will rarely explicitly write out the elements unless we are doing an illustration and furthermore it is more convienient to keep it as ai.

ˆ

r(ai, c) =x1 x2 ... xnc



w1,1 w1,2 ... w1,na w2,1 w2,2 ... w2,na

...

wnc,1 wnc,2 ... wnc,na

 ai1 ai2 ... aina

 Furthermore consider for the sake of readability that na= nc= 2. This gives us

ˆ

r(ai, c) =x1 x2w1,1 w1,2 w2,1 w2,2

 ai1 ai2



=x1w1,1+ x2w2,1 x1w1,2+ x2w2,2ai1 ai2



= ai1x1w1,1+ ai2x1w1,2+ ai1x2w2,1+ ai2x2w2,2.

So the intuition is that features of an object aiand features of the context c act multiplica- tively. For further illustration assume that the object ai is a movie and is characterized by playtime, ai1 and if the movie has a female or male lead, indicated by ai2. The context could then be to recommend movies to users that are characterized by age, x1 and gender x2. Then for example we can, by looking at the term

ai1x1w1,1

see that if w1,1 is positive it would imply, since ai1, x1 > 0, that a movie with a longer playtime will be preferred over a shorter one. However if w1,1 < 0, a shorter playtime is preferred instead. If there is no such connection observed in the data it could be switched off by letting w1,1 be equal to zero. So the interaction of the features could be switched off by letting the associated weight wi,j be equal to zero. Furthermore consider the following term

ai2x2w2,2

where ai2 and x2 are -1 for female and 1 for male. If w2,2 > 0 it would imply that for a female user a movie with a female lead actress is preferred over an movie with a male leading actor. Likewise with a male user. Again if w2,2 < 0, it would imply that a female user would prefer a movie with a male lead actor over a movie with a female lead actress.

Similarly a male user would prefer a lead actress rather than a lead actor. This is just an example and the reader might think that -1 and 1 perhaps is not the best encoding of the ‘gender feature’. Why should female lead be encoded as -1 and a male lead be

(14)

encoded as 1 or vice versa when we dont have any natural ordering between male and female? There is however a common way to encode categorical variables which dont have a natural ordering. This is done by adding another term, denoted ai3 and then using {ai2, ai3} to encode gender as {1, 0} for female and {0, 1} for male instead. Likewise for encoding the gender of the users. You could replace gender with countries and have the same situation but with more categories. This is brought up again later in chapter 4.

So how the features interact (or not) is encoded in the matrix W ∈ Rnc×na and DR- TRON algorithm learns this matrix W from repeated experimentations.

Consider the following setup. Say we have a1, ..., am items and at time-step t, the system recieves a query of a certain user. This information is encoded in ctas a vector xct ∈ Rnc. Then using equation (2.13) we can score all items a1, ..., am in terms of relevance to ct

given a matrix W of appropriate size. We will make use of the common subscript nota- tion a(1) to indicate that this is a permutation of the above items in which a(1) is scored highest by ˆr. Later we will explain how to deal with the situation of ties. So we get a ranking, ˆr(a(1), ct) ≥ ˆr(a(2), ct) ≥ ... ≥ ˆr(a(m), ct) which is presented to the user in order of relevance. Say the user clicks on the second item which corresponds to item aj while we gave the largest relevance to the first item ai. We have then predicted wrongly and the algorithm learns from this. This protocol is represented formally in algorithm (1) and we will continuously make use of the notation that ai corresponds to the most relevant presented item while aj corresponds to the user’s choice unless specified further.

Algorithm 1 DR-TRON

Require: Initiate W0 = 0 and compute the characterisations {xai ∈ Rna} of the m objects.

for t = 1, 2, 3, . . . do

(1) A context ct (encoded as xct ∈ Rnc) is provided.

(2) All m objects {a} are ordered in terms of predicted relevance ˆ

rt(a, ct) = xTctWt−1xa. (2.14) say as ˆrt(a(1), ct) ≥ ˆrt(a(2), ct) ≥ ... ≥ ˆrt(a(m), ct).

(3) The user is asked for feedback on this ranking.

(4) If there was a mistake at t on the preference between items (ai, aj), then the solution is updated as

Wt = Wt−1+ xct(xaj− xai)T, (2.15) else the solution stays as Wt= Wt−1.

end for

Because of DR-TRON being a linear recommender, meaning that the decision boundary on what to recommend over another is linear, it will have a drawback on data which is not seperated by linear boundaries. However it still is of interest to analyze the per- formance of this kind of recommender because of it’s ‘lightweight nature’ in comparison with more complex models. This is because in different applications and especially in

(15)

certain online settings, such as websites, more demanding algorithms might not be suit- able where being fast and learning online is important. In addition to this, the model has pleasant theoretical properties as well as easy interpretability of the model due to the features acting multiplicatively. DR-TRON also allows for users, contexts and items to change over time and this dynamic feature of the model allows us to not fix the number of items beforehand but allowing the items to come and go depending on the needs of the application. With this feature one extends the recommender system to a dynamic recommender system. Notice that in contrast, W is assumed to be invariant on a certain system.

Let σt∈ {0, 1} be the indicator function for a mistake at time t, that is σt= 1 if a mistake was made by the algorithm and 0 otherwise. We can then replace equation (2.15) with

Wt = Wt−1+ σtxct(xajt − xait)T. (2.16) The number of mistakes made by the algorithm up to time t is then

mt=

t

X

s=1

σs. (2.17)

Let’s look at a particular example, where the dimensions of the items and users are na = 1 and nc = 2. Furthermore let xa = (1 −1)T and xc = (c1 c2)T with c1, c2 ∈ R. This is a reduced case when the information of the items is limited to them being just different objects and only having two items. Having the items encoded as {1, −1} we end up with a binary classifier situation. In the example figure below the cross indicate that a user belongs to group 1 and the stars indicate users belonging to the group −1.

In subfigure (a) you can see the ’true’ matrix ¯W that seperates the groups. It can be seen in this case using (2.13) that the vector perpendicular to ¯W or Wt is the deci- sion boundary for the model. The circled user is the next user being queried. The first user was choosen arbitrarly and the following users were picked to make the illustration easier. In practice the group in which the first user is assigned to is randomized, but for the case of simplicity we make the wrong prediction on the first user to force the model to learn. In subfigure (e) querying the remaining users in any order will not alter the model since they are all correctly classified because W4 is very close to ¯W.

(16)

(0, 0)

(a)

+

+ +

+ +

*

*

*

*

* *

*

W¯

(.5,.2)

+

(b)

+

+ +

+ +

*

*

*

*

* *

*

W1 (.1,.52)

+

(d)

+

+ +

+ +

*

*

*

*

* *

*

(-.05,-.4)

W3

+

(c)

+

+ +

+ +

*

*

*

*

* *

*

W2

(-.2,-.5)

+

(e)

+

+ +

+ +

*

*

*

*

* *

* +

W4

(17)

Another interesting example is when na = m = 3, nc = 2 giving xa = (a1a2a3) and xc = (c1c2)T with c1, c2 ∈ R. Furthermore let a1 = (1, 0, 0)T, a2 = (0, 1, 0)T, a3 = (0, 0, 1)T. This is again an example where the only information is that we know that the items are different. To see this let us write out (2.14)

ˆ

r(a, c) =c1 c2w1,1 w1,2 w1,3 w2,1 w2,2 w2,3



a11 a21 a31 a12 a22 a32 a13 a23 a33

=c1 c2w1,1 w1,2 w1,3 w2,1 w2,2 w2,3



1 0 0 0 1 0 0 0 1

=c1 c2w1,1 w1,2 w1,3

w2,1 w2,2 w2,3



=

c1w1,1+ c2w2,1 c1w1,2+ c2w2,2 c1w1,3+ c2w2,3

T

.

(2.18)

For each item we could equivalently see the above as

ˆ

r(a1, c) =

−−−−−−−−−−−−→

c1 c2



a11 a12 a13 ·

−−−−−−−−−−−−→

w1,1 w1,2 w1,3 w2,1 w2,2 w2,3



=c1a11 c1a12 c1a13 c2a11 c2a12 c2a13 · w1,1 w1,2 · · · w2,3

=c1 0 0 c2 0 0 · w1,1 w1,2 · · · w2,3

= c1w1,1+ c2w2,1

(2.19)

where the arrow means that we turn a n × m matrix into an nm × 1 vector by just going down the matrix row by row. For example

−−−−−−−−−−−−→

w1,1 w1,2 w1,3 w2,1 w2,2 w2,3



=

 w1,1 w1,2

... w2,3

T

. (2.20)

So we could for every item view (2.18) as an inner product between two vectors of dimension nc× na = 6, as in (2.19). Here we see that we only know that the items are different since every item in this example uses its own set of weights corresponding to each column vector in Wt. In contrast to an example where we perhaps had songs as items which naturally would share features amongst each other such as: song length, genre and tempo. So in this example one could view it as if we had three separated 2-dimensional subspaces which together rank the items. To visualize this assume ¯W = ¯w123 where ¯wi = (w1,iw2,i)T is our true solution. Then we could draw these three planes for very simple ¯wi’s and see how they combine to separate the items based on the information about the users.

(18)

(0, 0)

(b) w¯1

c1 c2



· ¯w1 > 0

c1 c2



· ¯w1 < 0

(0, 0)

(c)

¯ w2

c1 c2



· ¯w2 > 0

c1 c2



· ¯w2 < 0

(0, 0)

(d)

¯ w3

c1 c2



· ¯w3 < 0

c1 c2



· ¯w3 > 0

a1

a2

a3

a1, a2, a3

a1, a3, a2

a2, a1, a3

a3, a1, a2

a3, a2, a1

a2, a3, a1

(a) Partition of the user space for the top recommended item and the entire recommended list of items induced by ¯W.

Figure 2.2: A visualization of an example of DR-TRON.

We will later in chapter four use this kind of model to recommend a list of intentions of a user who is currently phoning FreeSpee but with larger dimension nc.

(19)

2.3 First Theoretical Result

If the data is separable by a linear model then any good algorithm should converge to this model. This is what the first theorem is about.

Theorem 2.3.1. Assume that a matrix ¯W ∈ Rnc×na exists such that for any context ct and items ait, ajt one has that

rt(ct, ajt) > rt(ct, ait) ⇔ xTctWx¯ ajt ≥ xTctWx¯ ait + 1. (2.21) Assume further that there is a finite R < ∞ such that

maxt

xct(xajt − xait)T

2

2 ≤ R2. (2.22)

Then

mt≤ k ¯Wk22R2. (2.23)

Remark 1 Note the presence of ‘1’ in eq. (2.21). This is equivalent to requiring that rt(ct, ajt) > rt(ct, ait) ⇔ xTctWx¯ ajt > xTctWx¯ ait, (2.24) or that there exists an  > 0 so that

rt(ct, ajt) > rt(ct, ait) ⇔ xTctWx¯ ajt ≥ xTctWx¯ ait + , (2.25) and imposing that k ¯Wk2 = 1. That is, by multiplying ¯W by a positive constant, one can convert this  into ’1’ as desired. The first assumption (2.21) excludes ties in the algorithm. In the case that two items have the same relevance they should not be pre- sented to the algorithm. We will later on in the thesis handle the case of ties. The second assumption (2.22) is a boundary assumption on the data.

Theorem 2.3.1 gives us that the number of mistakes made up to time t is bounded by a constant. This is only possible if fewer and fewer mistakes are being made.

Proof. We begin by unfolding the recursion of (2.16), that is Wt=

t

X

s=1

σsxcs(xajs − xais)T. (2.26) We then bound the number of mistakes from above by studying tr( ¯WTWt). Hence we get

tr( ¯WTWt) = tr W¯ T

t

X

s=1

σsxcs(xajs − xais)T

!

=

t

X

s=1

σstr W¯ Txcs(xajs − xais)T

=

t

X

s=1

σstr (xajs − xais)xTcsW¯ 

=

t

X

s=1

σstr xTcsW(x¯ ajs − xais) ≥

t

X

s=1

σs.

(2.27)

(20)

From the first equality to the second we pull out the scalar and use the linearity (2.2) of the trace operator. Second to third we use that tr(A) = tr(AT) and that for matrices (AB)T = BTAT. From the third to the fourth we use the cyclic property (2.3) of trace and the inequality is given from assumption (2.21).

By Cauchy-Schwarz inequality we have that tr( ¯WTWt) ≤ |tr( ¯WTWt)| = |h ¯W, Wti| ≤

q

h ¯W, ¯WihWt, Wti = || ¯W||2||Wt||2. (2.28) Moreover, with the use of (2.7) and (2.9) from the preliminaries we get that

||Wt||22 = ||Wt−1+ σtxct(xajt − xait)T||22

= ||Wt−1||22+ σt2||xct(xajt − xait)T||22+ 2σttr(Wt−1(xajt − xait)xTct)

= ||Wt−1||22+ σt2||xct(xajt − xait)T||22+ 2σtxTctWt−1(xajt − xait)

≤ ||Wt−1||22+ σ2tR2.

(2.29)

From the second equality to the third we again use the cyclic property (2.3) of the trace.

From the last equality to the inequality we use the second assumption in theorem 2.3.1 and that the third term is negative.

So by unfolding Wt it gives us

||Wt||22 ≤ ||Wt−1||22+ σ2tR2 ≤ ||Wt−2||22 + σt2R2+ σt−12 R2

≤ R2

t

X

s=1

σ2s = R2mt. (2.30)

Putting together (2.27) with (2.29) and (2.30) we have

mt=

t

X

s=1

σs≤ tr( ¯WTWt) ≤ || ¯W||2||Wt||2 ≤ || ¯W||2p

R2mt (2.31)

and rearranging gives us the result

mt≤ || ¯W||22R2. (2.32)

2.4 Active Learning

Consider a sequence {c1, c2, ..., ct, ...} of queries, still encoded as xct ∈ Rnc, on which the DR-TRON algorithm learns from. In what way was this sequence generated? Ar- bitrarly? Is there any reason to engineer this sequence in some way to benefit us? A reasonable cause would be to engineer the ct to achieve fast learning using the least amount of queries possible. That is by learning on such an engineered sequence the matrix Wt evolves as fast as possible to the solution ¯W. This is relevant for applica- tions where one must train the model in a controlled brief timespan before deploying it

(21)

in real situations. Active learning adresses the question on how to design the queries {ct}.

A selective sampling algorithm refers to an algorithm which decides to query the current user, in our case xct, for feedback based on previously observed data. Through this se- lective sampling algorithm users are queried selectively to create a sequence of queries {c1, ..., ct, ...} for which the algorithm DR-TRON learns from. Much of the work involv- ing the selective sampling algorithm is inspired by similar work done in [4] and results of this section will be compared to results from that paper.

In this thesis however, we will be given a sequence {c1, c2, ..., ct, ...} from FreeSpee on which we will train our model on. We will refer to this sequence as our ‘vanilla’ sequence.

This means we already missed our chance to use a selective sampling method to generate this sequence. Regardless it is still interesting to see how the performance differs if we learn on a subsequence of the vanilla one which was picked out with the selective sam- pler, versus training an unfiltered model on the same number of queries. Performance of this kind will first be investigated on generated data in chapter 3 to see if the coming sampling method shows any promise.

To motivate a certain sampling method of the queries, we first define what will be called the empirical margin between two items ak and al for a certain context c as the following

emp(ak, al, c) := |ˆr(ak, c) − ˆr(al, c)| =

xcTW (xak− xal)

. (2.33) This is the difference in ranking between item ak and al and can be thought of how confident the model is by ranking either of them above the other. The intution of the sampling method is that the model might learn and update on queries of which the em- pirical margin between two items are small. Namely the model is not very ’confident’

in the ranking between these two items and therefore it would make sense to recieve feedback from this particular user. Then the question becomes, if a sampling method would make use of the empirical margin, what items should one consider? Well, when the items are more than two a natural choice is the top two currently ranked items of the model and this is also the choice made throughout this thesis. Of course one could try different schemes such as randomizing the items once or for all t. Another scheme could be to pick the smallest empirical margin of two adjacent ranked items for each t.

The formal sampling algorithm will later be stated with general ak and al.

Before proceeding any further we need to say something regarding the situation of ties in the ranked list and how to handle it. If there are any ties in the ranked list given by (2.14) we will for every block of ties randomize their order within the block but keeping the block in the same position in the list. For example, assume that in our ranked list the third to fifth items are ranked equally and the ninth and tenth items are ranked equally.

One block then consists of the items 3-5 and the other block consists of items 9 to 10.

Then we randomize their positions within these blocks separately but do not change the position of the blocks in the entire list. This is done when needed throughout the thesis.

Now let us see if the model really makes a large portion of its updates when the empir- ical margin of the top two ranked items is small. Consider therefore a training run of the DR-TRON (see Algorithm 1) made on users encoded as independent and identically

(22)

distributed (IID) multivariate Gaussian random variables, having dimension nc = 10, zero mean vector µ and covariance matrix Σ equal to the identity matrix. In short xct ∼ Nnc(µ, Σ). Further let the characterization of the objects, xai, be distributed as xai ∼ Nna(µ, Σ) with na = 10. Finally let our true solution, denoted Wtrue, to be an nc× na matrix with IID N (0, 1) entries. So the true ranking is given by inserting Wtrue into equation (2.13) and this is used to simulate a users feedback regarding the suggested ranking. Beginning with two items, namely m = 2 and a sample of 200 users we plot the empirical margin for each query and mark each query that leads to an update of the model. Below is the plot of this.

Figure 2.3: Empirical margin of each query between the only two items with a sample size of 200 users.

We do the same for m = 10 and a sample of 400 users. As mentioned earlier ak and al are the top two currently ranked items at each time point t.

(23)

Figure 2.4: Empirical margin of the top two currently ranked items for each query when m = 2 with a sample size of 400 users.

We can see in these figures that a large portion of the updates made by the model is done when the empirical margin of the currently top two ranked items is relatively small in comparison with the rest of the queries. This is what the following selective sampling method will take advantage of.

The selective sampling algorithm will give rise to the following form of Wt

Wt= Wt−1+ σtZtxct(xajt − xait)T, (2.34) where xait, xajt again corresponds to our predicted most relevant item and the users choice respectively, W0 = 0 and Zt is a Bernoulli distributed random variable with parameter Qt. The outcome of Zt will decide if we query the user for feedback or not and this is done whenever Zt is equal to 1. Equivalently we have

Wt=

t

X

s=1

σsZsxcs(xajs − xais)T. (2.35)

(24)

In more detail:

Algorithm 2 Selective sampling for the DR-TRON

Require: Initiate W0 = 0 and compute the characterisations {xai ∈ Rna} of the m objects. Let b > 0 be a constant. Let m0 = 0.

for t = 1, 2, 3, . . . do

(1) A context ct (encoded as xct ∈ Rnc) is provided, and two possibly relevant items al and ak are selected.

(2) Sample Zt∈ {0, 1} as a Bernouilli random variable with

0 < Qt= b

b +

tr(xTctWt−1(xal− xak))

≤ 1. (2.36)

and P (Zt= 1) = Qt. if Zt = 1 then

All m objects {a} are ordered in terms of predicted relevance ˆ

rt(a, ct) = xTctWt−1xa. (2.37) say as ˆrt(a(1), ct) ≥ ˆrt(a(2), ct) ≥ ... ≥ ˆrt(a(m), ct).

- The user is asked for feedback on this ranking.

if there was a mistake at t on the preference between items (ai, aj) then The solution is updated as

Wt= Wt−1+ xct xaj − xaiT

, (2.38)

else

the solution stays Wt= Wt−1 end if

end if end for

Note that aistill refers to the most relevant item presented and aj the users choice. We do not make any assumptions on the sequence (c1, al1, ak1), (c2, al2, ak2), ..., but we do require that the user can not take into account the knowledge of Ztto determine it’s preferences.

This means that if we use (2.37) and choose al and ak based on this. Then the require- ment means that we could determine σt using only the information of Z1, Z2, ..., Zt−1

since the user preference is already generated before Zt. This sums up to that σt is mea- surable to the σ-algebra generated by Z1, Z2, ..., Zt−1. This property will be used later on.

The choice of ak and al combined with every previous query will affect the probabil- ity that we will query the user at time t or not. This is decided by Zt which will act as a kind of filter and we will refer to the model using this sampling method as a filtered model. We also see that if b goes to infinity we end up with the unfiltered algorithm. So b affects the number of queries and a natural question is if there is any default value or strategy to initially set this constant. This will be investigated and analyzed in chapter 3. Again if we assume ¯W exists, do we have any result regarding the performance of the algorithm? The second theoretical result connects to this.

(25)

2.5 Second Theoretical Result

Before stating the second theoretical result an important distinction must be made for the following terms:

t

X

s=1

σs ,

t

X

s=1

σsZs.

The second term will be the number of updates made by the selective sampling algorithm (Algorithm 2) up to time t while the first term will be the number of mistakes made by the selective sampling algorithm up to time t. The important thing to note here is that σs can be 1 without Zs being 1, i.e a mistake can be made by the current model at time s without actually querying the available user at time s. The updates and mistakes are, in comparison to the first algorithm (2.17), not necessarly coupled with each other anymore.

Assuming the same conditions as in (2.3.1), we have that the number of updates the selective sampling algorithm can make up to time t is bounded by the same bound as in theorem (2.3.1) while using fewer queries. Before stating this result we define explicitly the number of updates made by the selective sampling algorithm to be

Ut:=

t

X

s=1

σsZs. (2.39)

Theorem 2.5.2. Assume the previous stated assumptions made in Theorem 2.3.1. Then

Ut ≤ k ¯Wk22R2 a.s (2.40)

and the expected number of queries are Pt

s=1E(Qs).

Proof. The expected number of queries is easily seen from using the mentioned tower property of conditional expectation in the preliminary section 2.1.3. That is

E

t

X

s=1

Zs

!

=

t

X

s=1

E (E [Zs|σ(Z1, .., Zs−1)] |{Ω, ∅}) =

t

X

s=1

E(Qs|{Ω, ∅}) =

t

X

s=1

E(Qs).

(2.41) Now the bound is analogous to the proof in theorem 2.3.1. The only difference is that Zs will pop up in the sums and in the same way Pt

s=1σs = Pt

s=1σs2 we have that Pt

s=1σsZs =Pt

s=1σs2Zs2.

This bound on the updates are also a bound for the mistakes that are coupled with these updates. We would like a bound on the total number of mistakes, i.e not only the mistakes associated with an update but also the mistakes made without actually querying the user. This was not achieved but a bound in expectation of the mistakes associated with an additional condition was achieved and will be stated below as a corollary. Before stating the corollary let us recall that in algorithm (2) Qt was defined as

Qt= b

b + tr(xTc

tWt−1(xal− xak))

(2.42)

(26)

For some possibly relevant items al and ak. Recall that earlier we took the two most relevant items suggested by the model as an example. The number of mistakes that will be bound are the mistakes σt associated with Qt having the property

Ft:=

tr(xTc

tWt−1(xal− xak))

< a (2.43)

for some positive constant a > 0. So we identify the mistakes σt which are coupled with the property (2.43) and count those. This implies for Qt that

Qt> b

b + a > 0 (2.44)

and this will be used in the corollary.

Corollary 2.5.2.1. Assume the previous stated assumptions made in Theorem (2.5.2).

Then

E

t

X

s : |Ft|<a

σs

≤ k ¯Wk22R2 b + a b



. (2.45)

Proof. We begin first by remembering that σs is measurable with respect to the sigma algebra σ(Z1, ..., Zs−1), i.e σs could be evaluated with only knowing the information up to time s − 1. Also note that Zsis sampled iid given the information up to time s − 1, i.e E [Zs|σ(Z1, ..., Zs−1)] = Qs. (2.46) Using this and the tower property of the conditional expectation mentioned in section (2.1.3) we can write

E

" t X

s=1

σsZs

#

=

t

X

s=1

E [σsZs] =

t

X

s=1

E [E (σsZs|σ(Z1, ..., Zs−1))]

=

t

X

s=1

E [σsE (Zs|σ(Z1, ..., Zs−1))]

=

t

X

s=1

E [σsQs]

= E

" t X

s=1

σsQs

# .

(2.47)

Now we split the sum inside the expectation operator into two parts and make use of (2.44)

E

" t X

s=1

σsQs

#

= E

t

X

s : |Ft|<a

σsQs+

t

X

s : |Ft|≥a

σsQs

> E

t

X

s : |Ft|<a

σs b a + b +

t

X

s : |Ft|≥a

σsQs

. (2.48) Finally using this and Theorem (2.5.2) we get that

 b a + b

 E

t

X

s : |Ft|<a

σs

< E

" t X

s=1

σsQs

#

≤ k ¯Wk22R2. (2.49)

Multiplying both sides with b+ab gives us the result.

(27)

With the classifier and the selective sampling algorithm used in [4] they achieved, in expectation, the same bound for the total number of mistakes. In comparison to our setting where the achieved results did not completely bound all the mistakes made up to time t > 0. Note that if one tinkers with the definition of Qt to be for example Q0t := max(d, Qt), then using the same arguments above, a bound on all the mistakes could be found. With the bound ending up being k ¯Wk22R2 b+db .

An implication of Theorem (2.5.2) is that if we let both the filtered and unfiltered algo- rithm query K times in a sequence of queries {c1, c2, ..., ct, ...} we should expect that the filtered algorithm makes in expectation more updates than the unfiltered one. We will see this happening on our own generated data in the next chapter.

(28)

Chapter 3

Artifical Evaluation

In this chapter we will analyze the filter and see if we can see some difference in favour of the filter on different types of data that we generate ourselves. As mentioned previ- ously in the outline of the thesis FreeSpee had difficulties delivering a suitable data set.

Therefore to progress the thesis work and also preparing for the actual data a preliminary analysis was done on general data and these results will be presented here.

This analysis will be done for different dimensions on the items xai ∈ Rna and con- texts xc∈ Rna, as well as for different parameters such as the number of objects denoted m and constants b. Again we fix al and ak in (2.36) to be the two most relevant items recommended at each time t.

We begin first with explaining how we intend to analyze a possible difference between the unfiltered and filtered algorithm. The idea is, given all necessary parameters, we first train a model using algorithm (2) on a sample size of n users. We save how many queries that was made, say l, then train another model using algorithm (1) on the first l users from the same sample. This is the scenario in which we have {c1, c2, ..., ct, ...} queries in a pipeline and we can afford to query l users in window of time [1, T ]. We then generate k new users for which both models will recommend items for and count the number of mistakes made for both models regarding the top relevant item. We then do this r times and see how the procentual mistakes distributes.

In real application you might not have an ending time T but rather end the query- ing when l queries are reached or some other terminating condition. The reason that the analysis is structured this way is because we are expected to be given a sequence {c1, c2, ..., cn} of queries from FreeSpee which the filter should have been involved gener- ating from the beginning. What we can do instead then is apply the analyzing scheme below to check the potential of the filter on this sequence.

(29)

Algorithm 3 Scheme for analyzing

Require: Set b > 0, r, m, na, nc, k and n.

Generate the m items Xa ∈ Rna×m and Wtrue ∈ Rnc×na. for i = 1, 2, 3, . . . , r do

Generate the n users Xc∈ Rn×nc. for t = 1, 2, 3, . . . , n do

(1) Train a model using Algorithm (2), sample Xc and the items Xa, with Wtrue

acting as the user feedback. Save the number of queries l made.

(2) Train a model using Algorithm (1) and Xa on the first l users in Xc using Wtrue as the user feedback.

(3) Generate k new users and count the number of times both models recommended the wrong most relevant item.

end for end for

As previously mentioned this will be done for different configurations of the parameters b, r, m, na, nc and n.

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Finally, the power budget for the hot case scenario is shown in Table VIII. It considers two thrusters firing for 10 s without previously heating the propellant in the tank because

The research questions investigated concerned teachers’ thoughts and experiences of students’ use of the target language both inside and outside the classroom,