To learn and evaluate a system for recommending business intentions based on customer behaviour

(1)

U.U.D.M. Project Report 2018:26

Examensarbete i matematik, 30 hp Handledare: Kristiaan Pelckmans Ämnesgranskare: Raazesh Sainudiin Examinator: Denis Gaidashev Juni 2018

Department of Mathematics Uppsala University

To learn and evaluate a system for

recommending business intentions based on customer behaviour

Niklas Fastlund

(2)

(3)

Typeset in L^ATEX

SE

©Niklas Fastlund, 2018

(4)

Abstract

The linear dynamic recommender algorithm DR-TRON together with a selective sampling scheme has been evaluated as means to recommend business intentions of users phoning in after using a business homepage. Furthermore an artifical analysis of the selective sampling method is also presented. The sampler seeks to improve the rate at which DR-TRON learns by selecting users based on empirical margins between recommended items to recieve feedback from.

The artifical analysis of the selective sampling method was done by training two models on a generated user sample of size n in particular scheme. The first model uses the sampling method to query k users out of n, then the second model trains on the first k users on the same sample. After the training new test users are generated and the procentual accuracy in correctly recommending the most relevant item for these new users are noted.

This is iterated r times to see how the procentual mistakes distribute.

The artifical analysis of the selective sampler shows promise on the different types of data generated. Albeit small the selective sampling algorithm were around 2-3 percent better at recommending the most relevant item.

Because of the insufficiently small dataset which was recieved for the case study regarding business intentions, a typical evaluation could not be done. An attempt was still made and the case-oriented analysis was done by first enlarging this dataset via a sampling with uniform probability on the original dataset. Then a typical evaluation was done on this enlarged dataset by splitting it randomly into a training set and a test set and a comparison between DR-TRON and a naive solution was made. The insufficiently small dataset given, allowed only for the conclusion that the DR-TRON peformed significantly better than the naive solution of simply guessing the business intention.

(5)

Chapter 1 Introduction

1.1 Background and Motivation

When a consumer contacts a business today, this business may only have one address of contact but many different departments which the customer may end up in. An example of this could be a consumer contacting a bank to discuss housing mortgage loan. The customer obviously would like to get in contact where such loans are handled and not any other department, say credit card issues. The reason for this set up could for example be to avoid overwhelming the customer with different contact information and making the experience of the customer as smooth as possible. However one still wishes to direct users properly and this is usually solved by cumbersome methods such as in the case of phone calls, asking the user for keywords or asking the consumer to traverse a phone-menu.

FreeSpee is a company that helps businesses track their online customers as they convert from online to offline channels, which the above example could be a illustration of. To achieve this FreeSpee provides a platform for their clients where they present statistics and analytics for business phone contact points as well as additional services to prevent lead loss and improve the customer experience surrounding phone calls. FreeSpee is al- ways focused on the improvement of customer experience surrounding phone calls and making the phone call experience as seamless as possible. Therefore there lies an interest in a system which starts from contextual information about customers and based on this the system produces a ranked list of possible business intentions of the customer. This list could then be used in general to perhaps recommend the customer contact addresses which is closer to the goals of the customer. If there is only one department to end up in it could be used to prioritize incoming calls or to prepare the answering party for the intentions of the caller. The first usage of the list would translate in the above example to one where the system suggests the business intention of ‘housing mortgage loan’

over anything else and the bank, if they have seperate phone numbers, could recommend the number to the department dealing with loans instead of only presenting the number leading to a phone-menu. The interest is to see if the methods of Machine Learning can be used to make this process more transparent.

To meet this interest the intention of the thesis is to, with the collaboration of Kristiaan Pelckmans and FreeSpee, prototype a recommender system. The system will rank the possible business intentions based on customer behaviour and evaluate the peformance

(8)

of this system.

1.2 Recommender Systems

Recommendation systems are computer software that recommends items to users in different contexts. The act of recommending is motivated by wanting to optimize some predefined objectives. For instance, the objective could be to maximize user engagement in a product and increase the total number of subscriptions. Imagine for example a product which offers a stream of music to the users where a recommender system could then suggest different songs or playlists to keep the user more engaged.

To rank items over each other a scoring function is usually learned based on the available input. Said inputs are typically information about the users, items, contexts and feedback. Looking at the above example with streaming music the feedback information could be that the user listens to the songs and playlists. Item information could be the different properties of the songs, for instance: rhythm, genre, and year. Information about the users could be previously liked songs. Usually it starts with recommending items based on the available information and then when feedback is recieved it is possible to learn from this and update the recommender system. Below is a mindmap of the flow of information and later on in the thesis this will be put into a more formal mathematical format [1].

User information

Item information Contextual information

Recommender system Feedback

Ranking Figure 1.1: Flow of information for a typical recommender system.

1.3 Outline of Thesis

In the next chapter preliminary theory will be reviewed and also a specific recommender system will be presented in a more mathematical format along with some theoretical results. In the same chapter a sampling scheme will also be described. This sampling scheme handles how to pick users to ask for feedback to potentially learn faster rather than just learning on arbitrarly picked users.

FreeSpee had information about the caller but not the crucial piece of information about where the caller ended up after traversing the phone-menu. So initially FreeSpee had to

(9)

find a suitable business for which FreeSpee offers their services to and motivate them to collaborate by handing over the information about which department the caller ended up in. This would be done in exchange for a prototype recommender system that could be used to present a recommended list of contacs through a widget on their homepage and possibly letting the users avoid the phone-menu entirely.

FreeSpee had difficulties acquiring this collaboration throughout the thesis work and to solve this setback and not let the whole project be delayed it was decided to devote chapter 3 to investigate the potential gain of using the sampling scheme mentioned in chapter 2. For these investigations, the data was generated by ourselves. Chapter 3 is therefore a more general analysis of the filter and not necessarily with the same encoding and format as for the FreeSpee data.

After FreeSpee failed to find a collaboration a very late compromise was made which led to recommending business intentions of users phoning FreeSpee after using their own homepage. FreeSpee began to label the callers manually and the format could atleast be settled upon and similar analysis that was made in chapter 3 was repeated. This is presented in chapter 4. It turned out that for this compromise, FreeSpee could not pro- duce enough rows of data to be useful for producing an actual prototype recommender system. Instead an artificial dataset was made based on the received data and a short analysis was made to try to make use of this input. This is also presented in Chapter 4.

Chapter 5 will be devoted to conclusions and a short outlook.

(10)

Chapter 2 Theory

2.1 Preliminaries

This section we will go through and remind the reader about the trace operator, matrix norms and other mathematical objects. The purpose of this is to help the reader under- stand the material presented later in the thesis. First let us go through some notations:

Boldface x is used to emphasize that it is a vector.

Let A be any event. 1A is used as an indicator function which is one if A occured and zero otherwise.

A ∈ R^n×m denotes a n × m matrix with real numbers as entries.

Let X be a random variable, F be a sigma field over the set Ω and P be a probability measure. Then we use the standard notation of a probability space (Ω, F , P). For a more formal explanation see [2].

We adopt the shorthand notation X ∈ F to say that X is measurable with respect to the sigma algebra F .

Let X ∈ (Ω, F, P) and G ⊂ F be a sub-sigma field. Then E(X) is the expected value defined as the Lebesgue integral of X over Ω and E(X|G) is the conditional expectation of X on the sigma field G. For more see [2].

Let X be a random variable. Then X ∈ L1 or X ∈ L₁(Ω, F , P) means that E(|X|) < ∞.

Let Z1, Z2, ...Zt be a sequence of random variables. Then σ(Z1, ..., Zt) refers to the sigma algebra generated by these random variables.

2.1.1 Trace

The trace of an n × n matrix A is defined as the sum of the diagonal elements. That is tr(A) =

n

X

i=1

a_ii = a_1,1+ ... + a_nn. (2.1)

(11)

In the proofs and in other parts we will use some basic properties of the trace operator.

Firstly that it is a linear mapping. That is, let A and B be square matrices and c a scalar. Then we have that

tr(cA + B) = c tr(A) + tr(B). (2.2)

Proof is straightforward and is omitted. We also have that for suitable matrices A, B, C, D tr(ABCD) = tr(DABC) = tr(CDAB) = tr(BCDA). (2.3) This means that the trace is invariant under cyclic permutations. Proof is omitted. Since the transpose of a matrix does not change the elements on the diagonal it is clear that

tr(A) = tr(A^T). (2.4)

2.1.2 Matrix Norm

There are several different matrix norms but here a matrix norm called the Frobenius norm will be used. First let us define a matrix norm.

Definition. Let K denote a field of real or complex numbers. A function ||·|| : K^m×n→ R is a matrix norm on m × n matrices if it satisfies

(i) Positivity ||A|| ≥ 0 and ||A|| = 0 iff A = 0.

(ii) Homogeneity ||cA|| = |c|||A||.

(iii) Triangle inequality ||A + B|| ≤ ||A|| + ||B||.

Some norms, but not all satisfy

||AB|| ≤ ||A||||B||. (2.5)

This property is usually refered to as being submultiplicative. The Frobenius norm is defined as

||A||_F = ||A||₂ =

sX

i,j

a²_i,j =p

tr(A^TA) (2.6)

and satisfies being submultiplicative. It is further true that hA, Bi = tr(A^TB) =X

i,j

a_i,jb_i,j (2.7)

is a inner product between R^m×nmatrices. Let us recall the definition of an inner product space

Definition. Let K denote a field of real or complex numbers. An inner product space is a vector space V over the field K together with an inner product

h·, ·i : V × V → K

that satisfies these three properties for all vectors x, y, z and c ∈ K

(12)

(i) Conjugate symmetry hx, yi = hy, xi.

(ii) Linearity in the first argument hx + y, zi = hx, zi + hy, zi and hcx, yi = chx, yi.

(iii) Positive-definiteness hx, xi ≥ 0 and equal to 0 iff x = 0.

So we have that (2.7) is a inner product on the space V = R^m×n matrices. We are reminding ourselves about this since we are later using this expression

hA + B, A + Bi = hA, Ai + hA, Bi + hB, Ai + hB, Bi (2.8) which follows from using (i) and (ii) from the definition. Since we are using real numbers we have that hA, Bi = hB, Ai and the above becomes

hA + B, A + Bi = hA, Ai + hB, Bi + 2hA, Bi. (2.9)

2.1.3 Conditional Expectation

We will not state the definition of conditional expectation with respect to a sigma algebra but only state a few properties we need of it. For a complete picture please refer to [2].

Suppose X ∈ L₁(Ω, β, P) and let G ⊂ β be a sub σ-field. These are the properties we need

(i) Product rule. Let X,Y be random variables satisfying X, Y X ∈ L₁. If Y ∈ G, then

E(XY |G) ^a.s.= Y E(X|G) (2.10)

(ii) Tower property. If G₁ ⊂ G₂ ⊂ G, then for X ∈ L₁

E (E(X|G²)|G1) = E(X|G¹) (2.11) E (E(X|G1)|G₂) = E(X|G1). (2.12) First property can be thought of as taking out what is known. Second property is also refered to as smallest sigma algebra wins or smoothing depending on the literature. For proof see [2].

2.2 DR-TRON

As mentioned earlier there is a motivation to learn some kind of scoring function to be able to recommend one item over another, i.e. recommend the item with the higher score.

Since the area of machine learning is vast there are many different ways one could model this scoring function. This thesis will focus on a specific algorithm called DR-TRON which was presented to me by my supervisor Kristiaan Pelckmans. DR-TRON extends a well known algorithm called the Perceptron but everything is fully explained in this thesis so no background knowledge is necessary from the reader.

Formally, consider a set of m objects a₁, ..., a_m which one wants to rank in relevance to a context c. Each object a_i is characterised by a vector x_a_i ∈ Rⁿ^a. The context is also

(13)

characterised by a vector x_c∈ Rⁿ^c. We then seek to learn a mapping from (a, c) to the relevance of a in c. This scoring or relevance is predicted as

ˆ

r(a, c) = x^T_cWx_a (2.13)

where the matrix W ∈ Rⁿ^c^×n^a is the part of the model that needs to be learned. We will throughout the thesis use object and item interchangeably to mean the same thing.

Let us illustrate by writing out the relevance of a_i to c. Before the illustration a short explanation of the notation is due. When writing out the elements of an item a_i we move the subscript i to be a superscript instead, i.e. a_i have the elements aⁱ₁, aⁱ₂, ..., aⁱ_n

a. This is because throughout the thesis we will rarely explicitly write out the elements unless we are doing an illustration and furthermore it is more convienient to keep it as a_i.

ˆ

r(a_i, c) =x1 x2 ... xnc







w_1,1 w_1,2 ... w_1,n_a w_2,1 w_2,2 ... w_2,n_a

...

w_n_c_,1 w_n_c_,2 ... w_n_c_,n_a











 aⁱ₁ aⁱ₂ ... aⁱ_n_a





 Furthermore consider for the sake of readability that n_a= n_c= 2. This gives us

ˆ

r(a_i, c) =x₁ x₂w_1,1 w_1,2 w2,1 w2,2

aⁱ₁ aⁱ₂

=x₁w_1,1+ x₂w_2,1 x₁w_1,2+ x₂w_2,2aⁱ₁ aⁱ₂

= aⁱ₁x₁w_1,1+ aⁱ₂x₁w_1,2+ aⁱ₁x₂w_2,1+ aⁱ₂x₂w_2,2.

So the intuition is that features of an object a_iand features of the context c act multiplicatively. For further illustration assume that the object a_i is a movie and is characterized by playtime, aⁱ₁ and if the movie has a female or male lead, indicated by aⁱ₂. The context could then be to recommend movies to users that are characterized by age, x₁ and gender x₂. Then for example we can, by looking at the term

aⁱ₁x₁w_1,1

see that if w1,1 is positive it would imply, since aⁱ₁, x1 > 0, that a movie with a longer playtime will be preferred over a shorter one. However if w_1,1 < 0, a shorter playtime is preferred instead. If there is no such connection observed in the data it could be switched off by letting w1,1 be equal to zero. So the interaction of the features could be switched off by letting the associated weight w_i,j be equal to zero. Furthermore consider the following term

aⁱ₂x2w2,2

where aⁱ₂ and x2 are -1 for female and 1 for male. If w2,2 > 0 it would imply that for a female user a movie with a female lead actress is preferred over an movie with a male leading actor. Likewise with a male user. Again if w_2,2 < 0, it would imply that a female user would prefer a movie with a male lead actor over a movie with a female lead actress.

Similarly a male user would prefer a lead actress rather than a lead actor. This is just an example and the reader might think that -1 and 1 perhaps is not the best encoding of the ‘gender feature’. Why should female lead be encoded as -1 and a male lead be

(14)

encoded as 1 or vice versa when we dont have any natural ordering between male and female? There is however a common way to encode categorical variables which dont have a natural ordering. This is done by adding another term, denoted aⁱ₃ and then using {aⁱ₂, aⁱ₃} to encode gender as {1, 0} for female and {0, 1} for male instead. Likewise for encoding the gender of the users. You could replace gender with countries and have the same situation but with more categories. This is brought up again later in chapter 4.

So how the features interact (or not) is encoded in the matrix W ∈ Rⁿ^c^×n^a and DR- TRON algorithm learns this matrix W from repeated experimentations.

Consider the following setup. Say we have a₁, ..., a_m items and at time-step t, the system recieves a query of a certain user. This information is encoded in c_tas a vector x_c_t ∈ Rⁿ^c. Then using equation (2.13) we can score all items a1, ..., am in terms of relevance to ct

given a matrix W of appropriate size. We will make use of the common subscript notation a₍₁₎ to indicate that this is a permutation of the above items in which a₍₁₎ is scored highest by ˆr. Later we will explain how to deal with the situation of ties. So we get a ranking, ˆr(a₍₁₎, c_t) ≥ ˆr(a₍₂₎, c_t) ≥ ... ≥ ˆr(a_(m), c_t) which is presented to the user in order of relevance. Say the user clicks on the second item which corresponds to item a_j while we gave the largest relevance to the first item ai. We have then predicted wrongly and the algorithm learns from this. This protocol is represented formally in algorithm (1) and we will continuously make use of the notation that a_i corresponds to the most relevant presented item while aj corresponds to the user’s choice unless specified further.

Algorithm 1 DR-TRON

Require: Initiate W₀ = 0 and compute the characterisations {x_a_i ∈ Rⁿ^a} of the m objects.

for t = 1, 2, 3, . . . do

(1) A context c_t (encoded as x_c_t ∈ Rⁿ^c) is provided.

(2) All m objects {a} are ordered in terms of predicted relevance ˆ

r_t(a, c_t) = x^T_c_tW_t−1x_a. (2.14) say as ˆrt(a₍₁₎, ct) ≥ ˆrt(a₍₂₎, ct) ≥ ... ≥ ˆrt(a_(m), ct).

(3) The user is asked for feedback on this ranking.

(4) If there was a mistake at t on the preference between items (a_i, a_j), then the solution is updated as

W_t = W_t−1+ x_c_t(x_a_j− x_a_i)^T, (2.15) else the solution stays as W_t= W_t−1.

end for

Because of DR-TRON being a linear recommender, meaning that the decision boundary on what to recommend over another is linear, it will have a drawback on data which is not seperated by linear boundaries. However it still is of interest to analyze the performance of this kind of recommender because of it’s ‘lightweight nature’ in comparison with more complex models. This is because in different applications and especially in

(15)

certain online settings, such as websites, more demanding algorithms might not be suitable where being fast and learning online is important. In addition to this, the model has pleasant theoretical properties as well as easy interpretability of the model due to the features acting multiplicatively. DR-TRON also allows for users, contexts and items to change over time and this dynamic feature of the model allows us to not fix the number of items beforehand but allowing the items to come and go depending on the needs of the application. With this feature one extends the recommender system to a dynamic recommender system. Notice that in contrast, W is assumed to be invariant on a certain system.

Let σ_t∈ {0, 1} be the indicator function for a mistake at time t, that is σ_t= 1 if a mistake was made by the algorithm and 0 otherwise. We can then replace equation (2.15) with

W_t = W_t−1+ σ_tx_c_t(x_a_jt − x_a_it)^T. (2.16) The number of mistakes made by the algorithm up to time t is then

m_t=

t

X

s=1

σ_s. (2.17)

Let’s look at a particular example, where the dimensions of the items and users are n_a = 1 and n_c = 2. Furthermore let x_a = (1 −1)^T and x_c = (c₁ c₂)^T with c₁, c₂ ∈ R. This is a reduced case when the information of the items is limited to them being just different objects and only having two items. Having the items encoded as {1, −1} we end up with a binary classifier situation. In the example figure below the cross indicate that a user belongs to group 1 and the stars indicate users belonging to the group −1.

In subfigure (a) you can see the ’true’ matrix ¯W that seperates the groups. It can be seen in this case using (2.13) that the vector perpendicular to ¯W or W_t is the decision boundary for the model. The circled user is the next user being queried. The first user was choosen arbitrarly and the following users were picked to make the illustration easier. In practice the group in which the first user is assigned to is randomized, but for the case of simplicity we make the wrong prediction on the first user to force the model to learn. In subfigure (e) querying the remaining users in any order will not alter the model since they are all correctly classified because W₄ is very close to ¯W.

(16)

(0, 0)

(a)

+

+ +

*

* *

*

^W^¯

(.5,.2)

+

(b)

+

+ +

*

* *

*

W₁ (.1,.52)

+

(d)

+

+ +

*

* *

*

(-.05,-.4)

W3

+

(c)

+

+ +

*

* *

*

^W²

(-.2,-.5)

+

(e)

+

+ +

*

* *

* +

W4

(17)

Another interesting example is when n_a = m = 3, n_c = 2 giving x_a = (a¹a²a³) and x_c = (c₁c₂)^T with c₁, c₂ ∈ R. Furthermore let a¹ = (1, 0, 0)^T, a² = (0, 1, 0)^T, a³ = (0, 0, 1)^T. This is again an example where the only information is that we know that the items are different. To see this let us write out (2.14)

ˆ

r(a, c) =c1 c₂w_1,1 w_1,2 w_1,3 w_2,1 w_2,2 w_2,3





a¹₁ a²₁ a³₁ a¹₂ a²₂ a³₂ a¹₃ a²₃ a³₃





=c1 c₂w_1,1 w_1,2 w_1,3 w_2,1 w_2,2 w_2,3





1 0 0 0 1 0 0 0 1





=c1 c₂w_1,1 w1,2 w1,3

w_2,1 w_2,2 w_2,3

=





c₁w_1,1+ c₂w_2,1 c₁w_1,2+ c₂w_2,2 c1w1,3+ c2w2,3





T

.

(2.18)

For each item we could equivalently see the above as

ˆ

r(a1, c) =

−−−−−−−−−−−−→

c₁ c₂

a¹₁ a¹₂ a¹₃ ·

−−−−−−−−−−−−→

w_1,1 w_1,2 w_1,3 w_2,1 w_2,2 w_2,3

=c₁a¹₁ c₁a¹₂ c₁a¹₃ c₂a¹₁ c₂a¹₂ c₂a¹₃ · w_1,1 w_1,2 · · · w_2,3

=c₁ 0 0 c₂ 0 0 · w_1,1 w_1,2 · · · w_2,3

= c₁w_1,1+ c₂w_2,1

(2.19)

where the arrow means that we turn a n × m matrix into an nm × 1 vector by just going down the matrix row by row. For example

−−−−−−−−−−−−→

w_1,1 w_1,2 w_1,3 w_2,1 w_2,2 w_2,3

=





 w_1,1 w1,2

... w_2,3







T

. (2.20)

So we could for every item view (2.18) as an inner product between two vectors of dimension n_c× n_a = 6, as in (2.19). Here we see that we only know that the items are different since every item in this example uses its own set of weights corresponding to each column vector in W_t. In contrast to an example where we perhaps had songs as items which naturally would share features amongst each other such as: song length, genre and tempo. So in this example one could view it as if we had three separated 2-dimensional subspaces which together rank the items. To visualize this assume ¯W = ¯w₁ w¯₂ w¯₃ where ¯w_i = (w_1,iw_2,i)^T is our true solution. Then we could draw these three planes for very simple ¯wi’s and see how they combine to separate the items based on the information about the users.

(18)

(0, 0)

(b) w¯₁

c₁ c₂

· ¯w₁ > 0

c₁ c₂

· ¯w₁ < 0

(0, 0)

(c)

¯ w₂

c₁ c₂

· ¯w₂ > 0

c₁ c₂

· ¯w₂ < 0

(0, 0)

(d)

¯ w₃

c₁ c₂

· ¯w₃ < 0

c₁ c₂

· ¯w₃ > 0

a₁

a₂

a₃

a1, a2, a3

a₁, a₃, a₂

a2, a1, a3

a₃, a₁, a₂

a₃, a₂, a₁

a₂, a₃, a₁

(a) Partition of the user space for the top recommended item and the entire recommended list of items induced by ¯W.

Figure 2.2: A visualization of an example of DR-TRON.

We will later in chapter four use this kind of model to recommend a list of intentions of a user who is currently phoning FreeSpee but with larger dimension n_c.

(19)

2.3 First Theoretical Result

If the data is separable by a linear model then any good algorithm should converge to this model. This is what the first theorem is about.

Theorem 2.3.1. Assume that a matrix ¯W ∈ Rⁿ^c^×n^a exists such that for any context c_t and items a_i_t, a_j_t one has that

r_t(c_t, a_j_t) > r_t(c_t, a_i_t) ⇔ x^T_c_tWx¯ _a_jt ≥ x^T_c_tWx¯ _a_it + 1. (2.21) Assume further that there is a finite R < ∞ such that

maxt

x_c_t(x_a_jt − x_a_it)^T

2

2 ≤ R². (2.22)

Then

m_t≤ k ¯Wk²₂R². (2.23)

Remark 1 Note the presence of ‘1’ in eq. (2.21). This is equivalent to requiring that r_t(c_t, a_j_t) > r_t(c_t, a_i_t) ⇔ x^T_c_tWx¯ _a_jt > x^T_c_tWx¯ _a_it, (2.24) or that there exists an > 0 so that

rt(ct, ajt) > rt(ct, ait) ⇔ x^T_c_tWx¯ a_jt ≥ x^T_c_tWx¯ a_it + , (2.25) and imposing that k ¯Wk2 = 1. That is, by multiplying ¯W by a positive constant, one can convert this into ’1’ as desired. The first assumption (2.21) excludes ties in the algorithm. In the case that two items have the same relevance they should not be presented to the algorithm. We will later on in the thesis handle the case of ties. The second assumption (2.22) is a boundary assumption on the data.

Theorem 2.3.1 gives us that the number of mistakes made up to time t is bounded by a constant. This is only possible if fewer and fewer mistakes are being made.

Proof. We begin by unfolding the recursion of (2.16), that is W_t=

t

X

s=1

σ_sx_c_s(x_a_js − x_a_is)^T. (2.26) We then bound the number of mistakes from above by studying tr( ¯W^TW_t). Hence we get

tr( ¯W^TW_t) = tr W¯ ^T

t

X

s=1

σ_sx_c_s(x_a_js − x_a_is)^T

!

=

t

X

s=1

σ_str W¯ ^Tx_c_s(x_a_js − x_a_is)^T

=

t

X

s=1

σ_str (x_a_js − x_a_is)x^T_c_sW¯

=

t

X

s=1

σ_str x^T_c_sW(x¯ _a_js − x_a_is) ≥

t

X

s=1

σ_s.

(2.27)

(20)

From the first equality to the second we pull out the scalar and use the linearity (2.2) of the trace operator. Second to third we use that tr(A) = tr(A^T) and that for matrices (AB)^T = B^TA^T. From the third to the fourth we use the cyclic property (2.3) of trace and the inequality is given from assumption (2.21).

By Cauchy-Schwarz inequality we have that tr( ¯W^TWt) ≤ |tr( ¯W^TWt)| = |h ¯W, Wti| ≤

q

h ¯W, ¯WihWt, Wti = || ¯W||2||Wt||2. (2.28) Moreover, with the use of (2.7) and (2.9) from the preliminaries we get that

||W_t||²₂ = ||W_t−1+ σ_tx_c_t(x_a_jt − x_a_it)^T||²₂

= ||W_t−1||²₂+ σ_t²||x_c_t(x_a_jt − x_a_it)^T||²₂+ 2σ_ttr(W_t−1(x_a_jt − x_a_it)x^T_c_t)

= ||W_t−1||²₂+ σ_t²||x_c_t(x_a_jt − x_a_it)^T||²₂+ 2σ_tx^T_c_tW_t−1(x_a_jt − x_a_it)

≤ ||W_t−1||²₂+ σ²_tR².

(2.29)

From the second equality to the third we again use the cyclic property (2.3) of the trace.

From the last equality to the inequality we use the second assumption in theorem 2.3.1 and that the third term is negative.

So by unfolding W_t it gives us

||W_t||²₂ ≤ ||W_t−1||²₂+ σ²_tR² ≤ ||W_t−2||²₂ + σ_t²R²+ σ_t−1² R²

≤ R²

t

X

s=1

σ²_s = R²m_t. (2.30)

Putting together (2.27) with (2.29) and (2.30) we have

m_t=

t

X

s=1

σ_s≤ tr( ¯W^TW_t) ≤ || ¯W||₂||W_t||₂ ≤ || ¯W||₂p

R²m_t (2.31)

and rearranging gives us the result

m_t≤ || ¯W||²₂R². (2.32)

2.4 Active Learning

Consider a sequence {c1, c2, ..., ct, ...} of queries, still encoded as xct ∈ Rⁿ^c, on which the DR-TRON algorithm learns from. In what way was this sequence generated? Ar- bitrarly? Is there any reason to engineer this sequence in some way to benefit us? A reasonable cause would be to engineer the ct to achieve fast learning using the least amount of queries possible. That is by learning on such an engineered sequence the matrix W_t evolves as fast as possible to the solution ¯W. This is relevant for applications where one must train the model in a controlled brief timespan before deploying it

(21)

in real situations. Active learning adresses the question on how to design the queries {c_t}.

A selective sampling algorithm refers to an algorithm which decides to query the current user, in our case x_c_t, for feedback based on previously observed data. Through this selective sampling algorithm users are queried selectively to create a sequence of queries {c₁, ..., c_t, ...} for which the algorithm DR-TRON learns from. Much of the work involv- ing the selective sampling algorithm is inspired by similar work done in [4] and results of this section will be compared to results from that paper.

In this thesis however, we will be given a sequence {c1, c2, ..., ct, ...} from FreeSpee on which we will train our model on. We will refer to this sequence as our ‘vanilla’ sequence.

This means we already missed our chance to use a selective sampling method to generate this sequence. Regardless it is still interesting to see how the performance differs if we learn on a subsequence of the vanilla one which was picked out with the selective sampler, versus training an unfiltered model on the same number of queries. Performance of this kind will first be investigated on generated data in chapter 3 to see if the coming sampling method shows any promise.

To motivate a certain sampling method of the queries, we first define what will be called the empirical margin between two items a_k and a_l for a certain context c as the following

emp(a_k, a_l, c) := |ˆr(a_k, c) − ˆr(a_l, c)| =

x_c^TW (x_a_k− x_a_l)

. (2.33) This is the difference in ranking between item ak and al and can be thought of how confident the model is by ranking either of them above the other. The intution of the sampling method is that the model might learn and update on queries of which the empirical margin between two items are small. Namely the model is not very ’confident’

in the ranking between these two items and therefore it would make sense to recieve feedback from this particular user. Then the question becomes, if a sampling method would make use of the empirical margin, what items should one consider? Well, when the items are more than two a natural choice is the top two currently ranked items of the model and this is also the choice made throughout this thesis. Of course one could try different schemes such as randomizing the items once or for all t. Another scheme could be to pick the smallest empirical margin of two adjacent ranked items for each t.

The formal sampling algorithm will later be stated with general a_k and a_l.

Before proceeding any further we need to say something regarding the situation of ties in the ranked list and how to handle it. If there are any ties in the ranked list given by (2.14) we will for every block of ties randomize their order within the block but keeping the block in the same position in the list. For example, assume that in our ranked list the third to fifth items are ranked equally and the ninth and tenth items are ranked equally.

One block then consists of the items 3-5 and the other block consists of items 9 to 10.

Then we randomize their positions within these blocks separately but do not change the position of the blocks in the entire list. This is done when needed throughout the thesis.

Now let us see if the model really makes a large portion of its updates when the empirical margin of the top two ranked items is small. Consider therefore a training run of the DR-TRON (see Algorithm 1) made on users encoded as independent and identically

(22)

distributed (IID) multivariate Gaussian random variables, having dimension n_c = 10, zero mean vector µ and covariance matrix Σ equal to the identity matrix. In short x_c_t ∼ N_n_c(µ, Σ). Further let the characterization of the objects, x_a_i, be distributed as x_a_i ∼ N_n_a(µ, Σ) with n_a = 10. Finally let our true solution, denoted W_true, to be an n_c× n_a matrix with IID N (0, 1) entries. So the true ranking is given by inserting W_true into equation (2.13) and this is used to simulate a users feedback regarding the suggested ranking. Beginning with two items, namely m = 2 and a sample of 200 users we plot the empirical margin for each query and mark each query that leads to an update of the model. Below is the plot of this.

Figure 2.3: Empirical margin of each query between the only two items with a sample size of 200 users.

We do the same for m = 10 and a sample of 400 users. As mentioned earlier a_k and a_l are the top two currently ranked items at each time point t.

(23)

Figure 2.4: Empirical margin of the top two currently ranked items for each query when m = 2 with a sample size of 400 users.

We can see in these figures that a large portion of the updates made by the model is done when the empirical margin of the currently top two ranked items is relatively small in comparison with the rest of the queries. This is what the following selective sampling method will take advantage of.

The selective sampling algorithm will give rise to the following form of Wt

W_t= W_t−1+ σ_tZ_tx_c_t(x_a_jt − x_a_it)^T, (2.34) where x_a_it, x_a_jt again corresponds to our predicted most relevant item and the users choice respectively, W₀ = 0 and Z_t is a Bernoulli distributed random variable with parameter Q_t. The outcome of Z_t will decide if we query the user for feedback or not and this is done whenever Z_t is equal to 1. Equivalently we have

W_t=

t

X

s=1

σ_sZ_sx_c_s(x_a_js − x_a_is)^T. (2.35)

(24)

In more detail:

Algorithm 2 Selective sampling for the DR-TRON

Require: Initiate W0 = 0 and compute the characterisations {xai ∈ Rⁿ^a} of the m objects. Let b > 0 be a constant. Let m₀ = 0.

for t = 1, 2, 3, . . . do

(1) A context ct (encoded as xct ∈ Rⁿ^c) is provided, and two possibly relevant items a_l and a_k are selected.

(2) Sample Z_t∈ {0, 1} as a Bernouilli random variable with

0 < Q_t= b

b +

tr(x^T_c_tW_t−1(x_a_l− x_a_k))

≤ 1. (2.36)

and P (Z_t= 1) = Q_t. if Z_t = 1 then

All m objects {a} are ordered in terms of predicted relevance ˆ

r_t(a, c_t) = x^T_c_tW_t−1x_a. (2.37) say as ˆr_t(a₍₁₎, c_t) ≥ ˆr_t(a₍₂₎, c_t) ≥ ... ≥ ˆr_t(a_(m), c_t).

- The user is asked for feedback on this ranking.

if there was a mistake at t on the preference between items (a_i, a_j) then The solution is updated as

W_t= W_t−1+ x_c_t x_a_j − x_a_iT

, (2.38)

else

the solution stays W_t= W_t−1 end if

end if end for

Note that a_istill refers to the most relevant item presented and a_j the users choice. We do not make any assumptions on the sequence (c1, al1, ak1), (c2, al2, ak2), ..., but we do require that the user can not take into account the knowledge of Z_tto determine it’s preferences.

This means that if we use (2.37) and choose a_l and a_k based on this. Then the require- ment means that we could determine σt using only the information of Z1, Z2, ..., Zt−1

since the user preference is already generated before Z_t. This sums up to that σ_t is measurable to the σ-algebra generated by Z₁, Z₂, ..., Z_t−1. This property will be used later on.

The choice of a_k and a_l combined with every previous query will affect the probability that we will query the user at time t or not. This is decided by Z_t which will act as a kind of filter and we will refer to the model using this sampling method as a filtered model. We also see that if b goes to infinity we end up with the unfiltered algorithm. So b affects the number of queries and a natural question is if there is any default value or strategy to initially set this constant. This will be investigated and analyzed in chapter 3. Again if we assume ¯W exists, do we have any result regarding the performance of the algorithm? The second theoretical result connects to this.

(25)

2.5 Second Theoretical Result

Before stating the second theoretical result an important distinction must be made for the following terms:

t

X

s=1

σ_s ,

t

X

s=1

σ_sZ_s.

The second term will be the number of updates made by the selective sampling algorithm (Algorithm 2) up to time t while the first term will be the number of mistakes made by the selective sampling algorithm up to time t. The important thing to note here is that σ_s can be 1 without Z_s being 1, i.e a mistake can be made by the current model at time s without actually querying the available user at time s. The updates and mistakes are, in comparison to the first algorithm (2.17), not necessarly coupled with each other anymore.

Assuming the same conditions as in (2.3.1), we have that the number of updates the selective sampling algorithm can make up to time t is bounded by the same bound as in theorem (2.3.1) while using fewer queries. Before stating this result we define explicitly the number of updates made by the selective sampling algorithm to be

U_t:=

t

X

s=1

σ_sZ_s. (2.39)

Theorem 2.5.2. Assume the previous stated assumptions made in Theorem 2.3.1. Then

U_t ≤ k ¯Wk²₂R² a.s (2.40)

and the expected number of queries are Pt

s=1E(Q^s).

Proof. The expected number of queries is easily seen from using the mentioned tower property of conditional expectation in the preliminary section 2.1.3. That is

E

t

X

s=1

Z_s

!

=

t

X

s=1

E (E [Zs|σ(Z₁, .., Z_s−1)] |{Ω, ∅}) =

t

X

s=1

E(Qs|{Ω, ∅}) =

t

X

s=1

E(Qs).

(2.41) Now the bound is analogous to the proof in theorem 2.3.1. The only difference is that Z_s will pop up in the sums and in the same way Pt

s=1σ_s = Pt

s=1σ_s² we have that Pt

s=1σsZs =Pt

s=1σ_s²Z_s².

This bound on the updates are also a bound for the mistakes that are coupled with these updates. We would like a bound on the total number of mistakes, i.e not only the mistakes associated with an update but also the mistakes made without actually querying the user. This was not achieved but a bound in expectation of the mistakes associated with an additional condition was achieved and will be stated below as a corollary. Before stating the corollary let us recall that in algorithm (2) Qt was defined as

Q_t= b

b + tr(x^T_c

tW_t−1(x_a_l− x_a_k))

(2.42)

(26)

For some possibly relevant items a_l and a_k. Recall that earlier we took the two most relevant items suggested by the model as an example. The number of mistakes that will be bound are the mistakes σ_t associated with Q_t having the property

F_t:=

tr(x^T_c

tW_t−1(x_a_l− x_a_k))

< a (2.43)

for some positive constant a > 0. So we identify the mistakes σ_t which are coupled with the property (2.43) and count those. This implies for Qt that

Q_t> b

b + a > 0 (2.44)

and this will be used in the corollary.

Corollary 2.5.2.1. Assume the previous stated assumptions made in Theorem (2.5.2).

Then

E





t

X

s : |Ft|<a

σ_s



≤ k ¯Wk²₂R² b + a b

. (2.45)

Proof. We begin first by remembering that σ_s is measurable with respect to the sigma algebra σ(Z₁, ..., Z_s−1), i.e σ_s could be evaluated with only knowing the information up to time s − 1. Also note that Z_sis sampled iid given the information up to time s − 1, i.e E [Zs|σ(Z₁, ..., Z_s−1)] = Q_s. (2.46) Using this and the tower property of the conditional expectation mentioned in section (2.1.3) we can write

E

" _t X

s=1

σsZs

#

=

t

X

s=1

E [σ^sZs] =

t

X

s=1

E [E (σ^sZs|σ(Z1, ..., Zs−1))]

=

t

X

s=1

E [σ^sE (Z^s|σ(Z1, ..., Zs−1))]

=

t

X

s=1

E [σ^sQs]

= E

" _t X

s=1

σsQs

# .

(2.47)

Now we split the sum inside the expectation operator into two parts and make use of (2.44)

E

" _t X

s=1

σ_sQ_s

#

= E





t

X

s : |Ft|<a

σ_sQ_s+

t

X

s : |Ft|≥a

σ_sQ_s



> E





t

X

s : |Ft|<a

σ_s b a + b +

t

X

s : |Ft|≥a

σ_sQ_s



. (2.48) Finally using this and Theorem (2.5.2) we get that

b a + b

E





t

X

s : |Ft|<a

σ_s



< E

" _t X

s=1

σ_sQ_s

#

≤ k ¯Wk²₂R². (2.49)

Multiplying both sides with ^b+a_b gives us the result.

(27)

With the classifier and the selective sampling algorithm used in [4] they achieved, in expectation, the same bound for the total number of mistakes. In comparison to our setting where the achieved results did not completely bound all the mistakes made up to time t > 0. Note that if one tinkers with the definition of Q_t to be for example Q⁰_t := max(d, Q_t), then using the same arguments above, a bound on all the mistakes could be found. With the bound ending up being k ¯Wk²₂R² ^b+d_b .

An implication of Theorem (2.5.2) is that if we let both the filtered and unfiltered algorithm query K times in a sequence of queries {c₁, c₂, ..., c_t, ...} we should expect that the filtered algorithm makes in expectation more updates than the unfiltered one. We will see this happening on our own generated data in the next chapter.

(28)

Chapter 3 Artifical Evaluation

In this chapter we will analyze the filter and see if we can see some difference in favour of the filter on different types of data that we generate ourselves. As mentioned previously in the outline of the thesis FreeSpee had difficulties delivering a suitable data set.

Therefore to progress the thesis work and also preparing for the actual data a preliminary analysis was done on general data and these results will be presented here.

This analysis will be done for different dimensions on the items x_a_i ∈ Rⁿ^a and contexts x_c∈ Rⁿ^a, as well as for different parameters such as the number of objects denoted m and constants b. Again we fix al and ak in (2.36) to be the two most relevant items recommended at each time t.

We begin first with explaining how we intend to analyze a possible difference between the unfiltered and filtered algorithm. The idea is, given all necessary parameters, we first train a model using algorithm (2) on a sample size of n users. We save how many queries that was made, say l, then train another model using algorithm (1) on the first l users from the same sample. This is the scenario in which we have {c₁, c₂, ..., c_t, ...} queries in a pipeline and we can afford to query l users in window of time [1, T ]. We then generate k new users for which both models will recommend items for and count the number of mistakes made for both models regarding the top relevant item. We then do this r times and see how the procentual mistakes distributes.

In real application you might not have an ending time T but rather end the querying when l queries are reached or some other terminating condition. The reason that the analysis is structured this way is because we are expected to be given a sequence {c₁, c₂, ..., c_n} of queries from FreeSpee which the filter should have been involved gener- ating from the beginning. What we can do instead then is apply the analyzing scheme below to check the potential of the filter on this sequence.

(29)

Algorithm 3 Scheme for analyzing

Require: Set b > 0, r, m, n_a, n_c, k and n.

Generate the m items X_a ∈ Rⁿ^a^×m and W_true ∈ Rⁿ^c^×n^a. for i = 1, 2, 3, . . . , r do

Generate the n users X_c∈ R^n×n^c. for t = 1, 2, 3, . . . , n do

(1) Train a model using Algorithm (2), sample Xc and the items Xa, with Wtrue

acting as the user feedback. Save the number of queries l made.

(2) Train a model using Algorithm (1) and X_a on the first l users in X_c using Wtrue as the user feedback.

(3) Generate k new users and count the number of times both models recommended the wrong most relevant item.

end for end for

As previously mentioned this will be done for different configurations of the parameters b, r, m, na, nc and n.

To learn and evaluate a system for recommending business intentions based on customer behaviour

U.U.D.M. Project Report 2018:26

Department of Mathematics Uppsala University

To learn and evaluate a system for

recommending business intentions based on customer behaviour

Niklas Fastlund

Contents

Chapter 1 Introduction

1.1 Background and Motivation

1.2 Recommender Systems

1.3 Outline of Thesis

Chapter 2 Theory

2.1 Preliminaries

2.1.1 Trace

2.1.2 Matrix Norm

2.1.3 Conditional Expectation

2.2 DR-TRON

+

+ +

+ +

*

*

*

*

* *

*

+

+

+ +

+ +

*

*

*

*

* *

*

+

+

+ +

+ +

*

*

*

*

* *

*

+

+

+ +

+ +

*

*

*

*

* *

*

+

+

+ +

+ +

*

*

*

*

* *

* +

2.3 First Theoretical Result

2.4 Active Learning

2.5 Second Theoretical Result

Chapter 3

Artifical Evaluation