Predicting Redemption Probability of Gift Cards

(1)

Predicting Redemption Probability of Gift Cards

Combining Rating and Demographic Data in a Hybrid User Based Recommender System

A G N E S S K A T T M A N U D D

Master of Science Thesis Stockholm, Sweden 2013

(2)

Predicting Redemption Probability of Gift Cards

Combining Rating and Demographic Data in a Hybrid User Based Recommender System

A G N E S S K A T T M A N U D D

DD221X, Master’s Thesis in Computer Science (30 ECTS credits) Degree Progr. in Computer Science and Engineering 300 credits Royal Institute of Technology year 2013

Supervisor at CSC was Anders Lansner Examiner was Jens Lagergren TRITA-CSC-E 2013:009 ISRN-KTH/CSC/E--13/009--SE ISSN-1653-5715

Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc

(3)

Recommender systems try to facilitate the decision making process of users, by recommending products such as movies, music and news articles. This work uses a user based recommender system to predict redemption probabilities of different gift cards. That is, the probability that a user redeems a gift card in a store, given that he or she receives it. This work is a base for ranking gift cards in the future.

Two collaborative filtering algorithms are evaluated, both based on neighbour recommender methods. The data are provided by the digital gift giving company Wrapp. The nearest neighbours are chosen by similarity, based on the rating of gift cards and the demographic data of the users.

The result shows that it is possible to predict redemption probabilities of gift cards with this data. It also shows that it is important to include certain user behaviors when predicting the redemption probabilities. One such example is if a user tends to redeem more or less gift cards than other users. This work does not explicitly show that demographic data are improving the result, compared to a rating data approach, even though the results with demographic data seem promising.

(4)

Referat

Förutsäga inlösen av presentkort, genom att kombinera inlösen- och demografisk data i ett användarbaserat rekommendationssystem

Rekommendationssystem strävar efter att underlätta an- vändarnas beslutstagande genom att rekommendera pro- dukter, såsom filmer, musik eller tidningsartiklar. Det här arbetet använder ett användarbaserat rekommendationssystem för att förutsäga inlösningssannolikheten för olika presentkort. Med inlösningssannolikhet menas sannolikheten att en användare löser in ett presentkort i en butik, gi- vet att hon eller han har tagit emot presentkortet. Det här arbetet är en grund för att i framtiden kunna ranka presentkort. Två kollaborativa filtreringsalgoritmer utvärderas, båda baserade på rekommendationsmetoder med grannar.

Datan tillhandahålls av det digitala presentkortsföretaget Wrapp. De närmaste grannarna bestäms genom likhet, ba- serat på användarnas betygsättning av presentkorten samt deras demografiska data. Resultatet av arbetet visar att det är möjligt att förutspå inlösningssannolikheten av presentkort med hjälp av denna data. Arbetet visar även att det är viktigt att ta del av vissa användarbeteenden, såsom att vissa användare tenderar att lösa in fler eller färre presentkort än andra. Det går inte med säkerhet att säga att den demografiska datan förbättrar resultaten, mot att ba- ra basera resultaten på betygsättning, men resultaten ser lovande ut.

(5)

This work is a master’s thesis at the School of Computer Sciences and Communi- cation, Royal Institute of Technology, based on a problem provided by Wrapp.

Introduction

This section gives you a short introduction to this work and describes the problem.

1.1 Recommender Systems

Recommender systems analyse patterns of user interest to provide suggestions for items or products [11, 15]. Many companies provide a huge selection of products and recommender systems strive to facilitate the user decision-making process. A user can be subject to various decisions such as what item to buy, what article to read, what music to listen to or what film/TV show to watch [15].

1.1.1 The Function of Recommender Systems

Recommender systems are used to meet special needs and tastes [11]. Some purposes are [15]:

• to increase the number of items sold,

• to introduce users to new items,

• to increase customer satisfaction,

• to increase and keep user fidelity for recurring costumers, and

• to better understand what a user wants

(8)

CHAPTER 1. INTRODUCTION

1.1.2 Objectives of This Work

To fully understand the aim of this work, some information about Wrapp should be read. This is presented in the following section 1.2. Nonetheless, the purpose of this work is to find a way to predict redemption probability, which is the probability that a user uses a gift card, given that he or she receives it. The idea is that, if you could determine the redemption probability of a user, for each in a set of different gift cards, then you could use these values to order a list of gift cards. This type of ordering or ranking could be used when presenting available gift cards for a Wrapp user, that is selecting a gift card for a friend. More explicitly, this ranking of the gift cards could:

• let users get gift cards more suited for their needs and interests

• facilitate the decision making process when a user chooses a gift for a friend

• increase Wrapp’s profit by giving gift cards to users, that they to a higher extent redeem

However, the ranking will not be included in this work. The aim is to find a way to determine the redemption probability. If this is possible, it is possible to rank gift cards in the future.

1.2 About Wrapp

Wrapp is a social gifting service, launched September 12, 2011, that enables friends to give digital gift cards to one another on a daily basis¹. Some gift cards need to be paid for, and some are sponsored by the merchant that issued the gift card, as a way to attract more customers to the store. Users are connected to their friends via Facebook. When a user receives a gift card, he or she may redeem it at a retailer store or their online shop. They can use Wrapp’s web application to redeem the gift cards for online shops or one of Wrapp’s mobile applications to redeem their gift cards directly in the store.

In order to give away a gift card a user selects a friend to celebrate. The giving user then gets a list of available gift cards to give to his or her friend, the receiving user. The receiving user is not necessarily a Wrapp user, he or she may only be a Facebook friend of the giving user. By using the information of which gift cards similar persons have been given in the past, together with which of these gift cards these persons have redeemed, we can conclude which gift card that is expected to be most appreciated. Since a store pays Wrapp for each gift card that is redeemed, a gift card that a user wants, is also profitable for Wrapp.

1https://www.wrapp.com/

(9)

One main difference between sponsored and paid gift cards is the time for which they are active, that is, before they expire. A sponsored and paid gift card are active for a month and a year respectively. Each gift card belongs to a campaign that is active for a certain period of time. Sponsored gift card campaigns also have a daily limit of the number of gifts that can be given away. Another difference is that sponsored gift cards have a predefined value, that vary depending on the target group of the campaign that the gift card belongs to. This value can be contributed to, by the giver, by adding increments of a certain predefined amount, called payment steps, determined by the campaign. Paid gift cards can be bought in different values defined by the payment steps. The availability of gift cards vary between different target groups, and also during the day, according to the daily limit. Sponsored gift cards are also limited if you have received a gift from that merchant before. If you have received a gift card from a merchant that still is active or if you already have redeemed it, you can not receive it again. There is also a limit of how many sponsored gift cards you can send per day. The value of a redeemed gift card determines how much Wrapp is paid by the merchant of that campaign.

The purpose of sponsored gift cards is to get customers within the target group to buy products at the retailer. These gift cards can not be refunded when not used to their full value. A user may choose to buy something of exactly the same, or less than the amount of the gift card. These users are not preferred by the retailer, since they want users that buy for more than the amount of the gift cards. These users are in this work called non-preferred users, and their contrary are called preferred users. The amount purchased for in a store is called transaction amount.

A user may have various reasons to not redeem a gift. The user may not have time or may not be in the mood for shopping during the period of which the gift card is active, or a new user may be suspicious of Wrapp, just to give two reasons. A user may also redeem a gift card for non personal use, the purchase may be a gift for someone else. These ambiguous user behaviors will be called indeterminate. If a user do not want to use a gift, there is no way to erase it from the gift card wallet.

Some users may let it expire, and some user may redeem the gift at home without buying something. Some users may also redeem a gift at home only to try Wrapp and see how it works. These types of redemptions are called false redemptions.

The order in which the gift cards are presented to the user can be ranked to improve the presentation of the gift cards. The current ranking only considers sponsored and paid gift cards, ordered by value. This master’s thesis will try to find a better method for this problem.

(10)

CHAPTER 1. INTRODUCTION

1.3 The data

The data used in this Master’s Theses is provided by Wrapp. The data were collected and anonymized from Wrapp’s database, generated in February 2012, during which time Wrapp was available in Sweden only. The data holds information of all gift cards that have been given away, received and redeemed by which person, since Wrapp’s launch in September 2011. The persons in the database are both users that actually have signed up for Wrapp and Facebook friends to these Wrapp users.

The Facebook friends of the Wrapp users may sometimes have received a gift card, but they have never signed up and logged in to Wrapp. The persons are associated with demographic information such as age, gender and location. However, not all of these persons have given Wrapp permission to receive all this information.

The data consist of 6,5 million user ids, and 88 000 of these are signed up users.

1.4 Problem Definition

This work will choose and evaluate two recommender system algorithms to predict the redemption probability. With these two algorithms the following questions are tried to be answered:

1. Is it possible to predict a redemption probability value, that is a measure of how likely a user is to redeem a gift, given that he or she has received it?

2. Which of the two algorithms makes the best predictions?

1.4.1 Restrictions

Since this is a work limited to a Master’s thesis, there is limitations to how the data can be explored. The general restriction for this work is that the algorithms will be tested with data that are complete. Only the users with the most information will be used and only campaigns with sufficient information to make a prediction will considered. Here follows more specific restrictions:

• This work will only evaluate the predictions of the redemption probabilities, and not if the predictions can be used for ranking the gift cards.

• This work will only study user behaviors from users that have received at least one gift card.

• The gift cards that are considered, have an expiration date before the data were collected. This is to be sure that a gift card that was not redeemed,

4

(11)

could not be redeemed later. Redeemed gift cards are chosen with the same condition.

• A user that redeems a gift card will be said to appreciate it, regardless of indeterminate behavior or false redemptions.

• Users with insufficient data, for example missing gender information, will be omitted.

• Only users with a current location in Sweden will be considered.

• Only sponsored gift cards are considered. Paid gift cards lack sufficient amount of data and have too long active time to make good predictions for.

(12)

Chapter 2

Background

2.1 Data Mining Methods for Recommender Systems

There are various techniques to collect, prepare, analyse and interpret data. In recommender systems the most important methods are sampling, dimensionality reductions and distance functions [2]. In this section an introduction to recommender system techniques are presented, where only some of these techniques are used in this work. In section 2.2 the techniques used in this work are more thor- oughly described.

2.1.1 Data Preprocessing

Real-life data typically need to be preprocessed [2]. In order to use the data in the analyses step of a recommender system it may be cleansed, filtered and transformed.

In this section some methods to do this are presented. Data are defined as collection of object with attributes, where an attribute is defined as characteristics of an object.

Similarity Measures

Most classifiers and clustering techniques are highly dependent on defining an ap- propriate similarity or distance measure. Recommender systems have traditionally used cosine similarity or the Pearson correlation [2]. These are defined as follows:

cos(x, y) = (x • y)

||x||||y|| (2.1)

(13)

Where • indicates the vector dot product and ||x|| is the norm of the vector x.

P earson(x, y) =

P(x, y)

σx× σ_y (2.2)

Which is the linear relationship between the objects x and y. σ_x is the standard deviation of x.

Sampling

Sampling is a technique used in data mining for selecting a subset from a larger data set. It is both used in the preprocessing of the data and for interpreting the final data [2]. It may also be used for training and testing datasets. The key issue of sampling is to find a subset that is representative of the entire set. The most common technique is to use random sampling, where there is an equal probability of selecting any item. Another technique is stratified sampling that splits the data into several partitions, based on a particular feature, and then uses random sampling on each partition.

The common approach to sampling is to perform it without replacement, where the item is removed from the population when selected. The other way is to use sampling with replacement which allows the items to be selected more than once.

Common practice when training and testing datasets with random sampling, is to select 20% of the data to evaluate the model. The rest is used for training the model. This can lead to overfitting of the dataset. To reduce the risk of overfitting;

selection, training and evaluation is performed K times. In the end the average performance of the K trained models are computed. This process is called cross validation.

Dimensionality Reduction

Datasets used in recommender systems are sets with features that define a high dimensional space. This data are also often very sparse, with few features with known values [2]. Distances and density, for example used in clustering, is much less meaningful in high dimensional spaces. Therefore dimensionality reduction is used to transform the data set to a lower dimensionality space. The two most common algorithms used in recommender systems are Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).

(14)

CHAPTER 2. BACKGROUND

Denoising

Denoising is used during the preprocessing step to remove unwanted effects in the data and maximize its information value. Missing values and outliers are two examples of noise that can occur [2].

2.1.2 Classification

A classifier maps between a feature space and a label space. Features represent characteristics of the elements to classify and labels represents classes. There are two main types of classifiers; supervised and unsupervised [2]. Supervised classifiers have predefined labels or categories that is known in advance and there is a set of labeled examples which can be used as a training set. In unsupervised classification the task is to organize the element at hand in a suitable way. There is no predefined labels. Here follows some examples of supervised classification.

k-Nearest Neighbour Classifier

Given a point to be classified, the nearest neighbour classifier finds the k closest points from the training records. It then assigns the class labels according to those neighbours. The idea is that if a new point that belong to a neighbourhood where a class label is dominant, it is likely to belong to that class. One of the most challenging issue in the nearest neighbour classifier is how to choose the value of k. If k is small it will be very sensitive to noise, and if k is too big it will consider points from other classes as well as those intended. The nearest neighbour classifier is considered a lazy learner and do not explicitly build a model. It leaves many decisions to the classification step and classifying unknown records can therefore be very expensive. However, since it is very good at finding like minded users and items it is one of the most common approaches in recommender systems.

Evaluating Classifiers

The most commonly used evaluation measures in recommender systems are the Mean Average Error (MAE) and Root Mean Square Error (RMSE) [2, 13]. Mean absolute error is defined as the difference between predicted ratings and actual ratings, given by:

M AE = 1

|R_test| X

r∈Rtest

|p − r| (2.3)

(15)

Where R_testis the set of data used for evaluating the learned model. The parameters p and r is the predicted and actual rating respectively. The root mean square error also considers the size of the absolute error, according to:

RM SE = v u u t

1

|R_test| X

r∈Rtest

(p − r)² (2.4)

Some recommender systems do not try to predict user preference (such as rating) [16], but rather tries to recommend items they may use. In this case you can classify the result according to table 2.1. Then the number of correct classifications become:

Correct classifications = |tp| + |tn|

|tp| + |f n| + |f p| + |tn| (2.5) Where |tp|, |tn|, |f p| and |f n| is the number of the different classifications respec- tively.

To measure the recommendations that were actually suitable for the user, a precision value can be calculated:

Precision = |tp|

|tp| + |f p| (2.6)

By using a receiver operating characteristic curve or ROC-curve, you can see if the model can select samples of test instances that have a high proportion of positives [19]. The ROC-curve consist of the false positive rate of the samples on the x-axis and the true positive rate on the y-axis [16, 19]. False positive rate and true positive rate are defined as follows:

True Positive Rate = |tp|

|tp + |f n| (2.7)

False Positive Rate = |f p|

|f p| + |tn| (2.8)

Rule-based Classifiers

Rule-based classifiers classify data according to predefined rules [2]. Each rule has a condition which determines if the consequent of the rule is a positive or negative classification. The rules operate on the attributes of the data without transforma- tion, which make the method very expressive. Rule-based classification is also easy

(16)

Recommended Not Recommended Used True Positive (tp) False Negative (fn) Not Used False Positive (fp) True Negative (tn)

Table 2.1. Classification of the possible result of recommendation of an item to a user.

to interpret, easy to generate and can classify new instances efficiently. However, it is very hard to implement a recommender system based on rules only. The method requires some prior knowledge of the decision making, which make it quite unpop- ular in recommender systems. It may improve the performance by applying some specific domain knowledge or business rules though.

2.1.3 Cluster Analyses

A possible solution for finding the k-nearest neighbours is reducing dimensionality, but there might still be very many objects to compute the distance to. The solution is to cluster the objects so that similar objects in the same group, are objects more similar to each other, than the objects in other groups [2]. This is unlikely to improve accuracy, but improves the efficiency. The two main categories of clustering are partitional clustering and hierarchical clustering. Partitional clustering divides the data objects into non-overlapping clusters, such that every object belong to a cluster, distinct from all other clusters. In hierarchical clustering, the clusters are organized as a hierarchical tree.

2.2 Recommender System Strategies

There are various types of recommender systems. The two main types are Content- based and Collaborative filtering [8, 9, 11, 15]. Content-based recommender systems are focused on the items in the system. The technique recommends items that are similar to those a user has liked in the past. The alternative, collaborative filtering, recommends what other similar users have liked in the past. This identifies user- item associations that can be based on a lot of data [11]. This is also the type of recommender system that is relevant for this work. Mainly because there is a limited number of items (gift cards), but also since there are many different types of gift cards that are of interest to the same user. The reasoning is that you can not tell if a person that redeems a gift card at a shoe retailer, only wants gift cards for buying shoes. It is likely that the person would enjoy gift cards at retailers that sells other things as well, and therefore it is important to suggest different kinds of gift

(17)

cards to the user. Here follows a short introduction to other types of recommender systems [15].

Demographic Recommender Systems: recommends items based on a demo- graphic profile of the user.

Knowledge-based: relies on the knowledge about how certain item features meet the users needs and preferences to make recommendations

Community-based: makes the recommendations based on what the friends of the user prefer

Hybrid: uses a mixture of different types of recommender systems

2.3 Collaborative Filtering

There are two primary areas of collaborative filtering [11, 15], neighbourhood-based methods and latent factor models. These concepts are described in section 2.4 and 2.5 respectively. In this section there is a description of some central concepts of collaborative filtering and a list of all parameters used in the following section.

2.3.1 The Cold Start Problem

The cold start problem is about having insufficient information about a product or a user [11, 13]. If there is a new or an obscure item that has been rated by very few or no users, it is very hard to know who to recommend the item to [13].

There is also a problem with new and inactive users that have not rated that many items, or no items at all. If a user has not rated any or few items it is impossible to determine which types of products he or she prefer, if not having any other data.

The cold start problem occurs in collaborative filtering only [11]. In this respect a content-based recommender system is superior.

2.3.2 Explicit and Implicit Feedback

Recommender systems rely on input from the users to evaluate their recommenda- tions. The most convenient type of input is the explicit feedback where the users explicitly tell their interest in the product (like ratings) [7, 8, 9, 11]. When explicit feedback is not available, a more ambiguous method have to be employed, implicit feedback, which reflects the opinion of the users by observing their behavior. Exam- ples of implicit feedback is purchase history, browsing history, search patterns and mouse movements. In Wrapp’s case you can say that redemption rate is a type of implicit feedback, since the user behavior may be indeterminate. However, in this

(18)

study the redemptions are regarded as explicit feedback, since there is no good way to determine the reason why a gift was redeemed or not redeemed.

2.3.3 The Netflix Price Competition

In October, 2006 Netflix released a dataset containing 100 million anonymous movie ratings and promised a great prize for the team that improved the rating prediction of the movies by 10% [5]. As a consequence, data mining, machine learning and computer communities all over the world started to give interest to recommender systems. On September 21, 2009 a team was awarded $1M when winning the competition¹. Consequently much of the literature available have developed algorithms to optimize predictions of the Netflix data. Much of the literature have also used the Netflix data for evaluating the algorithms used in the experiments.

2.4 Neighbourhood-based Recommendation Methods

Neighbourhood-based recommendation methods focus on relationships between items or users [9, 11, 15]. The problem is to estimate the response of a user for a new item, based on what other similar users have rated or what similar items the user has rated. The type of responses can be categorized into scalar, binary and unary [6]. In this work a user response from ranking gift cards could be said to be scalar.

If for example a giving user selects the highest ranked gift card, the user strongly agrees with the ranking. If the user selects the second ranked gift card, the user quite strongly agrees with the ranking, and so on. The response of a received gift is instead binary. If the receiving user likes a gift, he or she redeems it. If not, the receiving user just lets the gift expire (false redemptions excluded). You can also say that a received gift is unary with respect to indeterminate behavior, you do not know for sure it the response was positive or negative.

2.4.1 Advantages of Neighbourhood-based Recommendation

The main advantage of a recommender system based on neighbourhood is that it is more likely to introduce a user to new items, since it recommends what the neighbourhood likes and not items similar to what the user likes [6, 9]. It is also relatively simple to implement, it is efficient and does not require any training phases. It is also stable and handles the constant addition of new users, items and ratings.

1http://www.netflixprize.com/

(19)

u, v - users i - item

rui - rating of item i made by user u

ˆr_ui - prediction of rating r for item i by user u

¯r_u - average rating made by user u N (u) - the k nearest neighbours of user u

Ni(u) - the k nearest neighbour of user u, that have rated item i I_u - all items rated by user u

Iuv - all items rated by both user u and user v

wuv - the similarity between user u and user v, also denoted sim_uv R_u - set of item liked by u (only used with binary data)

du - the demographic properties of u

α - a weight to determine influence when combining similarities

Figure 2.1. List of parameters used in this section

2.4.2 User-based Recommendation

Neighbourhood-based methods can, as mentioned, be based on relationships between items or users. Since the gift cards in Wrapp’s service are very few and hard to compare to each other, only user-based methods are relevant. There are two types of neighbour-based methods, regression and classification [6]. Regression gives a result in continuous scale, which in this case is superior to classification, that gives a result with only few discrete values. Continuous scale is essential for making a recommendation order (ranking). As a consequent, mostly user based recommendation based on regression is mentioned in this work. Classification is only used for evaluation.

User-based recommender methods predict the rating r_ui of a user u for a new item i [4, 6]. This is made by finding the k-nearest-neighbours of u, here denoted by N (u), that are the k users v 6= u with the highest similarity w_uvto u. Only users in the neighbourhood who already have rated item i, denoted N_i(u), are used in the prediction. We can estimate the rating given to i by:

ˆr_ui= 1

|N_i(u)|

X

v∈Ni(u)

r_vi (2.9)

One problem with equation 2.9 is that the neighbours have different levels of sim- ilarity to u, the solution is to weight the contribution of each neighbour with its

(20)

similarity weight w_uv. This adjustment and normalization of the weight give [4, 6, 8]:

rˆ_ui= P

v∈Ni(u)

wuvrvi

P

v∈Ni(u)

|w_uv| (2.10)

Mean Centering

Ratings depend on the personal scale of each user. When having a rating scale, some users tend to rate higher or lower than other users. In Wrapp’s case, some users may redeem more or less gift cards than others. In user-based recommendation, a raw rating r_ui is transformed to to a mean centered h(r_ui), by subtracting the average rating ¯ru of the ratings given by user u to the items I_u rated by u [6]:

h(rui) = r_ui− ¯ru (2.11)

Using this approach for equation 2.10 we can predict r_ui as:

rˆui= ¯ru+ P

v∈Ni(u)

w_uv(r_vi− ¯r_v) P

v∈Ni(u)

|w_uv| (2.12)

2.4.3 Similarity Weight Computation

Similarity vectors are both used to select trusted neighbours whose ratings are used in the prediction, and to determine which neighbours that should have more or less importance in the prediction. Two types of similarities between users are local similar and global similar neighbours [3]. Local similar neighbours are users who have at least one item they rated in common. Similarity between user pairs that do not share any common encounters are called global similar and are for example based on item similarities. Global similarity are often used to enhance sparse data where local similarity is hard to find. Since this work discards data with insufficient information, as described in section 1.4.1, only local similar neighbours are considered. This section describes some ways to calculate similarity vectors.

Local-based Similarity

A measure of similarity between two users u and v, w_uv, can be calculated by computing the Cosine Vector (CV) [3, 6], introduced in equation 2.1:

(21)

wcosuv = cos(r_u, rv) =

P

i∈Iuv

ruirvi

rP

i∈Iu

r²_ui ^P

j∈Iv

r_vj²

(2.13)

Where I_uv denotes the items rated by both u and v and r_ui and r_vi is the ratings of i by u and v respectively. Another popular measuring technique, presented in equation 2.2 is the Pearson correlation [3, 6]. Pearson correlation defines similarity between a user u and v as:

w_pear_uv =

P

i∈Iuv

(r_u,i− ¯ru)(r_v,i− ¯rv)

r P

i∈Iuv

(r_u,i− ¯ru)² ^P

i∈Iuv

(r_v,i− ¯rv)²

(2.14)

Where ¯r_uis the mean rating of user u. When the preference information instead are binary, the Jaccard coefficient is used for computing the similarity [3]. It is defines as:

w_jacc_uv = |R_u^TRv|

|R_u^SR_v| (2.15)

Where R_u is the set of items liked by u.

Demographic Similarity

To characterize users, demographic information can also be used [12, 14, 17]. Voza- lis, M., et.al. [17] compute the similarity weight as a combination of demographic data and similarities in rating. The demographic properties of two users u and v are described by their demographic vectors d_u and d_v. The demographic correlation is determined by the dot-product of the vectors:

w_demo_uv = d^T_ud_v (2.16)

The weights used in 2.4.2 can then be calculated by:

w_demo1_uv = w_rating_uv · w_demo_uv (2.17)

or

(22)

w_demo2_uv = w_rating_uv + w_rating_uv · w_demo_uv (2.18)

Combining Similarities

A combined prediction from two different sets of neighbours can be calculated by a linear combination of the prediction of each set [3]. This can be defines as:

ˆr = (1 − α) · ˆr1+ α · ˆr2 (2.19)

Where ˆr₁ and ˆr₂ is the prediction of the two sets respectively. α is the weight given to the prediction based on set 2.

2.4.4 Selection of the Neighbourhood

The selection of neighbours are normally made in two steps. First a global filtering is made, where only the most likely candidates are kept. After that a prediction step which chooses the best candidates for the prediction is made. In a large system it is not possible to store all (none-zero) similarities between each pair of users and items in the pre-filtering step. Therefore a reduction must be made. There are several ways in which to reduce the amount of similarity weights to store:

Top-N filtering: for each user only the list of the top N nearest neighbours and their similarity weight is kept. The hard part is to choose an adequate number of N.

Threshold Filtering: instead of keeping a fixed number of nearest-neighbours every neighbour weight that has a similarity weight greater than the threshold is kept.

2.4.5 Limited Coverage and Sparse Data

When only rating coverage between users that have rated the same item is considered, the coverage gets very limited. There are also problems when there is few users that rate items and there may be new users have not rated any items at all (cold-start). To solve this problem singular value decomposition on the rating matrix may be used.

(23)

2.5 Latent Factor Models

Latent factor models are presented in this work since the plan was to implement one neighbour-based recommender model and on latent factor model, at the beginning of this work. A neighbour-based recommender model was implemented as a start, since it is more fit to depend on user to user relations, which this work is based on. Latent factor models have some computational issues, which is more described in this section. When the time was not enough for implementing them both, the latent factor model was omitted. This section is kept for the interest of the reader.

Latent factor models have the goal of uncover latent features based on both user and item factors inferred from rating patterns [8, 9, 10, 11]. Some of the most successful models are based on matrix factorization. Matrix factorization models characterize items and users by vectors of factors of rating patterns. High correspondence leads to a recommendation. Since the methods both have very good scalability and accuracy, they have become very popular in recent years. In this section user- user factorization methods are described. Figure 2.5 below shows a list of all new parameters used in this section.

m - the number of users n - the number of items

µ - the average rating of all items bi - observed deviation of item i b_u - observed deviation of user u

b_ui - the baseline prediction of an unknown rating K - all sets of users u and items i where rui is known λ - term to avoid overfitting

p_u - property vector of user u

zu - additional property vector of user u γ - learning rate

e_ui - prediction error of item i for user u R(u) - items rated by user u

R(i) - users that rated item i

R^k(i) - the k nearest neighbour that have rated item i

Figure 2.2. Parameters used in this section, see figure 2.1 for previously used parameters

2.5.1 Baseline Predictors

Much of the observed ratings in a recommender system are affected by users that tend to give higher or lower ratings than other users, or items that tend to receive

(24)

higher or lower rating than other items. To make ratings in a system comparable to each other, these effects are tried to be encapsulated in baseline predictors [9, 10].

A baseline prediction for an unknown rating r_ui is denoted b_uiand accounts for the user and item effects:

bui= µ + b_u+ b_i (2.20)

Where parameter b_u and b_i indicates the observed deviation average of user u and item i respectively and µ is the average rating of all items. To estimate b_u and b_i the least square problem is solved:

minb∗

X

(u,i)∈K

(r_ui− µ − b_u− b_i)²+ λ₁(^X

u

b²_u+^X

i

b²_i) (2.21)

For all sets K = {(u, i)|r_uiis known}. The term λ₁(^P

u b²_u+^P

i

b²_i) avoids overfitting.

2.5.2 A User-User Neighbourhood Model

In matrix factorization, most models map users and items to a joint latent factor space of dimensionality f [10]. Usually user-item interactions are modeled as inner products of that space. In this work only user-user interactions are considered, however. Each user u is associated with a vector p_u ∈ R^f, and each related user v is associated with a vector z_v ∈ R^f. The dot product p^T_uz_v captures the interaction between them.

A user-user neighbourhood model predicts ratings based on the rating similar users have given the items. The model is derived and presented as follows [10]:

rˆ_ui= µ + b_u+ b_i+ |R(i)|⁻¹² ^X

v∈R(i)

(r_vi− b_vi)w_uv (2.22)

The parameter R(i) contain all users that have rated item i. The parameter b_vi remains constant and w_uv is the similarity weight between the users u and v. This model enables rating prediction of an item i for a user u, without requiring u to have rated any similar items to i. The major disadvantage is that the model is very computationally intensive. It is very impractical to store all user-user relations, which is of space complexity O(m²), where m is the number of users. The time complexity is also much higher than a corresponding item-item model since there is (usually) many more users than items. The time complexity is O(^P_i|R(i)|²).

To avoid the computational issues, the model is rewritten to use the factorization

(25)

model. To do this, the user-user relation is structured as: w_uv= p^T_uz_v. A factorized model of 2.22 can then be written as:

rˆ_ui= µ + b_u+ b_i+ |R(i)|⁻¹² ^X

v∈R(i)

(r_vi− b_vi)p^T_uz_v (2.23)

To make this computationally efficient, terms that are dependent on i but indepen- dent of u are included in a separate sum. The prediction rule is then presented in the following form:

rˆui= µ + b_u+ b_i+ p^T_u|R(i)|⁻¹² ^X

v∈R(i)

(r_vi− b_vi)z_v (2.24)

The parameters are learned by minimizing the regularized squared error function associated with:

p∗min,z∗,b∗

X

(u,i)∈R



rui− µ − b_u− b_i− p^T_u



|R(i)|⁻¹² ^X

v∈R(i)

(r_vi− b_vi)z_v









2

+λ₁₁



b²_u+ b²_i + ||p_u||²+ ^X

j∈R(i)

||z_v||²



 (2.25)

The term λ₁₁ is usually determined by cross validation and one way to solve the minimization is to use stochastic gradient decent. The algorithm loops through the data and for every given rating r_uia prediction r_uiis made. For every training case a prediction error e_ui = r_ui− ˆrui is calculated and the parameters are updated in the opposite direction of the gradient according to figure 2.3.

As a corresponding item-item neighbourhood model to 2.24, a local neighbourhood approach can be developed [9, 10]. When applying local neighbourhood to 2.24 the model can be rewritten as:

ˆr_ui= µ + b_u+ b_i+ p^T_u|R^k(i)|⁻¹² ^X

v∈R^k(i)

(r_vi− b_vi)z_v (2.26)

Here R^k(i) denotes the users in the k nearest neighbours that have rated i.

(26)

function LearnFactorizedNeighbourhoodModel

% For each user compute p_u, z_v ∈ R^f

% which form a neighbourhood model

% Const parameters: iterations, γ, λ

% Gradient decent sweeps:

for count = 1, ..., iterations do for i = 1, ..., n do

% Compute the component independent of i:

pu ← |R(i)|⁻¹² ^P_v∈R(i)(r_vi− b_vi)z_v sum ← 0

for all u ∈ R(i) do

rˆui← µ + b_u+ b_i+ p^T_uzv

e_ui← r_ui− ˆr_ui

% Accumulate information for gradient steps on z_v sum ← sum + eui· z_v

% Perform gradient step on z_v, bu, bi: b_u ← b_u+ γ · (e_ui− λ · b_u)

bi ← b_i+ γ · (e_ui− λ · b_i) end for

end for end for

return {pu, zv|i = 1, ..., n}

end function

Figure 2.3. Shows how the parameters are updated in user-user factor model.

(27)

Related Work

In this chapter some of the previous work made in the topic are presented.

Sarwar, B.M., et.al. (2002) [1] address the k-nearest neighbour collaborative filtering performance issues by using a clustering technique. This clustering technique provided comparable prediction quality and improved the time performance signif- icantly.

Anand, D., et.al. (2010) enhanced poor data in the neighbourhoods by combining local similarity with global similarity measures [3]. A variable α was determined by estimating the sparsity and then used to combine the similarities using a linear combination. The experiments that estimated α based on the sparsity, outperformed the ones where it was kept constant.

In 2004 Vozalis, M., et.al. [17] explored if existing collaborative filtering algorithms could be enhanced by demographic data. The experiments were executed on Movie- Lense data and concluded that the demographic data in some cases lead to more accurate predictions. In 2007 Vozalis, M., et.al. continued the research by enhancing a latent factor model with demographic data [18]. The work claims that the application had the potential to tackle problems such as scalability and sparsity, but did not yield particularly lower error values.

(28)

Chapter 4

Methods

Wrapp is a growing start-up company and the data used in this study are therefore quite sparse. Many of the new users have only received a first gift card and have not had time to redeem it yet. Wrapp got additional data of demographic content though. By enhancing the redemption data with the demographic data it is possible to estimate the redemption probability even for a new user.

To predict the redemption probability of a user and a certain gift card, two methods will be evaluated. Since the time constraints did not allow the work to implement a latent factor model, the two algorithm are based on the two neighbour recommender methods presented in chapter 2.4. The first one will be based on the neighbourhood recommender method in equation 2.10 and the second one will be using mean centering as in equation 2.12, both located in section 2.4.2. These algorithms will be called neighbour algorithm and mean centering algorithm respectively. Both algo- rithms will extend the user similarity calculations with demographic data. In this section the preprocessing of the data and how the two methods will be adapted to Wrapp’s data are described in more detail.

4.1 Preprocessing

4.1.1 The Data and Restriction

This work is restricted (see section 1.4.1) to data that holds all necessary information. Therefore users that are missing demographic information such as gender, location or age are being omitted. This work is also restricted to sponsored gift cards. As a first approach only users, which at least have logged in once, are considered. These users should at least have received one gift card and be located in Sweden to be of interest in this study. With all these restriction, there is ap-

(29)

proximately 26 000 users out of the original 88 000, that can be used. All these users have received a gift card that had an expiration date dated before this data were collected. These 26 000 users have together received approximately 100 000 sponsored gift cards.

4.1.2 Similarity Measures

Neighbourhood based recommender methods are based on the k nearest neighbours N_i(u) to a user u that have received a campaign i. The neighbour similarity, used to determine which users that are the nearest, are calculated in the same way for both algorithm. This is done by combining demographic and rating similarity. The following sections describes how demographic and rating similarity are calculated and then combined.

4.1.3 Rating Similarity

To measure rating similarity, the Jaccard algorithm described in equation 2.15 in section 2.4.3 is used. This algorithm is chosen since it is adapted to binary data. A user that have redeemed a campaign is defined to have liked it. The rating similarity is calculated to a value between 0 and 1.

4.1.4 Demographic Similarity

Instead of using the dot product between the demographic properties as in equation 2.16 in section 2.4.3, the demographic properties are defined independently in this work. This enables different properties to be scalar or weighted. The final demographic similarity weight is calculated by summarizing all demographic properties.

The weight is then normalized to get a value between 0 and 1 as in the rating similarity. It is defined as follows:

w_demo_uv = P

d_juv∈D

djuv

|D| (4.1)

Where D is all demographic properties and d_j_uv is the similarity between a user u and v for a demographic property j in D. The demographic properties are defined as follows:

(30)

CHAPTER 4. METHODS

d_gender

( 1 if same

Demographic similarity 0 otherwise

d_region

( 1 if same 0 otherwise

d_city

dage

( (1-0,2)·age_difference 0 if age_difference > 5

(4.2)

Combining Similarity

To combine the rating and demographic similarity, a linear combination as in equa- tion 2.19 in section 2.4.3 is used. The the similarity weight w between a user u and v becomes:

w_uv= α · w_rating_uv + (1 − α) · w_demo_uv (4.3)

Where α is a values between 0 and 1. By doing this it is possible to elaborate with how much influence the rating respectively the demographic similarity will have.

It also keeps the final similarity weight w_uv between a value of 0 and 1. The k nearest neighbours to a user u is determined by calculating the similarity to all other users. The k most similar users are selected by taking those with the largest similarity weight. If there are several users that have the same similarity weight and competing of the last or all neighbour spots, the users are randomly selected.

4.2 The Algorithms

4.2.1 Similarity

Both the neighbour algorithm and the mean centering algorithm will use the same similarity weights, that were calculated for the k nearest neighbours, in previous section 4.1.

(31)

4.2.2 Rating

A user v that redeems a gift card of a campaign i, gives a gift card rating r_vi that is 1. If the user lets the gift card expire the rating r_vi is 0. It is possible for a user to receive and let a gift card from the same campaign expire several times. However this is still considered as 0 in rating. If the gift card is redeemed after one or several expirations it will instead be given rating 1.

4.2.3 Algorithm 1

The neighbour algorithm predicts the redemption probability value ˆr_uifor a receiv- ing user u and a campaign i according equation 2.10 in section 2.4.2, using the calculated similarity weights in 4.2.1 and the ratings in 4.2.2 in the neighbourhood.

The redemption probability value will span from 0 to 1.

4.2.4 Algorithm 2

The mean centering algorithm will like the neighbour algorithm predict the redemption probability value ˆruifor a receiving user u and a campaign i, using the similarity weights and ratings in the neighbourhood. The difference is that the mean centering algorithm will also use the mean rating of both the receiving user and the users in the neighbourhood. The mean rating is calculated based on the rating definition in 4.2.2. The predicted redemption value is then calculated by the equation 2.12 in section 2.4.2. This will give a predicted redemption value in the interval -1 to 2.

4.3 Experiments

The experiments will evaluate how many neighbours that should be used in the neighbourhood, which value of alpha to use and which properties of the demographic data that are the most important ones. The experiments are performed with both algorithms.

4.3.1 Evaluation

The algorithms will be evaluated using cross validation. The approximately 100 000 received gift cards will randomly be divided into 80% training data and 20% evaluation data. The training data will then predict the redemption probability value for each user-campaign pair in the evaluation data. If the training data do not

(32)

CHAPTER 4. METHODS

Classified as redeemed Classified as not redeemed Redeemed True Positive (tp) False Negative (fn)

Not Redeemed False Positive (fp) True Negative (tn)

Table 4.1. A table describing the classification

have enough neighbours to predict a user-campaign pair, that prediction will be discarded. A prediction that has a value higher than the threshold is classified as redeemed, and prediction that has a value lower than the threshold is classified a not redeemed. The classified predictions will then be compared to the real ratings in the evaluation data. This will be repeated ten times for each experiment such that approximately 200 000 predictions are made.

The results will be based on table 4.1 and the concepts presented in equations 2.5, 2.6, 2.7 and 2.8.

(33)

Results

All experiments in this chapter are performed on the same ten randomly generated permutations of training and evaluation data. The user-campaign pairs represented in the evaluation data, gives a maximum of 204 000 predictions if none of them are discarded due to too few neighbours. This chapter presents results based on different demographic similarity weights, different values of alpha and different neighbourhood sizes. Also some comparison between the algorithms are presented.

5.1 Demographic Properties

The results in this section are determined from ratings made by neighbours, that had the largest similarity weights, based on demographic properties only. All results based on more than 25 neighbours were almost identical and therefor 75 neighbours were chosen for presentation, see section 5.3 for more details. During the calculation of the predictions, each user-campaign pair that did not have 75 neighbours were discarded. This means that the results can be based on different number of predictions, depending on the similarity weight calculation. The results are presented as the percentage of correctly classified predictions, based on different similarity weights. The result of each similarity weight is taken from the threshold that gave the highest percentage of correct classifications.

5.1.1 Similarity Weights Based on One Property

To test the influence of different demographic properties, the predictions was calculated from similarity weights, that were based on one selected demographic property.

This was repeated for all the demographic properties. Due to discarding predictions

(34)

CHAPTER 5. RESULTS

with too few neighbours, the location, age and gender results are based on 128 000, 188 000 and 198 000 predictions respectively, for both algorithms.

The results of the neighbour algorithm are shown in table 5.1 as ’Based on the property only’. When comparing these results to table 5.2, where the result of neighbours that are randomly selected is shown, a small or no difference at all is seen. The correctly classified predictions based on location is a little lower than the other results, but are also based on the fewest number of predictions.

The results of the mean centering algorithm are presented in the table 5.3 as ’Based on the property only’. The predictions based on gender and age are equivalent to the result of the randomly selected neighbours in table 5.4. The similarity weight based on location gives a slightly better result.

The results of the randomly selected neighbours are based on 202 000 predictions.

5.1.2 Similarity Weights Based on Different Weighted Properties

Result from similarity weights based on the original, equally weighted demographic properties (equation 4.2), are shown in table 5.2 with the randomly selected neighbours. This result is based on 201 000 predictions. These weights show a small im- provement from the randomly selected neighbours. A last test was made by taking the equally weighted demographic properties, and weighting one of the properties according to:

dgender

dlocation







4 if same region and city 2 if same region

0 otherwise

dage

( (2-0,4)·age_difference 0 if age_difference > 5

(5.1)

The similarity weight was still calculated according to equation 4.1, but the number of demographic properties |D| was increased by one if the age or gender was weighted and by two if the location was weighted, to normalize the similarity weight between 0 and 1.

The results from the weighted demographic properties are presented in table 5.1 and 5.3 as ’Weighted property’ for both algorithms respectively. All results are based on 201 000 predictions. The results of both algorithms are equivalent to the equally

(35)

Neighbour Recommender Algorithm

Correct classifications % Location Age Gender Based on the property only 73,42 73,80 73,81

Weighted property 74,08 74,07 74,02

Table 5.1. A table showing the influence of different demographic properties in the neighbour recommender algorithm. The first row shows the percentage of correctly classified predictions with the similarity weight based on each of the demographic properties respectively. The second row shows the percentage of correctly classified predictions when a demographic property was weighted times two.

Neighbour Recommender Algorithm

Correct classifications % Randomly selected properties 73,95

Equally weighted demographic properties 74,07

Table 5.2. A table showing the percentage of correctly classified predictions for randomly selected neighbours (all with the same weight) and the original similarity weights based on equally weighted demographic properties.

Neighbour Recommender Algorithm with Mean Centering Correct classifications %

Location Age Gender

Based on the property only 91,97 91,83 91,83 Weighted property 92,04 92,02 91,98

Table 5.3. A table showing the influence of different demographic properties in the neighbour recommender algorithm with mean centering. The first row shows the percentage of correctly classified predictions with the similarity weight based on each of the demographic properties respectively. The second row shows the percentage of correctly classified predictions when a demographic property was weighted times two.

Neighbour Recommender Algorithm with Mean Centering Correct classifications % Randomly selected properties 91,83

Equally weighted demographic properties 92,01

Table 5.4. A table showing the percentage of correctly classified predictions for randomly selected neighbours (all with the same weight) and the original similarity weights based on equally weighted demographic properties.

Predicting Redemption Probability of Gift Cards

Predicting Redemption Probability of Gift Cards

Predicting Redemption Probability of Gift Cards

Referat

Förutsäga inlösen av presentkort, genom att kombinera inlösen- och demografisk data i ett användarbaserat rekommendationssystem

Contents

Introduction

1.1 Recommender Systems

1.2 About Wrapp

1.3 The data

1.4 Problem Definition

Chapter 2

Background

2.1 Data Mining Methods for Recommender Systems

2.2 Recommender System Strategies

2.3 Collaborative Filtering

2.4 Neighbourhood-based Recommendation Methods

2.5 Latent Factor Models

Related Work

Chapter 4

Methods

4.1 Preprocessing

4.2 The Algorithms

4.3 Experiments

Results

5.1 Demographic Properties