Recommending new items tocustomers

(1)

Recommending new items to customers

A COMPARISON BETWEEN COLLABORATIVE FILTERING AND ASSOCIATION RULE MINING

HENRIK SOHLBERG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Recommending new items to customers - a comparison between Collaborative Filtering and

Association Rule Mining

HENRIK SOHLBERG

hsoh@kth.se

Master’s Thesis at NADA Supervisor: Jens Lagergren External Supervisor: Markus Nilsson

Examiner: Anders Lansner

June 2015

(3)

(4)

Abstract

E-commerce is an ever growing industry as the internet infrastructure continues to evolve. The benefits from a recommendation system to any online retail store are several. It can help customers to find what they need as well as increase sales by enabling accurate targeted promotions. Among many techniques that can form recommendation systems, this thesis compares Collaborative Fil- tering against Association Rule Mining, both implemented in combination with clustering. The suggested implementations are designed with the cold start problem in mind and are evaluated with a data set from an online retail store which sells clothing. The results indicate that Collabora- tive Filtering is the preferable technique while associated rules may still offer business value to stakeholders. How- ever, the strength of the results is undermined by the fact that only a single data set was used.

Referat

Rekommendera nya produkter till kunder - en jämförelsestudie mellan Collaborative

Filtering och Association Rule Mining

E-handel är en växande marknad i takt med att Internet utvecklas samtidigt som antalet användare ständigt ökar.

Antalet fördelar från rekommendationssytem som e-butiker kan dra nytta av är flera. Samtidigt som det kan hjälpa kunder att hitta vad de letar efter kan det utgöra underlag för riktade kampanjer, något som kan öka försäljning. Det finns många olika tekniker som rekommendationssystem kan va- ra byggda utifrån. Detta examensarbete ställer fokus på de två teknikerna Collborative Filtering samt Association Ru- le Mining och jämför dessa sinsemellan. Båda metoderna kombinerades med klustring och utformades för att råda bot på kallstartsproblemet. De två föreslagna implementa- tionerna testades sedan mot en riktig datamängd från en e-butik med kläder i sitt sortiment. Resultaten tyder på att Collborative Filtering är den överlägsna tekniken samtidigt som det fortfarande finns ett värde i associeringsregler. Att dra generella slutsatser försvåras dock av att enbart en da- tamängd användes.

(5)

Contents

1 Background 1

1.1 Problem Statement . . . 1

1.1.1 Goal . . . 1

1.1.2 Thesis Delimitations . . . 2

2 Theory 3 2.1 Recommendation Systems . . . 3

2.1.1 Techniques . . . 3

2.1.2 Challenges . . . 4

2.1.3 Types of Data . . . 6

2.1.4 Existing Solutions . . . 6

2.2 Clustering . . . 8

2.2.2 Existing Research in the Field of Recommendation Systems . 10 2.3 Association Rule Mining . . . 10

2.3.1 The Apriori Principle . . . 11

2.3.2 Existing Research in the Field of Recommendation Systems . 12 2.4 Collaborative Filtering . . . 12

2.5 The Cold Start Problem . . . 13

2.6 Classification . . . 14

2.6.1 Evaluating Binary Classification Models . . . 15

3 Methodology 17 3.1 Data Set . . . 17

3.1.1 Tiers . . . 17

3.1.2 Training and Test Set . . . 18

3.2 Methods . . . 19

3.2.1 Preprocessing Phase . . . 19

3.2.2 Association Rule Mining . . . 19

3.2.3 Collaborative Filtering . . . 20

(6)

3.3 Results . . . 22

4 Results 23 4.1 Association Rule Mining . . . 23

4.1.1 ROC-plot . . . 23

4.1.2 Accuracy . . . 25

4.1.3 Tables . . . 25

4.1.4 Coverage . . . 27

4.2 Collborative Filtering . . . 28

4.2.1 ROC-plot . . . 28

4.2.2 Accuracy . . . 29

4.2.3 Tables . . . 30

4.2.4 Coverage . . . 31

5 Discussion 33 5.1 General . . . 33

5.2 Evaluated Methods . . . 34

5.2.1 Association Rule Mining . . . 35

5.2.2 Collaborative Filtering . . . 36

6 Conclusions 39 7 Future work 41 7.1 Collaborative Filtering . . . 41

7.2 Association Rule Mining . . . 41

7.3 Clustering . . . 41

Bibliography 43

(7)

(8)

Chapter 1

Background

1.1 Problem Statement

The Internet is an ever growing source where users can find information, read news, watch videos, and buy articles etc. The e-commerce industry, with Amazon and eBay in the front, keeps expanding and it is today more common to buy articles online than via a physical store, for some customers. There are several benefits from a customer’s perspective to use the online platform compared to traditional stores. Finding out what a retail store has to offer and comparing prices without leaving your home is a process that was far more time consuming before. From a seller’s perspective, the advantages are also numerous. A company selling products can reach out to any customer who has access to an internet device. For new actors, the threshold for entering the market has been lowered substantially as publishing a web site is far cheaper than opening a physical store.

The greatness of this information channel comes with a price and there are still many challenges that online stores need to overcome. It is certainly harder for customers to find what they want (or need) when there are a hundred of thousands various products in the store. This problem is not unique to e-commerce systems as it affects any system which offers content/items to users of a greater magnitude (e.g. YouTube). The lack of customized service from shop assistants adds up to the fact that it may be harder to find products in an online store.

A recommendation system is any system that seeks to provide recommendations to users in an online environment. There are numerous different techniques and implementations with varying performance depending on the context.

1.1.1 Goal

This thesis is focused around recommendation systems and different techniques with an in-depth comparison between two popular alternatives, namely Association Rule Mining and Collaborative Filtering. Both techniques suffer from a well-known issue

(9)

called the cold start problem. The goal for this thesis is to ultimately find out which of the two techniques that is preferable with the cold start problem in mind.

1.1.2 Thesis Delimitations

The recommendation system field is very wide and involves topics such as machine learning and data mining. The number of methods and theories is overwhelming and the scope for this thesis is delimited to the two previously mentioned techniques.

There are many aspects regarding the performance of a recommendation system e.g. accuracy, scalability, memory and time consumption. The primary focus for this thesis is accuracy, although other aspects are briefly discussed.

(10)

Chapter 2

Theory

2.1 Recommendation Systems

Recommendation systems are systems that seek to provide accurate recommendations to a user [10]. A recommendation may be a single item or top-N number of items that are predicted to be of interest for a user. It is commonly used in e-commerce systems, such as Amazon, but can be found in many other applications that have users and resources e.g. YouTube [4]. There are several reasons to why a recommendation system is valuable to multiple stakeholders for such a system. For an end-user, e.g. a customer in an e-commerce environment, it can act as a guide, helping the user to find interesting information in an (oftentimes) overwhelmingly large data set. As e-commerce has evolved, it has created a greater need to help customers find products in a store [16]. An ordinary physical superstore may have tens of thousands of books while an online book store may have millions. A con- crete example is Amazon.com, an online store which in December 2013 had over 230 million products [7]. Taking YouTube in the year of 2010 as another example, where users uploaded more than 24 hours of video per minute every day. To easily find videos that are interesting to a user among that ever-growing video set is most likely difficult and time consuming. Apart from end-users and their gains from a recommendation system, stakeholders with business in mind may also draw benefits from it. Higher sales, greater customer loyalty, and benefits from targeted promotions are all possible outcomes from a recommendation system [21].

2.1.1 Techniques

A recommendation system can be built using many different techniques and concepts within machine learning, and the techniques can be combined. The algorithms can be divided into three groups: memory-based, content-based, and model-based [21]. A memory-based algorithm uses the database containing users and items to make predictions. A content-based or search-based algorithm uses the content in text, search query, or user profile etc. to find out what item that might be interesting to a user. Model-based algorithms build models using training data which

(11)

then can be used to generate recommendations, predict behaviours etc. Contin- uously provided with new data, a model can also learn over time to adjust itself and thereby make more up-to-date and accurate recommendations. This paper will focus on the following three techniques:

Clustering

Clustering is a model-based technique and works by dividing users into segments based on similarity [10]. When a recommendation shall be generated for a specific user, the model finds the most suiting group for that user and uses that group’s data to form a recommendation. There are numerous different clustering and unsupervised learning algorithms to use in a clustering method.

Association Rule Mining

Mining for association rules is the task of finding patterns within an item set, more precisely looking for items that occur together in some context [22]. The method can be seen as a memory-based algorithm.

Collaborative Filtering

Collaborative Filtering (CF) is a recommender technique that includes different algorithms and approaches. In a general perspective, CF can be described as method where similar behaviours are found by a collaboration between recipients while filtering out those that are not relevant [21]. There are both memory-based and model-based CF techniques as well as hybrid-based, which is a combination of more than one method.

2.1.2 Challenges

There are several challenges for a recommendation system. Naturally, it depends on the environment in which the system is operating, but this section covers the most common challenges and depicts: accuracy, performance, scalability, lack of data, over specialization, and privacy [21].

Accuracy

Accuracy is of great importance since providing personalized recommendations to a user, which he or she will find interesting, is one of the main goals for a recommendation system. Suggesting content that is close to a user’s preference and at the same time avoiding narrow and limited suggestions i.e. also recommending content that is new and still interesting to a user, is challenging to achieve [4].

(12)

2.1. RECOMMENDATION SYSTEMS

Performance

A recommendation system needs to be rapid. In many applications the result is required to be returned in real time [10]. This requires the algorithm to run rapidly while still maintaining good accuracy. Forcing the user to wait for a considerable amount of time when he or she is browsing an online store is not preferable. Per- forming fast but providing inaccurate recommendations is neither acceptable.

Scalability

In a data mining and machine learning perspective, large amounts of data is generally a good thing. Lack of data is an issue that is described in the section below.

However, as the amount of data available to a recommendation system is growing, a challenge arises in terms of how well the system will scale [19]. There are different ways to solve the scalability issue to keep the performance at an acceptable level, but they will in general lower the accuracy.

Lack of Data

Many recommendation systems suffer when there is a lack of data [19]. Data sparsity refers to a problem specific to CF techniques. Especially in systems where there are thousands or millions of items and thousands of user, data sparsity is very common as the percentage of items that a user has bought or rated is severely low (<1%).

The Cold Start Problem

Another phenomenon, that is also related to the previous challenge, is called the cold start problem and occurs in two scenarios: when a user is new to the system and when an item is introduced. It is difficult to make accurate predictions when there is little or no information about a user when it uses the system for the first time. The same fact applies for a new item. The cold start problem is further described in section 2.5.

Over specialization

It is a challenge for a recommendation system to offer accurate recommendations that are not over specialized [19]. To illustrate the problem, an example could be an online movie store, where a user has bought “Star Wars 1” and “Star Wars 2”

and the system recommends a single item which is “Star Wars 3”. This is not a very useful recommendation for the user. Diversity is desirable.

Privacy

Personal data is vital to a recommendation system. From a user’s perspective, this may be a privacy concern [11]. Purchase records are for some customers, in different domains, considered as sensitive information. This data is often times used

(13)

and stored in an e-commerce systems. The challenge lies in getting users to trust the system with such information and keeping that data secure.

2.1.3 Types of Data

There are many data types that can be used to support a recommendation system.

The types of customer data can be divided into four groups: transaction, rating, demographic, and behaviour data [11]. Transaction data, also referred to as binary purchasedata, is customer records for bought products. Rating data consists of user ratings for items [10], typically on a scale 1-5. Binary purchase data and rating data are common types to use in CF techniques [21]. Demographic data, information such as age, gender, profession, and interests etc. can be used to generate user profiles and to find clusters of users with similar interests [9]. Web-mining techniques are used to capture behaviour data for users, e.g. how they are browsing a site [9], what movie a user watched, and whether a user watched a large portion of the video or not [4]. Production data or content data is metadata for items, e.g. movie title, clothing brand, and author etc. and is widely used in content-based techniques [21].

In addition to gathering data generated by users, user-agents or “bots” can be used to generate synthetic data which can be used to populate models [6].

2.1.4 Existing Solutions

Today, e-commerce sites are an ever-increasing market where a recommender system has its natural place [16]. As described earlier, it can also be found in other information technology eco systems. This section depicts existing solutions in two recognized and widely known applications.

Amazon

Amazon is, today, one of the most world wide famous online retail stores. They use their recommendation system as a powerful targeted marketing tool in email campaigns and product suggestions on their web site. The company currently has around 250 million active users [8] (an active user is a customer that has made a purchase within the last 12 months) and over 230 million products [7]. These numbers result in extreme demands on a scalable and robust system in terms of performance. However, they have created a system based on an algorithm they call item-to-item collaborative filtering which meets these criteria [10]. The algorithm finds similar items, i.e. items that customers tend to buy together, instead of similar users. The following pseudocode describes the algorithm to calculate the similarities between all products:

For each item in product catalog, I1 For each customer C who purchased I1

For each item I2 purchased by customer C

Record that a customer purchased I1 and I2

(14)

2.1. RECOMMENDATION SYSTEMS

For each item I2

Compute the similarity between I1 and I2

This is an expensive computation, in worst case O(N²M) where N is the number of items and M the number of customers, therefore it is done offline. A recommen- dation is generated by inspecting the items that a customer has purchased or rated, given a similar-item table find similar items, aggregating those items, and finally presenting the most popular items.

Item-to-item collaborative filtering scales independently of users and items compared to traditional CF which suffers when the number of users and items increases.

The computation of finding similar items for a specific customer is very cheap and depends only on the number of items a customer has purchased or rated.

YouTube

YouTube was founded 2005 and has grown to be one of the world’s most popular video site [4]. Users can view videos and movies as well as upload their own content. How users navigate on the page and how they find videos can be divided into three different categories: Direct navigation, search-oriented, and unarticulated want. Direct navigation refers to a user that is watching a single video that they have found elsewhere. Discovering content by searching within topics using queries is called search-oriented. Unarticulated want is interesting content that has been found based on recommendations, namely from a recommendation system.

The types of data available to YouTube, or data that is possible to collect, make it interesting and sometimes difficult in a recommendation system’s perspective. As there are very few restrictions when users upload content, metadata is often times untrustworthy if it even exists for a video. User interaction, i.e. for how many times or how large portion of the video that was watched, can be useful but can also be noisy data. Their solution to these challenges has been successful, as the video recommendation is one of the most important sources of information [23]. The solution involves many data sources, such as content data and user activity data, in an association rule mining approach [4]. It can be seen as building a directed graph where each node is a video and if two videos are related there is a weighted edge between them. The weight is determined by the relatedness of the videos. When a list of recommended items shall be presented to a user, the system selects a set of videos from the graph based on what videos the user has seen, liked, and favorited etc. Each video is then ranked based on an algorithm which takes number of views, number of likes, and metadata etc. into account.

Like Amazon, YouTube has adopted an offline precomputation approach since their application is handling enormous amounts of data as well. Precomputing related videos is therefore necessary to obtain satisfactory performance. They are at the

(15)

same time pipelining recommendation generation to update the data set several times a day to stay up to date.

2.2 Clustering

The task of clustering analysis is to divide data into segments (clusters) that are meaningful or useful [22]. Grouping similar objects together is applicable and useful in a wide variety of fields and cluster analysis is an important tool in: social science, biology, statistics, business, pattern recognition, data mining, and machine learning.

It is used in biology, for instance, to find groups of genes that have similar functions.

In business, it can be used to find groups of customers, which provides additional analysis capabilities and business opportunities.

There are different types of clusterings. Partitional clustering means that objects are divided into separate non-overlapping clusters. An object can only be a member of exactly one subset, compared to hierarchical where objects belong to a nested set of clusters. Figure 2.1 illustrates partitional versus hierarchical.

(a)Partitional (b) Hierarchical

Figure 2.1: Partitional versus hierarchical clustering

Clusterings can furthermore be exclusive, non-exclusive, or fuzzy. In exclusive (non- overlapping) clusterings, objects may only belong to a distinct cluster. The opposite is non-exclusive, or overlapping, where objects may be grouped into multiple segments. In fuzzy clusterings, every object belongs to every cluster and every object has a value which describes how close or how strong the membership is.

Another distinction for clusterings is whether they are complete or partial. For some applications, assigning every object from a data set to a cluster (complete

(16)

2.2. CLUSTERING

clustering) is not preferable. Noisy data and outliers are reasons to use partial clustering where objects may be omitted and not clustered.

K-means

The k-means technique is a partial, prototype-based clustering method. A prototype is often a centroid, i.e. the average of all data points in a group, or a medoid which is the most representative point in a cluster. The basic outline of the algorithm is to start by specifying a number K which is the number of prototypes (clusters).

Every point in the data set is then assigned to the closest cluster, i.e. grouped together with the prototype that is the most similar. After a cluster has been found for a point, that cluster’s prototype is updated. The basic k-means algorithm can be defined as [22]:

Select K points as initial centroids Repeat

Form K clusters by assigning each point to its closest centroid

Recompute the centroid of each cluster Until centroids do not change

There are different measurements to use on how to find the closest prototype for each point. Common metrics are Euclidean distance and Manhattan distance. Eval- uating the result from k-means is to calculate the sum of the squared error (SSE) for each cluster. This makes k-means to an optimization problem as finding the optimal solution is to minimize the sum of SSE.

K

X

i=1

X

x∈Si

kx − c_ik² (2.1)

where ci is the centroid of the cluster Si. This is an NP-hard problem [2], even for K = 2. However, the basic k-means algorithm (described above) stops when the centroids are not changing anymore, i.e. when local optimum is found. There is no guarantee that global optimum is found. This results in a time complexity of O(IKmd), where m is the number of points, d is the number of dimensions, and I the number of iterations. I is usually small which makes the basic k-means algorithm to be linear to m [22].

K-means is simple, applicable to many data types, and quite efficient. The fact that K needs to be specified beforehand is one of the disadvantages. Furthermore, the method does not guarantee finding a global optimum and it suffers from noise and outliers.

An extended version of k-means is called bisecting k-means. By starting to divide points into two clusters, the algorithm continues dividing clusters until K clusters

(17)

have been found. Choosing which cluster to split can be done in different ways.

The largest cluster or the cluster with the largest SSE are two ways.

Agglomerative Hierarchical Clustering

In addition to bisecting k-means which can be used to retrieve a hierarchical struc- ture for a data set, there is a class of clustering techniques called agglomerative hierarchical clustering algorithms. Two basic approaches to generate hierarchical clusters are called divisive and agglomerative. Divisive is similar to bisecting k- means as the approach is to start with a single complete cluster and then split clusters until there are only singleton clusters left. Agglomerative is the opposite approach: start by singleton clusters and then merge clusters until a single cluster has been defined.

When two clusters are to be merged in a agglomerative algorithmic approach, there are different ways of defining the proximity between clusters. MIN, MAX, and group average are commonly used. MIN, or single-link, defines proximity as the closest distance between two points that are members of different clusters. MAX, or complete-link, is the opposite, defining how close two clusters are by the longest distance between any two points in two different clusters. Group average, also called average-link, calculates the average distance for all points between two clusters.

The time complexity for the basic agglomerative algorithm described above is O(m³) but can be reduced to O(m²log m) where m is the number of data points.

2.2.2 Existing Research in the Field of Recommendation Systems Clustering techniques have been proven to be useful in recommendation systems either as standalone methods or combined together with other techniques. In applications where tags are used for resources, e.g. a word or sentence describing a photo, document, or news article etc., hierarchical clustering can be used as the basis for an efficient recommendation system also working as a remedy to tag ambiguity and redundancy [20]. Another clustering application is to use k-means together in a collaborative filtering system. Clustering users and items is shown to improve the performance of a CF system compared to a traditional, non-clustered, solution [5]. In several researches, performance (throughput, in this context) and scalability seem to benefit from combining clustering with CF while accuracy tends to get worse [15].

2.3 Association Rule Mining

The task of association analysis is to find patterns in an item set for items that occur together to a specific extent [22]. Searching for association rules can be valuable to any commerce company as it shows evidence for which items that are purchased

(18)

2.3. ASSOCIATION RULE MINING

together. To exemplify the technique, table 2.1 illustrates a widely known example called the market basket transactions.

Transaction ID Items

1 {Bread, Milk}

2 {Bread, Diapers, Beer, Eggs}

3 {Milk, Diapers, Beer, Cola}

4 {Bread, Milk, Diapers, Beer}

5 {Bread, Milk, Diapers, Cola}

Table 2.1: Market basket transactions

Support and Confidence are two measurements used when mining for association rules. Support determines how often a rule is true for a given data set and it is defined as:

s(X → Y ) = count(X ∪ Y )

N (2.2)

Where X and Y are disjoint itemsets, count(X ∪ Y ) is the number of times the itemsets X and Y occur together among the transactions in a data set, and N is the number of transactions. Confidence is defined as:

c(X → Y ) = count(X ∪ Y )

count(X) (2.3)

and is a value for how frequently items in Y appears in transactions that contains X. Reviewing table 2.1, the association rule {Bread} → {Milk} has support = ³₅ and confidence = ³₄.

Based on the support and confidence metrics, the steps involved in many association rule mining approaches are generating frequent itemsets and generating rules. The first step is to find all itemsets where the support value is higher than a specified threshold. The second step is to extract rules from that frequent itemset where the confidence value is higher than a specified threshold. The first step is the most computationally expensive one since every possible itemset needs to evaluated. The total number of possible itemsets for a data set is 2^k−1 where k is the number of items. This yields the time complexity O(NMw), where N is the number of trans- actions, M number of possible itemsets, and w is the maximum transaction width.

The brute-force approach is thereby not applicable to larger data sets, but there are algorithms resolving this issue. One of them is called the Apriori Principle.

2.3.1 The Apriori Principle

Inspecting every possible itemset to find frequent itemsets is an exponentially costly task. Frequent itemsets are itemsets with a support value higher than a specified threshold. The Apriori principle is defined as:

(19)

Definition. (The Apriori principle) If an itemset is frequent, then all of its subsets must also be frequent.

This means, if the itemset {a, b, c} is frequent, then the subsets {a, b}, {a, c}, and {b, c}are also frequent sets. This principle works in the opposite way. If the support for the itemset {a, b} is lower than the specified threshold, i.e. {a, b} is not a frequent itemset, then every superset (e.g. {a, b, c}) to {a, b} is infrequent. This principle makes it possible to prune all supersets to infrequent itemset and cut the needed time to generate frequent itemsets greatly.

2.3.2 Existing Research in the Field of Recommendation Systems Association rules are popular to use to produce top-N recommendations in e- commerce systems [14]. A basic approach is to find all the generated rules that are supported by a customer’s purchased items, that is items that occur on the left side of rules. Ranking them accordingly is usually done using the confidence value.

The technique is also a part of the YouTube recommender system [4].

Association rule mining performs equal to slightly worse than traditional CF [14].

2.4 Collaborative Filtering

Collaborative Filtering (CF) is one of the most successful recommendation system concepts [14]. The core in CF methods is to find similarities among users and/or items to generate recommendations. The fundamental assumption is that if two or more users are similar to each other, e.g. they have purchased same items or rated items similar etc, they can collaborate within that neighbourhood of users to provide recommendation to each other [21]. The underlying data structure in a typical CF solution is a n-by-m matrix, where n is the number of users and m the number of items.

There are different techniques within the concept of CF and this section depicts an overview of these techniques.

Memory-based

Memory-based techniques, also referred to as traditional CF, use the user/item database to generate predictions. The basic steps involved in a user-memory-based algorithm to recommend items to a current user are two [21]. The first step is to find X number of similar users (neighbours) to a current user. The second step is to find N number of items that a current user have not purchased or rated, and predict ratings or likelihoods (for the current user purchasing them) for those items using the neighbours records.

(20)

2.5. THE COLD START PROBLEM

Similarity measures Calculating similarities between users and items can be done using different measurements. Two commonly applied methods are Vector Cosine-Based similarity and Pearson Correlation. The Vector Cosine-Based simi- larity is calculated as treating a row or a column, depending on if it is user-based or item-based, as a vector and calculate the cosine angle between two vectors. The Vector Cosine-Based similarity is given by: [21]

w_i,j = cos(~i,~j) = ~i · ~j

k~ik ∗ k~jk (2.4)

where i and j are the user or item vectors. The Pearson Correlation technique calculates how two variables linearly relate to each other and is given by [21]:

wu,v =

P

i∈I(ru,i−r¯u)(rv,i−r¯v) qP

i∈I(ru,i−r¯_u)²^q^Pi∈I(rv,i−r¯_v)² (2.5) where I is the collection of items that both user u and v have rated and rv the average rating of co-rated items for the n^th user.

Generating prediction Predicting the rating for an item i for a user u can be done using simple to complex aggregate functions. A simple function is to take the sum of ratings from all neighbours for an item i and divide it by the number of neighbours.

Model-based

Model-based CF methods involves concepts from the machine learning field, such as Bayesian theory. A naive Bayesian model (NBM), also known as a simple Bayesian model, assumes that features are independent given the class and a class can be predicted given the observed features [21]. A feature can be a user, a user’s ratings can be the feature values, and classes can be defined as rating intervals [12]. Re- search has shown that NBM may outperform correlation-based CF, although due to a result for experimenting on a sparse data set [12].

Hybrid-based

Hybrid-based CF is a term referring to approaches where CF is combined with other techniques. Content-based recommendation techniques involves metadata, tags, and other data about resources, which can be used to support and improve the performance for traditional CF methods [21]. Using a NBM to populate a sparse matrix is another method that may improve a recommender system.

2.5 The Cold Start Problem

As mentioned in 2.1.2, the cold start problem is a well-known problem for recommendation systems, especially traditional CF approaches which are reliant on existing

(21)

data, such as item ratings or customer purchase records [18]. The cold start problem arises when a new user enters a system or when a new item is introduced. This issue affects other methods apart from CF as well that relies on customer/item records.

Association rule mining approaches that do not utilize content-data, such as what type (e.g. horror, action) the newly introduced movie is, also suffers from the cold start problem.

Research has been conducted to overcome this common challenge and several methods have been suggested. The cold start problem has two aspects: new users and new items. This thesis is related to the latter. Probabilistic models that are built and supported by content-data is one suggestion to resolve cold start recommendations [17]. In such a model, the content-data can be e.g. actors that have starred in movies that are previously rated, and once a new movie is introduced, data related to those actors can be used to make predictions for it. Another approach is a collaborative mixture model that uses representative users and content-data to recommend new items [3]. Representative users are users that are considered to have great influence over other users, therefore their data (ratings) can be used to estimate ratings for items. This was applied in a movie rating context, where the correlation between movie genres was calculated. Recommending a new movie, having at least one genre, involved combining its genres with the calculated genre correlation and representative users’ ratings.

The sparsity problem (2.1.2) is closely related to the cold start problem. Approach- ing to resolve the issues with data sparsity in a CF recommender system can be done using different methods. One successful technique is a cluster-based hybrid solution to transform the sparse CF matrix into a dense, meaningful, one [1]. A method called spectral clustering, involving k-means and several mathematical con- cepts, is applied on users and items to form clusters of similar users and items. The algorithm proceeds by assigning explicit (ratings) or implicit (e.g. purchase records) values to the “unrated” items by finding similar clusters to each item, applying an aggregation function, and then repeating the process until the matrix is stable (i.e.

the predicted values are fixed). This iterative cluster-based approach is proven to be superior to e.g. item-based CF in terms of mean absolute error.

2.6 Classification

A classification problem refers to the task of assigning an object to a category among N number of categories [22]. If N is equal to 2, it is called binary classification problem. Classification is a supervised learning problem as the labels (categories) are known, compared to unsupervised learning (e.g. clustering) where the labels are unknown. Classification is likely the most widely used machine learning technique and can be applied to solve a wide variety of challenging problems [13]. Document classification is one example, where the goal is to classify a document. Detecting

(22)

2.6. CLASSIFICATION

spam emails and dividing web pages into categories involves document classification.

Another real-world application is image classification, such as detecting a sunset or recognizing a face.

2.6.1 Evaluating Binary Classification Models This thesis presents the problem:

For a new product P, what customers C will buy it?

which subsequently leads to the classification problem:

Will customer c_i buy product p_i?

The categories are two (yes or no), hence it is a binary classification problem.

In a binary classification problem, there are four types of outcome when letting a prediction model predict the class for certain object: true positive (TP), false positive (FP), true negative (TN), and false negative (FN) [13]. The true cases are when the predicted outcome is correct, and false is when it is incorrect. Pos- itive and negative are the two labels for outcomes, in this case yes and no. These four values can be combined in equations to obtain different measurements which describe the performance of the prediction model. The true positive rate (TPR), also known as sensitivity, recall or hit rate, is defined as T P /N+, where N+ is the true number of positives (T P +F N). False positive rate, also known as false alarm rate, is calculated by F P /N−, where N−is the true number of negatives (T N+F P ).

A common way to evaluate the performance of a binary classification model is to study a Receiver operating characteristics (ROC) curve [13]. The x-axis repre- sents the FPR value and the y-axis represents the TPR value. Classification models usually have a threshold parameter π which affects the outcome of predictions. In a binary classification model, this parameter generally defines the threshold between positive and negative predictions. The ROC curve is thereby a tool to find the preferable threshold, as every point on the curve represents the resulting outcome for a certain threshold. Any system can perform at the bottom left corner in the ROC plot (TPR = 0, FPR = 0) which is a model that is classifying everything as negative. Similarly, classifying everything as positive is equally simple, which represents the top right corner in the plot (TPR = 1, FPR = 1). The goal for any classification model is to perform at the top left corner in the ROC plot, where no false positives are detected and all true positives are.

Precision is another measurement to study when evaluating a binary classification system. Precision is defined as T P / ˆN₊, where ˆN₊ is the total number of predicted positives (TP+FP), and describes the fraction of the predicted positives that actu- ally were true positives [13]. Precision together with recall form another useful plot

(23)

called precision recall curve. The x-axis represents recall and y-axis precision. The optimal result lies in the top right corner of the plot.

Accuracy is another indicator that expresses the performance in general. It is defined as dividing the number of correct predictions (TP + TN) by the total number of predictions.

(24)

Chapter 3

Methodology

The thesis presents a comparison between two suggested approaches for predicting which customers that will buy a newly introduced product:

• Collaborative filtering combined with clustering

• Association rule mining combined with clustering

This section contains a description about the data set used, the two hybrid recommendation methods, and how results were derived.

3.1 Data Set

The data used was a real-world data set from an e-commerce clothing company, with a customer base in Europe. The implementation handled historical data (offline) and was not launched or evaluated in any live environment. The data set contained 52,752 unique customers, 51,040 unique products, and a total number of 207,022 purchase entries. Purchase data was used as no ratings existed in the data set.

3.1.1 Tiers

The structure of the products can be described as a layered structure with three tiers. The two top tiers are categories of cloth types and distinct products exist on the third, bottom, layer. Figure 3.1 illustrates this structure. There are 10 different categories in the top tier (tier 2) and 54 different categories in the second tier (tier 1). Every product belongs to a group in both tier 1 and tier 2. To illustrate this relationship in the context of the data set used: A green pair of jeans of a specific brand (tier 0) belongs to the category “Jeans” (tier 1) and to the category “Pants”

(tier 2).

This structure makes it possible to assign products to predefined clusters. Ev- ery category in tier 1 forms a cluster as well as every category in tier 2. This

(25)

Pants Accessories

Jeans Shorts Shades Belts Earrings

Green pair of jeans

Figure 3.1: Tiers illustration. Blue boxes (top level) represent two categories on tier 2.

Red boxes (middle level) represent tier 1 products. The green box (bottom level) represent a distinct prouct at the bottom layer.

clustering approach is used in the user-based collaborative filtering method and the association rule mining method.

3.1.2 Training and Test Set

The data set was divided into a training and a test set. Roughly 90% of the data was used as a training set and 10% was left as a test set. The criteria for the test set was to only have newly introduced products, since the goal for this thesis is to evaluate methods for predicting what customers that will buy new products, i.e. products that are released at some point. That boundary, satisfying roughly a 90%/10% ratio, was found to be at the date of 2014-04-15. Every purchase record that was made before 2014-04-15 was assigned to the training set, and the rest of the purchase records, only containing products that were introduced from 2014-04-16, was included in the test set. Furthermore, only customers (their purchases) that have bought two or more products before 2014-04-15 were included in the training set. This was due to the fact that a customer that has only bought one product could not contribute to the evaluated recommendation methods. The resulting number of customers was found to be 33,200. These customers were used in the training set and considered as the only customers in the test set, hence any other customers (e.g. first time buyers) were left out. The total number of products in the test set was 3,941.

(26)

3.2. METHODS

3.2 Methods

This section depicts the data preprocessing phase and the two suggested methods for this thesis.

3.2.1 Preprocessing Phase

The data set primarily held purchase records where every record contained a unique customer and a unique product. The main task of the preprocessing step was to transform the set of records onto a map between a customer and a list of its bought products. This map was duplicated into two versions. All the products were translated to their tier category and represented in the two version. One version represented all products on tier 1, and the second one represented all the products on tier 2. The benefits from this transition were several. In the association rule mining approach, a list of a customer’s bought products represents a transaction ID (see example in table 2.1). All the customers’ transaction lists together form the data structure which is used as the source in the rule mining phase. The mappings were also used in both methods for finding customers that have or have not bought a specific product.

3.2.2 Association Rule Mining

A traditional association rule mining approach is unable to find patterns for new items. Let I be a new item that none has bought yet, let Ptier0 and Qtier0 be non- empty itemsets containing products on tier 0. Since I does not exist in a single transaction, there are no frequent itemsets containing I, hence there are no rules P_tier₀ → Q_tier₀ where I ∈ Qtier0. However, utilizing the fact that a new item belongs to two clusters (tier 1 and tier 2), association rules can, in theory, be used to find potential buyers for a new item. Let J be a new item that none has bought yet, let Ptierx and Qtierx be non-empty itemsets containing products (represented as categories) on tier x, where x is 1 or 2. Let J belong to a category C on tier x. If there exist at least one purchase record for a product that belongs to category C, then there exists a frequent itemset with support > 0. This leads to the fact that there must exist at least one associated rule, Ptierx → Q_tier_x where C ∈ Qtierx, with support > 0 and confidence > 0. Hence, the proposed principal for predicting what customers that will buy a new product, for such a rule, is: Customers that have bought all products in Ptierx are likely to buy J.

Two steps were required to make predictions. In the first step, rules were generated using the data structure retrieved from the preprocessing phase, at a specified tier.

Parameters used during this stage are described in the section below. Next step, predicting which customers (among the customers in the whole data set) that will buy a new product P was performed by:

1. Translating P to tier 1 or tier 2

(27)

2. Find the rule, satisfying support larger than a specified threshold, with the highest confidence value where P exists on the right side

3. Use the left-side products (in tier 1 or 2 space) from that rule and find customers that have purchased those products

4. Customers found are classifed as buyers All versus Selected

In step 3, selecting customers is done in two different ways. Apart from the fact that the selected customers have bought all the products in the itemset on the left side of the rule, a distinction is made whether they have bought a product of the same category as on the right side. All refers to selecting all customers that have or have not bought any product of the same category as P (which is on the right side of a rule). Selected means selecting only the customers that have not bought a product of the same category as of P .

Parameters

In the rule mining phase, where rules that have a tier-translated P on the right side are generated, the confidence parameter c is used to find the highest ranked rule that will be applied. The set of parameters is chosen to [0.8, 0.6, 0.4, 0.2], to evaluate how the results are affected by different confidence thresholds. A support threshold is initially set to 0.9 when mining for rules with confidence ≥ c. If no rules are found, the support threshold is decremented by 0.1 until a rule is found. If the support threshold reaches 0.01 and no rules are found, then this approach fails to make any predictions for a product P . This decision was made based on the fact that a rule with a support lower than 1% is too weak to be considered as a useful pattern.

3.2.3 Collaborative Filtering

The core principle in the suggested CF method is to transform the regular user/item matrix to a user/item-cluster matrix. Customers represent users, unique products represent items, and tier 1 and 2 categories form item clusters. The approach for deciding if a customer C will buy a product P , after the two matrices (tier 1 and 2 user/item-cluster matrices) have been set, can be summarized as:

1. Translate P to its tier 1 or 2 representation

2. Find N number of neighbors to C that satisfy a similarity threshold s 3. Apply an aggregation function on the neighbors’ transaction lists to retrieve

a list of products T . The aggregation function yields a product score for each product

(28)

3.2. METHODS

4. If P is in T and that score is higher than a specified threshold r, classify C as a buyer

Vector cosine-based similarity (2.4) was used to calculate the similarities between customers. Top-N refers to N number of neighbors with the greatest similarity.

The aggregation function used to get T from N number of neighbors was:

T = (

N

X

i=0

S_i) 1

N (3.1)

where Si is the transaction list (a vector) for neighbor ni (row ri in the user/item- cluster matrix) multiplied by the scalar 1/M where M is the total number of prod- ucts ni has bought. This normalization is done to prevent neighbors, that have bought many products, to have too much influence over neighbors that have bought fewer, on T . Multiplying^P^N_i=0S_i with the scalar _N¹ is done to get product scores be- tween 0 and 1. The length of T is equal to the width of the user/item-cluster matrix i.e. the number of categories. A value Ti represents a product score for a specific category. To exemplify, a category C in T has a score of 0.7. This means that 70% of the purchases, within the neighborhood that represents T , were products of category C.

All versus Selected

The distinction between All and Selected (3.2.2) is implemented in the CF method as well. This criteria is applied after step 1. If Selected is active, customers that have bought a product of the same category as of P are not passed through to step 2, hence they are automatically classified as non-buyers.

Parameters

There are at least three parameters that can be adjusted when this method is used to classify buyers. These are:

• Number of top-neighbors, N

• Neighbour similiarity threshold, s

• Product score threshold, r

Due to the limited time scope of this thesis, different values for s and r are evaluated while N is fixed and set to 5. The set of values for s is assigned to [0.95, 0.9, 0.8, 0.6, 0.4]

to study what impact a neighbor similarity threshold will have. In step 2, neighbors with similarities < s are invalid neighbors to a customer c. If the resulting number of neighbors to c is lower than 1, c is classified as a non-buyer.

The set of r values that will be evaluated is [0.7, 0.5, 0.3, 0.1] to see how a changing

(29)

product score threshold will affect the result. If a newly introduced product P have a score ≥ r, P is considered to be a product that the current customer is likely to buy, hence that customer is classified as a buyer of P .

3.3 Results

The test set (3.1.2) consisted of 3,984 products. Every customer (total number of 33,200) was classified as buyer or a non-buyer to every product. This lead to a binary classification (2.6.1) result set of size 33,200 for each product and the results were put together as the average result for all the products.

(30)

Chapter 4

Results

This section presents results for the implemented methods that were described in chapter 3. This chapter is divided into two subsections, one for each method.

Each subsection contains ROC-plot, Euclidean distances, accuracy, and TPR/FPR presented in tables or charts. Details over how results were derived can be found in section 2.6.1 and 3.3.

4.1 Association Rule Mining

This section presents the results for the Association Rule Mining method which is evaluated using different parameters. T1or T2refers to rules that are defined on tier 1 or 2 (section 3.1.1). All and Selected refers to a buyer/non-buyer distinction, see section 3.2.2 for a richer explanation. Confidence represents the specified confidence threshold (section 3.2.2).

4.1.1 ROC-plot

Figure 4.1 illustrates how the confidence threshold affects TPR and FPR. A complete summarization for these two figures together with accuracy can be found in section 4.1.3.

(31)

0 0.2 0.4 0.6 0.8 1 0

0.2 0.4 0.6 0.8 1

FPR

TPR

ROC-plot

T₁ All T₁ Selected T2 All T2 Selected

Figure 4.1: A ROC-plot illustrating how TPR and FPR are affected by changing confidence, for the association rule classifier.

Euclidean Distance

The Euclidean Distance to the point (0, 1) is calculated for every point in the ROC-plot (fig. 4.1) and is presented in table 4.1.

Confidence T₁ All T₁ Selected T₂ All T₂ Selected

0.2 0.608 0.920 0.588 0.927

0.4 0.671 0.947 0.588 0.938

0.6 0.933 0.994 0.731 0.973

0.8 0.948 0.995 0.922 0.998

Table 4.1: Euclidean Distances to the point (0, 1) in the ROC-plot (fig. 4.1).

(32)

4.1.2 Accuracy

Accuracy measures what fraction of the predictions that were correct, which is presented in figure 4.2.

T2 All T2 Selected T1 All T1 Selected

0.6 0.7 0.8 0.9 1

Accuracy

c= 0.2 c= 0.4 c= 0.6 c= 0.8

Figure 4.2: Accuracy (c equals confidence)

4.1.3 Tables

The underlying values presented in the previous sections are summarized in table 4.2 and table 4.4 together with the corresponding standard deviations shown in table 4.3 and table 4.5.

T₁ All T₁ Selected Confidence TPR FPR ACC TPR FPR ACC

0.2 0.45 0.26 0.74 0.09 0.16 0.83 0.4 0.36 0.20 0.80 0.06 0.12 0.88 0.6 0.07 0.03 0.97 0.006 0.009 0.991 0.8 0.05 0.01 0.99 0.005 0.003 0.997

Table 4.2: Tier 1 - Results in fractional number for TPR, FPR, and accuracy (ACC) for the association rule classifier.

(33)

T₁ All T₁ Selected Confidence TPR_std FPR_std ACC_std TPR_std FPR_std ACC_std

0.2 0.37 0.06 0.06 0.22 0.04 0.04

0.4 0.36 0.09 0.09 0.17 0.05 0.05

0.6 0.19 0.01 0.01 0.06 <0.01 <0.01

0.8 0.17 <0.01 <0.01 0.06 <0.01 <0.01 Table 4.3: Tier 1 - Standard deviations for TPR, FPR, and accuracy (ACC) for the association rule classifier.

T₂ All T₂ Selected Confidence TPR FPR ACC TPR FPR ACC

0.2 0.59 0.42 0.58 0.1 0.23 0.77 0.4 0.55 0.37 0.63 0.08 0.19 0.80 0.6 0.28 0.12 0.88 0.03 0.05 0.95 0.8 0.08 0.02 0.98 0.002 0.004 0.996

Table 4.4: Tier 2 - Results in fractional number for TPR, FPR, and accuracy (ACC) for an association rule classifier.

T₂ All T₂ Selected

Confidence TPR_std FPR_std ACC_std TPR_std FPR_std ACC_std

0.2 0.37 0.06 0.06 0.22 0.06 0.06

0.4 0.38 0.12 0.12 0.20 0.08 0.08

0.6 0.34 0.05 0.05 0.12 0.02 0.02

0.8 0.19 <0.01 <0.01 0.03 <0.01 <0.01 Table 4.5: Tier 2 - Standard deviations for TPR, FPR, and accuracy (ACC) for the association rule classifier.

(34)

4.1.4 Coverage

In the Association Rule Mining method, thresholds for support and confidence are used when rules are generated. Rules with support lower than 0.01 and a lower confidence value than what c is specified to, will never be used (see section 3.2.2 for more information). For some products at some tier, rules that meet these criterias were not found, thereby it fails to predict any potential buyers for those products.

Figure 4.3 presents the percentage of products that the association rule classifier were able to predict customers for.

T₂ T₁

20%

40%

60%

80%

100% 92%

78%

91%

73%

82%

58%

71%

33%

Applicable rules

c= 0.2 c= 0.4 c= 0.6 c= 0.8

Figure 4.3: The percentage of products where a satisfying rule could be used.

(35)

4.2 Collborative Filtering

This section presents the results for the Collaborative Filtering method which is evaluated using different parameters. T1 or T2 refers to rules that are defined on tier 1 or 2 (section 3.1.1). The variables s and r were introduced in section 3.2.3 and represent neighbour similarity threshold (s) and product score threshold (r).

The results using the Selected approach are omitted as the approach failed to find any customer that met the criterias to be categorized as a “buyer” (equivalent to classifying everything as negative). Additionally, a varying s for T2had no effect, hence making it redundant to include more than one value for s in this section. As for s in T1 iterations, the results were equal for s being 0.8, 0.6, or 0.4, therefore 0.6 and 0.4 are left out.

4.2.1 ROC-plot

Figure 4.4 illustrates how performance is affected when r is changing.

0 0.2 0.4 0.6 0.8 1

FPR

TPR

ROC-plot

T₂ s= 0.95 T₁ s= 0.95 T1 s= 0.9 T₁ s= 0.8

Figure 4.4: A ROC-plot illustrating how TPR and FPR are affected by changing r.

(36)

4.2. COLLBORATIVE FILTERING

Euclidean Distance

The Euclidean Distance between each point in the ROC-plot (fig. 4.4 and (0, 1) is calculated and presented in table 4.6.

r T₁ s= 0.95 T1 s= 0.9 T1 s= 0.8 T2 s= 0.95

0.1 0.634 0.521 0.468 0.441

0.3 0.701 0.638 0.618 0.546

0.5 0.777 0.765 0.764 0.702

0.7 0.895 0.894 0.893 0.857

Table 4.6: Euclidean Distances to the point (0, 1) in the ROC-plot (fig. 4.4).

4.2.2 Accuracy

Accuracy is presented in figure 4.5.

T2 s= 0.95 T1 s= 0.95 T1 s= 0.9 T1 s= 0.8 0.6

0.7 0.8 0.9 1

Accuracy

r= 0.1 r= 0.3 r= 0.5 r= 0.7

Figure 4.5: Accuracy for a CF classifier.

(37)

4.2.3 Tables

Table 4.7 and table 4.9 summarizes TPR, FPR, and accuracy (ACC) from section 4.2.1-4.2.2. Standard deviations for the result series are shown in table 4.8 and table 4.10.

T₁ s= 0.95 T₁ s= 0.9 T₁ s= 0.8

r TPR FPR ACC TPR FPR ACC TPR FPR ACC

0.1 0.38 0.15 0.85 0.51 0.19 0.81 0.58 0.21 0.79 0.3 0.31 0.12 0.88 0.38 0.14 0.86 0.40 0.14 0.86 0.5 0.23 0.08 0.92 0.24 0.09 0.91 0.24 0.09 0.91 0.7 0.11 0.03 0.97 0.11 0.03 0.97 0.11 0.03 0.97 Table 4.7: Tier 1 - Results in fractional number for TPR, FPR, and accuracy (ACC) for the CF classifier.

T₁ s= 0.95 T₁ s= 0.9 T₁ s= 0.8

r TPR FPR ACC TPR FPR ACC TPR FPR ACC

0.1 0.37 0.09 0.09 0.39 0.11 0.11 0.39 0.12 0.12 0.3 0.35 0.07 0.07 0.37 0.08 0.08 0.38 0.08 0.05 0.5 0.32 0.05 0.05 0.32 0.05 0.05 0.32 0.05 0.02 0.7 0.23 0.02 0.02 0.23 0.02 0.02 0.23 0.02 0.02 Table 4.8: Tier 1 - Standard deviations for TPR, FPR, and accuracy (ACC) for the CF classifier.

(38)

4.2. COLLBORATIVE FILTERING

T₂ s= 0.95 r TPR FPR ACC 0.1 0.75 0.36 0.64 0.3 0.52 0.25 0.75 0.5 0.32 0.17 0.83 0.7 0.14 0.06 0.94

Table 4.9: Tier 2 - Results in fractional number for TPR, FPR, and accuracy (ACC) for the CF classifier.

T₂ s= 0.95 r TPR FPR ACC 0.1 0.34 0.12 0.12 0.3 0.38 0.09 0.09 0.5 0.35 0.06 0.06 0.7 0.26 0.03 0.03

Table 4.10: Tier 2 - Standard deviations for TPR, FPR, and accuracy (ACC) for the CF classifier.

4.2.4 Coverage

When a product has never been bought before, a traditional CF method fails to generate any predictions. In this suggested approach, the CF classifier would fail if the category (tier 1 or 2) for a new product never have been purchased by any customer. This case did not exist, which means that “coverage” can been considered to be 100% for all configurations.

(39)

(40)

Chapter 5

Discussion

This section discusses the results and how to interpret them in the context of recommendation systems. To better understand the results in a broader perspective, certain aspects of the data set, used in the implementation, are outlined. Further- more, the two methods and their results are discussed.

5.1 General

The data set used in this thesis is genuine and originates from a (today) active e-commerce company with a customer base in Europe. This makes the results more relevant than if it would have been artificially made up. However, the fact that only a single data set was used undermines the strength of the results. If the implementation would have included more data sets, the results would eventually have been different but it would have increased the confidence for the results. Running the methods against multiple training and test sets was not feasible due to the scope of this thesis.

At this time, the e-commerce company is somehow niched into a specific type of clothes while they still offer a full collection of apparel types (e.g. hats, jackets, shirts, pants, and accessories). This became apparent during the implementation as it turned out that one category on tier 2 was overrepresented among all purchases, although there still existed purchases of every category. In practice, this means that the average customer is likely using this e-commerce shop to buy some clothes of a specific type. That could explain why the Selected distinction approach was not successful in any of the implemented methods, especially in CF where it was utterly useless. The Selected distinction was included in the suggested methods as it would stimulate recommendations to be less over specialized (section 2.1.2).

Again, the result may be significantly different for another domain where customers are not consuming products of the same category, hence it would be incorrect to generally reject the idea of the Selected distinction.

(41)

The fact that customers tend to buy similar products does not extinguish the business value of the suggested recommendation system. Even though customers may know beforehand what they are buying, recommendations may still offer some value.

Additionally, the suggested methods are able to identify potential buyers among the existing customers, which is something that can be used to improve sales e.g. using targeted promotions.

If the split of the data set into a training and a test set would have been different, it would most likely have affected the result. Shrinking the test set would presumably increase the performance and vice versa. The 90%-10% division was chosen arbitrarily but can be seen as motivated since in a real scenario, 100% of the historical data would be used to predict potential buyers to new products.

In tables 4.2-4.5 and tables 4.7-4.10, one can observe that FPR and ACC are strongly correlated. As for the performance values: ACC = 1 − F P R, and as for the standard deviation: ACC = F P R. This behaviour is explained by the fact that for a new product, the average number of actual buyers was immensely small compared to whole set of customers (33,200). The average number of customers that bought a product in the test set turned out to be 4.1, which makes TPR have a negligible impact on the accuracy.

5.2 Evaluated Methods

Based on the overall results, Collaborative Filtering (CF) is superior to Association Rule Mining (ARM) when predicting what customers that will buy a new product.

Comparing table 4.1 to table 4.4, different configurations for CF form a steeper pattern towards the golden spot, which is in the top-left corner, whilst ARM performs slightly better to worse than a random prediction model. A random prediction model would simply select some percentage of the customer set (e.g. 50% = 16600 customer in this case) and categorize them as positive and the rest as negative.

Such a model would perform at the line (in a ROC-plot) which intersects (0, 0) and (1, 1). ARM is in some cases achieving higher accuracy than CF, but as noted in the previous section, accuracy for this data set turned out to be a misleading measurement (since a model that would classify everything as negative would have, in average, roughly 4 out of 33,200 false outcomes which gives an accuracy of 99,99%).

Even though performance, in terms of execution time and memory, has not been the primary focus for this thesis, it is still a relevant topic. The difference in execution time depends on what the suggested methods would be used for. In the implementation, the expensive calculation is done in advance. In ARM, all the possible rules are generated before potential buyers are searched for. In CF, the neighbour- similarity matrix is calculated before any prediction occurs. If the methods would be used to find all potential buyers (e.g. find customers for a targeted promotion