Evaluating recommendation systems for a sparse boolean dataset

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING 300 , SECOND CYCLE

CREDITS

STOCKHOLM SWEDEN 2016,

Evaluating recommendation systems for a sparse boolean dataset

JONAS DANIELS

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

Evaluating recommendation systems for a sparse boolean dataset

Evaluering av rekommendationssystem för ett glest booleskt dataset

JONAS DANIELS JONDA@KTH.SE

Degree Project in Computer Science and Communication (30 ECTS credits) Degree Programme in Computer Science and Engineering

School of Computer Science and Communication KTH Royal Institute of Technology, Stockholm, Sweden

Supervisor: Arvind Kumar Examiner: Erik Fransén

February 10, 2016

(3)

(4)

Abstract

Recommendation systems is an area within machine learning that has become increasingly relevant with the expansion of the daily usage of technology. The most popular approaches when making a recommendation system are collaborative filtering and content-based. Collaborative filtering also contains two major sub approaches memory-based and model-based. This thesis will explore both content- based and collaborative filtering to use as a recommendation system on a sparse boolean dataset. For the content- based filtering approach term frequency-inverse document frequency algorithm was implemented. As a memory-based approach K-nearest neighbours method was conducted. For the model-based approach two different algorithms were implemented, singular value decomposition and alter least square. To evaluate, a cross-approach evaluator was used by looking at the recommendations as a search, a search that the users were not aware of. Key values such as the number of test users who could received a recommendation, time consumption, F1 score (precision and recall) and the dataset size were used to compare the methods and reach conclusions. The finding of the study was that collaborative filtering was the most accurate choice when it comes to sparse datasets. The implemented algorithm for the model- based collaborative filtering that performed most accurate was Singular value decomposition without any regularization against overfitting. A further step of this thesis would be to evaluate the different methods in an online environment with active users, giving feedback in real time.

(5)

Referat

Evaluering av rekommendationssystem för ett glest booleskt dataset

Rekommendationssystem är ett område inom maskininlär- ning som har blivit allt vanligare i och med expansionen av den dagliga användningen av teknik. Det mest populära metoder när du gör ett rekommendationssystemet, “collaborative filtering” och “content-based filtering”. Collabora- tive filtering innehåller också två sub kategorier, “memory- based” och “model-based”. Denna avhandling kommer att undersöka både “content-based” och “collaborative filtering” för användning som ett rekommendationssystem för ett glest boolesk dataset. Som “content-based” strategi implementerades term frekvens omvänd dokument frekvens (TF-IDF) algoritmen. Som en “memory-based” strategi implementerades K-närmast grannarna (K-NN) metoden. För

“model-based” angripsättet implementerades två olika al- goritmer, singulärvärdesuppdelning (SVD) och altenerande minsta kvadrat metoden (ALS). För att kunna utvärde- ra metoderna mot varandra sågs rekommendationer som en sökning, en sökning som användarna inte var medvetna om att det gjort. Viktiga värden som antalet testanvändare som kunde fått en rekommendation, tidsåtgång, “F1 score” (precision och recall) och dataset storlek användes för att jämföra det olika metoderna och dra slutsatser. Resul- tatet av studien visar att “collaborative filtering” var den högst presterande när det gäller en gles datamängd. Den implementerade algoritmen för “model-based collaborative filtering“ som visat sig vara den mest exakta var SVD utan reglering mot “overfitting”. En framtida påbyggnad av denna rapport är att utvärdera olika metoder i en online-miljö med aktiva användare som kan ge respons i realtid.

(6)

Introduction

In the introduction section a background to the problem will the presented as well as the problem formulation. Finally related works will be discussed in order to put the thesis into context of what has been done before.

1.1 Background

Recommendation systems have continuously evolved in the last couple of decades.

Along with the revolution of computers people started to use websites as a source of information. The main difference between a physical store and a on-line store is the number of items the store can display to the customer. In a physical store space is limited and the customer can only encounter so many products in the store so the shop makes a decision on what items to display. In an online store the customer has the whole stock to browse through. With the ability to display everything came the problem of having too many choices. A recommendation system helps the user with this problem by trying to estimate the probability of a user’s appreciation of a product. It then displays the items with the highest credibility to the user as a suggestion.

With computers came the possibility to collect user specific information, which can be later used to make a smaller list of suggestions for the user. In general, a recommendation system can focus on many different aspects. Either you can focus on the user and find similarities that the user has with other users and give suggestions based on these correlations. The other way is to look at all the different items, in this case courses, and evaluate how close they are related to each other. If you have users that like a specific course they might also like another course because it relates to the one that they just displayed an interest in.

In this study, I have investigated the possibility to calculate recommendations for a user based on previous interactions. Multiple methods with different approaches were implemented, tested and compared based on their recommendation qualities.

The different approaches focused on various aspects of the dataset to determine what best suited the case of a sparse boolean dataset. Previously used algorithms

1

(9)

and approaches were considered. Many of the methods have been used with other datasets with various successes. With multiple attempts and adaptations, some of the algorithms could be used as a recommendation system on the dataset at hand.

1.1.1 Related work

In this section a overview of the work that has been done in the fields of recommender systems will be presented. Some of the more closely related works will be briefly explained to get insight in the research field of recommender systems.

The area of recommendation systems has boomed with the expansion of the personal computer. A problem that rose up along with the personal computer is the Long Tail problem[2, 3]. A part of the long tail problem is when you have more items than you can display for the user. A problem that did not occur on the same level before the personal computer. Furthermore the recommendation system can have a few different goals, these are a few of the most common goal areas that have been studied.

• Product recommendations

• News articles

• Movie recommendations Product recommendations

The area that gets the most attention when it concerns recommendations systems is retail stores[21]. Most of the larger online stores have some kind of a recommendation system in place to give each customer a unique experience on the site[13].

Larger stores also have a lot more user data to build the recommendation models on. In some cases the customers prefer a recommendation system online rather than a more old fashioned recommendation such as a store clerk or another customer[22].

Even if the recommendation from another customer or a store clerk is more suitable for the customers needs, more trust was put in the recommendation systems suggestion. The product recommendation follows the same pattern as the movie recommendations and the new articles recommendation. The item to item collaborative filtering is a common way to group items together to make recommendations within the same group[13]. The item to item recommendation might be effective in some cases but disastrous in other cases. For example if you just purchased a new camera you will not likely buy a new camera. So recommendations for another camera would be unnecessary, but the collaborative filtering would also group items that other customers, that have also bought a camera, bought. In most cases these items are complementary to the camera. This could for example be a camera bag or a lens. Hence each user’s experience should be so unique that it differs largely from other users experiences, as stated by Jeff Bezos.

(10)

“If I have 3 million customers on the Web, I should have 3 million stores on the Web."

— Jeff Bezos,CEO of Amazon.com

News articles

Another section that recommendations are used in is news articles. The main idea behind the news article recommendation is to evolve on feedback it gets from the users. It is therefore harder to test a recommendation algorithm for new articles offline as the recommendations shown to the user will also provoke what articles the user clicks on[12]. In a study made for Yahoo they managed to increase the total amount of new articles clicked by 12.5 % by adding contextual recommendation based on the users and the items[12]. This study was performed on the Yahoo frontpage giving it high traffic to back up the result. The authors also highlighted the issues of a normal collaborative filtering algorithm where it is too static to work in the dynamic world of news articles.

Netflix

Netflix have had a big impact on the research development in the recommendation field. This is due to the Netflix prize competition that started in 2006[16]. The Netflix prize was a competition that challenged anyone to make a recommendation system that was 10% better than their own recommendation system cinematch[15], for a grand prize of one million dollars. It took about three years for a team of researchers to beat the 10% improvements mark. The winner was the multinational team of seven people called Bellkor’s Pragmatic Chaos and their solution was a hybrid combining many different algorithms[4]. The Netflix recommendation works in a similar environment as this thesis. This environment being that over 99% of the entries in a user to item matrix is empty. The advantage that Netflix has, is their rating scale and the fact that they still have a lot more data concerning each user.

1.2 Question

There are two explored approaches to developing a user recommendation system, content-based filtering and collaborative filtering. How can the two categories of recommendation system approaches be compared in order to choose the best suitable one for a sparse boolean dataset?

(11)

(12)

Chapter 2

Theory

The theory section contains an informative view of the area in which this thesis was conducted in. I briefly discuss what a recommendation is and the underlying theory about data sources. The segment also contains recommendation approaches and evaluation methods. Finally it also explains the theoretical part of the recommendation algorithm later implemented in this study.

2.1 Recommendation

Generally speaking, a recommendation system, is a way to guess how a user would rate certain things. If you would give a recommendation to another person you guess how that person would rate that item. A successful recommendation is when the guessed rating for the user is accurate to the person’s opinion of the product after using it. A recommendation can be based on different aspects such as content of the items, previous history, other users and countless more aspects. These aspects can also be combined to make more qualified guesses. How the calculations are made to arrive at the recommendation will divide the different methods into two different approaches: content-based and collaborative filtering.

2.2 Datasource

The first step in a recommendation system is what data to look at and take into account when making the recommendation. There are different types of item and user related data and different ways that the data can be collected in. In order for a recommendation system to work properly one needs to know what type of data is available and how the data is related. As a generalization, data we can divide into the types of sources, implicit and explicit[17].

5

(13)

2.2.1 Implicit

A implicit datasource is a source that was not created by the user intentionally. A datasource that is implicit is created in the database from actions the user made in the system that have no user feedback. These implicit actions that the user makes are hard to interpret for the designer of the recommendation system, since it is impossible to tell if it was a positive or negative encounter. For example if a user visits a specific course url on a website this would be implicit data as we can not be sure if this was a positive or a negative interaction for that particular user, did the user like the information on that site or not? A implicit data source will generally weigh less than a explicit data source, however you usually have a larger amount of implicit data since it does not require user feedback which makes it easier to collect.

Therefore a single implicit data point would be hard to make a prediction from but if you had access to many implicit sources you might be able to make some valid predictions from these data points.

2.2.2 Explicit

In a explicit datasource the user gives direct feedback. A source of direct information would for example be a user giving a rating of a course that it participated in.

In a recommendation system the type of datasource that gives back the most valid information is explicit data. But most of the data that is available for a recommendation system to work with is implicit and in order to get a less spares datasource the two types can get combined and weighted on importance.

2.2.3 Item-item

In a item to item recommendation system it is the items that are in focus. The base of the recommendation system will come from the relationship between different items. Most that have purchased something online have encountered this type of system. Usually you encounter it after putting something in the shopping cart and start heading for the checkout. Then the site suggests that an item might be of interest as it shares a relationship with the item you have chosen to purchase. A item to item recommendation system is really cheap to calculate in real-time since the system can pre calculate a item to item scheme among all the items in the store.

This scheme only needs to be updated when new items are added to the store.

2.2.4 User-user

Apart from the Item-Item recommendations a user to user recommendation ignores the actual items in the system and focuses on similarities between users. User to user will find the user or users that is the closest to your own likings. A typical example would be a dating site algorithm where the user wants to find someone that is as similar as possible to oneself. In the example from the item to item recommendation a user to user recommendation would be if the suggestion comes

(14)

after you fill in your profile on the website and get users that “matches” you. This kind of system would be more expensive to calculate since users using the system would trigger the need for a update. Filling in your “profile” requires users to voluntarily enter information. This will also be an example of explicit data since it comes from user feedback and can be used to make a better recommendation.

2.2.5 User-item

The user-item is based on the relation between users and items. Knowing the relation between a certain user and a certain item gives a recommendation system a good standing point to decide if that certain user desires this certain item. The big problem with this standing point is that most of the relations between items and users are unknown. If you have a database with thousands of items a user may not have seen all the items. An example of such a system that uses this approach would be supermarkets where you have member cards and later get a reduced price on your “favorite” items or the most frequently purchased items.

2.3 Collaborative filtering

Collaborative filtering is a recommendation approach where you focus on the distance between the users or items based on actions from other users in the system.

Collaborative filtering is a method where the recommender gains its knowledge from the users actions. Later the recommender will use this information to aid other users recommendations. You can interpret this approach as a large matrix between users and items. In this thesis the items are represented as courses. Where the matrix rows contains all the users and all columns contains different items. Each entry in the matrix contains information how a user regards an item. The goal of the collaborative filter is to calculate the entries that the user have not yet interacted with. The calculations can be put into two main categories being: memory-based and model-based.

2.3.1 Memory-based

Memory-based collaborative filtering is a method where you use the collected data to make a prediction of the user’s preferences. Memory-based can use the entire or a part of the dataset to make its predictions for the user. One of the most common memory-based approaches is the K-Nearest Neighbours. The memory- based approach is a valid and possible method for smaller datasets. However the memory used grows out of proportion when used for a normal computer, with the user times items matrix.

(15)

K-Nearest neighbours

In this algorithm the distance between a user, that we want to make a prediction for, to the rest of the users in the dataset is calculated. After that you have a list of all the users in order of “distance” from each other. The prediction is now achieved by taking the K first users in that list and basing the recommendation on those users. In order to make the recommendation the N top items are selected among the highest prediction.

2.3.2 Model-based

Model-Based collaborative filtering methods is a category of methods where the goal is to build a predictor. This predictor can require different kinds of input depending on what method you are using. The model can be built on everything from a single user’s data to the whole set of users. If the model is built on a single user’s data the input would be new courses or old ones that have not yet been viewed or considered.

If the predictor was constructed on the whole set of users the input would be a new user.

Singular value decomposition

Singular value decomposition is a factorization of a matrix. In the case of recommendation systems SVD can be used to reduce the dimensionality of the problem and identify features on the users and on the items. The good thing about SVD is that you do not have to understand what the features are, and it is usually hard to understand what the features are unless the SVD is done on fabricated data to provoke a certain behavior.

The SVD of a matrix X would be X = U SV^T where X is a M × N matrix, U is a M × R matrix, S is a R × R matrix and V^T is a R × N matrix.

S is a diagonal matrix that contains all the singular values for X in the diagonal.

If used in a recommendation system, you can look at the singular values as the features and the larger the number is the more data points towards this features being an accurate one. If the singular value is a small number the feature does not contain enough data in the X matrix that points to it actually being a feature in the data. Here comes the possibility to reduce the dimensions of the matrix by eliminating features. To eliminate a feature, the singular value also known as eigenvalue is reduced to zero, allowing this feature no impact on predicting users ratings[7]. Implementing the SVD with feature elimination would be considered a machine learning algorithm with feature selections between rounds. Where the feature selection process is an example of an embedded feature selection approach, where the algorithm would removed features and evaluate the prediction rate for the new matrix.

The matrix U can be interpreted as each user’s value to the different features.

Each row in the U matrix would represent the different users and each column the

(16)

different features. So the importance of features a for user b can be found in U on [a, b]

In a similar way that U represents user to features connection, the V^T matrix represents the feature to items connection and shows how important the different features are for items. In a surrogate dataset a feature might for example be computer usage. It is easy to understand why this feature would be important for a course in C# or a course in computer hardware but perhaps not so high in for example a gardening course.

X_m,n =







u1,1 u1,2 · · · u1,r

u2,1 u2,2 · · · u2,r

... ... . .. ... um,1 um,2 · · · u_m,r







×







s1,1 s1,2 · · · s_1,r s2,1 s2,2 · · · s_2,r ... ... . .. ... sr,1 sr,2 · · · s_r,r







×







v1,1 v1,2 · · · v_1,n v2,1 v2,2 · · · v_2,n ... ... . .. ... vr,1 vr,2 · · · v_r,n





 (2.1) Xm,n = U_m,r × S_r,r× V_r,n^T (2.2) The matrix X is the one that needs to be constructed and then factorized using SVD. This can be done in many ways but to keep the statements about what the U and V matrix contains the X matrix needs to contain the information given to us by the user about each of the items. The matrix would therefore be a user item matrix. Each entry in a user item matrix is the “rating” or appeal that a user expressed towards that item.

Altering least square

ALS or also known as altering least square, is a method that you can use when you have more than one overdetermined equation system that depends on each other[25]. In an overdetermined equation system, where you have more equations than unknown variables, there will never be a solution to the equation system.

However you can always have the best approximation. The least square is a method where you find the approximation for the overdetermined equation system, so that the square of errors in the equation systems is minimized. Returning to the altering least square (ALS) it combines the idea of least square with the problem of systems that depend on each other. The method locks all but one system from changing and then runs the least square algorithm on that system finding a least square solution.

After the least square solution is found the method alters the system that is not locked from change and finds a new least square solution. ALS will loop until it achieves low enough error improvement between runs. The stop margins will be highly dependent on the dataset and size. If a average error is used instead of a total error, the size of the dataset will not matter since the size of the dataset will not have an effect on the error.

(17)

2.3.3 Hybrid

Some of the more advanced collaborative filtering systems are hybrid-based. This is to eliminate the shortcomings of the different approach. Some of these shortcomings are scalability and cold start problems which are described below. However with the combination of the two different approaches comes new problems and are therefore much more complex to implement than a single method approach.

2.3.4 User distance

In collaborative filtering the biggest decision is choosing the method of calculating the user distances. The user distance is, as the name suggests, the distance between two users. Distance implies the calculatetable difference between two users. The user distance can be calculated in a number of ways. The easiest way to measure the distance would be to take the sum of all the items the two users have rated and normalize it by the number of items they have both rated[5]. User distance could also be considered a similarity measurement.

2.4 Content-based filtering

Content-based filtering has its roots in the search related research. If you look at the recommendation systems it is really just a search for something you like. What content-based approaches does is to look at the content of the items or users. In general one can look at a content-based recommendation system as an information retrieval approach that also takes the users interests into the equation[14]. The central part of a content-based system is in what way you measure the distance.

The distance between items or/and users will be based on the content of the items or users, hence the name. However there is more than one way to measure the distance between content. In this thesis the items are mostly text-based and distance must therefore be calculated using an algorithm to measure the distance between text snippets. The method that will be used in this paper is TF-IDF or also known as Term Frequency-Inverse Document Frequency .

2.4.1 Term frequency-inverse document frequency

TF-IDF (Term Frequency-Inverse document frequency) is a measuring method for calibrating the importance of a word in a text document[11]. TF-IDF consist of two parts TF and IDF. TF is the term frequency indicator.

T F_i,j = f_i,j/max_o,j (2.3)

Where T F_i,j is the term frequency of a word i in an item j. max_o,j is for normalization and represents the number of times the word with most occurrences, word o , has occurred in item j. That means that the largest TF score a word in a item can ever receive is 1.

(18)

The next part is IDF or the Inverse document frequency. The IDF is calculated only once for every term. The IDF for a term i would be the number of items that contains the term i. N is the number of items in the dataset.

IDFi = log₂(N/n_i) (2.4)

The IDF part would reduce the stop-words TF-IDF score and the terms with the highest TF-IDF would be the terms that best describe the item and also sets it apart from other items.

The score for term i in the item j would now be the combination of equation 2.3 and equation 2.4

T F_i,j× IDF_i (2.5)

Stop words

TF-IDF reduces the stop words importance. Word such as “and, the, if, at”. You can also consider stopwords as the most common words used when writing a text that are not really item specific and does not provide any insight about the text.

2.5 Common problems in recommendation systems

2.5.1 Shilling attacks

A shilling attack is Where the owner of a product goes into its items and gives it high ratings and goes into competitors items and gives low ratings[10]. The approaches get affected differently by shilling attacks. In the collaborative filtering approaches the shilling attacks can have a larger impact. Since the user input affects the outcome of the recommendation system. On the other hand, the content-based methods uses the information about the items and not the input from the users which makes content-based immune to these types of attacks.

2.5.2 Cold start

The cold start problem is one of the hardest and most common problems in a recommendation system[24]. The cold start problem is the problem that occurs when you do not have any or a limited amount of data about the user. Usually this happens when a new user enters the system. Users that we have obtained a significant amount of information from/about provides us with a much better chance of recommending a “correct” suggestion. This problem is easy to understand if you apply it to the real life problem of giving a recommendation to a friend or to a stranger. To a friend you can make a qualified suggestion, but to a stranger who has not provided any information one can simply guess.

Many different approaches have been done in an attempt to reduce the cold start problem. The simplest way to handle it is to take the average score for the

(19)

items and display the highest rated items to a new user. This recommendation is not always a good one but will be more accurate than random recommendations.

Some improved performing examples are semi-supervised learning methods but they can be hard to implement and will be highly dependant on what type of data the recommendation is on.

2.6 Evaluation methods

In order to decide which method to use for the recommendation you first need to benchmark or evaluate the methods. The evaluation part is one of the most important parts if you want to find out how accurate the algorithm really was.

In search you often talk about precision and recall. This can also, with a little modification, be applied to recommendations[8].

Actually Good Actually Bad Rated Good TP (True positive) FP (False Positive) Rated Bad FN (False negative) TN (True Negative)

Table 2.1. Prediction table

From the table above we can formulate the precision as all valid recommendations divided by all recommendations.

precision = T P/(T P + F P ) (2.6)

The recall can be defined as all rated good divided by actually good items.

recall = T P/(T P + F N ) (2.7)

Precision and recall are usually tradeoffs. If one increases the other is reduced.

The goal is to reduce both F P and F N to zero. This case is almost impossible to achieve so you have to do some trade-offs. Is it better or worse to get a F N or a F P ? Is it better to recommend something that the user does not find appealing?

This might decrease the trustworthiness in the recommendation system. On the other hand if the user is not shown something that he/she would be interested in, the recommendation loses its purpose.

A good score for how well the recommendation system is would be to combine the precision and recall.

F 1score = (precision × recall)/(precision + recall) (2.8) This is really easy to implement and is a comparison method that gives back a score that focuses on the tradeoff between precision and recall. Better methods that are more widely used are RMSE (root-mean-square-error) and MAE (mean- absolute-error) that are still really simple but are more customized for recommendation systems.

(20)

2.6.1 Root mean square error and mean absolute error

The first thing to consider when evaluating a recommendation system is the difference between ranked and unranked system[8]. A ranked system has items rated by the users while an unranked only has liked or disliked features. In both cases RMSE and MAE works. Root-mean-square-error(RMSE) and mean-absolute-error(MAE) both calculates the deviation between the predicted rating and the actual rating. If the system is a unranked system the value would range from zero to one and if it was a ranked system it would be from lowest to highest rating.

MAE and RMSE are almost the same algorithm but RMSE has a greater weight on the deviations. Both take the sum of the deviation from the predicted to the actual rating of all the items n.

M AE = 1 n

n

X

i=1

|p_i− r_i| (2.9)

RM SE = v u u t 1 n

n

X

i=1

(p_i− r_i)² (2.10)

2.6.2 Sign test

The ability to generate a good recommendation is also the ability to generate the more accurate option than the random choice. As soon as we have a better than chance choice it is possible to consider that you have a working model. In this paper we shall compare the different methods of recommendation systems to one and another. There are different ways to compare how well a system is doing in comparison to another system. One of the most popular ways is the per-user case[23], in this method you count the number of users for which a certain system gave the best result. This score is done for all the different users and can later easily be compared in order to decide the best performing system. This is also called the sign test. The sign test will give the probability that recommender A is better than B by taking a part of the user-base and deciding for every user, what system performed the most accurate for this user. A live implementation for the sign test would be A/B testing which is a popular method to evaluate different choices, this method will be discussed in the next paragraph.

2.6.3 A/B testing

In web-development a common method to use when evaluating different choices is A/B testing[9]. It’s a simple but effective way to test a feature such as a recommendation system. The idea is to display different option for users, with the assumption of equal distribution of what type of users see the different options. For a recommendation system evaluation this would mean displaying different recommendation systems for users and seeing what recommendation system generated recommendations that was selected by the user. The A/B test would than be able to choose

(21)

the best recommendation system and display that to all the users or start a new A/B test with the best option and a new option. A/B testing is highly dependent on the assumption of statistical distribution between what users viewing the different pages. As in all statistical measurements the more data that is collected the higher is the confidence level of the result. The result from a A/B test would be the optimal insurance that a certain recommendation algorithm was the superior one.

2.6.4 Trade-offs

“You can not eat the cake and have it aswell” is a common saying that describe the trade-off dilemma quite well. In the case of search algorithms precision and recall tradeoff is crucial i.e. you can not get all the search results that are relevant if you want to get only the relevant search results. In recommendation systems there are also some trade-offs. To understand what the tradeoffs are you first need to understand the concept of recommendation. One of the most common trade-offs is the new and serendipity aspect versus the most similar aspect. If it is possible to get a serendipity experience for the user it is only relevant if the user is interested in experiencing something new at that moment. On the other side of the spectra is giving the user a “safe” option that the users do not rate as in accurate. Many of the trade-offs for the various recommendation systems can be removed or at least reduced by combining the various recommendation systems. This is not so hard to grasp if you imagine the scenario of giving a friend advice about a movie.

First you could give the friend a recommendation based on what movies the friend has seen. This would force you to chose between movies that are similar to the movies the friend have seen, content based suggestion. This might not give such serendipity suggestions but if you also look at what other people have watched or is popular,(collaborative filtering) you can give a suggestion based both on what movies are similar and what other people that like the same movies also prefered.

2.7 Hybrid between content-based filtering and collaborative filtering

A hybrid approach between content-based and collaborative filtering would most likely be the best method of recommendation in real life. This is based on the ability to gathering the positive aspects from both sides. The worst best case would be the same as with a single system as there are more relations between the data available to base the recommendation on.

2.8 Machine learning

Data Mining and machine learning are closely related areas and in this thesis machine learning will be applied in order to make better recommendations for users of a web application. Machine learning is an area that can “learn” from previous data

(22)

and make predictions based on experience, on new data. In the recommendation case the previous data would be information from the database and new data would be the active users presently on the website. In the timespan of the thesis work it will not be possible to perform online tests and the the new data will be data from the database that was hidden from the recommendation during the learning phase.

The goal for the machine learning would be to acquire the knowledge of knowing what the user will do before the user performs the action, also know as a prediction, and in this thesis a prediction will be handled as a recommendation.

2.8.1 Feature selection

One of the most challenging parts of machine learning is the feature selection problem. Feature selection is, as the name suggests, the process of selecting the features to be used in the machine learning algorithm. Features would be aspects about an item or a user. There are many different ways of selecting features but they can be divided into the categories; wrappers, filters and embedded methods[6]. The filter method is filtering the features based on statistical measurements, where the features are considered to be independent. In the filter approach, area knowledge can help make a qualified guess on what features to use to reduce the need to search through the whole feature set. In the wrapper method the problem of feature selection is consider a search problem, where a method would be a recursive feature elimination where a feature is removed from the predictor and is compared to the predictor before the elimination. Either the feature is kept removed and next feature is removed if it was better than before or it is returned to the feature set and next item is removed and tested. Embedded methods make the feature selection while the machine learning algorithm is running. The most common machine learning algorithm that uses this kind of system are regularization algorithms that can evaluate the importance of a feature between iterations.

(23)

(24)

Chapter 3

Methods

This section of the thesis will discuss the methods implemented and the motiva- tional choices behind them. This passage also discusses the methods used for the implementation.

3.1 Datasource

The datasource used in this thesis consists of two parts. The first part is connections between the users and the items. A connection is created from a user requesting more information about that specific item. In figure 3.1 a graphical view of that data is displayed. The full dataset consists of about 3 million connections. The transformation shown in the figure can also explain how the collaborative filtering concept works. User B and C both have something in common with A and can get a recommendation from other items that user A has preferred. The second data source is the item description, that is also visible to all of the users visiting the site.

3.2 Choice of methods

The three main approaches to a recommendation systems; content-based, memory based CF and model based CF, have many different algorithms categories into them.

To make this thesis possible to carry through in the given time frame, a limitation of one method for each of the areas was selected and implemented.

3.2.1 Content-based filtering

The recommendation problem can be viewed as a search problem, where you find an item for the user that the user was not aware it wanted to search for. In the area of content-based search a method which has its roots in search was chosen and implemented. The method TF-IDF was chosen for its wide use in field of ranking pages in search[19]. Applying it to the content-based recommendation field where high ranked items are the more important[1].

17

(25)

Figure 3.1. From data source to basic data structure

3.2.2 Memory-based collaborative filtering

The other methods that were implemented was in the area collaborative filtering also know as CF. The reason multiple implementations were done in the area collaborative and only one in content-based was because the different types of method within collaborative filtering are much more diverse and can be divided into two subcategories. Each subcategory contains an algorithm that was implemented. In the subcategory memory-based collaborative filtering a K-nearest neighbor algorithm was implemented. The memory-based algorithm could considers all the users in the system[5]. This would appear when the K was the same size as the user base minus one. To make the algorithm scalable and possible to calculate, a limitation for K was added. Since the dataset was really sparse, a close approximation was achieved even if the whole dataset was not used. The main reason being the number of connections a user has to other users was rarely larger than a ten on the dataset size that was used. The average number of connections grows as the dataset increases and a smaller K might be needed to handle the load of larger datasets.

(26)

3.2.3 Model-based collaborative filtering

The final methods that were implemented was the model-based methods belong- ing to the collaborative filtering part of the recommender spectrum. As with the other areas of collaborative filtering, model-based also has a lot of different methods to choose from. The selection of the first method was based on the fact that the dataset was sparse. Looking for a methods for reducing the dimensions and finding a solution in a lower dimensional space, I found SVD to be relevant. SVD has been used previously successfully to reduce the dimension space of features in recommendations[20]. SVD also fits well into the problem at hand since it only requires to know the status or rating between users and items which, happens to be, what we are working with. The second method that was implemented is a method closely related to “normal” SVD. In this algorithm the unknown values are calculated rather than guessed as it is done using the “normal” SVD algorithm. The algorithm used is alter lease square or short ALS[25]. It was used on rating scores and numeric scale values by its authors and was implemented with a twist to make it possible to use on boolean values.

3.3 Content-based filtering

The method that was implemented for the content-based approach was TF-IDF(term frequency inverse document frequency) scoring of items based on the description text of the items. This gave each of the items a vector that contained the score for all the words within every items description.

item = [w₁, w₂, · · · , w_n] (3.1) where n is number of unique words in the dataset.

Stopwords were removed and regex helped removed HTML code, numbers, urls and split the description before the TF-IDF scoring algorithm scored the words for the items.

1 = [w₁, w2, · · · , wn] × [w₁, w2, · · · , wn]^T (3.2) A normalization of the scores was made so that each course had a 1.0 correlation with itself 3.2. This normalization comes from using the cosine similarity. The cosine similarity will at most have a 90 degree angle between two document vectors.

The 90 degree angle would be zero correlation and a zero degree would be 1.0 in correlation, in other words the same context. To get a recommendation for a user from this correlation matrix, the items that a user liked were put into a user vector and each item that was correlating to this user vector was interpreted as a recommendation.There was no information about what items the user liked more than other items so all of the items were weighted the same. If such information was available it could easily be added into the equation before the summation.

(27)

Figure 3.2. Item-item correlations

userV ector = [

p

X

t=1

t₁,

p

X

t=1

t₂, · · · ,

p

X

t=1

tn] (3.3)

where p = liked items The correlation to other items was calculated as the dot product between the the user vector and the other items.

A × B =

n

X

i=1

A_iB_i (3.4)

The item with the highest correlation is the top recommendation for the user.

The items that build up the user vector are removed from the items before the correlations are made so it does not recommend items that was already considered.

3.3.1 Evaluation

The evaluation on how well the recommendation was is determined by splitting the

“liked” items into a training and a test set. Where the training set constructs the user vector that was described in the previous section. When the user vector tries to complete a prediction the goal for the prediction would be to include all the items in the test set. The evaluation of the algorithm is done by using scoring with precision/recall.

3.4 Collaborative filtering

A large difference between the content-based and collaborative filtering is the dataset that can be utilized. Since the difference between the approaches are so large the same datasets can not be used for all of the approaches. For the collaborative filtering part the dataset consisted of user-item data.

(28)

3.4.1 Memory-based

In the memory-based implementation the method of choice fell on the K-nearest neighbours algorithm. The distance was calculated using the dot product or also called the cosine similarity. The K-nearest-neighbors was applied on the user-item information. Calculating the distance between users, based on what items they have interacted with. The distance between two users is later directly related to how much the users affect each other’s recommendations. After the K-NN algorithms have calculated the distance between all the users the recommendation system is introduced. So in the same manner that content-based recommendations were made the top ten correlating items or in this case users are chosen. The users that have been connected to correlating users can now be introduced to items that this correlating user has interacted with. Each of the items that the another user has liked is weighted in as a recommendation for the user. The weights for a certain item is the importance of that item for the user “suggesting” it. A graphical interpretation for a smaller set of data points can be shown in figure 3.3 where the darkness of a point represents the correlation. A darker point exhibits a higher correlation to that user. Since the image is a representation of a matrix from users to users the diagonal will represent correlations from the users to themselves.

Figure 3.3. User-user correlations

3.4.2 Model-based

The model-based solution was first implemented just as a SVD(singular value decomposition) reducing the dimensions of the dataset with a certain threshold. Later the first implementation was extended and also given the method the machine learning ability using ALS to reduced the error produced between the recommendations and the actual values.

(29)

3.4.3 Singular value decomposition

The singular value decomposition was implemented in a simple manner that was not very memory or computational efficient. It limited the algorithm on how many users and items that could be included in the computations, in order not to run out of memory. Please see the theory section 2.3.2 for a complete explanation of the method but in short you break a matrix down and built it back up again with removing some dimensions. The implementation was made in C-sharp with the help of the Math.NET Numerics library providing great support for the matrix computations and SVD construction. A matrix between the users and the items was completed. Most of the slots are missing since users only interacted with a smaller portion of the itemset. The unseen entities in the training set are given a value that is the average of that particular item and the average of that particular user. This will give the items that have had a higher interaction with the users a higher starting value. The start value given in this way presumes that an item that most people like will also be liked by the rest on average. SVD also has a couple of different implementations, with the “standard” one being implemented with the help of Math.NET. The Math.Net version decomposits the matrix into three matricies. One matrix for the items to features, one matrix for the users to features and a third one being “features correctness”. This last matrix could also be called the eigenvalues matrix with the eigenvalues for the decomposition in the diagonal row and the rest zeros. Reducing one of the non-zeros values to zeros would be removing a feature or a dimension if you would put the matrix back to the original matrix.

3.4.4 Altering least square

Altering least square was implemented, as with the rest of the method, with the help of the program visual studio. The ALS implementation was completed in a parallel manner so that it could run within an acceptable timeframe. Even if it was implemented in a parallel manner, it was still the slowest and most time consuming method implemented. ALS is a method used before with success in the area of recommendation[25]. The main difference to previous implementations is the fact that in this application only works with a boolean sparse dataset. In its predecessor the implementation was done using a regularizing term to prevent the method from overfitting. This term was not used in the implementation for this thesis since it was found to be obsolete when working with a boolean dataset that was as sparse as in this thesis.

ALS was implemented with two matrices U and I. Both of these contain infor- mation about features, U with the features to users and I with the features to the items ability.

Evaluating recommendation systems for a sparse boolean dataset

Evaluating recommendation systems for a sparse boolean dataset

Evaluating recommendation systems for a sparse boolean dataset

Abstract

Referat

Evaluering av rekommendationssystem för ett glest booleskt dataset

Contents

Chapter 1

Introduction

1.1 Background

1.2 Question

Chapter 2

Theory

2.1 Recommendation

2.2 Datasource

2.3 Collaborative filtering

2.4 Content-based filtering

2.5 Common problems in recommendation systems

2.6 Evaluation methods

2.7 Hybrid between content-based filtering and collaborative filtering

2.8 Machine learning

Chapter 3

Methods

3.1 Datasource

3.2 Choice of methods

3.3 Content-based filtering

3.4 Collaborative filtering