Evaluating if an analysis of testresult could be used when using gradient boosted decision tree inrecommender systems.

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2016,

Evaluating if an analysis of test result could be used when using gradient boosted decision tree in recommender systems.

A QUANTITATIVE STUDY BASED ON THE METHODS DEVELOPED BY BELLKOR'S PRAGMATIC CHAOS TEAM FROM THE NETFLIX PRIZE COMPETITION.

JACOB GULLBRING

(2)

Evaluating if an analysis of test result could be used when using gradient boosted decision tree in recommender

systems.

A QUANTITATIVE STUDY BASED ON THE METHODS DEVELOPED BY BELLKOR’S PRAGMATIC CHAOS TEAM FROM THE NETFLIX PRIZE

COMPETITION.

JACOB GULLBRING

May 11, 2016

(3)

Abstract

This essay is aimed for using as a template for future creations of recommender system. The main purpose of the study is to provide ad- ditional information that can be beneficial when creating recommender systems and it’s an uprising topic in the field of data mining were it’s important to analyze large collections of data and calculate patterns. In the beginning these systems were only useful for simpler tasks, but has evolved with the help of this contest to a much more complex system and this study will mainly focus on demonstrating leading methods for creating such recommender systems. Firstly the methods used are more detailed explained and some main concepts are brought forward, ending with a description of the datasets that were released. The results from the winning team BellKor’s Pragmatic Chaos will demonstrate the difference between each year for corresponding method and will result in a conclusion that by a analyze of the testing on the data, some of the final predictors could be found by using this technique. This will further reducing the quantity of combinations needed for testing and if there is a time pressure or financial issue, this is a strong argument for using this analysis as a template for future creations of recommender systems.

(4)

Abstract

Denna uppsats är ämnad för att använda som ett underlag för framtida skapelser av rekommendations system. Huvudsyftet av studien är att förse ytterligare information som kan vara förm˚anlig när ett rekommendations system skall skapas och det är ett ämne p˚a väg upp inom omr˚adet data mining där det är viktigt att analysera stora mängder samlingar av data och räkna ut mönster. I början var bara dessa system användbara för enklare uppgifter, men har utvecklats med hjälp av den här tävlingen till mycket mer komplexa system och denna studie kommer i huvudsak fokusera p˚a att demonstrera ledande metoder för att skapa s˚adanna rekommendations system. Först förklaras metoderna mer ing˚aende samt deras huvud koncept tas fram och avslutas med en beskrivning av data seten som släpptes. Resultatet fr˚an det vinnande laget BellKor’s Prag- matic Chaos kommer visa skillnaden mellan varje ˚ar fr varje metod samt resultera i en slutsats som menar att genom en analyseringen av testning av data, identifieras n˚agra av prediktionerna genom att använda denna teknik. Detta kommer ytterligare reducera mängden kombinationer som m˚aste testas och om det är en tidspress eller ekonomisk sv˚arighet invol- verad är detta ett starkt argument för att använda denna analys som ett underlag för framtida studier.

(5)

1 Introduction

Recommender systems have since the advent of the Internet been at a widespread use, at early stages, especially useful in the E-commerce but in recent time used in many different areas.¹ A competition between 2006-2009, created by the websitewww.Netflix.com, participants competed against each other to create the most optimized system for finding out recommendations for their users. New areas were discovered in the field of recommender systems and methods used for creating those, also the community around the competition which enabled competing participants to share ideas or problems with each other revealed to have a positive impact on how quickly a recommender system of this sort could be created by creating this community, via forums and other discussions on the web.

1.1 Problem Statement

Could an analysis of the performance of testing on the datasets, find some or all of the final blending predictors and be used a template for future creations of recommender systems that uses gradient boosted decision trees?

2 Background

2.1 Recommendation systems

Recommender systems try to give suggestions for items(movies, music, websites, products, etc.) to a specific user based on the information available. The main goal is to provide a potential customer the perception of the service being used have a knowledge of the persons interests. Of this recommender system the users will have the perception of the system ”knowing” which movies they could be interested in and recommend a similar one, i.e. the item is movies and each user has at least one item (few rated only one movie).

2.1.1 Motivation

Recommendation systems is a future key component to find out a specific user preference and also finding patterns in a users behavior. This is solely based on large datasets that includes different collected data about users/items and will give a logic recommendation depending on which method used. It’s important for the company to get knowledge about this information because it will further the understanding in what movies and series the user wants from this service and they also know which are popular, in popular, watched most/less time, clicked most etc. Also, this type of recommendation system could also build up a user profile based on their recommendations, but none of the competitors

1Sarwar, Badrul, et al. Application of dimensionality reduction in recommender system-a case study. No. TR-00-043. Minnesota Univ Minneapolis Dept of Computer Science, 2000.

(7)

created a system that made this possible, still this could be done by the company afterwards because they have this large scale of information.

2.1.2 Concept

At first the recommender systems were used in simple advertisements on web pages to give the user viewing it a recommended advertisement based on their browsing history or give suggestion of interesting items based on previous pur- chases as seen on the website www.Amazon.com.Initially they used recommender systems only for their advertisement, but later realized with the help of some of the new techniques and methods learned at that time they created a recommender system not only for simple item based advertisement. This new type of recommender system took a specific relation between a user and an item and created a suggestion for that user based on this relation.www.Youtube.com implemented this on their website, they created a recommender system based on the comparison between a user video preferences and the overall preferences, i.e.

examples such as top rated, most popular, most recent etc. was compared to the specific users rating, popular, recent etc. The design for a recommender system at first was similar to the www.Youtube.com system, but then it required to much data to comprehend, leading to a system that was based on relationships and patterns in the user data such as; ”other users that rated this movie also rated that movie” and ”other user that viewed this one time also watched that one time ” and so forth.

2.1.3 Main approaches

There are many approaches on how to create a recommender system, depending on what is necessary for the system to fulfill, i.e. should the system gives recommendations based on a user preference, ratings, comments or other content information? or should it give recommendations based on a users behavior, relations with other user behaviors, etc.? The answer to the first alternative is an approach called content-based (CB), which is suitable when creating systems that gives a more concrete recommendation rather than a contextual recommendation to the user. Contextual recommendation is also known as collaborative filtering (CF) and was the most used technique during the competition and its idea is to create predictions about the interest of a user by using the information of other users, which is why its called collaborating filtering, it uses cooperating features to filter the predictions. CF was chosen for the final approach used by all teams in 2009, because it could handle this small amount of data, i.e.

it’s more appropriate to use for small dataset overall. However, in the recent research done by Adomavicius, Gediminas, and Alexander Tuzhilin et.al (2005), they found that there is also a hybrid recommender system, which combines the CB and the CF, but during this time a hybrid system would be to demanding on computer resources and because of this it was not sufficient for the amount of data that was released, but later they had to make some adoptions from the CB which resulted in a hybrid recommender system, but many years later after

(8)

the competition.

2.2 The Netflix Prize

2.2.1 How it work

In October, 2006 www.Netflix.com released a dataset containing 100 million anonymous movie ratings and challenged the data mining, machine learning and computer science communities to develop systems that could beat the accuracy of its recommendation system, Cinematch.²This dataset became the training data set for the competition, but the dataset that was used for de- termines winners instead had fewer ratings included and with a different layout than the training set.(See Section 3.3) The recommendation system used by www.Netflix.com scored and RMSE of 0.9525 on the test set, while it had an RMSE of 0.9514 on the quiz sheet and the competition was to improve this by 10%. The competition was ongoing for three years, announcing a winner each year if no one has come up with a grand prize solution, rewarding the winner each year a small price and also contribute to the community by discussion of some parts of the solution to a web based forum.³ The progress prize was the price each year before the grand prize solution was created by the Bellkor’s Prag- matic Chaos, which then ended the competition abrupt. They had an RMSE of 0.8567 on the test set and 0.8554 on the quiz set, which is a 10% improvement as requested.

2.2.2 The teams

The competition resulted in 20.000 teams being created over 150 countries and 2.000 of them submitted a final solution, ending with a total of 13.000 prediction sets overall for the contest. Initially the top four teams which competed for the 2007 progress prize were from different parts of the world, and their nicknames were as follows;

• - WXYZConsulting - Two competitors, Wie Xu and Yi Zhang, front runners from November to December of 2006.

• - ML@UToronto A - A team from University of Toronto, front runners from October to December of 2006.

• - Gravity - A team from Budapest University of Technology, front runners from January to May of 2007.

• - Bellkor - A collaborative team from AT& T Labs, consisting of Robert Bell, Chris Volinsky and Yehuda Koren.

The discussions continued on the forums and this lead to a conference being held at San Jose in California, simultaneously a workshop was set up during

2https://www.cs.uic.edu/ liub/KDD-cup-2007/NetflixPrize-description.pdf

3http://www.netflixprize.com/community/

(9)

this time, which included a presentation of all of the four top leaders and they describe their techniques. The idea with the workshop was to have two tasks, task one and task two, with the task of being the predominant for deciding the winner of the workshop competition, but without no prize money involved. The team that won first place on task one was IMB Research consisting of Yan Liu, Saharon Rosset, Claudia Perlich and Zhenzhen Kou, they came in third place for task one, but finished in first place in task two so it was insignificant at that time. Next, it’s year two and it only had three teams that were competing for the number one position;

• - Bellkor - A collaborative team from AT& T Labs.

• - BigChaos - A team from a Austrian company Commendo Research &

Consulting.

• - BellKor in BigChaos - The final joint team between the AT& T Labs team and the Commendo Research & Consulting team.

Lastly the winning team was as merge of the BellKor in BigChaos and a team created 2009 called Pragmatic Theory with Martin Piotte and Martin Chabbert, called Bellkor’s Pragmatic Chaos.

2.2.3 RMSE - Root-Mean-Square Error

The root-mean-square error is an average model-prediction error in the units of the variable of interest. These measures also have been used to represent average difference. The RMSE is of particular interest because it is one of the most widely reported and misinterpreted error measures in the climatic and environmental literature.⁴RMSE represents a difference between predicted values and already known ones, this is also called prediction error. The idea of using the prediction error as a measure is because it’s a relevant measure when forecasting errors in different models for a particular value, not on pairs, so it is fitting the description of the datasets. The resulting RMSE varied each year, which resulted in different approaches thus differ in the RMSE for all participants (See Table 1). Some of the teams shown here was only created 2009 and has limited information about them, except their corresponding competition name, but the idea is to get an overview on how the final RMSE value differed.

4Willmott, Cort J., and Kenji Matsuura. ”Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance.” Climate research 30.1 (2005): 79.

(10)

Table 1: History of the RMSE on the Test set, each column represents a final result for each team. No result is due to no solution submitted at that time.

2009 demonstrates a tie between Team Bellkor and The Ensemble, but Team Bellkor won on submitting it faster thus ending the competition.

Team 2007 2008 2009

Team Bellkor 0.8712 0.8616 0.8567

The Ensemble - - 0.8567

Grand Prize Team 0.8675 - 0.8582

Opera Solutions and Vandelay United - - 0.8588

Vandelay Industries - - 0.8591

Dinosaur Planet 0.8769 -

Gravity 0.8785 0.8643

a ahttp://www.netflixprize.com/leaderboard

3 Method

This chapter is divided into two sections, the first section demonstrates the methods used by the winning team that and explaining the datasets layout/- functions, the second one demonstrates the analysis of the graphs. The graph analysis is the last part for which will summarize the idea of the competitions methods discovery which was based on finding patterns in the beginning but evolved to basically testing all cases before drawing a conclusion, i.e testing all predictors and their combinations. The idea for this chapter is to give a quick overview of some of the important aspects on the methods and also explain the graph analysis as far as possible. Finally it’s a summarize of the method which was the last method used in the winning solution and its a overview of the method more than a explanation , but also demonstrate the predictors chosen for usage in the final blending scheme.

3.1 CF - Collaborative Filtering

Collaborative filtering (CF) is a popular recommendation algorithm that bases its predictions and recommendations on the ratings or behavior of other users in the system. This is the main approach for the solutions created and it’s because it uses the aggregated behavior or taste of a large number of users to suggest relevant items to specific users. The majority of collaborative filtering algorithms operate by first generating predictions of the users preference and then produce their recommendations by ranking candidate items by predicted preferences. The content-based(CB) approach was considered to use in the beginning because it based recommendation on a items values or attributes, for example one movies rating and compare with other movies with the same rating or based recommendation on browsing history. The main issue is that CB

(11)

algorithms can’t comprehend a large amount of data, leading to the conclusion that the fastest approach is CF, if the purpose of the system is best accuracy, i.e lowest RMSE. The system www.Netflix.com uses today has a hybrid of both CF and CB, it gives recommendations both based on item to item relationships and viewing each item by itself and compare with other items similar to provide recommendation.

3.1.1 Asymmetric Factor Models

The asymmetric factor models establishes parameters that provides a symmetric view of users and movies, by viewing each user as ”a bag of movies” the factorization model parameterize only the movie factors using the a transformed function In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. The goal for this method is to create a matrix in the end that will suffice all the criteria for being used in our implementation. The key factor for the SVD was to put each predicted user rating as a matrix and then use the transpose of that, creating a reversed matrix, but with different layout. This was to simplify the computation, because now the user are horizontally and not vertical the dataset with all the ratings is stacked after each other instead of below each other. This type of modeling will be referred to as [NSVD1/2] and [SIMGOID1/2/3] and it’s the main methods used for factor models but it’s also implemented in other approaches. SVD started with a equation of this form and was one of the first predictors used in the recommender system;

ˆ

rui= µ + bu+ bi+ p^T_uqi (1) where ˆr is the predicted value of a ruiwhich is the rating a item in done by user u, i.e. preferences of user u of item i and it’s a scale between 1-5 stars.µ is the overall average rating, the buand biare the observed value of user and item and the last variable is as explained earlier, a transpose matrix of the items inside the test set(ratings, user). The asymmetric-SVD is the evolved regular SVD but the extension is a variable that is computing each factor separately while performing shrinkage, their corresponding methods are [IncFctr] and[SimuFctr], were the later one is more described in section about regression models. The asymmetric- SVD, which means basically that user features are estimated directly from the ratings and its factor model is similar to Equation 1, but a few additions;

ˆ

r_ui = µ + b_u+ b_i+ p^T_uq_i

|R(u)|^−1/2 X

j∈R(u)

(r_uj− b_uj)x_j+ |N (u)|^−1/2 X

j∈N (u)

y_j)

(2) where the only addition here is the last transpose function is multiplied with a neighborhood based model and its bridges the neighborhood and factor models.

The last model for this factor models is a even more accurate asymmetric SVD, but still based on the same model as Equation 1 and Equation 2, but with

(12)

some few additions;

ˆ

r_ui= µ + b_u+ b_i+ p^T_uq_i

p_u+ |N (u)|^−1/2 X

j∈R(u)

y_j

(3)

and the difference here is that |R(u)|^−1/2P

j∈R(u)(r_uj− b_uj)x_j is removed and implemented in the last functionP

j∈N (u)yj, this changes the range of relationships it goes through, instead of adding all N (u) relationships, instead add the R(u) relationships. The missing variable in the R(u) function is instead added as a variable itself, because its only a relationship that is not changing, i.e. it is a ”known fact over many items”.

3.1.2 RBM - Restricted Boltzmann Machines

Restricted boltzmann machines(RBM) is a technique starting with assigning a low-dimensional feature vector to each user and a low-dimensional feature vector to each movie so that the rating that each user assigns to each movie is modeled by the scalar-product of the two feature vectors. This technique was used because the RBM proved to have low sensitivity to parameter setting because it uses a model that is a extension to Maximum likelihood learning methods named contrastive divergence. Contrastive divergence is most efficient because it greatly reduces the variance of the estimates used for learning and it also provides the RBM with a better accuracy for the method using Gaussian visible units. The difference between Gaussian visible units and Gaussian hidden units is that the first one replaces the binary visible units by linear units with independent Gaussian noise and the second replace with binary hidden units with independent Gaussian noise and this is all in the scope of a energy function name E(v, h), where v and h are optimization paramters.⁵

3.1.3 k-NN - Neighborhood-based model

The most common approach to CF is the neighborhood-based approach. Its original form, which was shared by virtually all earlier CF systems, is the user- oriented approach. Such user-oriented methods estimate unknown ratings based on recorded ratings of like minded users. Prior to computing interpolation weights, one has to choose the set of neighbors. Some methods are based on pairing these weights, but others are using correlations distance for example some are named [Corr-kNN], [Fctr-kNN] and [Slow-kNN], with the last one being the first neighborhood approach called slow because it take longest to compute. As mentioned in the asymmetric factor model section, this approach is implemented for the SVD models, but there is a general model which they used independently;

ˆ

rui= µ + bu+ bi+ |R(u)|^−1/2 X

j∈R(u)

(ruj− buj)wij+ |N (u)|^−1/2 X

j∈N (u)

cij (4)

5Hinton, Geoffrey. ”A practical guide to training restricted Boltzmann machines.” Mo- mentum 9.1 (2010): 926.

(13)

where it is similar to the factor models with the difference being; for example Equation 3 has implemented the R(u) function for calculating relationships in the N (u) sum function, but in this model all relationship functions are calculated individually and then added up.

3.2 Gradient Boosted Decision Trees - GBDT

Regression models are used for creating two separate ways for building the a matrix for testing; one is the user-centric approach, where the data is all movies rated by a specific user and the other is the movie-centric approach,where the data is all users that rated a specific movie. These two approaches came with some minor issues that had to be solved because in some cases the user- centric approach needed to use movie-based predictors and for the movie-centric regression user-based predictors had to be created. However, by deriving either one of them from each other, new methods had been found and are later referred to as [PCA],[STRESS],[BIN-SVD1/2/3] and a few more.(See Section 4.2) The gradient boosted decision tree(GBDT) is the most useful method for generating effective models for the regression models and other tasks. This has been used in previous recommender system and have proven to be useful for testing different combinations of predictors, i.e in this system all of the different methods are called predictors, because they predict a certain value. The main purpose for GBDT is to test many predictors at one time instead of testing one at a time.

3.2.1 Blending

The sets of predictors that was included for the GBDT blending was firstly there was 454 predictors from the Bellkor’s Pragmatic Chaos team, second it was 75 predictors from the BigChaos team and lastly 24 predictors from the original Bellkor team. The 24 last predictors was chosen using random testing, i.e testing all possible combinations and the first 7 of these are more detailed explained in the 2009 report. This section will mainly demonstrate these remaining 17 predictors and their corresponding number, because it’s these predictors that could have been chosen using the analysis later described.(See Section 3.4) These are divided into 2 separate parts, one being the predictors from 2007 and the second is the ones from 2008 and as mentioned these are single predictors, but tested with different techniques, some are combinations of methods and some are just a single method. The 2007 predictors where;#3 in RBM, #8 in RBM, #5 in matrix factorization, #27 in matrix factorization, #3 in kNN, #5 in kNN, #11 in kNN,#13 in kNN, #15 in kNN, #2 in specials and these are the ones that the graph analysis will reveal later in the report, without using random testing, i.e only observing known values and plotting them on a diagram for analysis.(See Section 3.4)

(14)

3.3 Dataset

A complete dataset featuring 100 million ratings from different users was released initially. The Hold-out set⁶ was a combined collection of three subsets that where created, Probe, Quiz and Test. Based on 4.2 million ratings consisting the last nine movies rated by a user and the remaining data was used as training set. Divided intro three separate text files the output for them varied, but had similar layout, example of a row in Quiz is ”1,2003,Dinosaur Planet”, while in Probe a example is ”1:30878...” (See Subsets). They both use ”1” as movie identification number(MovieID), but then Quiz represents movie release year and movie title after, while Probe demonstrates the costumer identification number(CustomerID) after the ”1”. This evaluation made up a final set which was called the Qualifying set and it was a combination of Quiz and Test, because Test also had MovieID but with the extension of CostumerID. The Probe set was later attached to the training set(See Figure 1), which was the final training set they used for testing named Probe-Qualifying.

Figure 1: Model of the datasets for each user and the last block with user test data was not released, because they tested solutions on this data. Each of these block has a time input also that could be either when a movie was rated or when a movie was released, for clarification the arrow to the right represents that time is a factor in all three blocks.

3.3.1 Subsets

Three different datasets created as subsets including the Qualifying subset which as mentioned is a combination of Quiz and Test but still used in the model for

6Yehuda Koren ”The BellKor solution to the Netflix Grand Prize.” (2009).

(15)

constructing the training dataset.(See Section 3.1.2)

• ”Quiz” A,B,C, with the paramters A=MovieID, B=Year released and C=Name of the movie, for example; 1,2003,Dinosaur Planet.

• ”Probe” A:Rn, with the parameters A=MovieID and Rn=CustomerID number n ,for example; 1:1046323,1080030,...

• ”Test” A:Rn and Dn,Rn , with the parameters A=MovieID, Rn=CustomerID number n and Dn=Date of rating, for example: 1:1046323,1080030,...

2005-12-19:1046323,1080030 3.3.2 Training Set

The training set consisted of 17 700 movies and they had varied size depending on how many rating a specific movie has, for example; ”1:1488844,3,2005-09- 06822109,5,2005-05-13885013...”. The first number is the movie number(MovieID) followed by the user number(CustomerID) and the rating and lastly its the date when its rated.Probe was attached because it facilitated a easier overview with MovieID and CostumerID together, instead of collections of only the initial movie titles and their ratings. Each text file represented a customer and their ratings and the final training set had a similar layout as the subsets:

• ”Training” A:Rn,R,Dn, with the parameters A=MovieID, Rn=CustomerID, R=Rating and Dn=Date of rating

• ”Qualifying” A:Rn,D,Rn...., with the same paramters.

3.4 Graph Analysis

This section explains how the graphs should be analyzed, by estimating where the highest and lowest points are for each year(or by looking at the tables highest or lowest values, but this will take more time) a span could be created between those, i.e. for 2007 create a span between highest and lowest value and create a span for the highest and lowest of 2008. Then make same span for 2007 and in between these separate spans the predictors that are searched for is ”around”

this area. For example, simplifying with arbitrary numbers, say the imaginary RMSE has a span between one and five methods for the first attempt and a span between two and six for the second attempt then the searched predictors should be found between one and six methods. Also a analysis of behaviors in the graphs will also reveal predictors, by observing where the graphs cross at one method, i.e. when its approximately the same RMSE value, some of the predictors are found using this technique. The behavior is also when there is for example a high and low RMSE, i.e. a peak in the beginning of the graph, followed directly by small crossings of methods, but not larger than the first high and low value, then the span is much more sparse and leads to the span being from where this peak started to the end.

(16)

4 Result

4.1 Asymmetric SVD factor models

Table 2: Asymmetric SVD factor models result Method 2008 RMSE 2008 Method 2007 RMSE 2007 1. [SIGMOID2] 0.9286 [SIGMOID3] 0.9194

2. [NSVD2] 0.9383 [SIGMOID2] 0.9286

3. [NSVD1] 0.9236 [NSVD2] 0.9383

4. [NSVD1] 0.9259 [SIGMOID1] 0.9245

5. [NSVD1] 0.9260 [NSVD1] 0.9114

6. [SIGMOID1] 0.9225 [NSVD1] 0.9236

7. - - [NSVD1] 0.9259

8. - - [NSVD1] 0.9134

9. - - [NSVD1] 0.9260

Figure 2: Asymmetric SVD factor models result - error vs #method. The plot shows how the accuracy for each year varied depending on which method used.

(17)

4.2 Regression models

Table 3: Regression models.

Number/Method 2008 RMSE 2008 Method 2007 RMSE 2007

1. [BIN-SVD3] 0. 9223 [PCA-USER] 0.9269

2. [PCA] 0.9212 [STRESS] 0.9302

3. [PCA] 0.9241 [BIN-SVD-USER] 0.9335

4. [BIN-SVD-USER] 0.9335 [BIN-SVD-USER] 0.9335 5. [BIN-SVD3-USER] 0.9290 [BIN-SVD-USER] 0.8996 6. [BIN-SVD-USER] 0.9437 [BIN-SVD3-USER] 0.9290 7. [BIN-SVD-USER] 0.9610 [BIN-SVD3-USER] 0.9394

8. [BIN-SVD3-USER] 0.9414 [PCA] 0.9241

9. [BIN-SVD-USER] 0. 9067 [PCA] 0.9212

10. [BIN-SVD-USER] 0.9030 [BIN-SVD-USER] 0.9451 11. [PCA-USER] 0.9269 [BIN-SVD-USER] 0.9610 12. [STRESS] 0.9302 [BIN-SVD3-USER] 0.9414

13. - - [BIN-SVD-USER] 0.9067

14. - - [BIN-SVD-USER] 0.9020

15. - - [BIN-SVD-USER] 0.9030

16. - - [BIN-SVD3-USER] 0.9223

Figure 3: Regression models - error vs #method. The plot shows how the accuracy for each year varied depending on which method used.

(18)

4.3 RBM with Gaussian visible units

Table 4: Restricted Boltzmann Machines with Gaussian visible units - error vs

#method. The plot shows how the accuracy for each year varied depending on which method used.

1. RBMG 0.9052 RBMG 0.9052

2. RBMG 0.9044 RBMG 0.9044

3. RBMG 0.9056 RBMG 0.9056

4. RBMG 0.9429 RBMG 0.9068

5. RBMG 0.9074 RBMG 0.9121

6. RBMG 0.9267 RBMG 0.9429

7. RBMG - RBMG 0.9489

8. RBMG - RBMG 0.9267

Figure 4: Restricted Boltzmann Machines with Gaussian visible units - error vs

#method. The plot shows how the accuracy for each year varied depending on which method used.

(19)

4.4 RBM

Table 5: Restricted Boltzmann Machines

1. RBM 0.9029 RBM 0.9029

2. RBM 0.9029 RMB 0.9029

3. RBM 0.9087 RBM 0.9093

4. RBM 0.9093 RBM 0.9206

5. RBM 0.8960 RBM 0.8960

6. RBM 0.8905 RBM 0.8905

7. RBM 0.8904 RBM 0.8904

8. - - RBM 0.8888

Figure 5: Restricted Boltzmann Machines - error vs #method. The plot shows how the accuracy for each year varied depending on which method used.

(20)

4.5 Matrix factorization

Table 6: Matrix factorization

1. Matrix factorization 0.8992 - 0.9135

19. Matrix factorization - - 0.8970

(21)

Figure 6: Matrix factorization - error vs #method. The plot shows how the accuracy for each year varied depending on which method used.

4.6 kNN

Table 7: Neighborhood-based model(kNN)

1. 50 neighbors Fctr-kNN 0.9309 30 neighbors kNN 0. 8953 2. 75 neighbors Slow-kNN 0.9037 50 neighbors kNN 0. 9105

3. 30 neighbors kNN 0.8953 50 neighbors kNN 0. 9082

4. 50 neighbors kNN 0.9105 25 neighbors kNN 0.9496

5.25 neighbors kNN 0.9496 60 neighbors kNN 0.8979

6. 60 neighbors kNN 0.8979 50 neighbors Bin-kNN 0. 9247 7. 50 neighbors Bin-kNN 0.9215 50 neighbors Bin-kNN 0.9215 8. 25 neighbors Fctr-kNN 0.9097 50 neighbors Fctr-kNN 0.9309 9. 100 neighbors User-kNN 0.9290 25 neighbors Fctr-kNN 0.9097 10. 100 neighbors User-kNN 0.9097 50 neighbors Fctr-kNN 0.9290 11.30 neighbors User-MSE-kNN 0.9112 100 neighbors User-kNN 0.9097 12. Corr-kNN 0.9248 30 neighbors User-MSE-kNN 0. 9248

13. Corr-kNN 0.9170 50 neighbors Slow-kNN 0.9057

14. Corr-kNN 0.9079 Corr-kNN 0.9170

15. MSE-kNN 0.9237 MSE-kNN 0.9237

16. Supp-kNN 0.9085 Supp-kNN 0.9110

17.Supp-kNN 0.9110 Supp-kNN 0.9440

18. Supp-kNN 0.9440 Supp-kNN 0.9335

19. Supp-kNN 0.9335 - -

(22)

Figure 7: Neighborhood-based model(kNN) - error vs #method. The plot shows how the accuracy for each year varied depending on which method used.

4.7 Combinations

Table 8: Combinations

1. #4(RBM 2007) + factor model 0.8876 #27(MF) + #3(RBM) 0.8976 2. #10(kNN) + #6(kNN) 0.8977 #4(RBM) + 60 factors 0.8876 3. #5(MF 2007) + #5 0.8906 #11(kNN) + #5(kNN) 0.8977

4. #62 + User-kNN 0.9078 #5(MF) + User-kNN 0.8909

5. #5(MF 2007) + #10(MF 2007) 0.8967 #10(MF) + 20 factors 0.9003

6. #5(MF 2007) + NNMF 0.8957 #5(MF) + #3(kNN) 0.8906

7. #53 + User-kNN 0.9017 #3(kNN) + Slow-kNN 0.9024

8. #5(MF 2007) + #4(kNN) 0.8937 #14(kNN) + User-kNN 0.9078 9. #5(MF 2007) + 30 neighbors kNN 0.8904 #4(kNN) + #26(kNN) 0.9046

(23)

Figure 8: Combinations - error vs #method. The plot shows how the accuracy for each year varied depending on which method used.

4.8 Imputation of Qualifying predictions

Table 9: Imputation of Qualifying predictions

1. MSE-kNN 0.8952 MSE-kNN 0.8952

2. SimuFctr 0.9057 IncFctr 0.9100

3. SimuFctr 0.9056 IncFctr 0.9039

4. IncFctr 0.9093 SimuFctr 0.9056

5. MSE-kNN 0.9005 IncFctr 0.9093

6. 50 neighbors kNN 0.9082 MSE-kNN 0.9005

7. - - #2 and SimuFctr 0.8975

(24)

Figure 9: Imputation of Qualifying predictions - error vs #method. The plot shows how the accuracy for each year varied depending on which method used.

4.9 Specials

Table 10: Specials

Number/Method RMSE 2008 Method 2007 RMSE 2007

1. 40 factors binary matrix 1.1263 - 1.1263

2. - - Special 0.9162

3. - - Similar to #2 0.9134

(25)

Figure 10: Specials - error vs #method. The plot shows how the accuracy for each year varied depending on which method used. As the diagram presents the RMSE decrease for more methods used in 2007, which lead to a removal of methods till the next year and this resulted in only one RMSE for 2008.

4.10 Bellkor RMSE Timeline

Figure 11: Bellkor RMSE Timeline - error vs #method. The plot shows how the accuracy for each year varied depending on which method used.

(26)

5 Discussion

During this 3 year period of time the result differed a lot depending on if they’re any new discoveries in the area of recommender systems, i.e. the competitors had to invent by themselves or prepare for it to be discovered by someone else, then use it by themselves. This process created a environment for the competitors to find the most optimized algorithm using forums and general program- ming techniques combined with feedback from employees through out the competition. In the study done by Robert M. Bell and Chris Volinsky et.al (2007) before the final Bellkor in BigChaos team was assembled they found that com- bining different methods will result in the most optimized solution for this, even if they only knew about collaborative filtering and content-based approaches at that time. Initially Their conclusion revealed that 11 results of different combinations was the best solution to optimize this and this research will reveal that just by analyzing past history test results for the data for each method, this analysis of graphs is applicable when creating a recommender system as such. Further, this also reduces other factors outside the technical aspects such as time, money and manpower. The future creations of recommendation systems may have results that prove having considered using this technique when deciding combinations for their final blending schemes, even if it’s not a GBDT algorithm this could function on others as well. The general idea should be to analyze top values and bottom values for the test results and then test the blending technique only on the interval between these two. The results demon- strated that some of initial 24 predictors from the Bellkor team could be found using this technique, the methods for #4 and #8 in RBM in 2007 are the first predictors they used in the blend, i.e observe Figure 5 and see that the highest value is around method #4 and the lowest is at #8 and also same for 2008 so this demonstrates that the final predictor should be between those two for RBM and it match the criteria. Next, we now turn to the matrix factorization, where methods #5 and #27 was used in the blend and observing Figure 6 there is a span between #5 and #27 for their corresponding year which is matching the criteria. The last two interesting graphs are the ones for neighborhood-based model and specials, the first one Figure 7, had the lowest value in 2008 using method number three which is the lowest value in the graph, but the highest value is directly after which makes it to small to create a span leading to the span being from three to nineteen methods reducing the search for predictors only by one or two methods and the last graph is Figure 10 that was a case that they used three methods for 2007 and then reduced it to one method in 2008 and the span was only method two which was also the lowest value for any year.

5.1 Method Discussion

The datasets that where released had a size of 5 gigabytes so initially just the size of the data made it difficult to comprehend or try to use without any solid computer power. The strongest argument for using already presented data is

(27)

that the datasets used had a 99.8 percent missing information so they only released a minor part of the complete dataset, also they maybe shortened the data because it was to big to comprehend. CF was the main approach for this data and that limits the algorithms model available to use, i.e there was a strong argument for CF because of this reduction of data and in turn the methods for the CF approach had to be evolved for a small amount of data. This idea of reducing data because it’s easier to comprehend should also be applied when looking for the blending predictors in the GBDT blending scheme, i.e instead of reducing data, this method of analysing previous test results will reduce both time and effort for finding all the final predictors. In addition, the CF approach in the beginning became the foundation for the winning algorithm choice GBDT, but the period between discovering CF as a implementation and using GBDT was whole three years. This is the idea for the graph analysis, shorten the time for testing on all cases and facilitate approximately where the predictors are located so the testing could be done on fewer methods if there is a time limit or similar in creations of future recommendation systems. It’s also based on the idea of using odd values as a input when choosing predictors for a blend, but it’s preferable to look more at the span of methods, i.e.some of the final GBDT blending predictors is somewhere between these values.

5.2 Conclusion

To answer the problem statement: yes, the graphs demonstrate that by reducing the factors of having to find all the 24 final predictors, the analysis of the graphs demonstrate that if time is a key component to finding all the final predictors, this implementation should be used when using GBDT in recommender system.

It has both a time aspect that is being reduced and also a economical aspect being reduced, because now when choosing the first predictors for the first blend set, some of the predictors could be found using this technique and may result in less money spent on upgrading hardware or spent on obtaining advanced blending schemes. However, this research does not reveal if predictors could be found for larger amounts of dataset, but this is the template for future testing and could reveal that it’s a general idea rather than just maybe a coincidence.

In addition to this; if this technique could be tested on the remaining predictors used in the GBDT blending, could the results reveal even more interesting spans in this graph analysis? Or if they had tested with even more blending sets, instead of three, could the resulting graph have even more accuracy because the span could be shortened even further? These questions could be answered with future studies that require better performing equipment and also consider basic graph analysis before using complex blending schemes.

6 References

Bennett, James, and Stan Lanning. ”The netflix prize.” Proceedings of KDD cup and workshop. Vol. 2007. 2007., Page 1

(28)

Bennett, James, and Stan Lanning. ”The netflix prize.” Proceedings of KDD cup and workshop. Vol. 2007. 2007., Page 3

Schafer, J. Ben, Joseph Konstan, and John Riedl. ”Recommender systems in e-commerce.” Proceedings of the 1st ACM conference on Electronic commerce. ACM, 1999.

Qin, Song, Ronaldo Menezes, and Marius Silaghi. ”A recommender system for youtube based on its network of reviewers.” Social Com- puting (SocialCom), 2010 IEEE Second International Conference on.

IEEE, 2010.

Zhou, Yunhong, et al. ”Large-scale parallel collaborative filtering for the netflix prize.” Algorithmic Aspects in Information and Manage- ment. Springer Berlin Heidelberg, 2008. 337-348.

Jolliffe, Ian. Principal component analysis. John Wiley Sons, Ltd, 2002.

De Lathauwer, L., et al. ”Singular value decomposition.” Proc. EUSIPCO- 94, Edinburgh, Scotland, UK. Vol. 1. 1994.

Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. ”Re- stricted Boltzmann machines for collaborative filtering.” Proceedings of the 24th international conference on Machine learning. ACM, 2007.

Landauer, Thomas K., Peter W. Foltz, and Darrell Laham. ”An introduction to latent semantic analysis.” Discourse processes 25.2-3 (1998): 259-284.

Y. Koren, Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, Proc. 14th ACM Int. Conference on Knowledge Discovery and Data Mining (KDD’08), ACM press, 2008.

Bell, Robert M., Yehuda Koren, and Chris Volinsky. ”The BellKor solution to the Netflix prize.” (2007).

Jolliffe, Ian. Principal component analysis. John Wiley Sons, Ltd, 2002.

Sarwar, Badrul, et al. Application of dimensionality reduction in recommender system-a case study. No. TR-00-043. Minnesota Univ Minneapolis Dept of Computer Science, 2000.

(29)

De Lathauwer, L., et al. ”Singular value decomposition.” Proc. EUSIPCO- 94, Edinburgh, Scotland, UK. Vol. 1. 1994.

6.0.1 Electronic

http://www.netflixprize.com/

http://stuyresearch.googlecode.com/hg-history/b17661bbfaf905a2078902f1abe6b795d4a29137/blake/resources/p293- davidson.pdf

http://www.jmlr.org/papers/volume10/takacs09a/takacs09a.pdf

(30)

Evaluating if an analysis of testresult could be used when using gradient boosted decision tree inrecommender systems.

Evaluating if an analysis of test result could be used when using gradient boosted decision tree in recommender systems.

A QUANTITATIVE STUDY BASED ON THE METHODS DEVELOPED BY BELLKOR'S PRAGMATIC CHAOS TEAM FROM THE NETFLIX PRIZE COMPETITION.

JACOB GULLBRING

Evaluating if an analysis of test result could be used when using gradient boosted decision tree in recommender

systems.

JACOB GULLBRING

May 11, 2016

Contents

1 Introduction

1.1 Problem Statement

2 Background

2.1 Recommendation systems

2.2 The Netflix Prize

3 Method

3.1 CF - Collaborative Filtering

3.2 Gradient Boosted Decision Trees - GBDT

3.3 Dataset

3.4 Graph Analysis

4 Result

4.1 Asymmetric SVD factor models

4.2 Regression models

4.3 RBM with Gaussian visible units

4.4 RBM

4.5 Matrix factorization

4.6 kNN

4.7 Combinations

4.8 Imputation of Qualifying predictions

4.9 Specials

4.10 Bellkor RMSE Timeline

5 Discussion

5.1 Method Discussion

5.2 Conclusion

6 References