Comparison and Improvement Of Collaborative Filtering Algorithms

(1)

INOM

EXAMENSARBETE TEKNIK,

GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2017,

Comparison and Improvement Of Collaborative Filtering Algorithms

VICTOR HANSJONS VEGEBORN HAKIM RAHMANI

KTH

SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

(2)

Degree Project in Computer Science, DD142X Supervisor: Jens Lagergren

Examiner: Örjan Ekeberg

Swedish title: Jämförelse och förbättring av kollaborativa filtreringsalgoritmer

CSC, KTH 2017-06-05

(3)

2

Abstract

Recommender Systems is a topic several computer scientists have researched. With today’s e-commerce and Internet access, companies try to maximize their profit by utilizing var- ious recommender algorithms. One methodology used in such systems is Collaborative Filtering.

The objective of this paper is to compare four algorithms, all based on Collaborative Filtering, which are k-Nearest-Neighbour, Slope One, Singular Value Decomposition and Average Least Square algorithms, in order to find out which algorithm produce the best prediction rates. In addition, the paper will also use two mathematical models, the Arithmetic Median and Weighted Arithmetic Mean, to determine if they can improve the prediction rates.

Singular Value Decomposition performed the best out of the four algorithms and Aver- age Least Square performed the worst. However, the Arithmetic Median performed slightly better than Singular Value Decomposition and the Weighted Arithmetic Mean performed the worst.

(4)

Average Least Square, i syfte att ta reda på vilken algoritm som producerar den bästa be- tygsättningen. Uppsatsen kommer även använda sig av två olika matematiska modeller, Aritmetisk Median och Viktad Aritmetisk Median, för att ta reda på om dom kan förbättra betygsättningen.

Single Value Decomposition presterade bäst medan Average Least Square presterade sämst av de fyra algoritmerna. Däremot presterade Aritmetiska Median en aning bättre än Single Value Decomposition och Viktad Aritmetisk Median presterade sämst.

(5)

Introduction

Recommender systems play an important role in today’s era. Everyday people have to make choices based on the vast amount of data they get exposed to [1]. Things like, what to buy? What movie to watch? Where to travel and much more. In the past, even today, people have relied on the recommendations of their friends, family and co-workers when they want to make a decision over an item. For example, Netflix has over 17,000 movies in their collection. Its impossible to effectively expose all 17,000 movies to a user, so how to limit the exposure to a comprehensible level with relevant suggestions? This is where recommender systems have an important role.

Recommender systems recommend items to users based on the data gathered from users’ consumption patterns as well as their behavioural similarities with other users [2]. The data can be obtained implicitly or explicitly. Acquiring the data implicitly usu- ally mean keeping track of users’ behavior, such as tracking web sites visited or movies watched, while explicit data is the result of collecting user ratings [3]. In addition, in order for recommender systems to be effective, it must collect reliable data in order to make reliable recommendations [1].

What is the best way to collect reliable data? One might argue that consumption patterns and previous behaviours is one of the most reliable sources of data since it somewhat reflects peoples preferences. For example, the clothes bought or music listened to by some- one, might tell something of a person’s future consumption patterns.

Netflix held a competition where they offered 1M dollar to anyone who could improve the performance of their recommender system by at least 10% [1]. The importance of these systems are becoming greater as their impact may increase profits for companies all around the world.

1.1 Purpose of this study

The purpose of this paper is to investigate four different Collaborative Filtering algorithms and to find out which one predicts the best ratings. The four algorithms are: K-Nearest Neighbour, Slope One, Average Least Square and Singular Value Decomposition. Moreover, with the data produced by the algorithms, this paper will investigate the possible improvement of ratings by two mathematical models, the average mean and the weighted average mean.

6

(8)

(9)

Chapter 2

Background

In this chapter recommender systems are presented. Common terminology is defined and the main implementations of such systems are described followed by explanations of the four algorithms used in this paper. Afterwards, the data used by these algorithms and how the pre-process of the data will be done. Lastly, a brief overview on how the algorithm performance is measured and the mathematical improvement methods is defined.

2.1 Recommender Systems

2.1.1 Collaborative filtering

A well used method for recommender systems is Collaborative Filtering(CF). CF is a method that filter items based on recommendations of others. CF bases its recommendations and predictions either on ratings or behaviour of other users by gathering data from other users’

opinions [1]. Based on the data it gathers, it can make a reasonable prediction of one’s pref- erence. There are two different CF methods, User-Based Collaborative Filtering and Item- Based Collaborative Filtering.

2.1.1.1 User-Based filtering

User-Based Filtering is based on the similarities between users. If a user has not yet rated an item, the user-based filtering system will base its predictions on similarities between the user and other users [1]. For example, to predict Joe’s rating for an item Joe has not yet rated, User-Based CF finds users that have rated items similarly to Joe, and then suggests some item the other users have rated. Moreover, the users’ ratings for the item in question are then weighted by their level of agreement with Joe’s rating to make a prediction in accordance with Joe’s preferences [1]. A more comprehensive description can be seen in figure 2.1 [4].

8

(10)

Figure 2.1: Joe likes the three movies on the left. To make a prediction for him, the system finds similar users who also liked those movies, and then determines which other movies they liked. In this case, all three liked Saving Private Ryan, so that is the first recommendation. Two of them liked Dune, so that is next, and so on [4].

2.1.1.2 Item-Based filtering

Instead of finding similarities between users behaviour, Item-based CF bases its recommendations on similarities between items [1]. One way to accomplish this is to calculate the similarity values between items by looking into what other users have rated both items, as illustrated in figure 2.2 [2].

Figure 2.2: Item-Based filtering

Moreover, there are several different mathematical formulas to calculate the similarity between two items, such as Cosine-based similarity, Pearson-based similarity and Adjusted cosine similarity [1]. Each one uses different approaches to calculate the prediction ratings.

(11)

10 CHAPTER 2. BACKGROUND

2.2 Algorithms

2.2.1 K-Nearest Neighbour

K-Nearest Neighbour (KNN) is one of the most common recommendation algorithm. KNN is based on finding the nearest element from a given collection. The algorithm finds the most similar, K number of nearest neighbours to a given user [5]. Generally KNN uses the Euclidean distance to find nearest neighbour. Based on the information of the closest neighbours, KNN predicts it’s rating.

Figure 2.3: K-Nearest Neighbour

In figure 2.3, the KNN algorithm tries to define the unknown data (the red dot) by comparing the K number of neighbours. The unknown data is transformed into the class that dominates within the boundary [5]. When k=3 there are two “x” and one “square”, so the algorithm will classify the unknown data as an “x”. However, when k=5 there are five

“squares” and two “x”, the algorithm will set the unknown data to “shape”.

2.2.2 Slope one

Slope One produce predictions by looking at the average difference between user’s previous rated items together with other users’ ratings for the same item [6]. To predict a user’s rating for an item X, Slope One finds an item the user has already rated, for example item Y, and looks for other users that have also rated both X and Y. When an arbitrary user is found, who has rated both items, the average difference is computed from those ratings.

Finally, Slope One adds this average difference to the user’s rating of Y and the prediction for X is done [6].

(12)

Figure 2.4: To predict B’s rating for item J the Slope One algorithm calculates the difference for A’s rating between item J and item I, and add the difference with B’s rating for item I [6].

2.2.3 Single Variable Decomposition

Singular Value Decomposition (SVD) is a well-known algorithm, based on the matrix factorization (MF) technique that utilizes a Latent factor model [7, 8]. Instead of calculating the similarities to predict ratings, the Latent factor trains a model on some known data, by transforming both items and users to the same Latent factor space, thus making them directly comparable [8]. For example, if x number of users like movie Y and if they also like movie Z, then the Latent factor model will group them together in the same Latent factor space to form an agglomerative movie or feature [9]. Therefore, making it possible to compare two users by evaluating their ratings for different features instead of individual movies.

2.2.4 Average Least Sqaure

Average Least Sqaure (ALS) is also based on MF and is very similar to SVD. The ALS algorithm utilizes a rating matrix r to predict ratings [10]. Matrix r contains ratings given by nuusers (rows) by nmmovies (columns) [11]. However, matrix r will have many missing values. These missing values are unrated movies. In order to predict the ratings for the missing values, the ALS will first turn the rating matrix r into a product of two latent factor matrices, r = um^T, where user matrix u is a nuby nf actormatrix and movie matrix m is a n_{f actor}by nmmatrix. Finally, the ALS algorithm will then utilize a two-step iterative opti- mization approch [12]. In the first loop u is fixed in order to estimate m by using the least square method, then in the second loop changes the order by keeping m fixed in order to estimate u . This iteration will continue until matrices u and m are no longer changing or the change is quite small [13].

2.3 MovieLens data sets

The MovieLens data sets consist of users’ ratings of movies. There are multiple sizes of data sets available [14]. From 100K ratings by 240K users up to 24M ratings by 260K users.

The data sets consists of two different types of data. One is rating data and the other is movie data. This paper will use the rating data which consists of list entries of user ids,

(13)

12 CHAPTER 2. BACKGROUND

timestamps, movie ids and movie ratings between 1-5. The timestamps will be disregarded as this paper solely focuses on algorithms producing predictions from user/item-pairs i.e.

user ids paired with a movie id as well as a 1-5 rating. Moreover, this paper will produce subsets of data from the 20M data set as this allows for avoidance of typical recommender system related problems such as the cold start problem i.e when a new user is added to the system with no previous rating data [15].

2.4 Error measurement

The Root-Mean-Square Error (RMSE) measurement is a well used method to quantify a recommendation algorithms performance. In 2009, Netflix awarded a team of researchers

$1M in their quest to improve the Netflix movie recommendation system by reaching a RMSE of 0.8567, which was a 10% improvement of the system [16].

If an error e = rpredicted− r_actuali.e the difference between the predicted rating and the actual rating, and for a non-empty set of errors e1, e₂, ..., e_n, the RMSE is calculated by:

RM SE = v u u t 1 n

n

X

i=1

e²_i (2.1)

Other measurements such as the Mean Average Error (MAE) is also used as a standard statistical metric for performance in recommending systems. In general the MAE is always smaller then the RMSE since the errors are not squared when calculating the MAE [17]. As this paper focuses on improving predicted ratings using known mathematical formulas, the need for other measurements is obsolete due to the fact that the RMSE from the algorithms are relative to the improved RMSE calculated post-algorithm.

2.5 Improving results

Each algorithm produce predictions for the same users and computes the RMSE, which indicate on how well the algorithm performed. As the RMSE is the quality measurement of the algorithms, improving predictions is equivalent to improving the RMSE.

For every user/item-pair, the four algorithms predicted some rating where the user had not rated the item. Therefor, new predictions can be computed by iterating through the produced predictions while applying the following mathematical models to all user/item- pairs.

2.5.1 Arithmetic mean

The arithmetic mean (AM) summarize the ratings produced by the algorithms and di- vides it by the number of algorithms. If R is the set of predicted ratings r for a particular user/item-pair, the new rating RAM is calculated by:

RAM = 1

|R|

X

r∈R

r (2.2)

(14)

calculated by:

RW AM = 1

|R|

P

r∈R,w∈E(1 − w)r P

w∈Ew (2.3)

(15)

Chapter 3

Method

In the following chapter the used methodology is presented. We begin by explaining the pre-processing of the data set provided by MovieLens. Thereafter, a presentation of the software that was used, as well as an explanation of the testing of the algorithms and how the calculations for improved ratings where done. Lastly, a brief summary of the utilized hardware.

3.1 Data validation and subset generation

The data sets used in this study was generated from the MovieLens 20M data set. Due to the large amount of data, a small program was written [Appendix A.2 MovieLensCon- verter] to validate and generate smaller data sets. The 20 million ratings was stripped down to a 100K, 500K and a 1M data set. The data entries consisted of a user id, a movie id and a rating from 1-5.

The above mentioned program iterated through the 20M ratings and validated each data entry. By verifying the increasing order of the data with respect to user ids, and by verifying all data entries and the values within, the data was considered valid. The final generated subsets are listed below.

Table 3.1: Summary of data sets used in this study

100K 500K 1M

Number of users ∼700 ∼3300 ∼6750

Number of ratings ∼100000 ∼500000 ∼1000000 Number of movies ∼27000 ∼27000 ∼27000

3.2 Software

The software used in this study was the LibRec Java Library (version 2.0), which is open source and under the GPL-licence [18]. All of the recommendation algorithms were exe- cuted with the default LibRec configurations. The data sets were split into two sets, a traing set and a test set. This was done with the LibRec data splitter ratio set to 0.8 i.e generating a training set consisting of 80% of the ratings and a test set of 20%. Moreover, additions to the library were made to enable usage of the libraries internal evaluator for the calculated AM and WAM ratings [appendix A.3 getRawRecommenderList].

14

(16)

The AM and WAM RMSE values were calculated after the execution of the algorithms using the stored data from the algorithm testing. For each item/user-pair, a AM and WAM rating was calculated from the generated predictions from the algorithms. These new ratings were then fed into the LibRec’s internal RMSE-evaluator.

Each sequential test run was done ten times, over all data sets (100K, 500K, 1M) with the default configurations of the algorithms. This was done to enforce consistency in the results.

3.4 Hardware

These test were done on a Mac Book Pro unibody mid 2012 running OSX 10.10.5 with a 2,6 GHz Intel Core i7-3720QM and 16 GB 1600 Mhz DDR3.

(17)

Chapter 4

Results

In the following chapter the results from the execution of the algorithms Slope One, KNN, SVD and ALS is presented. In addition, the results of the improvement models AM and WAM are also presented followed by a comparison of results per data set. All numerical values presented are the result of the mean average of ten independent tests.

4.1 Algorithms

Figure 4.1: RMSE values from the testing of Slope One with data sets 100K, 500K and 1M as input.

16

(18)

Figure 4.2: RMSE values from the testing of KNN with data sets 100K, 500K and 1M as input.

The KNN algorithm produced RMSE values for the each data set: 100K=0.9341, 500K=0.8779, 1M=0.8690. Rating performance increased by 0.06395% from 100K to 500K, and by 0.07489%

from 100K to 1M.

(19)

18 CHAPTER 4. RESULTS

Figure 4.3: RMSE values from the testing of SVD with data sets 100K, 500K and 1M as input.

The SVD algorithm produced RMSE values for the each data set: 100K=0.9004, 500K=0.8366, 1M=0.8190. Rating performance increased by 0.07622% from 100K to 500K, and by 0.09936%

from 100K to 1M.

(20)

Figure 4.4: RMSE values from the testing of ALS with data sets 100K, 500K and 1M as input.

The ALS algorithm produced RMSE values for the each data set: 100K=1.2027, 500K=0.9473, 1M=0.8854. Rating performance increased by 0.3350% from 100K to 500K, and by 0.3584%

from 100K to 1M.

(21)

4.2 Improvement models

Figure 4.5: RMSE values from the improvement model of AM with data sets 100K, 500K and 1M.

The AM mathematical improvement model produced RMSE values for the each data set:

100K=0.8918, 500K=0.8274, 1M=0.8125. Rating performance increased by 0.09581% from 100K to 500K, and by 0.09757% from 100K to 1M.

(22)

Figure 4.6: RMSE values from the improvement model of WAM with data sets 100K, 500K and 1M.

The WAM mathematical improvement model produced RMSE values for the each data set: 100K=1.2668, 500K=1.2230, 1M=1.2148. Rating performance increased by 0.04249%

from 100K to 500K, and by 0.04278% from 100K to 1M.

(23)

4.3 Comparison

Figure 4.7: RMSE values for AM, SVD, Slope One, KNN, ALS and WAM over the 100K data set, ordered by decreasing rating performance from left to right.

As can be seen from the diagram, SVD produced the best predictions among the algorithms closely followed by Slope One and KNN over the 100K data set. ALS performed poorly in comparison to the other algorithms. The improvement models performed both best and worst. The AM predictions where slightly better then the best performing algorithm SVD and WAM had the worst predictions.

Table 4.1: RMSE results and differences relative to AM and WAM over the 100K data set

AM SVD Slope One KNN ALS WAM

RMSE 0.8918 0.9004 0.9177 0.9341 1.2027 1.2668

∆AM 0 0.0086 0.0259 0.0423 0.3109 0.3750

∆WAM 0.3750 0.3664 0.3491 0.3327 0.0641 0

(24)

Figure 4.8: RMSE values for AM, SVD, Slope One, KNN, ALS and WAM over the 500K data set, ordered by decreasing rating performance from left to right.

As with the 100K data set, the resulting RMSE show a consistency in the performance of the prediction methods relative to one another. Apart from a an overall increase in performance, the ALS algorithm show a greater improvement in performance when fed the larger 500K data set.

Table 4.2: RMSE results and differences relative to AM and WAM over the 500K data set

RMSE 0.8274 0.8366 0.8668 0.8779 0.9473 1.2230

∆AM 0 0.0092 0.0394 0.0505 0.1199 0.3956

∆WAM 0.3956 0.3864 0.3562 0.3451 0.2757 0

(25)

Figure 4.9: RMSE values for AM, SVD, Slope One, KNN, ALS and WAM over the 1M data set, ordered by decreasing rating performance from left to right

As with the 100K and 500K comparison diagrams, AM, SVD, Slope One, KNN and ALS show an increase in performance of the predicted ratings over the 1M data set. All prediction methods perform better when fed larger amounts of data, although WAM show a poor performance regardless of data set.

Table 4.3: RMSE results and differences relative to AM and WAM over the 1M data set

RMSE 0.8125 0.8190 0.8598 0.8690 0.8854 1.2148

∆AM 0 0.0065 0.0472 0.0565 0.0728 0.4023

∆WAM 0.4023 0.3958 0.3551 0.3458 0.3295 0

(26)

matical improvement models as the data size grow. Moreover, the individual increase in performance of each algorithm implies a great difference in their effectiveness, especially between the matrix factorization algorithms SVD and ALS.

SVD produced the best ratings among the algorithms with an RMSE of 0.8190 under the 1M data set and an performance increase of 0.09936% from the 100K to 1M data set. This was expected as the winners of the prize used a modified version of the SVD algorithm in their solution [19].

The worst performing algorithm was ALS with a RSME of 0.8854 under the 1M data set.

On the other hand, ALS had the largest increase in performance with its 0.3584% from the 100K to 1M data set. It is possible that ALS would produce better predictions on larger data sets.

The less complex algorithms i.e the non-matrix factorization algorithms, Slope One and KNN performed relatively well in comparison to ALS. Slope One produced a RMSE of 0.8598 and KNN produced a RMSE of 0.8690 under the 1M data set. Slope One had the worst increase in performance with 0.06733% under the 1M data set closely followed by KNN with an increase of 0.07489%.

The RMSE values indicate how well these algorithms predicts ratings for non-rated items. Based on these results, it is suggested to utilize SVD out of the four algorithms for a recommender system in production. However, when complexity becomes a factor, these results cannot solely be the basis for which algorithm to put into production. Instead, for an effective system, both execution time and input data size should be considered as factors when implementing a recommender system. For example, this is the case with the Netflix prize winning algorithm, which was never put into production although the RMSE was improved by 10% [20].

The mathematical improvement models AM and WAM used the resulting predictions from the four algorithms to produce new ratings. Out of the two, the AM improvement model produced the overall best ratings with an RMSE of 0.8125 under the 1M data set, while WAM performed significantly poorly in comparison to AM and the algorithms in all test cases with a RMSE of 1.2146 under the same data set.

Since the mathematical improvement models did not have any major beneficial impact on the resulting RMSE values, one might argue that these models are useless. However,

25

(27)

26 CHAPTER 5. DISCUSSION AND CONCLUSION

a possible area for future research would be to investigate whether or not the AM model would produce proportionally better results, both complexity and RMSE considered, then the more complex algorithms such as the matrix factorization algorithms.

5.2 Methodology criticism

In this study, only a small portion of the data available from MovieLens was used. As the algorithms had different performance increases depending on the input data size, this studies conclusion might have been different. Unfortunately, the limitations of the LibRec library forced the study to use smaller data sets.

During execution of the test suite on data sets larger then 1M, the KNN algorithm used up more then the available 16 GB of memory. This put the Java Virtual Machine into garbage collection mode and the execution terminated.

By solely executing the KNN algorithm, an attempt was made to rule out the possibility of poor implementation of the algorithm. During the execution the issue was still present.

As a result, the smaller data sets were used.

5.3 Conclusion

This paper focused on comparing four different collaborative filtering algorithms, in which the aim was to find out which one that produced the best prediction rate. The four algorithms were KNN, SVD, ALS and Slope One. This paper also used two mathematical models, Arithmetic mean and Weighted arithmetic mean, in order to determine if the mathematical models could produce better prediction ratings than the four algorithms.

Out of the four algorithms the SVD had the best prediction rate whereas ALS the worst.

However, ALS had the highest performance improvement when the data increased from 100K ratings to 1M ratings. AM had a slightly better prediction rate than SVD, and WAM had the overall worst prediction rate.

(28)

[3] A. Hernando A. Gutiérrez J. Bobadilla, F. Ortega. Knowledge-based systems. Rec- ommender systems survey, 2013. URL https://pdfs.semanticscholar.org/34b4/

9026d61b3c308bce9ba5cf4ae8305f09d0e2.pdf.

[4] Robert Bell Chris Volinsky Koren.Yehuda. Matrix factorization techniques for recommender systems, 2009. URL https://datajobs.com/data-science-repo/Recommender- Systems-%5BNetflix%5D.pdf.

[5] Computer Science Comps Project. k nearest neighbour, 2011. URL http://cs.carleton.

edu/cs_comps/0910/netflixprize/final_results/knn/index.html.

[6] Lyndsey Clevesy. Slope-one recommender - exclusive article from mahout in action, 2010. URL https://dzone.com/articles/slope-one-recommender.

[7] Latent Factor Models. Matrix factorization methods, 2014. URL http://recommender.

no/algorithms/latent-factor-models-matrix-factorization-methods/. [8] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collab-

orative filtering model, 2008. URL http://www.cs.rochester.edu/twiki/pub/

Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_

Collaborative_Filtering_Model.pdf.

[9] Espy Yonder. Singular value decomposition to create a bench making data set from movielens data, 2014. URL http://maheshakya.github.io/gsoc/2014/05/18/

preparing-a-bench-marking-data-set-using-singula-value-decomposition-on- movielens-data.html.

[10] James. Duling David. N. Langville Amy . D. Meyer Car Albright, Russell. Cox. Algo- rithms, initializations, and convergence for the nonnegative matrix factorization, 2014.

URL http://www.dm.unibo.it/~simoncin/NMFInitAlgConv.pdf.

[11] Bregt Verreet. The alternating least squares algorithm in recommenderlab. 2016.

URL http://www.infofarm.be/articles/alternating-least-squares-algorithm- recommenderlab.

27

(29)

28 BIBLIOGRAPHY

[12] Jesse Steinweg-Woods. A gentle introduction to recommender systems with implicit feedback, 2016. URL https://jessesw.com/Rec-System/.

[13] Bugra Akyildiz. Alternating least squares method for collaborative filtering, 2014.

URL https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares- method-for-collaborative-filtering/.

[14] MovieLens, 2017. URL https://movielens.org/.

[15] Stathes Hadjiefthymiades Blerina Lika, Kostas Kolomvatsos. Facing the cold start problem in recommender systems. Expert Systems with Applications, 2013. URL http://

www.sciencedirect.com.focus.lib.kth.se/science/article/pii/S0957417413007240.

[16] Netflix, 2009. URL http://www.netflixprize.com/.

[17] R.R Draxler T.Chai. Root mean square error (rmse) or mean absolute error (mae)? – arguments against avoiding rmse in the literature, 2014. URL http://www.geosci- model-dev.net/7/1247/2014/gmd-7-1247-2014.pdf.

[18] Librec. Librec: A java library for recommender systems, ver 2.0 [software], 2017. URL https://www.librec.net/.

[19] Yehuda Koren. The bellkor solution to the netflix grand prize. 2009. URL http:

//www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf.

[20] Xavier Amatriain Justin Basilico. Netflix recommendations: Beyond the 5 stars (part 1), 2012. URL https://medium.com/netflix-techblog/netflix-recommendations-beyond- the-5-stars-part-1-55838468f429.

(30)

A.1 TestSuite

i m p o r t java . io . F i l e N o t F o u n d E x c e p t i o n ; i m p o r t java . io . F i l e W r i t e r ;

i m p o r t java . io . I O E x c e p t i o n ;

i m p o r t java . io . U n s u p p o r t e d E n c o d i n g E x c e p t i o n ; i m p o r t java . util . H a s h M a p ;

i m p o r t java . util . I t e r a t o r ; i m p o r t java . util . L i n k e d L i s t ;

i m p o r t net . l i b r e c . c o m m o n . L i b r e c E x c e p t i o n ; i m p o r t net . l i b r e c . conf . C o n f i g u r a t i o n ;

i m p o r t net . l i b r e c . data . mod el . T e x t D a t a M o d e l ; i m p o r t net . l i b r e c . eval . R e c o m m e n d e r E v a l u a t o r ; i m p o r t net . l i b r e c . eval . r a t i n g . R M S E E v a l u a t o r ; i m p o r t net . l i b r e c . r e c o m m e n d e r . R e c o m m e n d e r ;

i m p o r t net . l i b r e c . r e c o m m e n d e r . R e c o m m e n d e r C o n t e x t ; i m p o r t net . l i b r e c . r e c o m m e n d e r . cf . U s e r K N N R e c o m m e n d e r ; i m p o r t net . l i b r e c . r e c o m m e n d e r . cf . r a t i n g . M F A L S R e c o m m e n d e r ;

i m p o r t net . l i b r e c . r e c o m m e n d e r . cf . r a t i n g . S V D P l u s P l u s R e c o m m e n d e r ; i m p o r t net . l i b r e c . r e c o m m e n d e r . ext . S l o p e O n e R e c o m m e n d e r ;

i m p o r t net . l i b r e c . r e c o m m e n d e r . item . R e c o m m e n d e d I t e m L i s t ; i m p o r t net . l i b r e c . r e c o m m e n d e r . item . R e c o m m e n d e d L i s t ; i m p o r t net . l i b r e c . r e c o m m e n d e r . item . U s e r I t e m R a t i n g E n t r y ; i m p o r t net . l i b r e c . s i m i l a r i t y . P C C S i m i l a r i t y ;

i m p o r t net . l i b r e c . s i m i l a r i t y . R e c o m m e n d e r S i m i l a r i t y ; p u b l i c cla ss T e s t S u i t e {

// A l g o r i t h m s

fi nal int KNN = 0;

29

(31)

30 APPENDIX A. UTILIZED CODE AND ADDITIONS TO LIBREC LIBRARY

fi nal int ALS = 1;

fi nal int SVD = 2;

fi nal int ONE = 3;

fi nal int n A l g o r i t h m s = 4;

C o n f i g u r a t i o n conf = null;

T e x t D a t a M o d e l M o t h e r D a t a M o d e l = null; O u t p u t D a t a data = new O u t p u t D a t a () ; fi nal S t r i n g d a t a F o l d e r = " 1 M ";

fi nal S t r i n g d a t a P a t h = " / Us ers / v i c t o r w e g e b o r n / D o c u m e n t s " +

" / KTH / vt17 / k e x d a t a / m o v i e l e n s /20 M "; cl ass O u t p u t D a t a {

d o u b l e[] rmse = new d o u b l e[ n A l g o r i t h m s ];

d o u b l e[] i n t e r p _ r m s e = new d o u b l e[2];

d o u b l e r m s e S u m = 0;

int n u m b e r O f U s e r s = 0;

HashMap < Integer , R e c o m m e n d e d L i s t > o u t p u t = new HashMap < Integer , R e c o m m e n d e d L i s t >() ;

};

p u b l i c s t a t i c void main ( S t r i n g [] args ) t h r o w s E x c e p t i o n { T e s t S u i t e t = new T e s t S u i t e () ;

t . s e t u p D a t a () ; t . S l o p e O n e () ; t . KNN () ; t . SVD () ; t . M FAL S () ;

t . E v a l u t e A l g o r i t h m s () ; t . O u t p u t R e s u l t s () ; }

void s e t u p D a t a () t h r o w s L i b r e c E x c e p t i o n { conf = new C o n f i g u r a t i o n () ;

conf . set (" dfs . data . dir ", d a t a P a t h ) ;

conf . set (" dfs . r e s u l t . dir "," / U ser s / v i c t o r w e g e b o r n / D o c u m e n t s / KTH / vt17 / k e x d a t a / lib ec / r e s u l t ") ;

conf . set (" data . i npu t . path ", d a t a F o l d e r ) ; conf . set (" data . c o l u m n . f o r m a t "," UIR ") ; conf . set (" data . m ode l . s p l i t t e r "," r ati o ") ; conf . set (" data . s p l i t t e r . r ati o "," r a t i n g ") ; conf . set (" data . m ode l . f o r m a t "," text ") ;

conf . set (" data . s p l i t t e r . t r a i n s e t . r ati o "," 0.8 ") ; conf . set (" rec . r a n d o m . seed "," 1 ") ;

conf . set (" data . c o n v e r t . b i n a r i z e . t h r e s h o l d "," -1.0 ") ;

(32)

O N E c o n f . set (" rec . r e c o m m e n d e r . i s r a n k i n g ", " fa lse ") ; O N E c o n f . set (" rec . r e c o m m e n d e r . cl ass ", " s l o p e o n e ") ; O N E c o n f . set (" rec . eval . e n a b l e ", " true ") ;

O N E c o n f . set (" rec . i t e r a t o r . m a x i m u m ", " 50 ") ; O N E c o n f . set (" rec . f a c t o r y . n u m b e r ", " 30 ") ;

O N E c o n f . set (" rec . i t e r a t o r . le arn . rate ", " 0. 001 ") ; O N E c o n f . set (" rec . r e c o m m e n d e r . l a m b d a . user ", " 0.05 ") ; O N E c o n f . set (" rec . r e c o m m e n d e r . l a m b d a . item ", " 0.05 ") ;

// bui ld r e c o m m e n d e r c o n t e x t

T e x t D a t a M o d e l d a t a M o d e l = M o t h e r D a t a M o d e l ; R e c o m m e n d e r C o n t e x t c o n t e x t = new

R e c o m m e n d e r C o n t e x t ( ONEconf , d a t a M o d e l ) ; // bui ld r e c o m m e n d e r

R e c o m m e n d e r r e c o m m e n d e r = new S l o p e O n e R e c o m m e n d e r () ; r e c o m m e n d e r . s e t C o n t e x t ( c o n t e x t ) ;

// run r e c o m m e n d e r a l g o r i t h m r e c o m m e n d e r . r e c o m m e n d ( c o n t e x t ) ; // f i l t e r the r e c o m m e n d e d r e s u l t

R e c o m m e n d e r E v a l u a t o r e v a l u a t o r = new R M S E E v a l u a t o r () ; data . rmse [ ONE ] = r e c o m m e n d e r . e v a l u a t e ( e v a l u a t o r ) ;

data . o u t p u t . put ( ONE , r e c o m m e n d e r . g e t R a w R e c o m m e n d e d L i s t () ) ; }

void M FAL S () t h r o w s E x c e p t i o n {

C o n f i g u r a t i o n A L S c o n f = new C o n f i g u r a t i o n () ;

A L S c o n f . set (" rec . r e c o m m e n d e r . s i m i l a r i t y . key "," user ") ; A L S c o n f . set (" rec . r e c o m m e n d e r . cl ass "," mf als ") ;

A L S c o n f . set (" rec . i t e r a t o r . l e a r n r a t e "," 0.01 ") ;

A L S c o n f . set (" rec . i t e r a t o r . l e a r n r a t e . m a x i m u m "," 0.01 ") ; A L S c o n f . set (" rec . i t e r a t o r . m a x i m u m "," 100 ") ;

A L S c o n f . set (" rec . user . r e g u l a r i z a t i o n "," 0.01 ") ; A L S c o n f . set (" rec . item . r e g u l a r i z a t i o n "," 0.01 ") ; A L S c o n f . set (" rec . f a c t o r . n u m b e r "," 10 ") ;

A L S c o n f . set (" rec . l e a r n r a t e . b o l d d r i v e r "," fa lse ") ;

(33)

A L S c o n f . set (" rec . l e a r n r a t e . de cay "," 1.0 ") ; // set r e c o m m e n d e r c o n t e x t

R e c o m m e n d e r C o n t e x t ( ALSconf , d a t a M o d e l ) ; // bui ld s i m i l a r i t y

R e c o m m e n d e r S i m i l a r i t y s i m i l a r i t y = new P C C S i m i l a r i t y () ; s i m i l a r i t y . b u i l d S i m i l a r i t y M a t r i x ( d a t a M o d e l ) ;

c o n t e x t . s e t S i m i l a r i t y ( s i m i l a r i t y ) ; // bui ld r e c o m m e n d e r

R e c o m m e n d e r r e c o m m e n d e r = new M F A L S R e c o m m e n d e r () ; r e c o m m e n d e r . s e t C o n t e x t ( c o n t e x t ) ;

// run r e c o m m e n d e r a l g o r i t h m r e c o m m e n d e r . r e c o m m e n d ( c o n t e x t ) ; // e v a l u a t e the r e c o m m e n d e d r e s u l t

R e c o m m e n d e r E v a l u a t o r e v a l u a t o r = new R M S E E v a l u a t o r () ; data . rmse [ ALS ] = r e c o m m e n d e r . e v a l u a t e ( e v a l u a t o r ) ;

data . o u t p u t . put ( ALS , r e c o m m e n d e r . g e t R a w R e c o m m e n d e d L i s t () ) ; }

void SVD () t h r o w s E x c e p t i o n {

C o n f i g u r a t i o n S V D c o n f = new C o n f i g u r a t i o n () ;

S V D c o n f . set (" rec . r e c o m m e n d e r . s i m i l a r i t y . key "," user ") ; S V D c o n f . set (" rec . r e c o m m e n d e r . cl ass "," sv dpp ") ;

S V D c o n f . set (" rec . i t e r a t o r . l e a r n r a t e "," 0.01 ") ;

S V D c o n f . set (" rec . i t e r a t o r . l e a r n r a t e . m a x i m u m "," 0.01 ") ; S V D c o n f . set (" rec . i t e r a t o r . m a x i m u m "," 13 ") ;

S V D c o n f . set (" rec . user . r e g u l a r i z a t i o n "," 0.01 ") ; S V D c o n f . set (" rec . item . r e g u l a r i z a t i o n "," 0.01 ") ; S V D c o n f . set (" rec . i m p I t e m . r e g u l a r i z a t i o n "," 0. 001 ") ; S V D c o n f . set (" rec . f a c t o r . n u m b e r "," 10 ") ;

S V D c o n f . set (" rec . l e a r n r a t e . b o l d d r i v e r "," fa lse ") ; S V D c o n f . set (" rec . l e a r n r a t e . de cay "," 1.0 ") ;

// bui ld r e c o m m e n d e r c o n t e x t

R e c o m m e n d e r C o n t e x t ( SVDconf , d a t a M o d e l ) ; // bui ld s i m i l a r i t y

(34)

// e v a l u a t e the r e c o m m e n d e d r e s u l t

R e c o m m e n d e r E v a l u a t o r e v a l u a t o r = new R M S E E v a l u a t o r () ; data . rmse [ SVD ] = r e c o m m e n d e r . e v a l u a t e ( e v a l u a t o r ) ;

data . o u t p u t . put ( SVD , r e c o m m e n d e r . g e t R a w R e c o m m e n d e d L i s t () ) ; }

void KNN () t h r o w s E x c e p t i o n {

C o n f i g u r a t i o n K N N c o n f = new C o n f i g u r a t i o n () ; K N N c o n f . set (" rec . s i m i l a r i t y . cl ass "," pcc ") ; K N N c o n f . set (" rec . n e i g h b o r s . KNN . n u m b e r "," 80 ") ; K N N c o n f . set (" rec . r e c o m m e n d e r . cl ass "," u s e r K N N ") ; K N N c o n f . set (" rec . r e c o m m e n d e r . s i m i l a r i t i e s "," user ") ; K N N c o n f . set (" rec . r e c o m m e n d e r . i s r a n k i n g "," fa lse ") ; K N N c o n f . set (" rec . r e c o m m e n d e r . r a n k i n g . topn "," 10 ") ; K N N c o n f . set (" rec . f i l t e r . cl ass ", " g e n e r i c ") ;

K N N c o n f . set (" rec . s i m i l a r i t y . s h r i n k a g e "," 25 ") ; K N N c o n f . set (" rec . r e c o m m e n d e r . v e r b o s e ", " true ") ; // bui ld r e c o m m e n d e r c o n t e x t

R e c o m m e n d e r C o n t e x t ( KNNconf , d a t a M o d e l ) ; // bui ld s i m i l a r i t y

c o n t e x t . s e t S i m i l a r i t y ( s i m i l a r i t y ) ; // bui ld r e c o m m e n d e r

R e c o m m e n d e r r e c o m m e n d e r = new U s e r K N N R e c o m m e n d e r () ; r e c o m m e n d e r . s e t C o n t e x t ( c o n t e x t ) ;

// run r e c o m m e n d e r a l g o r i t h m r e c o m m e n d e r . r e c o m m e n d ( c o n t e x t ) ; // e v a l u a t e the r e c o m m e n d e d r e s u l t

R e c o m m e n d e r E v a l u a t o r e v a l u a t o r = new R M S E E v a l u a t o r () ; data . rmse [ KNN ] = r e c o m m e n d e r . e v a l u a t e ( e v a l u a t o r ) ;

(35)

data . o u t p u t . put ( KNN , r e c o m m e n d e r . g e t R a w R e c o m m e n d e d L i s t () ) ; }

void E v a l u t e A l g o r i t h m s () t h r o w s E x c e p t i o n {

S y s t e m . err . p r i n t l n (" = = = = = = = = = = = = = R E S U L T S = = = = = = = = = = = = = ") ; S y s t e m . out . p r i n t l n (" RMSE V A L U E S ") ;

S y s t e m . out . p r i n t l n (" SLO PE ONE :: RMSE = " + data . rmse [ ONE ]) ; S y s t e m . out . p r i n t l n (" KNN :: RMSE = " + data . rmse [ KNN ]) ; S y s t e m . out . p r i n t l n (" SVD ++ :: RMSE = " + data . rmse [ SVD ]) ; S y s t e m . out . p r i n t l n (" ALS :: RMSE = " + data . rmse [ ALS ]) ; data . r m s e S u m = data . rmse [ ONE ] + data . rmse [ ALS ] +

data . rmse [ KNN ] + data . rmse [ SVD ];

int A L S s i z e = data . o u t p u t . get ( ALS ) . size () ; int S V D s i z e = data . o u t p u t . get ( SVD ) . size () ; int K N N s i z e = data . o u t p u t . get ( KNN ) . size () ; int O N E s i z e = data . o u t p u t . get ( ONE ) . size () ;

if(!( A L S s i z e == S V D s i z e && S V D s i z e == K N N s i z e && K N N s i z e

== O N E s i z e ) )

th row new E x c e p t i o n (" O U T P U T DATA L IST S ARE NOT OF SAME L E N G H T ") ;

Iterator A L S i t e r a t o r = data . o u t p u t . get ( ALS ) . e n t r y I t e r a t o r () ; Iterator K N N i t e r a t o r =

data . o u t p u t . get ( KNN ) . e n t r y I t e r a t o r () ; Iterator S V D i t e r a t o r =

data . o u t p u t . get ( SVD ) . e n t r y I t e r a t o r () ; Iterator O N E i t e r a t o r =

data . o u t p u t . get ( ONE ) . e n t r y I t e r a t o r () ; R e c o m m e n d e d L i s t A M r a t i n g = new

R e c o m m e n d e d I t e m L i s t ( data . n u m b e r O f U s e r s ) ; R e c o m m e n d e d L i s t W A M r a t i n g = new

R e c o m m e n d e d I t e m L i s t ( data . n u m b e r O f U s e r s ) ;;

d o u b l e[] f a c t o r s = { data . rmse [ ALS ]/ data . rmseSum , data . rmse [ KNN ]/ data . rmseSum , data . rmse [ SVD ]/ data . rmseSum , data . rmse [ ONE ]/ data . r m s e S u m };

wh ile( A L S i t e r a t o r . h a s N e x t () && K N N i t e r a t o r . h a s N e x t ()

&& S V D i t e r a t o r . h a s N e x t () && O N E i t e r a t o r . h a s N e x t () ) { Li nke dLi st e n t r i e s = new

Li nke dLi st () ;

(36)

for( U s e r I t e m R a t i n g E n t r y e : e n t r i e s ) {

if( e . g e t U s e r I d x () == u s e r I D && e . g e t I t e m I d x () ==

i t e m I d x ) {

r a t i n g s . add ( e . g e t V a l u e () ) ; } else

th row new E x c e p t i o n (" I t e r a t i o n t h r o u g h r e c o m m e n d a t i o n li sts are non u n i f o r m ") ; }

A M r a t i n g . a d d U s e r I t e m I d x ( userID , itemIdx , AM ( r a t i n g s ) ) ; W A M r a t i n g . a d d U s e r I t e m I d x ( userID , itemIdx , WAM ( ratings ,

f a c t o r s ) ) ; }

// Make sure no more use rs are left in the li sts

if( K N N i t e r a t o r . h a s N e x t () ) thr ow new E x c e p t i o n (" User / item pa irs not p r o c e s s e d ") ;

if( S V D i t e r a t o r . h a s N e x t () ) thr ow new E x c e p t i o n (" User / item pa irs not p r o c e s s e d ") ;

if( O N E i t e r a t o r . h a s N e x t () ) thr ow new E x c e p t i o n (" User / item pa irs not p r o c e s s e d ") ;

R M S E E v a l u a t o r e v a l u a t o r = new R M S E E v a l u a t o r () ; data . i n t e r p _ r m s e [0]

e v a l u a t o r . e v a l u a t e ( M o t h e r D a t a M o d e l . g e t D a t a S p l i t t e r () . g e t T e s t D a t a () , A M r a t i n g ) ;

data . i n t e r p _ r m s e [1] =

e v a l u a t o r . e v a l u a t e ( M o t h e r D a t a M o d e l . g e t D a t a S p l i t t e r () . g e t T e s t D a t a () , W A M r a t i n g ) ;

S y s t e m . out . p r i n t l n (" ~ AM :: RMSE = " + data . i n t e r p _ r m s e [0]) ;

S y s t e m . out . p r i n t l n (" ~ WAM :: RMSE = " + data . i n t e r p _ r m s e [1]) ;

}

d o u b l e AM ( Lin ked Lis t < Double > v a l u e s ) { d o u b l e sum = 0;

(37)

for( D o u b l e v alu e : v a l u e s ) sum += v alu e ;

r e t u r n sum / v a l u e s . size () ; }

d o u b l e WAM ( Lin ked Lis t < Double > values , d o u b l e[] f a c t o r s ) { d o u b l e sum = 0;

for(int i = 0; i < v a l u e s . size () ; i ++) { sum += ((1 - f a c t o r s [ i ]) * v a l u e s . get ( i ) ) ; }

r e t u r n sum /( v a l u e s . size () ) ; }

void O u t p u t R e s u l t s () t h r o w s U n s u p p o r t e d E n c o d i n g E x c e p t i o n , F i l e N o t F o u n d E x c e p t i o n , I O E x c e p t i o n {

try( F i l e W r i t e r w r i t e r = new F i l e W r i t e r ( d a t a P a t h +

" / r e s u l t s / r e s u l t s . txt ", true) ) {

S t r i n g B u i l d e r sb = new S t r i n g B u i l d e r () ;

sb . a p p e n d (" = = = = = = = = = = = = = = = = = BE GIN = = = = = = = = = = = = = = = = = = = \ n ") ; sb . a p p e n d (" R E S U L T S :: " + d a t a F o l d e r + " \ n ") ;

sb . a p p e n d (" SLO PE ONE :: " + data . rmse [ ONE ] + " \ n ") ; sb . a p p e n d (" KNN :: " + data . rmse [ KNN ] + " \ n ") ; sb . a p p e n d (" SVD :: " + data . rmse [ SVD ] + " \ n ") ; sb . a p p e n d (" MFA LS :: " + data . rmse [ ALS ] + " \ n ") ; sb . a p p e n d (" AM :: " + data . i n t e r p _ r m s e [0] + " \ n ") ; sb . a p p e n d (" WAM :: " + data . i n t e r p _ r m s e [1] + " \ n ") ; sb . a p p e n d (" = = = = = = = = = = = = = = = = = END = = = = = = = = = = = = = = = = = = = = = \ n \ n ") ; w r i t e r . a p p e n d ( sb . t o S t r i n g () ) ;

} } }

(38)

i m p o r t java . io . O u t p u t S t r e a m W r i t e r ;

i m p o r t java . io . U n s u p p o r t e d E n c o d i n g E x c e p t i o n ; i m p o r t java . io . W r i t e r ;

i m p o r t java . util . C o l l e c t i o n s ; i m p o r t java . util . H a s h M a p ; i m p o r t java . util . I t e r a t o r ; i m p o r t java . util . L i n k e d L i s t ; i m p o r t java . util . Map ;

i m p o r t java . util . Map . Ent ry ; i m p o r t java . util . Set ;

p u b l i c cla ss M o v i e L e n s C o n v e r t e r { cl ass D a t a M a t r i x {

Li nke dLi st < Integer > u s e r I d s = new L ink edL ist < Integer >() ; Li nke dLi st < Integer > m o v i e I d s = new L ink edL ist < Integer >() ; Li nke dLi st < Double > r a t i n g s = new L ink edL ist < Double >() ; HashMap < Integer , Integer > r a t i n g s P e r U s e r = new

HashMap < Integer , Integer >() ; }

s t a t i c fin al S t r i n g p a t h 2 0 M i l I n p u t =

" / U ser s / v i c t o r w e g e b o r n / D o c u m e n t s / KTH / vt17 / k e x d a t a / m o v i e l e n s /20 M / csv / r a t i n g s . csv "; s t a t i c fin al S t r i n g p a t h O u t p u t =

" / U ser s / v i c t o r w e g e b o r n / D o c u m e n t s / KTH / vt17 / k e x d a t a / m o v i e l e n s /20 M / txt / r a t i n g s _ "; s t a t i c fin al S t r i n g p a t h 1 0 0 K i l I n p u t =

" / U ser s / v i c t o r w e g e b o r n / D o c u m e n t s / KTH / vt17 / k e x d a t a / m o v i e l e n s /100 K / csv / r a t i n g s . csv ";

s t a t i c fin al S t r i n g p a t h 1 0 0 K i l O u t p u t =

" / U ser s / v i c t o r w e g e b o r n / D o c u m e n t s / KTH / vt17 / k e x d a t a / m o v i e l e n s /100 K / txt / r a t i n g s 2 0 M . txt "; D a t a M a t r i x data = new D a t a M a t r i x () ;

p u b l i c s t a t i c void main ( S t r i n g [] args ) t h r o w s E x c e p t i o n { M o v i e L e n s C o n v e r t e r mlc = new M o v i e L e n s C o n v e r t e r () ; mlc . c v s T o D a t a M a t r i x () ;

mlc . e x a m i n e D a t a () ; mlc . w r i t e D a t a T o F i l e () ; }

(39)

void c v s T o D a t a M a t r i x () t h r o w s I O E x c e p t i o n { int f a i l e d R e a d s = 0;

// Read each line and put i n s i d e m a t r i x

try ( B u f f e r e d R e a d e r br = new B u f f e r e d R e a d e r (new F i l e R e a d e r ( p a t h 2 0 M i l I n p u t ) ) ) {

S t r i n g line ;

wh ile (( line = br . r e a d L i n e () ) != null) { // c o l u m n s [0] := user id

// c o l u m n s [1] := mov ie id // c o l u m n s [2] := r a t i n g s // c o l u m n s [3] := time sta mp

S t r i n g [] c o l u m n s = line . spl it (" ,") ; try {

int u s e r I D =

I n t e g e r . p a r s e I n t ( c o l u m n s [0]. trim () ) ; int m o v i e I D =

I n t e g e r . p a r s e I n t ( c o l u m n s [1]. trim () ) ; d o u b l e r a t i n g =

D o u b l e . p a r s e D o u b l e ( c o l u m n s [2]. trim () ) ; data . u s e r I d s . a d d L a s t ( u s e r I D ) ;

data . m o v i e I d s . a d d L a s t ( m o v i e I D ) ; data . r a t i n g s . a d d L a s t ( r a t i n g ) ; } c atc h ( E x c e p t i o n e ) {

// e s c a p e s the top line with s t r i n g s f a i l e d R e a d s ++;

S y s t e m . err . p r i n t l n ( e ) ; }

} }

S y s t e m . err . p r i n t l n (" N u m b e r of fai ld rea ds : " + ( f a i l e d R e a d s -1) ) ;

}

void e x a m i n e D a t a () t h r o w s E x c e p t i o n { // E x a m i n e d i m e n s i o n s of d a t a M a t r i x int d i m C o l 1 = data . u s e r I d s . size () ; int d i m C o l 2 = data . m o v i e I d s . size () ; int d i m C o l 3 = data . r a t i n g s . size () ;

S y s t e m . out . p r i n t l n (" D i m e n s i o n s of c o l u m n s . 1: " + d i m C o l 1 + " , 2: " + d i m C o l 2 + " , 3: " + d i m C o l 3 + " . ") ;

// Thr ow e x c e p t i o n if d i m e n s i o n s are off

if( d i m C o l 1 != d i m C o l 2 || d i m C o l 1 != d i m C o l 3 || d i m C o l 3 !=

d i m C o l 2 )

th row new E x c e p t i o n (" < Da taM atr ix > C o l u m n s ize s are

(40)

L a r g e s t M o v i e I d N u m b e r ) ;

// E x a m i n e if all use rs have rat ed

// and c a l c u l a t e each time a user have ra ted

Iterator < Integer > it = data . u s e r I d s . l i s t I t e r a t o r () ; wh ile( it . h a s N e x t () ) {

int ID = it . next () ;

if( data . r a t i n g s P e r U s e r . c o n t a i n s K e y ( ID ) ) data . r a t i n g s P e r U s e r . put ( ID , 1 +

data . r a t i n g s P e r U s e r . get ( ID ) ) ; else

data . r a t i n g s P e r U s e r . put ( ID , 1) ; }

}

void w r i t e D a t a T o F i l e () t h r o w s U n s u p p o r t e d E n c o d i n g E x c e p t i o n , F i l e N o t F o u n d E x c e p t i o n , I O E x c e p t i o n {

Iterator < Integer > u s e r I t e r a t o r = data . u s e r I d s . i t e r a t o r () ; Iterator < Integer > m o v i e I t e r a t o r = data . m o v i e I d s . i t e r a t o r () ; Iterator < Double > r a t i n g I t e r a t o r = data . r a t i n g s . i t e r a t o r () ; int c o u n t e r = 0;

for(int i = 1;; i ++) {

try ( W r i t e r w r i t e r = new B u f f e r e d W r i t e r (new

O u t p u t S t r e a m W r i t e r (new F i l e O u t p u t S t r e a m ( p a t h O u t p u t + i + " . txt ") , " utf -8 ") ) ) {

wh ile( u s e r I t e r a t o r . h a s N e x t () ) { int user = u s e r I t e r a t o r . next () ; int m ovi e = m o v i e I t e r a t o r . next () ; d o u b l e r a t i n g = r a t i n g I t e r a t o r . next () ; S t r i n g B u i l d e r sb = new S t r i n g B u i l d e r () ; sb . a p p e n d ( user ) ;

sb . a p p e n d (" ") ; sb . a p p e n d ( mov ie ) ; sb . a p p e n d (" ") ; sb . a p p e n d ( r a t i n g ) ; sb . a p p e n d (" \ n ") ;

(41)

w r i t e r . wri te ( sb . t o S t r i n g () ) ; c o u n t e r ++;

if(( c o u n t e r % 5 0 0 0 0 ) == 0) { w r i t e r . clo se () ;

br eak; }

}

if(! u s e r I t e r a t o r . h a s N e x t () ) { br eak;

} } }

S y s t e m . err . p r i n t l n (" Wri te to file done ") ; }

}

A.3 getRawRecommenderList

R e c o m m e n d e d L i s t g e t R a w R e c o m m e n d e d L i s t () ;

The line was added to the interface Recommender.java inside the LibRec library. The implementation of the method follows bellow and was added to the class AbstractRecom- mender.java.

p u b l i c R e c o m m e n d e d L i s t g e t R a w R e c o m m e n d e d L i s t () { if( r e c o m m e n d e d L i s t != null)

r e t u r n r e c o m m e n d e d L i s t ; r e t u r n null;

}

(42)

Comparison and Improvement Of Collaborative Filtering Algorithms

Comparison and Improvement Of Collaborative Filtering Algorithms

VICTOR HANSJONS VEGEBORN HAKIM RAHMANI

Degree Project in Computer Science, DD142X Supervisor: Jens Lagergren

Examiner: Örjan Ekeberg

Swedish title: Jämförelse och förbättring av kollaborativa filtreringsalgoritmer

CSC, KTH 2017-06-05

Abstract

Contents

Chapter 1

Introduction

1.1 Purpose of this study

Chapter 2

Background

2.1 Recommender Systems

2.2 Algorithms

2.3 MovieLens data sets

2.4 Error measurement

2.5 Improving results

Chapter 3

Method

3.1 Data validation and subset generation

3.2 Software

3.4 Hardware

Chapter 4

Results

4.1 Algorithms

4.2 Improvement models

4.3 Comparison

5.2 Methodology criticism

5.3 Conclusion

A.1 TestSuite

A.3 getRawRecommenderList