Matrix factorization in recommender systems How sensitive are matrix factorization models to sparsity?

(1)

Matrix factorization in recommender systems

How sensitive are matrix factorization models to sparsity?

By Zakris Strömqvist

Department of Statistics

Uppsala University

Supervisor: Patrik Andersson

2018

(2)

Abstract

One of the most popular methods in recommender systems are matrix factorization (MF) models. In this paper, the sensitivity of sparsity of these models are investigated using a simulation study. Using the MovieLens dataset as a base several dense matrices are created.

These dense matrices are then made sparse in two different ways to simulate different kinds of data. The accuracy of MF is then measured on each of the simulated sparse matrices.

This shows that the matrix factorization models are sensitive to the degree of information available. For high levels of sparsity the MF performs badly but as the information level increases the accuracy of the models improve, for both samples.

Keywords: Recommender systems, Collaborative filtering, Matrix factorization,

(3)

1 Introduction

In the age of information, recommendations are becoming a large part of how we are provided with new content on the internet. Today many users of the internet have in some aspect been in touch with a recommender system. Many webpages uses the recommender systems to try to predict content they think the user would like, buy etc. As its own area of research, recommender systems is quite new and the research really took of in the 1990s (Ricci et al., 2011). Large sites like Netflix and Youtube are two sites that try to predict what movie the user would like to view next. Other large sites that relies heavily on recommender systems include Amazon and Facebook. The area of recommender systems is quite new and it was popularized with the famous Netflix prize in 2005 (Netflix, 2005). This was a competition held by Netflix with the aim of improving their recommendation algorithm. A million dollar prize was awarded for this in 2009. This shows that these systems and their performance are highly valued by the companies.

In essence, a recommender system is trying to predict how much a user likes an item. To do this, information about the users past behaviour is used. For example, Amazon uses your ratings on previous items together with other users ratings when trying to predict whether the user likes a book or not. The general problem for recommender systems is that they often have very little data. The problem can be seen as a user-item rating matrix with users on the rows and items on the columns. Since users in most cases have not seen, read or used the majority of the items, most of the data is missing, the matrix is sparse. The level of sparsity is the level of how much data is missing. Often over 95% of the data can be missing. This is especially the case for new companies or products with very little data. How hard should the users be pushed for more feedback in such cases? This is an important problem as knowing whether or not it’s worth pushing the users for more ratings and feedback is an important decision and knowing how well the new information improves the recommendations can be crucial.

One of the most popular models in recommender systems is matrix factorization. It’s categorized as latent factor model that tries to find latent structures and dynamics in the observed ratings.

(5)

Matrix factorization in recommender systems is today a much studied field but few, if any, have studied how well these models perform under different levels of sparsity. This is in part because of the problem of generating data, simulating user and item preferences/factors might be impossible. In this thesis we try to solve this problem by using an existing dataset for recommender systems as a basis for creating a dense matrix. From this dense matrix we create sparse matrices for different levels of sparsity and evaluate how well the matrix factorization models perform.

1.1 Purpose and outline

The aim of this study is to investigate if matrix factorization models for recommender systems are sensitive to the level of sparsity in the data.

The outline is as follows. A background to recommender systems and their different methods are given in section 2. Section 2 also includes the theoretical background for the matrix factorization models used in recommender systems. In section 3 the methodology of the simulation study is described as well as a description of the data. The results are then presented in section 4 followed by conclusions and a discussion in section 5.

(6)

2 Background

2.1 Recommender systems

Recommender systems are software designed to recommend items to users of the platform where the recommender system operates. For example, this can be items for sale, books or articles to read or videos to watch, (Ricci et al., 2011). The most common recommendations are personalized for a specific user. The aim of recommender systems can be to increase the amount of purchased items in a store but it can also be implemented to enchance the user experience. The data and information used in recommender systems can be quite diverse.

The information can be user or item specific data such as age, gender, location, it can be ratings or it can be interactions betweens users and items. What kind of information that is available determines what kind of recommender system that can be used. The problem of recommending an item can be put in two ways, prediction and ranking, (Aggarwal, 2016).

In the prediction version, the recommender system tries to predict the rating of a user-item pair. This is often thought of as an n × m matrix, with n users and m items. With the existing ratings the systems tries to predict the missing ones. This is often called the matrix completion problem, (Aggarwal, 2016). In the ranking version of the problem it is not the ratings themself that are important but the relative ranking of the items. In this version of the problem the recommender system tries to recommend the top-k items, where k is the number of items. Of course the ranking version problem can be solved indirectly with the prediction version but in many cases it can be easier to just rank the items relative to each other for each user.

In the matrix completion problem the user-item rating matrix have missing data. How much data that is missing is called the level of sparsity of the matrix. The sparsity level of the rating matrix is defined as (Sarwar et al., 2000),

1 − Non-zero elements Total number of elements.

(7)

2.2 Collaborative filtering

The two basic types of recommender systems to deal with this problem are content-based systems and collaborative filtering systems. Content-based systems uses information about the users and the items to try to match user with items whereas collaborative filtering systems uses the past behaviour of the users, often in terms of explicit or implicit ratings.

Explicit ratings are ratings based on a specific scale, for example 1 to 5. Implicit ratings can be a binary rating, measuring whether a user has bought, listened to or watched an item, or not. Sometimes, only the fact that the user has browsed the site of an item can be enough (Aggarwal, 2016). In general, the idea in collaborative filtering is that if two users have liked the same items in the past, the user should like the same items in the future (Ricci et al., 2011). It can be thought of as if the existing ratings contains enough information of interactions and dynamics to predict the missing ratings. These interactions can be seen as correlations or similarities (Aggarwal, 2016). These types of systems use the collaborative capability of the existing ratings, hence the name. The problem can be thought of as a user-item matrix with users in the rows and items in the columns.

The main issue in collaborative filtering is that the matrix is sparse. A user has used and/or rated only a small part of the existing items and most of the cells in the matrix are empty. Therefore, the problem at hand is essentially an imputing problem. The methods used in collaborative filtering can be divided into two types, memory-based methods and model-based methods. This distinction is sometimes confusing and not clear since the memory-based methods also can be seen as models, but this way of categorising the methods is still often made.

Memory-based methods are also called neighborhood-based methods as they often use the neighborhood of a user or item for predicting the ratings. The neighborhood, the most similar users or items, are calculated with some similarity measure. This similarity is calulated using the existing ratings and predictions for the user and is made using the neighborhood of users with highly correlated preferences, (Jannach et al., 2010). For calulating the similarity many similarity measures or statistical distances have been used but the most common is the Pearson correlation and the cosine similarity. The Pearson correlation has been shown to

(8)

often outperform the others in earlier research, (Jannach et al., 2010). For the computations to be more feasible, the neighborhood is often limited to a specific number of users or items, making the k-nearest neighbors the most popular memory-based method.

The neighborhood-based methods can be divided into two different types, user-based and item-based collaborative filtering. This division is essentially just different ways of computing the neighborhood. In the user-based collaborative filtering methods only the users is used for computing the neighborhood and in item-based collaborative filtering only the items are used. For example, in the user-based methods if user X and user Z have items in common that they like they are assumed to have similar taste in items that they do not have in common. On the other hand, in the item-based methods if user X like book A and book B and C are very similar to book A then user X is assumed to also like book B and C. When they are used differs on the context, but as an example, many e-commerce web-sites use item-based methods, while streaming sites for films and tv-series use both.

Memory-based methods are popular since they are often easy to implement and interpret.

If there are few users that have rated the same item, or a user have rated only items that almost nobody else have rated, the memory-based methods perform quite bad. In general, memory-based models do not perform well with sparse matrices, (Aggarwal, 2016).

Model-based methods are often statistical models with some parameterization. Even though some models are non-parametric, most models used are parametric. This enables them to be estimated or learned beforehand and the prediction of new ratings are done using these estimated models. The models used in collaborative filtering are models commonly used in other areas in statistics and machine learning, such as decision trees, rule-based models and support vector machines, but the models often has to be modified to take into account the very sparse data found in recommender systems (Aggarwal, 2016).

2.3 Matrix factorization

One of the most popular model-based collaborative filtering methods is matrix factorization, or matrix decomposition. In the context of recommender systems, matrix factorization is a latent factor model that tries to find latent factors for both users and items inherent in the data. Latent factor models is believed to be state-of-the-art in this context (Aggarwal, 2016)

(9)

and common models of this kind include neural networks and Latent Dirichlet Allocation, among many others.

The general idea of matrix factorization in recommender systems is to create a low-rank matrix approximation of the rating matrix. The idea of low-rank matrix approximation was first proposed by Eckart and Young (1936) and has been used in information retrieval since the 1980s. Deerwester et al. (1990) used SVD as a method to find latent factors in documents.

This is called latent semantic analysis. In Deerwester et al. (1990) both documents and some search query is presented as vectors of terms. Together this forms a matrix where highly correlated terms can be summarised in latent factors (Jannach et al., 2010). These matrix factorization methods, as a method to find latent factors, was then picked up by recommender systems research in the beginning of the 2000s (Jannach et al., 2010). The difference from the case of latent semantic analysis is that in recommender systems the matrix is a lot more sparse.

The technique was popularized by the Netflix prize competition where it was called SVD after the common matrix decomposition technique singular value decomposition (SVD). The main distinction is that SVD is not feasible for the problem at hand in recommender systems since SVD is not defined when information is missing (Koren, 2010b). In the beginning, one solution was to impute the missing values with some other model and perform conventional matrix decomposition methods on the dense matrix (Koren, 2010b). This is problematic in two ways. First, it increases the amount of information by orders of magnitude and makes the computations a lot more expensive. Second, the imputations themself might be biased and and only bring noise to the data. Therefore, the evolution of the models went to use only the observed ratings. The main drawback of these models is when it comes interpretations of factors and thereby the predicted ratings.

Matrix factorization models (MF) became popular because of their scalability and their predictive performance (Koren et al. (2009), Koren (2010b)) and was widely used in the Netlfix competition solutions (Jannach et al., 2010). A probabilistic foundation for these models was then given by Mnih and Salakhutdinov (2008).

The most famous and used MF is the one presented in Koren et al. (2009). This low-rank matrix factorization model tries to map users and items to a joint latent factor space. This

(10)

factor space has dimension K and the ratings are given by the inner product of this space.

Item i is modelled with a latent factor vector qi and likewise user u is modelled with latent factor vector pu. The predicted rating for user u for item i is

ˆ

r_ui = p^T_uq_i (1)

If this is done for all users and items the full user-item rating matrix becomes





 p^T₁

...

p^T_n





 h

q₁ . . . q_m i

=





 ˆ

r₁₁ . . . rˆ_1m ... ... ...

ˆ

r_n1 . . . ˆr_nm







, (2)

which can be written as

ˆ

r = p^Tq, (3)

where ˆr is n × m, p is k × n and q is k × m. k is the number of latent features and can be chosen arbitrarily, but the bigger k the more information is in the approximation, but the computations is heavier. To estimate the latent factor vectors, the regularized squared error of the observed ratings will be minimized:

min X

u,i∈κ

(rui− puq^T_i )², (4)

where κ is the set of user-item pairs for which the rating is known.

Research has shown, (Koren et al., 2009), that these models are highly prone to over- fitting if estimated as is and they are therefore often estimated with regularization. The minimazation problem then becomes

min X

u,i∈K

(r_ui− p_uq_i^T)²+ λ(kq_ik² + kp_uk²), (5)

where k · k² is the Frobenius norm and λ is the regularization parameter. This has shown to be very effective in improving the performance of the model, (Koren et al. (2009), Koren (2010b)).

(11)

2.3.1 Bias

The model in (1) tries to model the dynamics between the users and the items. Despite that, a large part of the ratings is user and item specific biases. For example, if user u in general gives a 0.5 higher rating than the average user, the user specific bias for user u is 0.5. The ratings can be seen as having four parts, the global mean, user-specific bias, item-specific bias and the true interaction between users and items. For rating rui the bias is

b_ui = µ + b_u+ b_i, (6)

where µ is the global mean, bu is the bias for user u and bi is the bias for item i.

For the user-item rating matrix to only contain the interactions between users and items, the matrix has to be adjusted with these biases. The ratings can also be modelled with the bias terms alone, trying to capture the interactions with only the bias terms. This has shown to yield good results on the Netflix data, (Koren, 2010a).

2.3.2 MF with bias

The MF and the bias can be combined into a enhanced model (Koren et al., 2009) which take into account both the biases and the interactions. It can be expressed as

ˆ

r_ui= µ + b_u+ b_i+ p^T_uq_i, (7) which by the same logic as above gives the minimization problem

min X

u,i∈K

(r_ui− µ − b_u− b_i− p^T_uq_i)²+ λ(kq_ik²+ kp_uk²+ b²_u+ b²_i). (8)

2.4 Estimation

The estimation of the matrix factorization models are done with either stochastic gradient descent (SGD) or alternating least squares (ALS). In most cases SGD is faster and also easier to implement and it will also be used in this study. SGD solves the minimization problem in (5) by looping through the observed ratings and predicts ˆrui, (Koren et al., 2009). One iteration is one loop through the whole dataset. After the prediction is done the prediction error eui is calculated:

(12)

e_ui^def= r_ui− p^T_uq_i. (9) Using this, the vectors qi and pu are updated in proportion to the learning parameter γ:

q_i ← q_i+ γ(e_uip_u− λq_i), p_u ← p_u+ γ(e_uiq_i− λp_u).

(10)

This is done until the desired number of iterations are done. The logic is the same for only estimating the bias terms. The update rules then are:

b_i ← b_i+ γ(e_ui− λb_i), b_u ← b_u+ γ(e_ui− λb_u).

(11)

The bias terms are often initialized to 0. For the combined model in (8), the logic is once again the same. The difference is now that the model now includes both latent factor vectors and bias terms and the update rules now are

q_i ← q_i+ γ(e_uip_u− λq_i), p_u ← p_u+ γ(e_uiq_i− λp_u),

b_i ← b_i+ γ(e_ui− λb_i), b_u ← b_u+ γ(e_ui− λb_u).

(12)

(13)

3 Method

3.1 Data

To evaluate the performance of the matrix factorization models under different levels of sparsity, a simulation study is performed. In order to study the performance of the MF methods under different levels of sparsity, we need a dense matrix, from which matrices of different levels of sparsity can be created. Since in general, one does not have access to a dense matrix of rating data, we will use an existing dataset, the Movielens dataset (MovieLens, 2016), as a basis for generating a dense matrix. The data used for creating the dense matrices is the MovieLens 100k dataset. It is a common dataset for testing recommender systems and is rating data collected from the MovieLens website for research purposes. It contains 100, 004ratings on 9066 movies from 671 users. The ratings is on the scale 1 to 5.

3.1.1 Descriptive statistics

In table 3 we can see the descriptive statistics of the MovieLens data. It contains 100, 004 ratings on 9066 movies from 671 users, with the total amount of elements in the user-item rating matrix being 671 ∗ 9066 = 6 083 286. This is a sparsity of 98, 4%. The mean rating is about 3.54, the smallest number of ratings for a user is 20. The smallest number of ratings for an item is 1 rating, while the largest number of ratings for an item is 341. The most amount of ratings a user has given is almost 2400. The marginal distribution of the number of ratings can be seen in figure 1. The structure and characteristics given by the MovieLens data is what we are trying to capture when sampling sparse matrices described above. We want the sparse matrices to be very similar in structure of the existing MovieLens data.

Descriptive statistics for an example of a structured sample is seen in Appendix A.

(14)

0 500 1000 1500 2000 2500

0 200 400 600

Users ordered by number of ratings

Number of ratings

0 100 200 300

0 2500 5000 7500

Items ordered by number of ratings

Number of ratings

Figure 1: Marginal distribution of number of ratings

(15)

Descriptive statistics of Movielens 100k data

Statistic Value Statistic Value

Users 671 Min # ratings, items 1

Items 9066 Min # ratings, users 20 Mean ratings 3.54 Max # ratings, items 341 Median # ratings, users 71 Max # ratings, users 2391 Median # ratings, items 3

Table 1: Descriptive statistics of the MovieLens 100k dataset

3.2 Design

On the Movielens dataset we will apply matrix factorization, as described in section 2, to obtain dense matrices. From these matrices we will sample user-item pairs to create a sparse matrix. This sampling will be done in two ways, one where the the creation of the sparse matrices will try to capture the structure of the existing MovieLens dataset in terms of how many ratings users and movies have. The second method of creating sparse matrices will uniformly draw ratings from the user-item rating matrix and serve as a robustness check to see if the results hold for different structures of the data in recommender systems. The sampling methods are described in section 3.3.

For each method, five dense matrices will be generated The five dense matrices will be created with different values of k in the model. The values of k will be 20, 40, 60, 80 and 100. Each of these five dense matrices will then be made sparse 100 times for each sparsity level. Therefore, in total 500 sparse matrices will be generated for each sparsity level. The matrix factorization models will be estimated and evaluated on each matrix. The presented results are means of the 500 iterations for each model.

3.3 Generating sparse matrices

As mentioned above, the dense matrices will be made sparse in two ways. In the first way we try to preserve the structure of the matrices. However, only choosing observations at

(16)

random gives the right sparsity but probably misses out on important dependecies inherent in data, commonly encounterd in recommender systems. Some users rates very few items while some users rate many. Also for items, some are more popular and therefore has many ratings while some are less popular and is rated less. To capture, this we will be careful in how the sparse matrices are generated. We want the same characteristics in the generated sparse dataset as in the existing base dataset. This is done in the following way.

First, we calculate the proportions of ratings for each user and item in the existing dataset. For example, if user A has rated 7 items and there are 1000 ratings available, 0.7%

of the ratings are from that user. Therefore, in the generated sparse matrices we want user A to have around 0.7% of the ratings. These proportions will be calculated from the existing MovieLens dataset and used as probabilities when drawing the users. These draws will be made from the available users and items, where the probability of being drawn is the same as the proportion of that user or item in the original dataset. The hypothetical user A will have a 0.7% chance of being drawn. The same reasoning is applied to items. But we don’t want it to be the same user or item for all matrices. Therefore, before each sparse matrix is created the probabilities/proportions will be randomized in order to ensure that the same users and items do not occur with the same proportion in all created matrices. We still want one user to have around 0.7% of the ratings, but it doesn’t need to be user A, instead perhaps user B. This will make sure that the created sparse matrix will have a similar structure as the original dataset in terms of the proportions of ratings for user and items, but it is still different users and items in all created matrices.

After, a user and an item has been drawn with the procedure described above, this user- item pair will be matched with the corresponding element in the dense matrix. For example, if user 3 and item 5 is drawn as the user-item pair, this corresponds to element (3, 5) in the dense matrix with ratings. Depending on the level of sparsity, a certain amount of user- item pairs will be drawn. For example, if the dense matrix has 1,000,000 elements and the desired sparsity is 99% then 10,000 user-item pairs will be drawn, using this method. For every matrix, we will draw the right amount of user-item pairs which will make up the sparse matrix that is used for evaluating the matrix factorization models. The matrices created this way will be refered to as the structured sample, to distinguish it from the uniform sample.

(17)

A short example with two matrices can be seen in Table 2 and 3. In Table 2 we can see the proportions for some of the users in the Movielens dataset. These proportions are used as probabilities for the respective users in the first matrix of user-item pairs, as seen in Table 3. For the second matrix the probabilities are randomized and a new matrix is sampled. In the first matrix user 1 had a probability of 0.00247 of being drawn but in the second matrix the probability of being drawn was for user 1 is 0.00001.

For the second method of creating sparse matrices, the users and items will be drawn uniformly. This means that the created matrices is completely random and that all users and items have the same probability of being drawn. This will be refered to as the uniform sample.

Proportions of Users

ID Number of ratings Proportion

1 20 0.00020

2 76 0.00076

... ... ...

670 31 0.00031

671 115 0.00115

Table 2: Number of ratings and proportions of first and last users in the original Movielens dataset

Example of draws

Draw 1 Draw 2

ID Proportion ID Proportion 1 0.00247 . . . 1 0.00001 2 0.00107 . . . 2 0.00011

... ... ... ...

670 0.00001 . . . 670 0.00107 671 0.00011 . . . 671 0.00247 Table 3: Example of draws of sparse matrices

(18)

3.4 Evaluation

Each sampled sparse dataset will be split in two sets, a training set and a testing set. The split will be 75/25. The models are estimated on the training set and evaluated on the testing set. For the testing set, predictions will be made with the models. Based on these predictions, we will measure the accuracy of the model. Several accuracy metrics have been used for measuring performance of recommender systems, but one of the most popular is RMSE, (Ricci et al., 2011). The Netflix prize used RMSE and it has been a popular metric for evaluating recommender systems for ratings ever since. The RMSE is calculated as

RM SE = v u u t

1

| ˆR|

X

(u,i)∈ ˆR

(rui− ˆrui)², (13)

where | ˆR| is the number of ratings in the test set and (u, i) is the user-item pairs in this test set, (Aggarwal, 2016). Using this, 500 RMSE measures for each model will be generated.

3.5 Models

For both the MF models, with and without bias, two different levels of latent factors will used, k = 20 and k = 40. The values of k is chosen with respect to both performance and computational feasability. Mnih and Salakhutdinov (2008) show that k = 30 still yields good results while it is not too heavy computationally. Therefore, k = 20 and k = 40 is chosen for evaluating the matrix factorization methods under different levels of sparsity. This means in total four different MF models will be tested. As a comparison, one model with only the bias will be tested as well.

For the MF models the hyperparameters is important as well. The learning and regularization parameter was tuned with 5-fold cross-validation on the existing MovieLens dataset and the results from this is used for all matrices. From this, the learning parameter γ will be set to 0.005 and the regularization parameter λ will be set to 0.06 for all models. A value of 0.005 on the learning parameter seems to be the best in most cases, as seen in Mnih and Salakhutdinov (2008), Ricci et al. (2011) and Bell and Koren (2007).

The number of iterations in the stochastic gradient descent can have a big impact on the predictions. Earlier research has found that too many iterations will overfit the model. As

(19)

shown in Mnih and Salakhutdinov (2008) and Jing et al. (2015) the error is not improving after around 20 iterations so the number of iterations will be set to 20. The matrices qi

and pu will be initiated with random values. These values are generated from a normal distribution with µ = 0 and σ = 0.1.

4 Results

The matrix factorization models described on section 2 was estimated on 500 sparse matrices each and the RMSE was measured. The results are presented in both tables and figures below. The models will first and foremost be compared with themself for different levels of sparsity.

RSME for simulations 95% sparsity

MF(k=20) MF(k=40) MF-Bias(k=20) MF-Bias(k=40) Bias

Mean 0.3181 0.3191 0.3222 0.3202 0.338

SE 0.0053 0.0052 0.0047 0.0045 0.0052

97% sparsity

Mean 0.3665 0.3705 0.3663 0.3651 0.3741

SE 0.0055 0.0054 0.0045 0.0043 0.0049

98% sparsity

Mean 0.4284 0.4359 0.4072 0.4071 0.41

SE 0.0064 0.0064 0.0048 0.0047 0.0052

99% sparsity

Mean 0.6242 0.6384 0.4889 0.4913 0.4871

SE 0.0096 0.0095 0.006 0.0059 0.0061

Sample size 500 500 500 500 500

Table 4: Mean and standard deviation of simulations for structured sample

(20)

In table 4 we see the results for the structured sample. All models perform better for lower levels of sparsity, which is also clearly seen in figure 2. The two models without bias follow each other quite close. The same can be said for the two models with bias. The two models with bias also follows the bias only quite close, indicating that the bias terms capture a lot of the variation in the ratings. What’s interesting is that for the highest level of sparsity, the bias only model outperforms the matrix factorization models without any bias. For the highest level of sparsity the MF methods without bias has a RMSE of 0.62 and 0.64 as compared to 0.49 for both the methods with bias and the bias. But as more information is available the better the MF performs. For the level with most information, the lowest level of sparsity, 95% the MF models without bias performs better than the MF models with bias and the bias only model. For this level of sparsity both the methods with and without bias has a RMSE of around 0.32 compared to 0.33 for the bias only. It seems that the more information that is available the better the pure MF model is compared to the models including the bias and the bias only model.

(21)

e⁻¹ e^−0.8 e^−0.6

95 96 97 98 99

Sparsity

RMSE

Model Bias MF20 MF20−Bias

e⁻¹ e^−0.8 e^−0.6

95 96 97 98 99

Sparsity

RMSE

Figure 2: Mean of RMSE for the structured sample. Above is the two MF-models with k=20 compared to the bias only. Below is the MF-models with k=40 together with the bias only

As seen in figure 2 the improvements in performance are largest for changes in sparsity for the highest levels. Between 99% and 98% sparsity the performance increase is quite substantial.

After that the models continue to improve but the marginal improvement for each sparsity level is less and less.

(22)

RSME for simulations 95% sparsity

Mean 0.2683 0.2733 0.2932 0.2941 0.2935

SE 0.0011 0.0011 0.0015 0.0016 0.0011

97% sparsity

Mean 0.3006 0.3258 0.3646 0.3697 0.3597

SE 0.0022 0.0027 0.0017 0.0019 0.0016

98% sparsity

Mean 0.448 0.5209 0.4622 0.4695 0.455

SE 0.0107 0.0103 0.0022 0.0023 0.0022

99% sparsity

Mean 1.5646 1.5488 0.6468 0.6532 0.6401

SE 0.0091 0.0095 0.0038 0.0038 0.0038

Sample size 500 500 500 500 500

Table 5: Mean and standard deviation of simulations for the uniform sample

(23)

e⁻¹ e^−0.5 e⁰ e^0.5

95 96 97 98 99

Sparsity

RMSE

e⁻¹ e^−0.5 e⁰ e^0.5

95 96 97 98 99

Sparsity

RMSE

Figure 3: Mean of RMSE for the uniform sample. Above is the two MF-models with k=20 compared to the bias only. Below is the MF-models with k=40 together with the bias only

The results for the uniform sample is presented in tabel 5 and figure 3. The same patterns can be seen as in the structured sample. All models perform better for lower levels of sparsity compared to higher levels of sparsity. Just as in the structured sample the models without bias follow each other quite closely and just as the models with bias does. For the two highest levels all the models perform worse in the randomized sample as compared to the dynamic sample. Especially the two MF models without bias have an extremely high RMSE for 99%

sparsity. At this level of sparsity, the RMSE for the MF methods without bias is over 1.5, which is very high. This is in contrast to that all models seems to perform better in the uniform sample compared to the structured sample for lower levels of sparsity. Also in this

(24)

sample the improvement is the largest for the highest levels of sparsity and the effect on the performance diminish for each level.

5 Discussion and conclusions

In this paper the performance of matrix factorization (MF) methods under different levels of sparsity has been investigated. This study shows that matrix factorization methods are sensitive to the level of sparsity in the data. Both pure MF methods and enhanced variants with bias terms are evaluated. In order to do this, dense matrices are generated from an existing dataset and these dense matrices are then made sparse. The methods are evaluated on these sparse matrices in two ways.

First, a structured way of creating sparse matrices is assessed. Results from this show that the matrix factorization methods are sensitive to the level of sparsity as the measured error is lower for the matrices with lower sparsity than the matrices with higher sparsity, for all kinds of methods. The matrix factorization methods without bias perform worse than the methods with bias for higher levels of sparsity, but as the information level is increasing the accuracy of the methods without bias becomes better.

It seems like the more information that is available the better the MF methods is to capture the the interactions in the sparse rating matrices. When the data is very sparse the bias seem to be able to model most of the interactions and perform better than the methods without bias.

Second, a uniform way of creating sparse matrices is evaluated. Also here the matrix factorization methods perform better at lower levels of sparsity than for higher levels for all models. For 99% sparsity, the MF methods without bias performs really bad compared to both the structured sample and to the methods with bias. At that sparsity level the MF methods seems unable to capture much of the interactions as the methods with bias performs much better. As the sparsity level decreases the MF without bias increases their performance and for 95% sparsity level, these methods have a lower RMSE than the other methods. For the uniform sample the trend is clear, the MF methods without bias is able to capture the interactions better than the other methods as the level of information increases.

(25)

It is also interesting that the bias only model performs better than the MF methods at high levels of sparsity, but at the lower levels of sparsity the MF methods performs better.

It is hard to say why but perhaps with little information the bias terms is able to better model the interactions between users and items than the MF methods.

Overall, the matrix factorization methods show sensitivity to the level of sparsity for both the structured and the uniform sample. This indicates that there can be of value to try to push a users to rate more items as the accuracy of the recommendations improve as the information level increases. This is of course only tested on the Movielens dataset but this study strongly indicates that the matrix factorization methods are sensitive to sparsity. A suggestion for further study would be to test this on more dataset and also datasets with other types of ratings, for example implicit ratings.

(26)

References

Aggarwal, C. C. (2016). Recommender systems. Springer.

Bell, R. M. and Koren, Y. (2007). Lessons from the netflix prize challenge. Acm Sigkdd Explorations Newsletter, 9(2):75–79.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).

Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391.

Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218.

Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010). Recommender systems: an introduction. Cambridge University Press.

Jing, L., Wang, P., and Yang, L. (2015). Sparse probabilistic matrix factorization by laplace distribution for collaborative filtering. In IJCAI, pages 1771–1777.

Koren, Y. (2010a). Collaborative filtering with temporal dynamics. Communications of the ACM, 53(4):89–97.

Koren, Y. (2010b). Factor in the neighbors: Scalable and accurate collaborative filtering.

ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1):1.

Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8).

Mnih, A. and Salakhutdinov, R. R. (2008). Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257–1264.

MovieLens (2016). Datasets. https://grouplens.org/datasets/movielens/ ; accessed 20-May- 2018.

Netflix (2005). Netflix prize. https://www.netflixprize.com/ ; accessed 20-May-2018.

(27)

Ricci, F., Rokach, L., and Shapira, B. (2011). Recommender systems handbook. Springer.

Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. (2000). Analysis of recommendation algorithms for e-commerce. In Proceedings of the 2nd ACM conference on Electronic commerce, pages 158–167. ACM.

(28)

Appendix A

Descriptive statistics example structured matrix

Descriptive statistics example structured matrix

Statistic Value Statistic Value

Users 671 Min # ratings, items 1

Items 9066 Min # ratings, users 8

Mean ratings 2.25 Max # ratings, items 341 Median # ratings, users 73 Max # ratings, users 2355 Median # ratings, items 4

Table 6: Descriptive statistics of structured sampled matrix with sparsity level 98.4%

The stochastic gradient descent (SGD) algorithms described more in detail:

SGD for MF without bias SGD for MF without bias

Randomly initialize matrices pu and qi

while epoch < 20 do for u, i do in ratings

e_ui = ˆr_ui− p^T_uq_i

q_i ← q_i+ γ(e_uip_u− λq_i) p_u ← p_u+ γ(e_uiq_i− λp_u) end for

end while

Final matrix : ˆr_ui= p^T_uq_i

(29)

SGD for bias only Initialize bi and bu

e_ui = ˆr_ui− µ − b_u− b_i bi ← bi+ γ(eui− λbi) b_u ← b_u+ γ(e_ui− λb_u) end for

end while

Final matrix : ˆrui= µ + bu+ bi

SGD for MF with bias

Randomly initialize matrices pu, qi

Initialize bi and bu

e_ui = ˆr_ui− µ − b_u− b_i− p^T_uq_i q_i ← q_i+ γ(e_uip_u− λq_i) p_u ← p_u+ γ(e_uiq_i− λp_u) b_i ← b_i+ γ(e_ui− λb_i) b_u ← b_u+ γ(e_ui− λb_u) end for

end while

Final matrix : ˆr_ui= µ + b_u+ b_i+ p^T_uq_i

Matrix factorization in recommender systems How sensitive are matrix factorization models to sparsity?