A Comparison between Different Recommender System Approaches for a Book and an Author Recommender System

(1)

Linköpings universitet

2020 | LIU-IDA/LITH-EX-A--20/017--SE

A Comparison between

Diﬀer-ent Recommender System

Ap-proaches for a Book and an

Au-thor Recommender System

Jesper Hedlund

Emma Nilsson Tengstrand

Supervisor : Marco Kuhlmann Examiner : Arne Jönsson

(2)

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

A recommender system is a popular tool used by companies to increase customer sat-isfaction and to increase revenue. Collaborative filtering and content-based filtering are the two most common approaches when implementing a recommender system, where the former provides recommendations based on user behaviour, and the latter uses the char-acteristics of the items that are recommended.

The aim of the study was to develop and compare different recommender system ap-proaches, for both book and author recommendations and their ability to predict user rat-ings of an e-book application. The evaluation of the models was done by measuring Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). Two pure models were devel-oped, one based on collaborative filtering and one based on content-based filtering. Also, three different hybrid models using a combination of the two pure approaches were devel-oped and compared to the pure models. The study also explored how aggregation of book data to author level could be used to implement an author recommender system.

The results showed that the aggregated author data was more difficult to predict. How-ever, it was difficult to draw any conclusions of the performance on author data due to the data aggregation. Although it was clear that it was possible to derive author recommen-dations based on data from books. The study also showed that the collaborative filtering model performed better than the content-based filtering model according to RMSE but not according to MAE. The lowest RMSE and MAE, however, were achieved by combining the two approaches in a hybrid model.

(4)

We would like to thank our supervisors at Storytel Erik Persson and Dave Clarke for excel-lent guidance among the massive Storytel datasets and for providing valuable input through-out the work.

We also want to thank our supervisor at Linköping University Marco Kuhlmann for the patience, for keeping us structured, and for providing excellent feedback on every part of the work. Without Marco, the report would have been a mess. Also, thanks to our examiner

Arne Jönssonfor the valuable feedback.

Lastly, thanks to our seminar group at Linköping University for listening and coming up with new ideas for our work during milestone seminars. Especially thanks to our opponents

(5)

ALS Alternating Least Squares

BOW Bag-Of-Words

BRS Book Recommender System

BERT Bidirectional Encoder Representations from Transformers

CBOW Continuous Bag-Of-Words Model

CBCF Content-Boosted Collaborative Filtering

CF Collaborative Filtering

CN Content-based Filtering

HRS Hybrid Recommender System

IR Information Retrieval

KNN K-Nearest Neighbor

LDA Latent Dirichlet Allocation

LSA Latent Semantic Analysis

MAE Mean Absolute Error

MAP Mean Average Precision

ML Machine Learning

NDCG Normalized Discounted Cumulative Gain

NLP Natural Language Processing

POS part-of-speech

PCA Principal Component Analysis

PC1 principal component 1

PC2 principal component 2

PV-DBOW Distributed Bag of Words version of Paragraph Vector

PV-DM Distributed Memory Model of Paragraph Vector

RF Random Forest

(6)

RS Recommender System

SGD Stochastic Gradient Descent

S-BERT Sentence BERT

SVM Support Vector Machines

SVD Singular Value Decomposition

t-SNE t-Distributed Stochastic Neighbor Embedding

tf-idf term frequency-inverse document frequency

UMAP Uniform Manifold Approximation and Projection

(7)

Abstract iii Acknowledgments iv Acronyms vi Contents vii List of Figures ix List of Tables xi 1 Introduction 1

1.1 Motivation and aim . . . 2

1.2 Research questions . . . 2

1.3 Delimitations . . . 3

1.4 Thesis Outline . . . 3

1.5 Author responsibility and contribution . . . 3

2 Theory 4 2.1 Evaluation of Recommender Systems . . . 4

2.2 Collaborative filtering (CF) and associated methods . . . 5

2.3 Content-based Filtering (CN) and associated methods . . . 12

2.4 Adding implicit ratings to a Recommender System (RS) . . . 19

2.5 Advantages and drawbacks of CF and CN . . . 20

2.6 Collaborative-filtering (CF) and content-based (CN) hybrid models . . . 21

3 Method 24 3.1 Ensuring replicability . . . 24

3.2 Datasets . . . 24

3.3 Aggregation from book level to author level . . . 25

3.4 Creating the dataset . . . 26

3.5 Evaluation of the different models . . . 27

3.6 Baseline models . . . 28

3.7 Collaborative filtering (CF) model . . . 28

3.8 Content-based filtering (CN) model . . . 33

3.9 Collaborative-filtering (CF) and content-based (CN) hybrid models . . . 37

3.10 Setup: frameworks and tools . . . 42

4 Results 43 4.1 Performance for all models . . . 43

4.2 Pure CF and pure CN results . . . 44 4.3 Hybrid Collaborative Filtering (CF) Content-based Filtering (CN) models results 47

(8)

5.1 Result discussion . . . 49 5.2 Method discussion . . . 65 5.3 The work in a wider context . . . 75

6 Conclusion 77

6.1 Future Work . . . 78

A Stop words 80

B Libraries used 81

C Full results on implicit ratings experiment 82

(9)

1.1 How recommendations are made using item and user data. . . 2

2.1 How a recommendation is made with collaborative filtering. . . 5

2.2 A User–Item (U–I) matrix. . . 6

2.3 Matrix factorization using SVD. . . 9

2.4 Recommendations with content-based filtering. . . 12

2.5 System description of content-based filtering. . . 13

2.6 Graphical description of the relation between corpus, document, and word. . . 13

2.7 Architecture of Skip-gram and CBOW. . . 16

2.8 Architecture of PV-DM and PV-DBOW. . . 16

2.9 Classification tree example. . . 17

3.1 Venn diagram over ratings and books. . . 26

3.2 Distribution of ratings for books and authors. . . 27

3.3 Distribution of completion rate for unfinished books. . . 27

3.4 RMSE and MAE over k neighbors, CF KNN Baseline books. . . 31

3.5 Before and after text preprocessing. . . 34

3.6 RMSE and MAE over k neighbors, CN model books. . . 36

3.7 RMSE and MAE over k neighbors, CN model authors. . . 36

3.8 Graphical representation of the hybrid weighted model. . . 37

3.9 RMSE and MAE over weighting factor σCF books, weighted hybrid model books. . . . 38

3.10 RMSE and MAE over weighting factor σ_authorsCF , weighted hybrid model authors. . . 38

3.11 Graphical representation of the hybrid switching model. . . 39

3.12 Feature Augmentation CF CN model training data for hyper-parameter optimiza-tion. . . 41

3.13 Feature Augmentation CF CN model description. . . 41

3.14 Feature Augmentation CF CN model training data. . . 42

4.1 The prediction distribution for the CF and CN models for books and authors. . . . 45

4.2 The prediction distribution for the hybrid models for books and authors. . . 48

5.1 An example of book ratings aggregated to author level. . . 50

5.2 RMSE and MAE for all models for various dense data. . . 52

5.3 Prediction distribution for each rating for the pure CF model and pure CN model on books and authors. . . 53

5.4 Residual distribution for the pure CF model and the pure CN model for each rating. 54 5.5 Distribution over how many times a user has rated. . . 55

5.6 RMSE and MAE for different types of user groups in terms of how many ratings a user has given. . . 55

5.7 Distribution over the variance of a user’s ratings. . . 56

5.8 RMSE and MAE for different users in terms of the variance of a user’s ratings. . . 56 5.9 Variance in the ratings of a user correlation with amount of ratings given by a user 57

(10)

5.11 RMSE and MAE for different users in terms of the mean of a user’s ratings. . . 57 5.12 The mean RMSE and MAE per category - on books and authors. . . 58 5.13 The prediction distribution for each rating for the hybrid models for books and

authors. . . 62 5.14 Residual distribution for the predictions versus the true values for the hybrid

models: weighted, switching and Feature Augmentation (FA). . . 63 5.15 Feature importance for RF in switching model. . . 64 5.16 Cumulative probability for the CN and the CF model to place a book with a true

rating of 5, at a certain rank. . . 66 5.17 Uniform Manifold Approximation and Projection (UMAP) decomposition of

Doc2vec vectors for bookscolored by associated category. Books labeled as “Fic-tion” and “Non-Fic“Fic-tion” excluded. . . 70 5.18 Principal Component Analysis (PCA) decomposition of Doc2vec vectors for books

colored by associated category. Books labeled as “Fiction” and “Non-Fiction” ex-cluded. . . 70 5.19 PCA decomposition of Doc2vec vectors for fiction and non-fiction books, colored

by associated category. . . 71 5.20 The PCA representation of the authors Adam Smith, Aristotle, John Stuart Mill,

(11)

2.1 Potential implicit ratings information that can be acquired by a system. . . 20

2.2 Advantages and drawbacks with Collaborative Filtering approach respectively Content-based Filtering. . . 21

2.3 Hybridization methods. . . 22

3.1 Books metadata table. . . 25

3.2 User ratings table. . . 25

3.3 Completion rate of unfinished books table. . . 25

3.4 Aggregated author rating table. . . 26

3.5 Data sparsity for U–I matrix on books and on authors. . . 28

3.6 Model performance for CF models on user ratings on books dataset. . . 30

3.7 Parameter grid KNN baseline random search books. . . 30

3.8 Parameter grid KNN baseline grid search books. . . 31

3.9 Parameter grid SVD random search books. . . 31

3.10 Parameter grid SVD grid search books. . . 32

3.11 Parameter grid Co-clustering random search books. . . 32

3.12 Parameter grid Co-clustering grid books. . . 32

3.13 Validation set results for selected CF models. . . 32

3.14 Parameter grid SVD grid search authors. . . 33

3.15 Model performance of different dimensional reduction techniques. . . 35

3.16 Parameter grid of Random Forest (RF) parameters for hybrid switching random search books. . . 39

3.17 Parameter grid of RF parameters for hybrid switching grid search books. . . 40

3.18 Parameter grid of RF parameters for hybrid switching random search authors. . . 40

3.19 Parameter grid of RF parameters for hybrid switching grid search authors. . . 41

3.20 Parameter grid feature augmentation hybrid grid search books. . . 42

3.21 Parameter grid feature augmentation hybrid grid search authors. . . 42

4.1 Results for the predictions on the test set for each model. . . 44

4.2 Pure CF and pure CN model performance according to RMSE and MAE. . . 45

4.3 Pure CF and pure CN model performance with implicit ratings. . . 46

4.4 Hybrid models performance for the switching model, the weighted model and the feature augmentation (FA) model. . . 47

5.1 Rating distribution book data versus author data. . . 50

5.2 Minimal number of ratings required related to the sparsity of the resulting U–I matrix. . . 51

5.3 RMSE and MAE for CF an CN on new items and the difference compared to RMSE and MAE on the whole dataset. . . 59

5.4 Precision, recall and F1 score for Random forest. . . 74

(12)

(13)

A Recommender System (RS) is a tool used to recommend an item to a user with the pur-pose of helping the user in their decision-making (Melville et al., 2002). RSs are widely used amongst companies to, for example, increase the number of items sold or increase user satis-faction. Historically, RSs emerged from the observation that users tend to choose what others choose or recommend (Ricci et al., 2002). For example, users often trust a book recommen-dation from a librarian, a movie recommenrecommen-dation from a film critic or a song recommended by a friend. From this observation, RS algorithms evolved within different markets, and with the increase of data available, more and more advanced algorithms developed.

There are several approaches that can be used when creating RSs, of which the two most common are Collaborative Filtering (CF) and Content-based Filtering (CN). CF uses user feedback as input data. This feedback can be explicit such as how the user rated an item on a scale of 1–5, or implicit feedback such as if the user purchased or observed a certain item. Based on these ratings, similarity either between users or items are computed, where similar users are those who rated items similarly, and similar items are those items that have received the similar ratings from the same users. Based on these similarities, predictions for how a user will rate items can be made using different models. If a predicted rating is high, it is likely that the user is interested in the item, and will therefore be recommended that item. To convert the predicted ratings into recommendations, a filtering component is used, see Figure 1.1. The filtering component takes the highest predicted ratings, filters out the items that the user already consumed, and provides recommendations. For the CN approach, instead of using user ratings to compute similarity, the content of the items is used. This content could be for example metadata of the item, such as the year a movie was released or the full-text of a book or a song. The content of the items is processed using certain text processing techniques, and from this similarity between items is computed.

Both approaches have their advantages and drawbacks. The CF approach is considered to be the most simple to implement since no information about the items is needed. It is the most commonly used approach for RSs. However, the approach has its drawbacks since it is not able to recommend items that have yet not been rated, a problem called the first-rater issue (Melville et al., 2002). The CN approach, on the other hand, is able to provide recommendations with less user rating data (Lops et al., 2002). A drawback with the CN approach is, however, that it tends to over-specialize, meaning that the user will only be recommended items that are similar to the items the user already consumed, and there will

(14)

Model Item and user data Rating predictions Filtering component Recommendations

Figure 1.1: How recommendations are made using item and user data.

be no novel items that might be of interest. Moreover, content data is not always available, and even if there is data available there might be insufficient features from the content to distinguish an item from another (Lops et al., 2002). To manage the disadvantages of different RS techniques, they can be combined to create hybrid models (Burke, 2002). For example, combining the CF and CN techniques, one could explore if it is possible to create an RS that take the advantages of each approach and overcomes the drawbacks, such as the first-rater issue.

This study explores the CF approach, the CN approach, as well as different hybrid proaches of CF and CN, when creating a book-RS and an author-RS. The different ap-proaches, CF, CN and hybrid, are evaluated for different types of users, authors, books and types of ratings, to see how the different approaches perform for different sets of data. There are for example users that have rated few items, authors that have books that have not been rated at all, as well as really popular authors in terms of that their books received many good ratings.

The study is done in collaboration with Storytel, which is the largest actor within the e-book and audio-e-book market in Sweden. E-e-books are electronic e-books that are read online and audio-books are books that are listened to. Storytel provided data such as user ratings on books, metadata, and full-text of the books. They also provided data on the completion rate of all unfinished books, which is how much of the book a user read before abandoning the book, out of all books that the user did not finish. Both book-RS and author-RS were implemented using the CF approach and the CN approach, as well as the hybrid CF CN approach, in order to find the best performing approach for the recommendation engines.

1.1 Motivation and aim

The goal of implementing a pure CF RS and a pure CN RS, is to identify how the different approaches perform for the entire dataset as well as subsets of the data. Different hybrid im-plementations of a RS are also explored to be able to conclude if both CF and CN approaches could complement each other and a hybrid model can out-perform both pure models.

The study also aims to understand how the aggregation of the user rating data on books as well as the book content to author level affect the performance.

1.2 Research questions

The thesis explores and answers the research questions below.

1. How does the aggregation of user ratings on books to author level affect the performance of a RS? To be able to develop a RS for authors, user rating data available on books and the book content need to be aggregated to author level. The performance of the different RSs is evaluated on book level as well as author level in order to see how the data aggregation affects the performance.

2. What performance can be achieved when predicting ratings using a pure CF approach and a pure CN approach for a book-RS and an author-RS?

(15)

Different RSs are created: one CF RS and one CN RS for books, as well as one CF and one CN for authors. Each RS are evaluated according to the metrics RMSE and MAE, and the different RS are compared to each other.

3. Will the performance increase when including implicit ratings as input into the CF and CN book-RS and author-RS?

The dataset provided by Storytel on the completion rate of unfinished books is a kind of implicit rating which might provide further insights in how the user liked or did not like, the book and its author. These implicit ratings are incorporated into the existing models to see whether the performance improves.

4. What performance will a hybrid CF CN RS achieve compared to the pure CF RS and the pure CN RS, on book level and on author level?

A set of different hybrid models, combining the two pure approaches, CF and CN are modelled, using both user data and item content data.

1.3 Delimitations

The data used in this study is limited to data provided by Storytel on e-books. All audio-books, with their associated ratings, are excluded since they have no full-text data available.

Also, the authors used in the study is limited to those authors with e-books written in English or translated to English.

1.4 Thesis Outline

In Chapter 2, related work and theoretical concepts on RS techniques are presented and ex-plained. This includes CF techniques, CN techniques as well as hybrid RS methods. Further-more, in Chapter 3, the methodology of the experiments are explained. Chapter 4 presents all results from the experiments. In Chapter 5, the obtained results are discussed and lastly in Chapter 6, the findings are concluded, research questions are answered and future work related to the thesis is suggested.

1.5 Author responsibility and contribution

Jesper Hedlund was mainly responsible for the theoretical concepts and methodology behind the CN model, whereas Emma Nilsson Tengstrand contributed with theory and methodology on the CF model. The aggregation of the dataset, the hybrid models and evaluation of the models were implemented. Further on, the discussion and conclusions made were co-created.

(16)

In a Recommender System (RS), there is the item, i, which denotes what the RS recommends, such as a book or an author, as well as the user, u, that is the receiver of the recommended items, such as a user of the Storytel app. The user has associated behavioral data, for instance, ratings of items or number of purchases, whereas the item has associated metadata as well as content information such as full-text for a book or lyrics for a song (Ricci et al., 2002).

There are several techniques that are used for constructing a RS, which are distinguished by their knowledge source (data used as input into the RS) (Burke, 2007). The two most com-mon techniques are the Collaborative Filtering (CF) approach that produces a recommenda-tion based on user behaviour data, and the Content-based Filtering (CN) approach where recommendations are based on deriving item similarity from the metadata and content of the items (Ricci et al., 2002). In addition, there are several hybrid techniques, combining the two approaches in order to overcome drawbacks of the different techniques.

All methods and relevant theories for each approach are described in associated section; even when an algorithm or a concept is not exclusively linked to a certain approach it has been placed in a chapter where it is the most relevant. For example, since Singular Value Decomposition (SVD) is used in the CF approach it is described in the theory section for CF, even though it is not a model that is exclusively used for CF modelling. For models that are used in several approaches, it is described in the first-mentioned approach and then referred back to.

2.1 Evaluation of Recommender Systems

There are several ways to evaluate a RS. One common method used in several research papers is to measure the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of predictions on ratings.

MAE computes the mean of the absolute error, for each prediction and actual value, see Equation 2.1, where ˆyiare the predictions and yiare the actual values.

MAE=

řn

i=1|ˆyi´yi|

(17)

RMSE computes the standard deviation of the residuals, and is a metric that describes how far the predictions ˆyiare from the true values yiin the same way as MAE, but penalizes

prediction further away from the true values more than MAE, see Equation 2.2.

RMSE= g f f e 1 n n ÿ i=1 (ˆyi´yi)2 (2.2)

Balasingam et al. (2019) state that both MAE and RMSE are useful when measuring the performance of a RS. MAE measures how effective the RS is, a lower MAE means better predictions. RMSE instead indicate the stability of the predictions and a low RMSE means low variability of the predictions, since RMSE penalizes larger deviations compared to MAE. There are other metrics that have been discussed in the literature, for instance, there are ranking metrics such as Mean Average Precision (MAP) or Normalized Discounted Cumula-tive Gain (NDCG) which evaluate the recommendations of a model. Precision and recall of the recommendations that are derived from the predictions are also commonly used.

2.2 Collaborative filtering (CF) and associated methods

The Collaborative Filtering (CF) approach provides recommendations based on patterns in the behavioral data, such as how the users have rated or what items the user has consumed (Koren and Bell, 2002). It is based on the idea that two users are similar if their behaviours are similar, and users will be recommended items that have been consumed or liked by similar users. For example, if there are two users, user u1and user u2, see Figure 2.1, that both like

item i1and item i2, they have similar behavioral patterns and are therefore according to the

CF approach considered as similar users. Then, if user u1also likes item i3, user u2would

be given a recommendation of item i3. Analogously, two items are similar if the same users

have liked or consumed the items, which will be discussed more in Section 2.2.1.

Likes Item User User Item Likes Item Likes _{Recommendation} Similar

Figure 2.1: How a recommendation is made with collaborative filtering.

The CF approach is considered to be the simplest approach when implementing a RS because it only needs user data as input and no knowledge about the items are required (Ricci et al., 2002). CF models are all based on patterns of feedback, either explicit or implicit, which are described below.

(18)

Implicit feedback: Represents an action of a user, which could be if a user has con-sumed an item or finished a book (Schafer et al., 2007). Implicit feedback could be used separately or used as an additional input to complete a model of sparse explicit feed-back (Koren and Bell, 2002).

The CF approach conventionally relies on the U–I matrix, where each user rating on an item is recorded, see Table 2.2 (Shi et al., 2014). One row in the U–I matrix is called a user vector and represents the user with all its ratings, whereas one column in the U–I matrix is called the item vector and represents the items and what ratings that item has received. The matrix is typically very sparse, since each user does not rate every item, and some users only rate once or twice. Sparsity in a U–I matrix is typically computed as a ratio between the observed number of entries in the matrix and the total number of entries. The total number of entries is all possible ratings between users and items, so the number of unique users multiplied with the number of unique items, whereas the number of observed entries is the number of actual ratings that have been recorded. The sparsity of the U–I matrix is relevant to measure since it may affect the performance and computational requirements of a model. Again, the rating in the U–I matrix could be of either explicit or implicit character and is measured in a suitable way according to the rating system.

Figure 2.2: A User–Item (U–I) matrix.

The models for the CF approach are mainly divided into two categories: neighborhood models and matrix factorization models (Koren and Bell, 2002). There are also less common approaches, for example, different kinds of clustering techniques.

2.2.1 Neighborhood models

Neighborhood models are based on finding correlations between either users or items, based on feedback from users on items. Measuring the correlation between users is called user– user similarity and the correlation between items is called item–item similarity. The user–user approach produces recommendations based on ratings from similar users. The item–item ap-proach computes the similarity between items in terms of how users have rated the items and from that similarity, recommends items to a user that is similar to the ones it has previously

(19)

rated (Koren and Bell, 2002). The item–item approach has gained more and more popularity lately, as it scales better and has better accuracy than the original user–user approach (Sarwar et al., 2001).

Neighborhood models are the most commonly used algorithms for CF RS, mainly because of their simplicity. They are more simple to implement compared to for example matrix factorization models but have been shown to not have as high performance (Koren and Bell, 2002).

Ricci et al. (2002) state that the two most common similarity measures for CF RSs, to mea-sure similarity between either users or items, are cosine similarity, see Equation 2.3 and Pearson correlation, see Equation 2.4. Similarity is computed either between user vectors cuand cvfor

the users u and v, or between item vectors diand djfor the items i and j (described in Figure

2.2), depending on if user–user similarity or item–item similarity is used. All following sim-ilarities and models will be described with users notation u and v, for clarity. All simsim-ilarities and neighborhood models are however compatible for item–item similarity too.

Cosine similarity between two users u and v measures the dot product between the user vectors, cucv, divided by the norm of each vector, |cu||cv|.

cos(u, v) = cucv

|cu||cv| (2.3)

Pearson correlation between two users u and v computes the covariance between the user vectors, cov(cu, cv), divided by the standard deviation of each vector, σcuσcv.

Pearson(u, v) = cov(cu, cv)

σcuσcv

(2.4) The study made by Lathia et al. (2009) shows that in the general case, it does not make that much difference what measurement is used. Even though cosine similarity and Pearson cor-relation have been used successfully in several studies, both can have a problem measuring similarity when the data is very sparse, according to Ahn (2007).

2.2.1.1 K-Nearest Neighbor (KNN)

One neighborhood model is K-Nearest Neighbor (KNN), which is a model that makes a weighted average of the values of the k nearest neighbors to compute the prediction. The pre-diction, ˆrui, of a user u on an item i is estimated by firstly computing the similarity sim(u, v)

of the k nearest neighbors, v P N_ik(u). The similarities are multiplied with the rating from the neighbors on item i, rvi. These values are summed and divided by the sum of the similarities

of the k neighbors, see Equation 2.5.

ˆrui= ř vPNk i(u)sim(u, v)¨rvi ř vPN_ik(u)sim(u, v) (2.5) Where the similarity sim(u, v)between two users u, v is calculated by using either cosine similarity or Pearson correlation.

2.2.1.2 K-Nearest Neighbor (KNN) with means

This KNN variant takes into account the mean rating of each user µu (or µi depending if

the prediction is computed using user–user similarity or item–item similarity), as well as the mean of k neighbors, µvwhen computing the predicted value ˆrui, according to Equation 2.6.

(20)

into account how the user usually rates, e.g. if it is a user that generally rates atypically high or low, which can help the model to account for general user behaviours.

ˆrui=µu+ ř vPNk i(u)sim(u, v)¨(rvi´ µv) ř vPN_ik(u)sim(u, v) (2.6)

2.2.1.3 K-Nearest Neighbor (KNN) with means and z-score

This KNN variant takes into account the mean rating of each user, and applies a z-score normalization of each user for user–user similarity (z-score normalization of each item for item–item similarity). Z-score normalization is done by subtracting the mean rating for all neighbors, as for KNN with means in Section 2.2.1.2, and also dividing by the standard devi-ation for the user. The prediction ˆruiis consequently computed according to Equation 2.7.

ˆrui=µu+σu ř vPNk i(u)sim(u, v)¨(rvi´ µv)/σv ř vPNk i(u)sim(u, v) (2.7) Z-score normalization takes into account not only if a user tends rates particularly high or low, but also accounts for if a user rates with a lot of variety or not. For example, if there are two users, u1and u2, both have the same mean rating, e.g. µu1 =µu2 =3, there is still a

possibility that their rating patters are different. If, for example, user u1have rated the items

i=1, 2, 3 with the ratings r11 =1, r12 =3 and r13 =5, whereas user u2might have rated all

these three items the same, r11 = r12 =r13 =3. The standard deviation of the users will be

different, where σ1 =2 and σ2= 0. By taking into account the variance of which user rates,

additional user patterns could be captured compared to for example ordinary KNN.

2.2.1.4 K-Nearest Neighbor (KNN) with baseline

This KNN variant takes into account the baseline rating buifor a user u on item i. The baseline

rating buiis usually predicted by taking the overall average µ and then the observed biases

buof user u, and bifor item i are added, see Equation 2.8.

bui=µ+bu+bi (2.8)

The biases, bu and bi, can be estimated as a least square problem by optimizing over all

existing ratings for each pair of users and items(u, i), that is denoted by the setK =t(u, i)|rui

is knownu. Minimizing the term ř

(u,i)PK(rui´ µ ´ bu´bi)2 aims to find the optimal bus

and bis, denoted b˚. Moreover, a regularization term, λ(

ř

ub2u+

ř

ibi2), is included to avoid

overfitting, see Equation 2.9 (Koren, Bell, and Volinsky, 2009). min b˚ ÿ (u,i)PK (rui´ µ ´ bu´bi)2+λ( ÿ u b2u+ ÿ i b2_i) (2.9)

The final prediction for the rating ˆruiis then set as:

ˆrui=bu+ ř vPNk i(u)sim(u, v)¨(rvi´bvi) ř vPNk i(u)sim(u, v) (2.10) Incorporating the user and item bias into the prediction might help the model to capture the specific behaviour of users and characteristics of items.

(21)

2.2.2 Matrix factorization models

Matrix factorization models decompose the U–I matrix into a product of matrices with lower dimensionality. The idea is to represent the users with a user feature matrix and the items with an item feature matrix in a low dimensional latent feature space. This dimensional reduction approach aims to avoid the sparse U–I matrix problems, and to be able to scale RS problems well (Dheeraj et al., 2015).

Matrix factorization models have been more and more used in RS, due to showing high performance for sparse data and having good scalability (Koren and Bell, 2002). For exam-ple, in the late 2000s Netflix started a competition for improving their RS with the goal of improving the RMSE of their RS with 10 %. The winner of this competition won one million dollars (Bennett et al., 2007). Matrix factorization was shown to be a suitable approach in this competition and the technique gained more and more popularity after this (Koren, Bell, and Volinsky, 2009). Furthermore, Murphy (2012d) emphasises the sparsity and scale of the Netflix dataset, with approximately 8.6 ˆ 109entries in the U–I matrix, but only 100,480,507 observed entries, which implies a sparsity of approximately 1%. The author also discusses the imbalance of the dataset, with some users that have rated fewer than 5 times, and some that have rated over 10,000 times. The baseline system of the Netflix RS engine had a RMSE on the test set of 0.9525.

2.2.2.1 Singular Value Decomposition (SVD)

SVD is a matrix factorization model that identifies latent features by mapping both users and items to a joint latent factor space of dimensionality f . For example, when decomposing a U–I matrix, the matrix is factorized into three different matrices: a user feature matrix, a diagonal singular value matrix, and an item feature matrix which can be seen in Figure 2.3. For the SVD model, each user is associated with a user feature vector pu P Rf represented by a row

in the user feature matrix. Each element in this user feature vector describes how interested the user u is in items corresponding to the associated latent feature. Moreover, each item i is linked to an item feature vector qi PRf represented by a column in the item feature matrix.

Each element in the item feature vector describes how well the item i corresponds to a latent feature. Lastly, the diagonal singular value matrix contains singular values for the latent features of the U–I matrix. These singular values are used for scaling (Koren and Bell, 2002).

1 2 . . . . 1 2 ... ... − _≈ 1 2 ... ... ℎ 1 2 . . . .

Figure 2.3: Matrix factorization using SVD.

The dot product of the vectors, qT_i pu, describes the interaction between a certain user u

and an item i, which is how interested the user u is in the characteristics of item i. This interaction is later used to make an approximation of the user rating of an item in Equation 2.11.

Koren, Bell, and Volinsky (2009) state that this conventional way of performing SVD im-plies difficulties because it requires factorization of the U–I matrix. The conventional ap-proach is not defined for missing values, and addressing only the few existing values typi-cally leads to overfitting. This is often the case for a sparse U–I matrix. Or, in a case where

(22)

the missed values are imputed, this is often computationally expensive or has a risk of being misrepresenting. Later work instead suggests using only the explicit ratings, meaning no im-putation of missing values are made, and including a regularization component in order to avoid overfitting the data (Koren and Bell, 2002). The rating is predicted by taking the base-line prediction which is the overall mean and the user and item biases, as it is described in section 2.2.1.4. This baseline prediction is added to the user–item interaction qT_i pu, resulting

in the prediction of ˆruiin Equation 2.11. Murphy (2012d) further states that the advantage

of incorporating the user bias and item bias into the matrix factorization rating estimation is that a lot of the variation in the data often can be explained with specific user or specific item causes. For example, in the Netflix problem, there are always some movies that are rated high no matter the type of users, or there are some users that always rate a movie low, no matter movie.

ˆrui=µ+bi+bu+qTi pu (2.11)

To learn the optimal user and item biases, as well as the user feature vectors and item feature vectors, denoted b‹, q‹, p‹, the minimization problem described in Equation 2.12 is

solved. Minimizing the termř

(u,i)PK(rui´ µ ´ bu´bi´qiTpu)2strives to find b‹, q‹, p‹, and

the regularization term λ(b2

u+b2i +||qi||2+||pu||2)is added to avoid overfitting.

min

b‹,q‹,p‹

ÿ

(u,i)PK

(rui´ µ ´ bu´bi´qTi pu)2+λ(bu2+b2i +||qi||2+||pu||2) (2.12)

The minimization problem is often solved using either Alternating Least Squares (ALS) or Stochastic Gradient Descent (SGD). Aberger (2002) states that SGD is faster and more accurate compared to ALS, for this kind of baseline parameter optimization. Although in those cases where the data is extremely sparse, ALS could perform better. Moreover, does Koren (2008) explore both ALS and SGD for neighboring and matrix factorization CF models, and conclude that SGD is a faster, and equally accurate algorithm as ALS for the problem, and therefore uses it for the experiments in the study.

One way of optimizing the parameters using SGD has been practiced and gained attention after Simon Funk used it in his model for improving the Netflix algorithm (Funk, 2006). This algorithm has been practiced by Funk and by several others. This SGD algorithm starts with that a prediction ˆruiis made for each rating ruiand the error eui = rui´ˆruiis computed for

every rating. The user bias bu, item bias bi, item feature vector qiand user feature vector pu

are iteratively modified by moving in the direction of the gradient, using a learning rate γ and a regularization term λ, looping over the steps below (Koren and Bell, 2002; Koren, Bell, and Volinsky, 2009).

• bu Ðbu+γ(eui´ λ ¨ bu)

• bi Ðbu+γ(eui´ λ ¨ bi)

• qiÐqi+γ(eui¨pu´ λ ¨ qi)

• puÐ pu+γ(eui¨qi´ λ ¨ pu)

Koren and Bell (2002) and Koren, Bell, and Volinsky (2009) state that it is possible to set different learning rates, γ and regularization terms, λ, for the user bias bu, item bias bi, item

feature vector qi and user feature vector pu respectively, and better accuracy can thus be

(23)

2.2.2.2 SVD++

The accuracy of a prediction from the SVD model could be improved using implicit feedback, such as if a user has consumed an item, or looked at or searched for an item. Koren (2008) describes the SVD++ model, that uses the implicit feedback that is whether or not a user has rated an item or not, denoted as "1" for rated and "0" for not rated, not taking into account the actual rating. Even though this type of implicit data is not typically independent of the explicit ratings, the author found that incorporating these implicit binary ratings into the model, significantly improved the prediction accuracy of the model.

Initially, another set of item features is introduced, where each item i is related to a feature vector yi PRf. R(u)denotes the set of items that the user has expressed an implicit rating

for. So, the user preferences φufor an item can be expressed as in Equation 2.13 (Koren, Bell,

and Volinsky, 2009).

φu=

ÿ

iPR(u)

yi (2.13)

To stabilize the variance across the observed values, this vector is normalized, resulting in a new definition of the user preferences in Equation 2.14 (Koren, Bell, and Volinsky, 2009).

φu=|R(u)|´12 ÿ

iPR(u)

yi (2.14)

This normalized sum of implicit ratings is added to the user vector pu, in order to

char-acterize the user based on the set of items it has rated. Equation 2.15 describes the rating estimate using SVD++. ˆrui=µ+bi+bu+qTi  pu+|R(u)|´ 1 2 ÿ jPR(u) yj   (2.15)

The optimal values for bu, bi, qi, pu and yj, could be determined by minimizing the

reg-ularized squared error function, using SGD, similar to the procedure described for SVD, in Section 2.2.2.1. This is done by looping through the steps below. In addition to the SVD loop, the implicit item feature vector yjis added, using two different regularization terms λ1and

λ2, where λ1is used for the biases, and λ2for the feature vectors.

• bu Ðbu+γ ¨(eui´ λ1¨bu) • bi Ðbi+γ ¨(eui´ λ1¨bi) • qiÐqi+γ ¨(eui¨(pu+|R(u)|´ 1 2 ř jPR(u)yj)´ λ2¨bi) • puÐ pu+γ ¨(eui¨qi´ λ2¨pu) • @j P R(u): yjÐyj+γ ¨(eui¨ |R(u)|´ 1 2 ¨qi´ λ2¨yj)

2.2.3 Clustering models

Clustering methods assign users and items to certain clusters and by comparing the similarity of the clusters for users and items, similarities between users or items are derived. George et al. (2005) present a clustering approach to be used for RSs where users and items are assigned to clusters Cu, Ciand some co-clusters Cui. This method is called Co-Clustering. The prediction

(24)

cluster of user u, Ciis the average rating of the cluster of item i and Cuiis the average rating

of the co-cluster of Cui.

ˆrui=Cui+ µu´Cu+ µi´Ci (2.16)

The clusters are then assigned, minimizing the error in such a way that, unlike SVD for example, enables incremental update in case of new ratings. George et al. (2005) state that this approach has similar accuracy with significantly lower computational effort compared to neighboring techniques and matrix factorization models.

2.3 Content-based Filtering (CN) and associated methods

Content-based Filtering (CN) aims to recommend an item to a user that is similar to the previous items that the user liked, using the content of the items. The similarity is derived from the content of the items, such as the title or the full-text of a book (Lops et al., 2002). If a user, u1, likes an item, i1, which is similar to another item, i2, this item will be recommended

to user u1, see Figure 2.4 for a graphical representation.

Figure 2.4: How a recommendation is made with Content-based Filtering (CN). A CN RS consists of several components which are visualized in Figure 2.5. There is the content analyzer which performs the data preprocessing step that takes the item content data and represents items numerically for further analysis. There are different ways of prepro-cessing and structuring the content data depending on the type of data that is used as input. In the case when the input data contains text, there are several processes that can be applied which are described in Section 2.3.1. The processed text can be combined with metadata of the item as well. The pre-processed item content is divided into a training and test set, where the training data is used as input to the profile learner, and the test set is sent to the filtering component directly. Some of the items might only occur in the test set, therefore, some of the item representations are only sent to the filtering component directly and thus unknown to the profile learner.

The user profiles are learned through the profile learner. As input, apart from the content input, the profile learner also utilizes user data such as consumption data or rating data. These user profiles could be learned using a machine learning algorithm such as linear regression or KNN.

Lastly, the filtering component uses the user profiles, all items processed by the profile learner and any items that potentially have not been processed in the profile learner to pro-duce a list of recommendations. This is done through an evaluation of the similarity of the items in the test data and a comparison of the user preferences given by the user profiles. A

(25)

sorting of the items if often made to deliver a ranked list of recommendations where the item that is thought to be the best fit for a user, is at the top of the list.

Item content

data Content Analyzer

Item repre-sentations

Filtering Component

List of Recommendations Items only in test data

Items in training data

Proﬁle Learner _User

Proﬁles User data

Figure 2.5: System description of content-based filtering. Offline version adapted from Lops et al. (2002), Figure 3.1.

2.3.1 Natural Language Processing (NLP)

Natural Language Processing (NLP) is a research area focusing on developing techniques and tools that enables computer systems to understand human text or speech. Some early developments within NLP were made in the late 1940s in the application of machine transla-tion (Liddy, 2001). It was suggested that existing techniques used in cryptography could be used to, with the help of machines, translate a text from one language to another. Today the development of techniques in this area has evolved significantly since. Now there are tools that can capture the semantics of different languages and represent whole documents in a vector space to compute similarity (Collobert et al., 2011; Le et al., 2014). Softwares such as SIRI and Google Assistant in mobile phones, email spam filters, and chatbots are a part of our daily life all using NLP to interpret human language. Another very popular example is the Google search engine that uses NLP to a large extent for Information Retrieval (IR) which is a closely related subject to RSs (Ricci et al., 2002).

The full dataset when using NLP is denoted as corpus and consists of a set of documents, an example could be a set of books. Each book in the corpus is denoted as document and each document consist of words, see Figure 2.6 for a graphical representation.

Corpus Document Word

(26)

2.3.1.1 Preprocessing

An important part of NLP is preprocessing, which is the practice of preparing the text and represent it in such a way so that it is possible to perform text mining. According to Srividhya et al. (2010), this is the most crucial and complex part of the process when incorporating a document into an IR system. The authors further state that preprocessing is a fundamental part of achieving good performance for NLP tasks. There are several types of preprocessing: lowercase conversion, tokenization, stemming and lemmatization, and stop word removal among others. The objective of the preprocessing techniques is to remove noise represent the text in a way that is easier to interpret by the computer.

Lowercase conversion: Converting all text to lowercase is one simple straightforward

type of preprocessing. The reason for this is that for example the word “Pear” and “pear” means the same thing no matter capitalization and should therefore be represented in the same way. A drawback of this method is that the lowercase and uppercase letters distinguish some words, for example, “apple” and “Apple” may be represented in the same way even though the former refers to the fruit and the latter refers to the company.

Tokenization: Text usually comes in the form of sentences, paragraphs, or whole

docu-ments. Tokenization is the process of dividing the text into smaller pieces such as a paragraph to a list of sentences or a sentence to a list of words (Webster et al., 1992). An example could be the sentence “I am typing” being divided into a list of the words [’I’, ’am’, ’typing’]. Each word in this list is called a token, hence the name of the process. The split is usually done by splitting the text on each white space, resulting in a text being split up into tokens the size of a word. When splitting a text into sentences, each sentence can be separated by using punctu-ation. At first glance, tokenization could be seen as a simple and straightforward technique. Although, as Webster et al. (ibid.) mention, there is often a complexity to it depending on the language of the text. As for the English language, there are for example the difficulties when identifying and handling acronyms. Moreover, there are issues with how to handle the use of the apostrophe for possession and contractions. For example in the sentence “It’s important that you do this ASAP!”, “It’s” actually contains two words and “ASAP” is an abbreviation of the four words “As Soon As Possible”. The challenge when creating a tokenizer is to save as much information as possible from the text without misrepresenting the data, in order to have a viable token representation.

Stemming and lemmatization: Both stemming and lemmatization aim to reduce the

ef-fect of inflectional forms1in the text. Stemming cuts words according to a rule set by looking for commonly used prefixes and suffixes. This is done in order to remove inflections of the words and only get the root representation of each word, the stem. An example of a rule is “if the word ends with an s, remove the s”. So, for example, the word “books” would be cut off to “book” because of the common suffix “s”. A drawback of this method is that there is no guarantee that it captures the actual root, for example, taking the word “studies”, stemming will take away the suffix “es” resulting in the word “studi” instead of the real root word that is “study”.

Lemmatization instead uses morphological analysis of words, removing inflectional end-ings and only returning the lemma, which is, the base form of the word. This is done in several ways, one is simple by looking it up in a dictionary. Taking the previous example of the word "studies", lemmatization would return the base form of the word which is “study”. Stemming is a simpler method for this while lemmatization is considered to be more accurate (Balakrishnan et al., 2014).

(27)

Stop word removal: In most English texts there are some words that are more frequently used than other words. Such as “the”, “a”, “an”, and “of”, which are called stop words. These words usually do not contain that much information since they are used in almost every text. Therefore, in the process of stop word removal, a list of stop words is created and the words are subsequently removed from the text. A common way to create a stop word list is to sort all words in the vocabulary of the data that is being processed by the frequency of appearance and add the most common ones to the stop word list (Manning et al., 2009).

Even though stop word removal often is considered to increase the performance of a model, Schofield et al. (2017) state that if being too extensive, the stop word list will be biased towards what the creator sees as important. According to Schofield et al. (ibid.) it is sufficient to only remove the extremely frequent words in the preprocessing and then perform a post-hoc removal after the model has been trained.

2.3.1.2 Text representation

An important part of NLP research is to develop techniques to represent text in a way that is understandable for computers. To be able to use text as input to, for example, a Machine Learning (ML) model, a numerical representation of the text is needed. This is usually done by representing each word as a fixed-length vector in a multidimensional vector space, called embedding. One basic model that is commonly used is Bag-Of-Words (BOW). BOW repre-sents the text in a vector which has the same dimension as the number of unique words in the text, where each dimension represents the frequency of a certain word. Le et al. (2014) mention a couple of weaknesses of BOW in the sense that the technique does not capture the order of the words and totally ignore semantics of the words.

Le et al. (2014) introduced a new embedding technique called Word2vec that makes use of a two-layer (one hidden, sometimes called projection layer, and one output layer) neural net-work to produce a vector representation of a word. It is one of the state-of-the-art techniques for word embedding that developed during the 2010s. It can be based on two different archi-tectures, the skip-gram model or Continuous Bag-Of-Words Model (CBOW) . The skip-gram model uses a word to predict its context and the CBOW model uses the context of a word in order to predict the word. The difference in the actual architecture of the two approaches can be seen in Figure 2.7. Even though the models are trained to predict either the context of a word or the targeted word, the actual purpose of the neural network is to produce a hidden layer that will capture the semantics of each word. It is in the hidden layer that the word vectors are produced and later used as output from the Word2vec model.

In the same paper, Le et al. (2014) also introduced an extension of Word2vec that was able to represent a document as a vector compared to Word2vec that would only represent words. It is very similar in the architecture and has the same output of vectors that are produced in the hidden layer. Instead of the vectors representing only words, the extended model is also able to create vector representations of documents. The model is called Doc2vec and can use both the architecture types used in the Word2vec model, the skip-gram model and the CBOW model. Besides from the words, Doc2vec also uses document id as an input to the neural network and through that learn the characteristics of each document. In Doc2vec, the skip-gram based model is referred to as Distributed Bag of Words version of Paragraph Vector (PV-DBOW) and the CBOW based model is referred to as Distributed Memory Model of Paragraph Vector (PV-DM). Examples of the Doc2vec architecture with the PV-DM and PV-DBOW approach can be seen in Figure 2.8.

2.3.2 Profile learners

In order to be able to produce personalized recommendations, the profile (preferences) of each user has to be learned. This is usually done by applying a supervised machine learning

(28)

Skip-gram

Predicting neighboring words based on the

single word

Input Projection Output

CBOW

Input Projection Output

Sum ( − 2) ( − 1) ( + 1) ( + 2) ( − 2) ( − 1) ( + 1) ( + 2) ( ) ( )

Predicting word based on neighboring words

Figure 2.7: Architecture of Skip-gram and CBOW.

PV-DBOW

Predicting neighboring word based on the single

word and document ID

Input Projection Output Input Projection Output

Sum ( − 2) ( − 1) ( + 1) ( + 2) ( − 2) ( − 1) ( + 1) ( + 2) ( ) ( ) PV-DM

Predicting word based on neighboring words and document ID Document ID Document ID

Figure 2.8: Architecture of PV-DM and PV-DBOW.

algorithm to create a predictive model (Lops et al., 2002) such as linear regression or decision trees. The predictions made by the model are later used by the filtering component in the RS to produce recommendations for the user.

2.3.2.1 Linear regression

Linear regression is one of the most commonly used models in the area of regression (Mur-phy, 2012a). The output of the model is an affine function of the input. Equation 2.17 displays the mathematical formula for linear regression describing the target y as a function of input vector x with wibeing the weight on feature xi(Murphy, 2012a).

(29)

One advantage of using linear regression is that it is a very simple and interpretable model that rarely overfits (Burkov, 2019). A drawback that follows the simplicity is that it may be hard for the model to capture complex patterns.

2.3.2.2 Decision trees

The basic idea of decision trees (also called classification and regression trees) is to split the data into different subgroups based on a set of criteria. This is done incrementally resulting in a set of leaf nodes, each representing a subgroup of the data. How the split is made is based on what gives the highest information gain, that is, splitting the data to separate the target values while keeping the data in each leaf node as homogeneous as possible. The data is split until the information gain of the possible splits is below a set threshold or until no more splits can be made.

An example of a decision tree could be when trying to predict whether a book is of the category “Fiction” or “Non-fiction”. The data could be split on whether the book contains the word “magic” and if it contains the word “dragon”. The resulting decision tree could look like the one in Figure 2.9.

Contains

"Dragon"

Yes

No

Contains

"Magic"

Yes

No

Fiction

Non-ﬁction

Fiction

Figure 2.9: Classification tree for whether a book is fiction or non-fiction.

A decision tree can be used as a classifier as in the example in Figure 2.9 where the target variable is categorical. It can also be used as a regressor where the target variable is continu-ous. In that case, the mean of all the data points in each leaf node is calculated and set as the prediction (Murphy, 2012a). While decision trees are easy to interpret a drawback is that they are prone to overfit (ibid.).

2.3.3 Dimensionality reduction techniques

The process of reducing dimensions is very common and exists in everyday life. Maybe the most common usage is the case of a camera capturing a three-dimensional world and project-ing it into two dimensions in the form of a picture or a video. The main goal of dimensional-ity reduction is to project the data into a lower-dimensional subspace while keeping as much information as possible from the original higher-dimensional data (Murphy, 2012a). These kinds of techniques can be used for visualization of high dimensional data, feature selection, and also to increase the performance of a predictive model (Bishop, 2006; Murphy, 2012a).

In this section, three different dimension reduction techniques are described: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE) and

(30)

Uni-form Manifold Approximation and Projection (UMAP). PCA which perUni-forms a linear pro-jection onto a lower-dimensional subspace and t-SNE and UMAP which are both based on neighbor graph theory and performs more complex projections than PCA. The sections about t-SNE and UMAP are theoretically complex and may be skipped if one is not familiar with similar concepts. It is not vital for this study to understand them fully. The concept of both techniques is to look at the neighborhood of each data point in the original dimension and try to replicate the same neighborhood in a lower-dimensional subspace.

2.3.3.1 Principal Component Analysis (PCA)

One of the most commonly used methods for dimensionality reduction is PCA. It is often used for noise reduction and visualization Murphy, 2012b. The concept of PCA is to per-form an orthogonal projection into a lower-dimensional subspace in a way that maximizes the variance of the data that is projected (Bishop, 2006). The output is a set of principal com-ponents from order 1 to n, where n is the number of dimensions of the higher-dimensional data that is being projected. The first principal component is the component that captures the most variance in the data and the second capturing the second most variance and so on.

To maximize the variance of the projected data points, represented by the matrix Z, of the original data xi:N, is equal to minimize the reconstruction error J(W, Z)displayed in Equation

2.18, where ˆxi=Wziand W containing L orthogonal basis vectors wj(Murphy, 2012b).

J(W, Z) = 1 N N ÿ i=1 ||xi´xˆi||2 (2.18)

To calculate the optimal ˆW projection to the lower-dimensional subspace that minimizes J(W, Z), ˆW is set equal to VL, where VLcontains the eigenvectors of the empirical covariance

matrix ˆΣ (see Equation 2.19) with the largest eigenvalues (Murphy, 2012b).

ˆ Σ= 1 N N ÿ i=1 xixiT (2.19)

2.3.3.2 t-Distributed Stochastic Neighbor Embedding (t-SNE)

Maaten et al. (2008) introduced another dimensionality reduction technique called t-SNE that is mainly used to visualize higher-dimensional data. The technique is based on the stochastic neighbor embedding (SNE) algorithm (a k-neighbors graph learning algorithm) presented by Hinton et al. (2003), where the focus is on projecting every data point in such a way that it optimally preserves its neighborhood identity. The improvement that t-SNE provides is that it uses a t-distribution to calculate similarities instead of a Gaussian distribution that is used in SNE. Because it uses a more heavy-tailed distribution, t-SNE prevents the data points from crowding up in the lower-dimensional space, as well as making the optimization problem easier to solve (Maaten et al., 2008). To find the optimal projection, the conditional probabilities pj|i and qj|i are calculated. pj|i is the similarity of datapoint xi and xj as well

as the conditional probability that xi would choose xj as a neighbor if the neighbors were

chosen according to a t-distribution centered around xiwith the variance σ2. The equation to

calculate pj|iis displayed in Equation 2.20 .

pj|i=

exp(´||xi´xj||2/2σ2)

ř

k‰iexp(´||xi´xk||2/2σ2)

(31)

As p_j|iis the similarity for xiand xj, qj|iis the corresponding similarity of yiand yjwhich

represents the lower-dimensional counter parts of xiand xj. In Equation 2.21 it is displayed

how qj|iis calculated.

q_j|i = exp(´||yi´yj||

2₎

ř

k‰iexp(´||yi´yk||2)

(2.21) After p_j|i and q_j|iare calculated for each data point, a projection is found through a min-imization problem of the loss function. The loss function is defined as the sum of Kullback-Leibler divergence over pj|iand qj|ifor all data points, as described by Equation 2.22.

KL(p_j|i||q_j|i) =ÿ

i

ÿ

j

p_j|ilogpj|i qj|i

(2.22)

2.3.3.3 Uniform Manifold Approximation and Projection (UMAP)

UMAP was introduced by McInnes et al. (2018) as a dimensionality reduction technique both for visualization and for general purpose use in machine learning. Because UMAP is based on manifold learning and topolocial theory, it is also good in the sense that it scales well with large datasets, compared to t-SNE (ibid.). Even though UMAP is different from t-SNE, it can be considered as k-neighbors graph learning algorithm.

The first step in UMAP is to create a weighted k-neighbor graph. The input sent to UMAP is a dataset X=tx1, ..., xnu, a hyperparameter k (amount of neighbors to use), and a

dissim-ilarity measure d, where the dissimdissim-ilarity between xi and xj is defined as d(xi, xj). For each

data point xia set of k nearest neighbors txi1, ..., xikuis calculated through an arbitrary nearest

neighbor algorithm. After the k nearest neighbors has been calculated, ρi(Equation 2.23) and

σi(Equation 2.24) is defined for each xithat will be used to calculate the weights in the graph.

ρi =min j td(xi, xij)|1 ď j ď k, d(xi, xij)ą0u (2.23) 2 ÿ j=1 exp ´max(0, d(xi, xij)´ ρi) σi ! =log2(k) (2.24)

The weights w((xi, xj))of the edges in the graph are calculated as shown in Equation 2.25.

w((xi, xj)) =exp

´max(0, d(xi, xj)´ ρi

σi

(2.25) UMAP then uses a force-directed graph layout algorithm in a lower dimension space to generate a new graph. The graph is minimized over the cross-entropy CU MAPbetween the

original graph representation and the lower-dimensional graph representation. The calcula-tion of the cross entropy can be seen in Equacalcula-tion 2.26, where vijare the weights of the edges

in the lower-dimensional graph.

CU MAP = ÿ i‰j vijlog vij wij ! + (1 ´ vij)log 1 ´ vij 1 ´ wij ! (2.26)

2.4 Adding implicit ratings to a Recommender System (RS)

Adding implicit ratings into a RS, which are described in Section 2.2, could be useful to increase the performance. Nichols (1998) explain how explicit ratings have a drawback in essence that the customer behavior may not be reflected by explicit ratings. For example, it might not be reflected what a user of a news application reads, by the ratings of the user,

(32)

if the user only rates articles that it does not like. Moreover, the author underlines the dif-ficulties there might be in acquiring explicit ratings. The author discusses further how the solution could be implicit ratings. There are several possible types of implicit ratings that can be captured by the system, which are listed in Table 2.1 (Nichols, 1998).

Konstan et al. (1997) performed the GroupLens project, implementing a CF RS for news articles. The author state that implicit ratings are a good way of achieving more ratings, and they conclude that using the time spent on an article was nearly as accurate as using explicit ratings when predicting ratings.

Lastly, (Nichols, 1998) state that implicit ratings are a useful source with the potential of boosting a RS. The author further concludes that the implicit ratings should preferably be used as a complement to the explicit ratings, not alone since their effectiveness and stability is not fully explored.

Table 2.1: Potential implicit ratings information that can be acquired by a system.

Action Description

Assess evaluates or recommends

Repeated Use (Number) e.g. multiple check out stamps Save / Print saves document to personal storage

Delete deletes an item

Refer cites or otherwise refers to item Reply (Time) replies to item

Mark add to a ’marked’ or ’interesting’ list Examine / Read (Time) looks at whole item

Consider (Time) looks at abstract

Glimpse sees title / surrogate in list

Associate returns in search but never glimpses Query association of terms from queries

2.5 Advantages and drawbacks of CF and CN

There are several advantages and disadvantages with both the CF and the CN approach, which can be seen in Table 2.2 (Barragáns-Martínez et al., 2010). The major advantages of the CF approach are that there is no need to have content data available or techniques in place in order to perform the content analysis since the recommendation is solely based on user ratings. Moreover, there is a serendipity possibility, which is the chance of recommending something novel to the user. There are, however, several drawbacks with the CF approach. One of them is the first-rater problem. If an item is new, either with regards to that it is newly put into the system or that it has not yet received any explicit or implicit feedback, it will be impossible to recommend this item with the CF approach because the system relies only on the preferences of the users. Additionally, there is the cold-start problem, that when users have not submitted any ratings in the system, the system is not able to derive the preferences of the user and thus not be able to provide any ratings.

The major advantages with the CN approach are that user rating data is not needed for new items in the same terms as for the CF approach since item similarity is computed based on the content, not based on ratings. Another advantage is that there is often a widespread of information available which is possible to analyse, whether it is the year of a book, the singer of a song or the introduction of a novel or the books of an author. The drawbacks, nonetheless, are that there is the tendency of over-specialization, which in literature also is called the serendipity problem. Since the CN approach recommends items based on how similar the content of these items is to the items that a user have rated, there is a risk that the user will be recommended items that are too similar to the ones already read. This could limit