• No results found

Implementing a scalable recommender system for social networks

N/A
N/A
Protected

Academic year: 2021

Share "Implementing a scalable recommender system for social networks"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköping University

Linköpings universitet

LiU-ITN-TEK-A--17/031--SE

Implementing a scalable

recommender system for social

networks

Alexander Cederblad

(2)

LiU-ITN-TEK-A--17/031--SE

Implementing a scalable

recommender system for social

networks

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid

Linköpings universitet

Alexander Cederblad

Handledare Pierangelo DellAcqua

Examinator Camilla Forsell

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Abstract

Large amounts of items and users with different characteristics and preferences make personalized recommendations a problem. Many companies employ recommender sys-tems to solve the problem of discovery and information overload where it is unreasonable for a user to go through all items to find something interesting. Recommender systems as a field of research has become popular during the past two decades. Recommendations are for many companies an important aspect of their products concerning user experience and revenue. This master’s thesis describes the development and evaluation of a recommender system in the context of a social network for sports fishing called Fishbrain. It describes and evaluates several different approaches to recommender systems. It reasons about user characteristics, user interface, and the feedback data provided by the users, for which help make recommendations. The work aims to improve user experience in the given context. All this has been implemented and evaluated, with mixed results, considering the many variables taken into account that are important to Fishbrain.

(5)

Acknowledgments

I would like to thank Fishbrain for allowing me to finish my thesis under their stewardship. I express my humble gratitude to Niklas Andersson for helping me brainstorming ideas in the inception of the project, the continuous support during the entire project, and for keeping me on track. Also from Fishbrain, I would like to thank Mattias Lundell for taking the time to review my work and giving me valuable feedback, and also for showing a special interest during my time working on the thesis. Thanks to Ariel Ekgren for challenging my critical thinking during data analysis.

Finally I would like to thank my supervisor from the University, Pierangelo Dell’Acqua for valuable feedback and peppiness during meetings, without which I would not have been able to finish the work.

(6)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1

1.1 Motivation . . . 1

1.2 Discovery in similar services . . . 2

1.3 Aim . . . 4 1.4 Research questions . . . 5 1.5 Delimitations . . . 5 2 Theory 6 2.1 Recommender systems . . . 6 2.2 Content-based filtering . . . 7 2.3 Collaborative filtering . . . 7

2.4 Hybrid recommender systems . . . 9

2.5 Non-traditional approaches . . . 9

2.6 Cold start problem . . . 10

2.7 Feedback data . . . 10

2.8 Data sparsity . . . 11

2.9 Explaining recommendations . . . 11

2.10 Evaluation . . . 11

3 Method 13 3.1 Feedback data exploration . . . 13

3.2 Implementation . . . 14

(7)

4 Results 20 4.1 Dataset . . . 20 4.2 Evaluation framework . . . 20 4.3 Offline evaluation . . . 20 4.4 Online evaluation . . . 21 5 Discussion 23 5.1 Results . . . 23 5.2 Method . . . 24

5.3 The work in a wider context . . . 24

6 Conclusion 25 6.1 Research questions . . . 25

6.2 Aim . . . 26

6.3 Future work . . . 26

(8)

List of Figures

1.1 (a) Screenshot of the Explore view where users can explore content from specific geographical areas by browsing a world map while filtering on species, fishing methods, etc. (b) The current Discover view where users can search and follow users, fish species, and fishing methods that may interest them at their own voli-tion. Below these options is a list of the latest catches globally. . . 2 1.2 Showcasing two similar services and their presentation of recommendations. . . . 3 1.3 Showcasing two similar services and their presentation of recommendations. . . . 4 3.1 Diagram of the recommender system infrastructure. Arrows indicate dependency,

not flow of information. . . 17 3.2 Recommendations as presented on the two different platforms. . . 18

(9)

List of Tables

3.1 The number of users for each dataset is relative to the number of users in the sparse, long time frame dataset. A larger number in mean and median is also an indicator of the amount of data for each user, and as stated in previous chapter, more data implies better recommendations. . . 14 4.1 Table presenting the offline evaluation results for the different approaches tested

on two different datasets. Hyphenated cells are missing measurements for the given approach. Cells with bold numbers illustrates the argued best result. For P@10, the number is a fraction of the recommended items that are in the test set. For RMSE, the number is the numerical error for the predicted ratings. . . 21 4.2 Online evaluation results for users globally. The numbers in the cells represent the

percentage (%) change in conversion rates measured. Cells with bold text indicate changes with a reasonable probability level (* 80 %, ** 95 %), to determine the results as statistically significant. . . 21 4.3 Online evaluation results for users in the USA. . . 22 4.4 Online evaluation results for users in Sweden. . . 22

(10)

1

Introduction

1.1

Motivation

Fishbrain is a company developing a social network- and utility application (also called Fish-brain) for anglers. The application is available for iOS and Android. It features utility func-tionality for preparing fishing trips and enables socializing with other anglers by sharing their fishing experiences. The social network has over 3,500,000 users. Fishbrain provides similar functionality as featured in other social networks; a social feed where users can keep up with other users (and/or other entities) that they follow, a profile page, a few facilities for finding new friends and content, etc. In addition the applications also provide utility functionality like weather (or rather fishing prospect) forecasts, and a map where anglers can explore new bodies of water to visit and fish.

Fishing is a highly seasonal activity. For the main target group, located in the United States, high season occurs around May through August. During this time users do most of their yearly fishing, this means time on the water, posting catches, and engaging in the utility side of the application. During this time the current geographical neighborhood of the users are especially important. Users want to know when fish start to bite in the waters they usually fish. During off-season, users are more inclined to browse through content from the past season and explore fishing activity in their own vicinity but also around the world.

The most important type of content to Fishbrain are the catches that users share. They include information about the catch (location, time of day, species, weight, bait, etc.) accom-panied by a photo and optionally a short text. Users may also share fishing related status updates, images, and videos without it being a logged catch. Users consuming the content provided by other users may provide feedback in the form of a "like" and/or comments. This information can be used to improve the user experience by providing relevant and personal-ized recommendations of content to users.

Arguably the most important thing to the social network, putting aside the utility side of the application, is that users interact with each other and engage in the community. Therefore it is important to make sure the users have a good experience while using the application, and to keep them engaged such that they want to come back. The hypothesis of the thesis is that improving content discovery and exploration will improve overall user experience. The current facilities for exploration and discovery are presented in Figure 1.1. They currently lack in the ability to produce personalized recommendations for users.

(11)

1.2. Discovery in similar services

(a) Explore view (b) Discover view

Figure 1.1: (a) Screenshot of the Explore view where users can explore content from specific geographical areas by browsing a world map while filtering on species, fish-ing methods, etc. (b) The current Discover view where users can search and follow users, fish species, and fishing methods that may interest them at their own volition. Below these options is a list of the latest catches globally.

1.2

Discovery in similar services

There are several social networks with similar user interaction that provide personalized rec-ommendations for content discovery to their users.

Instagram allow their users to interact with each other similarly to Fishbrain. They use the follow/followed paradigm of connecting users and let users "like" and comment posts that users make in the form of images and videos. They provide a "Discover" view (Figure 1.2a) with a grid of recommended posts. Each post is accompanied by an ex-planation as to why it was recommended (e.g. "based on people you like", "based on people you follow", etc.) as well as the option for users to mark it as not desired by letting users opt in for seeing "fewer posts like this", which may be used as feedback to make better recommendations. The Instagram engineering team published an article in 2015 describing their approach to discovery [10].

YouTube previously had a five-star rating system for their videos, they have since switched to likes and dislikes as the main feedback users can provide. They also have com-ments enabled and the ability to mark videos as your favourites. Aside from these ex-plicit forms of feedback, they can also track imex-plicit feedback in the form of how many views, how much of the video is watched, how many times a user replays a video, etc. Researchers at Google have published a few papers describing how they might make recommendations on YouTube [6, 8, 33].

(12)

1.2. Discovery in similar services

(a) Instagram discover. (b) YouTube recommendations.

Figure 1.2: Showcasing two similar services and their presentation of recommendations.

Netflix has been involved in the recommender system community arguably more than any other company. Since they sponsored the Netflix Prize offering 1,000,000 USD for im-proving their then algorithm CineMatch. The winning solution "BellKor Pragmatic Chaos" included a collection of predictors, and the development have been featured in many research paper, the final solution being described in [18]. The competition sparked an interest in recommender systems and the outcome is a lot of contributions to the field. Netflix used to have five-star ratings system, however recently they de-cided to change their ratings system to the "thumbs up"/"thumbs down" system, their motivation being that users were not being totally honest and that Netflix would be able to make better recommendations using this paradigm. This sparked controversy on so-cial media amongst users who meticulously has kept ratings on their accounts, worried that recommendations would get worse. Amatriain et al. authored a case study about recommendations at Netflix [2].

Pinterest is a photo sharing website where users can catalog their ideas and inspiration of food recipes, interior design, fashion, etc. Users are recommended items that are similar to the content a users are interested in and also related "pins", as the items are called when collected by a user. Some of the company’s approaches are described in [15]. The listed companies, and other companies, rely heavily on personalized recommended content and so it is an important aspect from a business perspective [30], due to a significant chunk of revenue that is generated that way. YouTube has reported that 60 % of clicks from their homepage are recommended content [8]. Almost everything on Netflix is a recommen-dation, and is therefore an important part of their business model [2]. In 2015 Gomez-Uribe et al. [11] claimed that two out of three hours of watched content was from recommenda-tions. Matchmaking-/dating websites which rely heavily on recommender systems is an

(13)

1.3. Aim

(a) Netflix recommendations. (b) Pinterest recommendations.

Figure 1.3: Showcasing two similar services and their presentation of recommendations. enormous industry [16]. Music recommendations is an application used by many music ser-vice providers, e.g. Spotify [4].

Recommender systems can be used to produce targeted ads. However there are some differences that should be discussed. Many targeted ads systems rely heavily on a real-time bidding system where advertisers can bid on showing their ads to users by selecting a target group considering a variable amount of characteristics; demographics, interests, etc. Rec-ommender systems rather focuses on providing items that users would not have found by themselves, items that are a total surprise or an unexpected coincidence. This concept is called serendipity and will be referred to as such henceforth. Targeted ads show relevant items coming from a highest bidder, or some other metric maximizing revenue. This bidding would for a recommender system be considered cheating the system and through that: bro-ken. An example for Netflix is described by Amatriain et al. [2] where they claim that the cost for streaming is similar for every item. This means that they can focus on recommending the best item for their users instead of having to directly account for revenue while making recommendations.

1.3

Aim

The aim of this project is to improve the user experience in the application by providing personalized recommendations to the users in the social network. These recommendations should help users to engage with other users by giving feedback and expanding their own social network. The recommender system should be able to recommend items to the users that are relevant for them, based on their social network and personal preference. The solution should also be able to scale as the user base grows without compromising user experience.

(14)

1.4. Research questions The project demands exploring the data available on user interaction of the Fishbrain application and how users are connected to be able to reason and hypothesize about the best way to make recommendations.

Many companies and social networks have similar ways of recommending items and pre-senting recommendations to their users. However they target different groups of people. Their users are different in behaviour, and their applications differ in user experience. These differences needs to be taken into account when designing the recommender system.

1.4

Research questions

1. How does one evaluate different approaches to recommender systems, and more specif-ically different models; offline while developing, and online with real users?

2. How well do traditional recommender systems based on collaborative filtering perform when having only one-class, positive ratings as opposed to other types of ratings five-star ratings, and implicit feedback?

3. Which user characteristics needs to be taken into account when developing a recom-mender system? How do different users respond to recommendations?

4. Can the recommender system help increase the use of the Fishbrain application?

1.5

Delimitations

There are many approaches to producing personalized content. Therefore, to narrow the scope of the thesis a delimitation to the underlying data used to produce recommendations is set. User feedback will be exclusively used as basis for making recommendations. Going further, the thesis will only consider explicit feedback that users give. Not implicit feedback collected from user behaviour in the application.

(15)

2

Theory

2.1

Recommender systems

Recommender systems are systems that suggest items that may be interesting to a user. For example, they can be used to help make decisions of what to buy, what to listen to, and what to read. They solve the problem of information overload where it is unreasonable for a user to work through all items to find something of interest, or in the case of search; what to actually search for.

There has been a lot of work in the field of recommender systems and it is an active field of research. There is an ACM Conference Series (RecSys1) exclusively focused on the subject. In 2006 Netflix sponsored a contest that boosted the interest and research made on the subject.

Recommender systems are typically categorized as follows:

Content-based filtering Recommendations are based on the similarity between user prefer-ence and item profiles. Item profiles consist of important characteristics of that item. Text documents, for example, could be represented in vector space by important key-words, movies could be represented by genre and actors.

Collaborative filtering The recommendations are calculated based on the similarity between different users considering their collective interactions with items (e.g. ratings, likes, and dislikes). This approach is therefore domain agnostic considering the items and users.

Hybrid recommender Combining different approaches has been shown to be successful for many cases. These types of recommender systems can be a mix of collaborative- and content-based filtering together with information about the explicit user preference (also known as knowledge-based recommender systems), demographic information, etc.

Even though the subject of recommender systems has been explored for many years, it is not trivial to implement a well performing recommender system. Data needs to be collected and groomed, and models needs to be chosen, and tweaked for the specific domain. Rec-ognizing the domain is important since the type of underlying data available differ between

(16)

2.2. Content-based filtering domains. It is also important to recognize that users are different, both as individuals and depending on the domain they find themselves in. Therefore it is not trivial to choose an ap-proach without evaluating the recommender system properly to make sure it performs well. Evaluation can be conducted while developing by testing the hypothesis by measuring the error and prediction precision of the model. This can help steering the development in the correct direction to improve on the model. However it is not until actually testing the system with real users that a conclusion can be drawn on the performance of the recommendations.

The recommender problem involves predicting the user rating for an item and also rank the predictions so that the system can provide recommendations based on these predicted ratings. From the set of usersU and itemsI, the rating can be expressed as:

r :UˆI Ñ R (2.1)

The goal is to make predictions every users such that for a user u rank the best prediction from the set of itemsI (Equation 2.2) or rather rank the predictions and produce an ordered

top-N sequence of the best predictions.

prediction=argmax

jPI

(r(u, j)) (2.2)

2.2

Content-based filtering

Content-based filtering is based on item characteristics and user preferences [21]. Items rec-ommended have similar characteristics as items preferred by the users in the past. These characteristics can be labels describing an item, like a genre or the director of a movie, or a derived labels from the collection of items (e.g. TF-IDF [28]), etc. Thus, an item can be de-scribed by a vector of labels, called the feature vector. Similarly to items, user preferences can be described by preference vectors. It is then fairly straightforward to recommend items based on the similarity to each other by some measurement. One common measurement for comparing vectors is by calculating the angle between the vectors. Selecting the items for which the vectors have the smallest angles gives the most similar items.

Another approach is to create decision trees for each users [28]. Decision tree learning is a predictive model used in machine learning. The decision tree is represented as a binary tree where each non-leaf node is a condition of some feature of the item. The leaf nodes then represent the decision, in the case of recommender systems, if the user would prefer an item or not.

The main advantage of content-based filtering is that they are independent of other users. It is only the relationship between the user in question and the items that are considered that is taken into account. This helps eliminate the cold-start problem (described more in detail in section 2.6) for new items. There are also drawbacks with content-based filtering. Content labeling is a hard problem to solve such that recommendations are something unexpected (serendipity) and interesting to the users. Consider a movie recommender system: a content-based recommender system would easily be able to recommend movies within the same genres, and with the same directors that a user likes, but would fail to recommend something novel. The cold-start problem remain for new users.

2.3

Collaborative filtering

Collaborative filtering takes advantage of users’ collective ratings of items. It uses this infor-mation to predict ratings based on similar users in respect to their ratings. Given a set of users

U, the set of itemsI, the ratings matrix R is a |U| ˆ |I| matrix describing ratings for items and

users. The columns of R represent items and its rows represent users. A rating for an item i and a user u is written as rui.

(17)

2.3. Collaborative filtering Collaborative filtering is based on the assumption that similar users display similar rating patterns. Essentially the assumption made for the prediction is that, based on users’ similar past behaviour, users will have similar preferences in the future. Collaborative filtering can be divided into two different approaches memory based and model based.

Memory based approach

The memory based approach [5], sometimes called the neighborhood approach, uses the entire set of ratings directly in the ratings calculations. It is called the memory based approach since it uses all ratings to produce recommendations. Similar users are found by a similarity measurement sim(u, v), where u and v are two users. One commonly used similarity mea-surement is the Pearson’s correlation coefficient. It takes into account the users ratings in respect to their average rating. This similarity measurement is then used in the prediction calculation. The prediction pred(u, i), where u is a user, i is an item and ¯vu is the average

rating for a user u, is then often expressed as a weighted sum over the other users:

pred(u, i) = ¯vu+ ř vPU sim(u, v)(ruv´ ¯vv) ř vPU sim(u, v) (2.3)

The performance of the memory based approach decreases as the data grows, and data

sparsity increases, which is more often than not the case for recommendation problems

(de-scribed in section 2.8). Since the entire ratings matrix needs to be considered, it is not suitable to use this approach for applications with massive user bases.

There have been attempts, with mixed results, to improve the performance of memory based recommender systems by using compression techniques to overcome memory con-straints [31].

Model based approach

The model based approach involves uncovering latent factors from the observed ratings. These latent factor models can be created in various ways, most commonly by matrix fac-torization, or singular value decomposition (SVD), which involves transforming the items and users into latent factor spaces. These latent factor spaces tries to characterize the users and items inferred by the collective ratings. Other methods for uncovering these latent factors in-clude neural networks, Bayesian networks, probabilistic LSA, and latent Dirichlet allocation. Although memory based approaches are more precise in the theoretical sense, since the calculations use all the data available, they do, as previously mentioned, suffer from perfor-mance issues at scale. The model based approach produces well performing results even for large, sparse datasets.

Some of the best latent factor models are based on matrix factorization [19]. Matrix fac-torization approaches are popular since they combine scalability and predictive accuracy. Considering the ratings matrix R, it can be factored into matrices of lesser dimensions:

R « ˆR=PQT (2.4)

where P is a |U| ˆ k matrix and Q is a |I| ˆ k matrix, where k is the number of latent

factors. Refer to equation 2.5 for an example illustration of this approximation, with k=2.

PQT=   p11 p12 p21 p22 p31 p32  ˆ  q11 q12 q13 q14 q21 q22 q23 q24  =   ˆr11 ˆr12 ˆr13 ˆr14 ˆr21 ˆr22 ˆr23 ˆr24 ˆr31 ˆr32 ˆr33 ˆr34   (2.5) In the most common and simple approach these matrices are estimated by minimizing the error of the approximation in equation 2.4.

(18)

2.4. Hybrid recommender systems For recommender systems, most elements in the ratings matrix will be unknown. Earlier systems used to rely on imputation to fill the missing values to make the matrix dense. This decreases performance since it increases the amount of data and inaccurate imputation is sensitive to overfitting. More commonly today is to model only the observed ratings in the ratings matrix. Overfitting the model is avoided using a regularized model. So factors are learned by minimizing the regularized squared error (2.6) where ˆr is a predicted rating, P and

Qare the latent factor matrices for users and items, R is the regularization function, and λ is a constant controlling the amount of regularization. Different approaches to regularization for SVD are described in [26].

ǫ=min

P,Q

ÿ

u,iPR

(rui´ ˆrui)2+λR(P, Q) (2.6)

There are a few learning algorithms to consider for solving the problem of minimizing the error. Two common ones are stochastic gradient decent (SGD) and alternating least squares (ALS). The former algorithms is the simpler one, which can easily be implemented and works reasonably well. The latter can be slower but it has the advantage of being able to parallelize the computation and in systems with implicit data [14], which will be further explained in the upcoming section 2.7.

Matrix factorization enables flexibility and customized solutions by adding biases. These are values that describe variations in individual users and items; users giving higher ratings than others, items receiving higher ratings than others. This can be an effect of items col-lectively being considered better than others, and instead of users’ true preference. That is why biases can help to better predict ratings than only by the user interactions. Biases can for example be used to handle temporal effects to ratings. These effects being that users change their ratings behaviour over time, or that items change in perception and therefore change their given ratings.

2.4

Hybrid recommender systems

Hybrid recommender systems combine techniques from different approaches, often from col-laborative filtering and content-based filtering. It can also include other techniques for mak-ing recommendations e.g. collectmak-ing demographic information from users or havmak-ing users supply preference explicitly to draw conclusions. The two latter mentioned techniques can be used to overcome the cold-start problem which is covered in more detail in section 2.6.

The key aspect is about how to combine the different techniques, in fact they can be weighted, preferring better performing parts of the hybrid, or mixed to provide diversity, etc.

2.5

Non-traditional approaches

Since the recommendations handled in this report is in a social network context, graphs have a natural fit to solving the recommendation problem.

Graph databases could be utilized to make recommendations based on social connections. Such a system could recommend people to follow based on existing relationships by execut-ing a simple query language statement. There are some limitations to recommendexecut-ing items with graph databases, the most apparent being to recommend an item when no relationship exists. It is easy recommend a friend of a friend, but to recommend a distant user with sim-ilar preferences, the graph database is prone to quickly grow in size and complexity. Graph databases could possibly be part of a hybrid recommender system.

(19)

2.6. Cold start problem

2.6

Cold start problem

The cold start problem arises when users have not yet given any feedback, and the counter-part when items have not yet been given any feedback [12]. Then there is no clear way of deriving user preference, and respectively item appeal.

There are a few approaches that can be considered for making recommendations when user feedback is not available. As with content-based filtering, labels may be derived from the items through analysing the content, image recognition, location, etc. or they might be, by compulsion, provided by the user. As for user preference there may be things that can be derived from meta data, e.g. location, age, gender, etc. These features can then be used to rec-ommend items similar to these user profiles. Important to note is that these labels and profiles do not describe explicit preference for the individual items, therefore top rated items, items in geographical proximity, etc. can be used to to weigh in to improve the recommendations.

The interface providing the recommendations to the user may feature a cold-start view forcing users to specify a starting point for their preference before serving recommendations. The downside to this solution is that there is an extra hurdle for the users before being able to interact with the recommendations.

2.7

Feedback data

User feedback are the different ways users interact with the underlying system, and the feed-back data is the recorded interactions that can be used by the recommender system to produce recommendations. The most common type of feedback described in literature is ratings, most often 5-star ratings, this is probably because of the Netflix Prize’s prevalence in the field. However there is an uncountable number of different types of feedback and how they are presented to the user. Feedback can be given by different types of ratings: one-class "likes", binary "thumbs up"/"thumbs down", numeric scales, etc. but also comments, views, time on site, etc. The feedback must be extracted from the system while considering how users in-teract with the underlying system or application, and what the recommender system tries to achieve. The different types of feedback can be divided into two groups, explicit- and implicit feedback:

Explicit feedback Explicit feedback is feedback that users directly provide to evaluate an item; these are most often presented as ratings. Ratings are good because they show the degree of preference. With explicit ratings, users may feel that they have control over the recommendations presented to them. However it has been shown that explicit ratings can biased towards positive feedback, which can result in misleading rating predictions [22].

Implicit feedback Implicit feedback is feedback that users provide indirectly, or conversely feedback that the system records about the user. There are a few important character-istics to consider about implicit feedback [14]. A user’s implicit feedback for an item is considered to be the confidence about user preference as opposed by the degree of user preference (as with explicit feedback). Implicit feedback is positive-only, it is therefore hard to infer what users did not like. Implicit feedback data is inherently noisy, since a user’s preference only can be guessed. E.g. it can be that a user is purchasing a gift for someone else or that a user interacts with something recommended to them that they did not like. Implicit feedback is less intrusive for the users than explicit feedback however the feedback data can be very noisy.

(20)

2.8. Data sparsity

2.8

Data sparsity

Considering the ratings matrix R, it is unlikely that it will be entirely filled with ratings, that would mean that all users have rated all items in the dataset. On the contrary it is most often the case in the domain of collaborative filtering that ratings matrices are very sparse. Consider an example: In a movie recommender, it is unlikely that every user have rated every movie. The sparsity of a matrix is defined as the number of zero-valued elements divided by the total number of elements. It is the inverse of the density of the matrix which is the number of non-zero elements w divided by the total number of elements. For the ratings matrix R, its sparsity is defined as:

1 ´ w

|U| ¨ |I| (2.7)

A less sparse (or more dense) matrix is preferred as it contains more information, making it easier to produce good recommendations. Depending on the context it may be possible to make the ratings matrix less sparse by removing users, and maybe even items, with few ratings. The different approaches for making recommendations could be evaluated with dif-ferent datasets with varying sparsity.

2.9

Explaining recommendations

Users are more likely to trust a recommendation when they know the reason as to why a rec-ommendation was made [13]. An explanation can range from telling the user that a presented item is a personalized recommendation to in detail explaining the reasoning behind why an individual item is recommended. Surveys show that the vast majority want to have their rec-ommendations accompanied by an explanation. Recommender systems can be divided into two groups, considering their ability to explain recommendations:

Black-box A black-box recommender system is unable, or has difficulty, explaining its rec-ommendations. From a technical perspective, this may stem from limitations in the model. From a user perspective it means that users are not presented with explanations as to why an item was recommended.

White-box A white-box recommender system that can, in contrary to black-box recom-mender systems, explain why a particular item was recommended to a user. This en-ables users to justify their trust in the system. Explanations can empower users, making them feel in control of the content that is provided to them. If the user can understand why an item was recommended to them, they may choose to interact with the system to line themselves with the system reasoning, and effectively improve the recommender system.

2.10

Evaluation

Evaluating recommender systems can be done numerically to ensure that the model per-forms well, considering a set goal or hypothesis. It can therefore be advantageous to state the problem to be solved as to simulate user behaviour, which can be difficult. [1] compre-hensively describes several approaches and considerations to be made while evaluating a recommender system. This offline evaluation can evaluate how well a model can predict and rank items. There are several evaluation metrics that can be used to evaluate recommender systems. A common one is root mean square error (RMSE), which is a measurement of the error between the training and testing data after running it through the system. Another eval-uation strategy is measuring the precision of recommendations, measuring the rate at which the recommender recommends items in the test set. There are a few commonly used met-rics for measuring precision, two common ones are "Precision at K" (P@K) and "R-Precision".

(21)

2.10. Evaluation Offline evaluation can be used when testing different models, to help narrow in on a good model for the recommender system. However to make sure that users appreciate the rec-ommendations in practice, they need to be included in the evaluation process [12]. This can be done by conducting user studies, to get qualitative information from users. It can also be done in online evaluation, actually testing the recommender system on users, preferably without them knowing. Online evaluation involves measuring some metric important to the system for which recommendations are made; for example the user ratings, click rates, and number of purchases. The online evaluation can be done by conducting A/B tests while ex-perimenting with different approaches. A/B testing involves showing certain variations of a feature to certain users, while measuring some important metric and reason about which variation in the test performs the best.

(22)

3

Method

This chapter is divided into three parts; (i) the explorations and considerations made for the feedback data used to produce recommendations, (ii) the implementation concerning the different approaches in algorithms and the evaluation framework used to provide and evalu-ation recommendevalu-ations, (iii) and the considerevalu-ations made for the evaluevalu-ation both offline and online.

3.1

Feedback data exploration

Because of the strong influence of the Netflix challenge and open datasets like MovieLens, much of the published research handles ratings on a numeric scale, most commonly one to five stars. Another common type of user feedback presented in literature is the implicit feed-back. Neither of these are a perfect match for Fishbrain. Considering "likes" as ratings limits the level of preference to one-class, meaning one level, positive-only ratings. Considering likes as implicit feedback, the preference is only at a singular degree of confidence that an item matches the user preference. By definition, likes are not implicit since they are explic-itly expressed by users. Arguments for handling the user feedback (in the form of likes) as explicit feedback and on the contrary implicit feedback:

Explicit feedback Likes are explicitly expressed by the user. Likes only describe a singular level of preference.

Implicit feedback Likes are not explicitly expressed by the user. Likes describe a singular level of confidence for preference.

There are widely adopted schemes to mix the two types of feedback, e.g. SVD++ [17]. The work presented in this thesis does not account for implicit feedback data due to limitations in the data collection of the underlying system, and also due to scoping the master’s thesis work to a reasonable level. The work did however involve testing out the different implicit feedback approaches with the ratings as is, along with the approaches more suited for explicit feedback.

In comparison to services like Netflix, which can very well recommend movies from decades ago, social networking posts are more time sensitive in that users heavily favour

(23)

3.2. Implementation

Sparse Dense

Time frame Users Sparsity Mean Median Users Sparsity Mean Median 4 weeks 1.00 99.95 24.69 3 0.66 99.93 37.08 6 2 weeks 0.64 99.93 18.12 3 0.40 99.89 27.42 6 1 weeks 0.42 99.89 12.74 2 0.26 99.83 19.44 5 Table 3.1: The number of users for each dataset is relative to the number of users in the sparse, long time frame dataset. A larger number in mean and median is also an indicator of the amount of data for each user, and as stated in previous chapter, more data implies better recommendations.

newer posts over older ones. Picking a ratings dataset from a time frame that can be appreci-ated by the user is therefore a convenient way of limiting the amount of items to be considered for recommendation. Picking such a time frame can be sensitive to how users react and are willing to interact with the system. Studying similar services, e.g. Pinterest and Instagram, it is quite clear that they only recommend fairly new items. Instagram seems to premier posts within 24 hours, but it can also feature week old posts. Pinterest recommend older posts than Instagram, somewhere within a couple of months seem to be common (somewhat depending on which view in the application). Considering that Fishbrain is more, as a service, similar to Instagram, and listening to qualitative feedback from users imply that a smaller time frame is favourable. Users do most of their fishing during weekends, therefore the most interesting posts are posted during, or shortly after, weekends. Therefore it makes sense to limit the time frame to week cycles.

For every presented feedback dataset, users have rated at least one item. Considering the cold-start problem, these are the users for which the collaborative filtering model will be able to produce recommendations.

The different approaches have been evaluated against a number of permutations of the dataset of user feedback (Table 3.1). These datasets vary in time frame, and allowing ratings for users (and potentially items) with a set lower limit for optimizations for sparsity. Opti-mizing for sparsity can be done by only considering the users (and potentially items) with a minimum number of ratings.

From largest to smallest dataset, the density is more than doubled. However the dataset is still very sparse. Offline evaluation can give an indication to the impact of that change. Other benchmarked datasets have had less sparse datasets, around 94-98 %. The number of users for which ratings are present in the dataset decreases almost linearly in relationship to the density. A larger time frame must be taken into account for making a denser dataset to be able to produce recommendations for a reasonable amount of users.

Options arise in how to handle the recency in recommendations that is hypothesised to be better for Fishbrain. Biases can be added to the model to take into account users varying preference as a function of time. Another option is to consider ratings from a larger time frame and then during post-processing filter items from a desirable time frame.

3.2

Implementation

Recommendation system model

Different collaborative filtering models were tested, among these the baseline regularized

sin-gular value decomposition (RegSVD). Not all models tested were chosen for further exploration

and evaluation due to considerations described below. The memory based approaches was quickly ruled out due to performance issues with larger datasets.

(24)

3.2. Implementation

RegSVD (SGD) This regularized SVD approach is a well established approach for identi-fying latent factors, and commonly used for a baseline recommender system. The im-plementation, and the software package used for this approach is straight forward and easy to use.

NMF/NNMF (ALS) Non-negative matrix factorization with alternating least squares. This type

of matrix factorization is quite similar to SVD, and has also been used for recommender systems [34]. This approach is implemented in the same software package as the previ-ous entry.

RegSVD (ALS) This approach is from a different software package than the stochastic gra-dient decent approach. It is built for heavy parallelization and enormous datasets. Due to its implementation complexity, this package is, in comparison to the previously de-scribed RegSVD implementation, less attractive for further evaluation.

BPR Bayesian personalized ranking is a method for implicit datasets that centers around

itera-tively sampling positive and negative items and comparing them, i.e. learning to rank [29]. It reportedly optimizes best for a ROC (receiver operating characteristic) curve, which impose some issues with comparing the different results offline. The chosen software package implementing this algorithm is easy to use and easy to implement in an existing system.

WARP Weighted Approximate-Rank Pairwise loss is another learning to rank algorithm similar

to BPR. It was first introduced by Weston et al. [32]. The authors claim that, in com-parison to BPR, optimizes best for precision as opposed to ROC, or area under curve (AUC). It is implemented in the same software package as the BPR approach.

Post-processing recommendations

Post-processing is done after the recommender has predicted and ranked the items. Due the differences in approaches, and software packages chosen for further evaluation, some post-processing is done to keep the presentation uniform. Decisions regarding the post-post-processing have mainly been motivated by qualitative user feedback and researching similar services.

Users’ own catches should not appear in their discover view. Naturally there is no serendipity in presenting a post that a user has posted themself. A discover view could be argued to be a kind of top list for like-minded. If users sharing some characteristics would be presented similar recommendations, it could be considered a feat for a user to have their posts featured in such a view, and a way to get recognition for their con-tributions. However presenting these items is not suitable as subject for personalized recommendations.

Items that users have already seen should typically not be recommended to the user. The data needed to be able to facilitate this type of filtering is the impressions users have made on the different items. If there are a lot of users exposed to a lot of items, some data structure, like Bloom filter, might be required to be able to handle the amount of data and still perform. Because of limitations in the underlying system, this was not considered for implementation. It is also a design choice from a business perspective how users want to have it. User interviews or tests may give further insight to what users may appreciate in the given domain. Related to the previous statement, items should not be recommended multiple times. The recommender system could trivially implement such a filter to make sure that items are not recommended multiple times for a user. However considering the relatively short time frame set for recommendations in this domain, it is unlikely that an item would be recommended multiple times. To summarize, the problem can be funneled into:

(25)

3.2. Implementation 1. Not showing items that users have seen.

2. Not showing items that users have seen in the discover view. 3. Not showing items that users have explicitly interacted with.

These different considerations for post-processing is a filter on the ordered list of recom-mendations output from the prediction and ranking step of the recommender system (3.1).

trecommendationsu=tranked predictionsu ´ tuser ownedu ´ tuser interactedu (3.1) Since the implementation is limited to recommendations generated offline, the recommen-dations will, for each user, be the same for a short time period. However, the post-processing described will be applied online at the time of request.

The follower/followee paradigm also opens the question if the recommender system should present items from users that are followed by the users. Depending on how the ap-plication is structured, users may already have seen the items in their feed, or they may have missed it, due to information overflow. The filtering out of items from users followed was not implemented since the hypothesis was that users may miss out on items from users they follow because of the non-chronological feed and also because of mixed feedback from qual-itative user feedback.

Infrastructure

There are many free and open-source software packages available to produce recommen-dations. Many of these are created by researchers, implementing a number of different al-gorithms and approaches, some are created by open source communities, and a few are commercial. Amongst the reviewed software packages, many of the packages used in re-search are not suitable for production environments, due to various reasons, examples being performance and interoperability (especially from the company point of view). There are a few seemingly robust software packages more suited for production environments, how-ever these can lack in flexibility of implementation. For the implementation of the evaluation framework covered in this report, ease of implementation and use is premiered.

A number of packages was evaluated offline but was excluded due to complexity of im-plementation, dead projects, company decisions, etc. Two packages were chosen for further evaluation, together they implement three different approaches to recommendations which was chosen for further evaluation. Here is a list of evaluated software packages, with short description (in alphabetic order):

Apache Mahout [3] A collection of machine learning algorithms that can run on various dis-tributed computation software.

Apache Spark (MLlib) [23] An engine for large-scale data processing, also bundling ma-chine learning algorithms and amongst them a couple of collaborative filtering algo-rithms.

LensKit [9] Toolkit authored by Michael Ekstrand during his doctoral studies, in an attempt to help researchers and developers of recommender systems to reduce the experimen-tation needed when developing recommender systems.

lightfm [20] Software package to produce recommendations with an approach called "Learn-ing to rank".

mlpack [7] Machine learning library in C++ bundling various machine learning algorithms, therein also matrix factorization algorithms.

(26)

3.2. Implementation

Figure 3.1: Diagram of the recommender system infrastructure. Arrows indicate dependency, not flow of information.

Scikit-learn [27] A Python toolkit bundling various algorithms in machine learning, it can be used to create recommender systems with their matrix factorization facilities, etc. To produce recommendations, an isolated and simple-to-use component of the entire in-frastructure of the application was developed for providing the existing application with facilities for experimenting with recommender systems. An interface was developed with redundancies to be able to fallback to providing items even though the service would be entirely offline.

Since online testing with real users is important for evaluation, the service needs to be able to provide recommendations from different providers. Since these are designed differ-ently, in programming language, and overall architecture they must be able to run uniformly against a shared interface. For simplicity plain text was chosen as interface for its many ben-efits considering virtually all programming languages can read and write from files and/or standard input streams.

The providers to be tested are each run with a single runner wrapping the providers. Input for the individual provider runners is user feedback and output is recommendations. This is an offline process that can be run continuously or a few times a day, filling a data store. The store can then be queried by other applications in the infrastructure for recommendations by a single internal interface.

Figure 3.1 illustrates the recommender system infrastructure. The variable number of providers to be tested are individually and independently wrapped by the runner. The input for the individual provider runners is the user feedback and output consists of all the recom-mendations. This is an offline process that can be run continuously or several times per day, filling a data store. The store can then be queried by other applications in the underlying in-frastructure for recommendations through a single internal network interface. The application represents the underlying system of which the recommender system is part. The store is a persistent database caching the recommendations for online querying.

The different providers are then A/B tested with the clients querying for recommenda-tions from a specified provider determined by the variation assigned to them in the test. The solution can be extended to conduct further experiments with different sets of providers.

(27)

3.3. Evaluation in stages

3.3

Evaluation in stages

Offline

The software packages presented, and their provided algorithms was tested offline with the datasets presented in previous sections. The algorithms have different evaluation methods, so the different solutions are sometimes difficult to compare. The actual results are presented in the following Results chapter. The main metrics that was used as foundation for making decision of which approaches to continue evaluating was RMSE and P@K.

Online

The online evaluation was conducted as an A/B-test, where the control, or baseline of the test is the current feature in the application. The current feature is, as previously described, presenting the users with most recent catches globally. The variations in the experiment are the different algorithms (from the different providers), presenting recommendations of items. Although the feature could benefit from changes in user interface (more screen real estate, further explanations, etc.), the initial experiment changes little in terms of how items are presented to the user. This is to make sure that the recommendations are tested on their own merits, as compared to the older, already established feature. If successful, the evaluation framework for the recommendations could be used in the evaluation for different approaches exclusively making recommendations.

(a) Android (b) iOS

Figure 3.2: Recommendations as presented on the two different platforms.

As shown in figure 3.2, the experiment changes only; the title describing the feature from "Latest catches" to "Catches you may like" and the items presented. The hypothesis for the experiment is that users presented with the personalized recommendations will interact more with the items than users in the control group. The main disadvantage of this approach is

(28)

3.3. Evaluation in stages that users familiar with the application may not notice the difference from the older Discover view. If they already have dismissed the older "Latest catches" view, they may never visit the newer view and/or notice the difference. The experiment would not suffer this limitation if all variations (including control) would be recommendations. Then changes to the view would be less sensitive to changes in user interface, etc. This should be considered while evaluating the recommender system further since explaining the feature further could help build trust with the user, and have them accepting the feature.

Also apparent from figure 3.2 is the difference in user interface for the two different plat-forms. The Android application features meta data, in the form of number of likes and com-ments, etc. Consider the comparison to latest catches, it is likely that items that are just caught (i.e. items featured in "Latest catches") have close to zero user feedback. This might affect the results, due to peoples likeliness to interact with more or less popular items.

The online evaluation measures success based on three different business metrics that relates to the aim of the thesis:

Like Likes on posts are a low-threshold indication of user engagement and also an indication of preference. Likes are an indication of positive feedback whereas comments might be negative. Increasing this metric can also, as a side effect, help improve the recommender system since they are the primary feedback data.

Comment Comparing to likes; comments are more qualitative than likes and can range across all types of sentiment and indicates a deeper engagement.

Follow Users following each other are an important aspect of the work. This is a strong indication of preference since a user, from a recommended post, commits to following another user.

To be able to see significant results in the changes in conversion rates of the enumerated metrics, sample size and the amount of change needs to be considered to be able to draw a statistically significant conclusion.

(29)

4

Results

The presentation of results are divided into four sections; decisions made for the datasets, the evaluation framework, the offline evaluation with a number of approaches, and the online evaluation with a subset of those approaches.

4.1

Dataset

The dataset parameters (time frame, sparsity, etc.) ultimately chosen for further evaluation were determined by the offline evaluation results, described in the following section. Because of the slight difference in performance between the approaches for each dataset, considera-tions about the user experience were also taken into consideration. Users favour more recent than older posts in their feed. This is determined by studying similar services and qualita-tive input from users. The solution also respects week cycles so that feedback and items are sampled from every day of the week.

4.2

Evaluation framework

An evaluation framework was developed as described in chapter 3. It runs continuously of-fline filling a data store. These recommendations can then be requested by the application and presented to the users. The framework also enables developers to experiment with rec-ommender systems and continuously improve recommendations for users. This is done by extending the framework with providers that can be virtually any thinkable approach as long as it is able to communicate with the common text interface.

4.3

Offline evaluation

The offline evaluation was conducted on multiple datasets. As previously mentioned, these results had an impact on choosing which feedback dataset to consider when making rec-ommendations. There are some evaluation metrics that are not applicable for the different approaches due to fundamental differences in algorithms and implementation. For preci-sion, measuring the precision at 10 (P@10) makes sense because that is the first page served to the application when users browse to the Discover page. The RMSE gives an indication of

(30)

4.4. Online evaluation the numerical error for the model. The results for the offline evaluation is presented in Table 4.1. Sparse Dense P@10 RMSE P@10 RMSE RegSVD (SGD) 0.0020 0.0660 0.026 0.0505 RegSVD (ALS) 0.0010 0.0304 0.012 0.0257 BPR 0.0201 - 0.0231 -WARP 0.0225 - 0.0246

-Table 4.1: -Table presenting the offline evaluation results for the different approaches tested on two different datasets. Hyphenated cells are missing measurements for the given approach. Cells with bold numbers illustrates the argued best result. For P@10, the number is a fraction of the recommended items that are in the test set. For RMSE, the number is the numerical error for the predicted ratings.

The results show that the WARP approach has the best precision, which is argued the most important metric since it better describes user behaviour than RMSE. The results for RegSVD did show low (and sometimes noisy) precision, while being able to produce a small RMSE in comparison to benchmarked datasets. For these reasons both RegSVD and WARP were chosen for further evaluation.

4.4

Online evaluation

Results for the online evaluation are presented as the percentage change from the control group for important metrics in table 4.2. They are first presented for metrics measured for all users globally, then also for users in the USA and Sweden. The results are split up in results for the different platforms (Android and iOS). This is because they (and their users) differ in several ways. The distribution of iOS and Android users vary depending on markets. For the USA, the distribution is close to fifty-fifty when comparing the number of Android and iPhone users, however in Brazil, Android make up the absolute majority of the market. The application also differs in overall user interface and user experience, even though the features are close to identical.

Android (Global)

Comment Follow Like RegSVD (SGD) 2.95* 0.70 0.63

WARP 2.33* 1.13 -0.15

iOS (Global)

Comment Follow Like RegSVD (SGD) -0.67 -0.21 -0.24

WARP 0.13 1.28* 0.02

Table 4.2: Online evaluation results for users globally. The numbers in the cells rep-resent the percentage (%) change in conversion rates measured. Cells with bold text indicate changes with a reasonable probability level (* 80 %, ** 95 %), to determine the results as statistically significant.

Results are determined statistically significant based on the change in conversion rates and sample size. Without statistical significance, there is no way of telling that the results are true. The metrics are numbers measured holistically for the entire application. This is good because it measures the impact of having this feature in the application. However it may have

(31)

4.4. Online evaluation difficulty showing any significant results if the usage of the feature is low in comparison to the rest of the application or if there is granular differences between the different variations.

Android (USA)

Comment Follow Like RegSVD (SGD) 3.90** 0.89 -0.31

WARP 3.09* 1.66 0.32

iOS (USA)

Comment Follow Like RegSVD (SGD) 0.07 0.88 0.49

WARP 0.08 1.48* 0.38

Table 4.3: Online evaluation results for users in the USA.

Table 4.3 shows the results for users in the USA. Notice that the results correspond with the global results. Also keep in mind that the majority of users are from the USA.

Android (Sweden)

Comment Follow Like RegSVD (SGD) 20.59* 8.21 8.32 WARP 10.85 27.98* 25.73**

iOS (Sweden)

Comment Follow Like RegSVD (SGD) -6.30 -4.19 -4.69 WARP -0.27 -1.91 -9.77*

Table 4.4: Online evaluation results for users in Sweden.

Results for users in Sweden are presented in table 4.4. Notice that the conversion rates need a larger change to show statistical significance. This is due to the sample size being smaller than for example users in the USA.

All three tables presenting results for the online evaluation show similar trends of differ-ences between the two platforms (Android and iOS).

(32)

5

Discussion

5.1

Results

The offline evaluation results were almost consistently worse for each approach used in com-parison to benchmarks presented in papers and scientific surveys for datasets with similar user feedback and sparsity. Different datasets describe different users and how they interact with the application from which the data is collected. This makes it hard to compare datasets by benchmarking, to get a sense of reasonable results. Datasets used for comparisons in these benchmarks were presented in various research papers, the most common ones, as described before, handle five star ratings, and items are movies.

The offline evaluation was conducted without a validation set, which means that the algo-rithm could be optimized to a point where it overfits the test dataset. Regularization should make it less likely to overfit, however it is not possible to know for sure without validating the data.

The decision for measuring the online evaluation metrics holistically for the entire appli-cation was made both because of implementation limitations, and because of the difference in baseline and variations. The experimentation was not only to compare different algorithms for making recommendations, but also to have recommendations instead of listing the latest catches globally. Since a small part of the measured metrics is recorded in the discover view it is difficult to know for sure that changes to the feature were the cause for an increase or decrease in recorded metrics, or noise (or at least the level of noise).

It is difficult at this stage to draw conclusions on the differing results concerning the coun-try and platform since the results are measured holistically across the application. For Swe-den especially, the results are more likely to be due to chance because of the comparatively small sample size. One significant change in the user interface between the platforms may be an interesting piece of knowledge. For Android, the users are presented with meta data for the catches (likes and comments, etc.). This leads to the conclusion that users are more likely to interact with an item that other users have already interacted with. For "Latest catches" it is likely that the comments- and like counters are zero. The catch may therefore be considered worse from a user perspective and thus the user may feel less inclined, and be less likely, to give feedback.

(33)

5.2. Method

5.2

Method

It became apparent while reading the literature that, in order to measure success, the rec-ommender system needs to be evaluated thoroughly online with real users. This evaluation takes time, as data needs to be collected about the usage. A part of the thesis work was spent in developing the framework and evaluating the different recommender systems. There have been attempts to develop frameworks for this purpose previously (e.g. [9]), as described in chapter 3.2. However these were not mature enough for use in this thesis.

There is no novel approach for the actual model or algorithm for making recommdations presented. The algorithms were chosen from existing software packages. This en-abled focusing on the development of a framework for experimenting and evaluating recom-mender systems. It is hard to definitively say if, or rather by how much, a better developed model for making recommendations would perform. However, since the framework now exist and it is easy to extend, further efforts can be put into experimentation. Many software packages and approaches were tested to make sure that they fit well with Fishbrain. Research, trying out different solutions, implementing, and evaluating recommender systems, are all important and time consuming to implement.

There was a decision to be taken if rolling out the recommendations as a brand new feature or rather to develop an alternative new feature. Changes to the user interface need to be carefully considered to make sure that the experiments are set up in a way that the data is trusted and interpreted. The decision to roll out the feature in a careful way, keeping the old feature with latest catches as a baseline, was decided upon since it lines better with how the company conduct experiments (with A/B tests) and develop features in the application in general.

For the online evaluation, focus was put on evaluating the metrics while having users interact with the system rather than to conduct user interviews. Some qualitative feedback was collected from users getting recommendations. However that was not the focus of the thesis.

The A/B test was rolled out evenly distributed against all users. Another approach would have been to conduct the test only with users new to the application. This way, the users have no preconceived notion about the application and thus the test is fair. However this approach would have drastically affected the sample sizes negatively.

5.3

The work in a wider context

The term filter bubble was coined in 2011 by Eli Pariser in the eponymous book [25]. It is the phenomenon when personalized searches, recommendations, and different website algo-rithms are guessing and showing users what they want to see. Thus effectively not exposing users to viewpoints different than their own. Many companies, in particular Facebook and Google, have been heavily criticized for enabling this phenomenon in searches and recom-mendations. Especially in the aftermath of political elections and discourse during 2016 and 2017.

Nguyen et al. published a paper in 2014 [24] exploring the filter bubble phenomenon, and its longitudinal impacts, in recommender systems for movies. They found evidence that recommender systems do recommend a slightly narrowing set of movies, however also that users consuming recommended movies experience lessened narrowing effects.

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Finally the conclusion to this report will be presented which states that a shard selection plugin like SAFE could be useful in large scale searching if a suitable document

There is a rather large difference between the similarity scores of these two models, about 0.2, which in part can be explained by the fact that several of the features used for