Personal news video recommendations based on implicit feedback: An evaluation of different recommender systems with sparse data

(1)

Personal news video

recommendations based on

implicit feedback

An evaluation of different recommender systems

with sparse data

MORGAN ANDERSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

implicit feedback

An evaluation of different recommender systems

with sparse data.

MORGAN ANDERSSON

Master in Computer Science Date: August 31, 2018

Supervisor: Håkan Lane (KTH), Mats Ekholm (Newstag) Examiner: Olov Engwall

Swedish title: Personliga rekommendationer av nyhetsvideor baserade på implicita data

(4)

Abstract

The amount of video content online will nearly triple in quantity by 2021 compared to 2016. The implementation of sophisticated filters is of paramount importance to manage this information flow. The re-search question of this thesis asks to what extent it is possible to gener-ate personal recommendations, based on the data that news videos im-plies. The objective is to evaluate how different recommender systems compare to complete random, each other and how they are received by users in a test environment.

This study was performed during the spring of 2018, and explore four different algorithms. These recommender systems include a content-based, a collaborative-filter, a hybrid model and a popularity model as a baseline. The dataset originates from a news media startup called Newstag, who provide video news on a global scale. The data is sparse and includes implicit feedback only.

Three offline experiments and a user test were performed. The met-ric that guided the algorithms offline performance was their recall at 5 and 10, due to the fact that the top list of recommended items are of most interest. A comparison was done on different amounts of meta-data included during training. Another test explored respective algo-rithms performance as the density of the data increased. In the user test, a mean opinion score was calculated based on the quality of rec-ommendations that each of the algorithms generated for the test sub-jects. The user test also included randomly sampled news videos to compare with as a baseline.

The results indicate that for this specific setting and data set, the content-based recommender system performed best in both the recall at five and ten, as well as in the user test. All of the algorithms outper-formed the random baseline.

Keywords: Information Filtering, Recommender Systems, News Videos,

(5)

Sammanfattning

Mängden video som finns tillgänglig på internet förväntas att tredubb-las år 2021 jämfört med 2016. Detta innebär ett behov av sofistikerade filter för att kunna hantera detta informationsflöde. Detta examensar-bete ämnar att svara på till vilken grad det går att generera personliga rekommendationer baserat på det data som nyhetsvideo innebär. Syf-tet är att utvärdera och jämföra olika rekommendationssystem och hur de står sig i ett användartest.

Studien utfördes under våren 2018 och utvärderar fyra olika algo-ritmer. Dessa olika rekommendationssystem innefattar tekniker som content-based, collaborative-filter, hybrid och en popularitetsmodell an-vänds som basvärde. Det dataset som anan-vänds är glest och har endast implicita attribut. Tre experiment utförs samt ett användartest.

Mätpunkten för algoritmernas prestanda utgjordes av recall at 5 och recall at 10, dvs. att man mäter hur väl algoritmerna lyckas generera värdefulla rekommendationer i en topp-fem respektive topp-10-lista av videoklipp. Detta då det är av intresse att ha de mest relevanta vi-deorna högst upp i sin lista av resultat. En jämförelse gjordes mellan olika mängd metadata som inkluderades vid träning. Ett annat test gick ut på att utforska hur algoritmerna presterar då datasetet blir mindre glest. I användartestet användes en utvärderingsmetod kal-lad mean-opinion-score och denna räknades ut per algoritm genom att testanvändare gav betyg på respektive rekommendation, baserat på hur intressant videon var för dem. Användartestet inkluderade även slumpmässigt generade videos för att kunna jämföras i form av bas-värde.

Resultaten indikerar, för detta dataset, att algoritmen content-based presterar bäst både med hänsyn till recall at 5 & 10 samt den totala po-ängen i användartestet. Alla algoritmer presterade bättre än slumpen.

(6)

2.1 An overview of different recommender systems. . . 6

2.2 Illustration of how the content-based filtering works. Based on a previously seen item, a similar news is rec-ommended. . . 7

2.3 This figure illustrates how different users and items are represented as vectors in a vector space model, where the items could be news articles and the different axes are news categories. . . 8

2.4 Illustration of how the collaborative-filter works. Two similar users will be recommended items that either one have not yet seen . . . 10

2.5 Visualization of matrix factorization, where R is the rat-ing matrix(also called interaction matrix), P is the user-feature matrix, Q is the item-user-feature matrix. The high-lighted element ruican be calculated by puqTi . . . 12

3.1 Illustration of the two phases in the user test. . . 31

4.1 Recall@k for different input when constructing the vec-torization model for the content-based recommender sys-tem. . . 33

4.2 Performance of recall@10 for each algorithm. . . 34

4.3 Performance of recall@5 for each algorithm. . . 34

4.4 Performance on different density levels. . . 35

4.5 Results of the Mean Opinion Score user test for each al-gorithm. . . 36

(7)

2.1 Absolute Category Rating . . . 19

3.1 Dataset: Item metadata . . . 25

3.2 Dataset: User interactions . . . 26

3.3 CB - Varied input for the TF-IDF vectorization . . . 27

3.4 Different levels of density . . . 29

4.1 Algorithm improvement relative to random, in the MOS user test. . . 37

A.1 Video clip recommendations from each algorithm and the rating from the test user. . . 50

(8)

1 Introduction 1

1.1 Problem definition . . . 2

1.2 Research question and aim . . . 2

1.3 Limitations . . . 2

1.4 Sustainability, ethics and societal aspects . . . 3

2 Background 4 2.1 What is a recommender system? . . . 4

2.2 Content-based filtering . . . 5

2.2.1 TF-IDF and similarity measurement . . . 6

2.2.2 Pros and cons . . . 8

2.3 Collaborative filtering . . . 9

2.3.1 Memory-based . . . 9

2.3.2 Model-based . . . 10

2.3.3 Pros and cons . . . 12

2.4 Hybrid filtering . . . 12

2.5 Challenges for news recommendations . . . 13

2.5.1 The cold-start problem . . . 13

2.5.2 Time-sensitive content . . . 13

2.5.3 Shifting short-term trends . . . 13

2.5.4 Global trends . . . 14

2.5.5 Scalability . . . 14

2.5.6 Modeling user-preference profiles . . . 14

2.5.7 Gray sheep . . . 15 2.5.8 Serendipity . . . 15 2.6 Data gathering . . . 15 2.7 Evaluation . . . 16 2.7.1 Online evaluation . . . 16 2.7.2 Offline evaluation . . . 16 vii

(9)

2.7.3 Mean opinion score . . . 18

2.8 Related work . . . 19

2.8.1 Analysis of the related work . . . 22

3 Method 23 3.1 Data . . . 23 3.1.1 Data collection . . . 23 3.1.2 Data analysis . . . 23 3.1.3 Data structure . . . 24 3.1.4 Data sparsity . . . 24

3.1.5 Train and test data . . . 25

3.1.6 Hyper-parameter optimization . . . 26

3.2 Hardware and software . . . 26

3.3 Algorithms . . . 27 3.3.1 Baseline: popularity . . . 27 3.3.2 Content-based . . . 27 3.3.3 Collaborative filtering . . . 28 3.3.4 Hybrid . . . 28 3.4 Evaluation . . . 28 3.4.1 Recall at k . . . 28

3.4.2 Different levels of density . . . 29

3.4.3 Mean opinion score . . . 29

4 Results 32 4.1 Offline experiments . . . 32

4.2 User test . . . 36

5 Discussion and conclusion 38 5.1 Offline experiments . . . 38

5.2 User test . . . 39

5.3 News value . . . 40

5.4 Reflection and future research . . . 41

5.5 Good news for the news . . . 41

Bibliography 43

(10)

Introduction

Online video content is growing rapidly, according to Cisco’s annual Visual Network Index (VNI) forecast, the amount of overall IP traffic will increase three times by 2021 compared to 2016 [9]. Video will ac-count for 82% of that transferred data and the need for sophisticated filters is paramount to manage it. Recommender systems (RS) have become a scientific field of its own. State-of-the-art approaches to give good recommendations based on user preferences and content cur-rently rely on machine learning. There are several algorithms and en-semble methods to deal with personal recommendations, but there are specific challenges in the news media industry compared to other in-dustries. For example, few users register their credentials and the con-tent have short lifespan. That means that the recommendation more or less has to work with the data that can be gathered from a first-time user during a temporary session, i.e. the time the user stays on the website. It is therefore rare to be able to infer a user’s preference from historical data, at the same time as the content to be recommended is frequently refreshed. These aspects make the data that the algorithms should train on sparse.

This thesis evaluates different recommender systems for news and compares their performance in an offline setting as well as in a user test. The research has been done at a news media startup called New-stag. They provide a news platform that offers video news in different languages globally.

(11)

1.1 Problem definition

The problem at hand is to evaluate different recommender systems on how well they can provide personal recommendations. This is done in two steps. First, the algorithms are trained, optimized and evalu-ated on a data set provided by Newstag. The second step is to test the trained models abilities to provide personal recommendations to a group of test subjects. In order to take into account the specific chal-lenges that the news media domain implies, different state-of-the-art and base algorithms are compared.

1.2 Research question and aim

The specific research question in this thesis follows:

• To what extent is it possible to generate relevant recommenda-tions based on the data that news video imply?

To answer this, the following sub questions will be evaluated:

• Do the selected recommender systems perform better than ran-dom?

• How do they compare to a simple popularity baseline model? The aim is to evaluate how different recommender systems compare to each other and how they are received by users in a test environment.

1.3 Limitations

Even though the content that Newstag supply is video only, no im-age recognition will be involved in the evaluation of relevant recom-mender systems. This research will use the metadata and description text that comes along with each video, which in turn makes this report valuable to those who deal with news articles as well.

The algorithms under evaluation will not be implemented into pro-duction at Newstag and will therefore not have the possibility to be evaluated on real users online on the platform. Other evaluation meth-ods will instead be implemented and they are described in section 2.7.

(12)

Another implication of the fact that these tests will be done on of-fline data is that time is no longer that relevant. This is because the users will likely have seen some of the content which would gener-ate a bias, therefore the experiments are focused on finding interesting and personal news regardless of time.

Due to the nature and size of the dataset that is provided by New-stag, deep learning algorithms will not be under evaluation. Deep structures are of interest for recommending news, however, it is not relevant at this stage for Newstag since it requires a much broader set of features and data points to be of value.

1.4 Sustainability, ethics and societal aspects

One assumption for any recommender system to work is that there exists a certain level of correlation between users and items. For ex-ample, a person who is interested in food documentaries or culinary shows would probably prefer to be recommended another cooking show rather than a horror movie. On the other hand, one problem with this assumption and its integration into today’s recommender systems is that it creates an echo chamber where the user becomes iso-lated from other content that potentially could have been of interest. Perhaps it is not a problem that the above-mentioned person missed out on a horror movie, however when it is applied on the news and other important sources of information it can be. It becomes a social interest when the automated information filtering in society isolates people from different perspectives and important general knowledge [34]. There is currently an ongoing debate regarding recommender systems and their amplification of filter bubbles and echo chambers in society [17].

Another aspect of the research and implementations of recommender systems is the vast amount of personal data that is handled. This needs to be collected and stored in a secure manner to not risk user’s integri-ties. A new law called the General Data Protection Regulation (GDPR) was implemented 25th of May 2018 in the European Union, which can result in large fines if personal data are not handled correctly [41].

(13)

Background

In this chapter, necessary information is shared in order to understand the theory behind the algorithms that are implemented. Specific chal-lenges for recommendations in the news domain are presented, fol-lowed by the different kinds of data that can be used to train recom-mender models. Relevant evaluation methods are explained and to-wards the end of the chapter, there is a walkthrough of related work.

2.1 What is a recommender system?

A recommender system (RS) is a subclass of information filtering sys-tems that aims to predict a preference between a user and an item based on historical data, with the goal to suggest relevant items in the future. One exception would be the knowledge-based recommender systems, where suggestions are based on specified requirements from the user and not historical data [3].

An RS can use various sources of data to learn these preferences. It can leverage explicit input such as active ratings or “likes” and im-plicit feedback such as items purchased or videos viewed. The exim-plicit data may provide more insight into a users preference, however, the collection of implicit data is usually more common since there is no ef-fort for a user to generate it. The more data that can be collected about a user and their preference for items the more robust the prediction can become [3].

An RS that generates these predictions can be modeled in several ways. One approach is to find groups of users that share a similar “taste profile”, i.e. they are interested in similar items. Once these

(14)

clusters are identified, the model then find items that a portion of users within this cluster like and recommend these to other users in the same group who have not yet been exposed to them. This method belongs to a collection of algorithms called neighboring models, which in turn, belongs to a greater class of models called collaborative filtering (CF) [3].

Another method is to model the recommender system based on the properties of the item itself, rather than the collective preference of users. This means that attributes, normally referred to as features, are collected about the items and matched to the users’ preferences. This strategy is called content-based (CB) recommendation and is built on the assumption that user preferences can be extracted from the nature of items.

The two different aforementioned approaches are the most com-mon ones and can also be combined into so-called hybrid models to leverage the strength of both worlds. See figure 2.1 for an overview of the different recommender systems [3].

These algorithms can become more advanced by adding contextual data such as demographics and other social information, geograph-ical data and temporal information which is something that will be touched upon in this thesis [3].

2.2 Content-based filtering

Content-based recommender systems are domain-dependent, where the focus lies on the analysis of the item and its features to generate suggestions. The algorithm will look at an item that a user has shown interest of in the past, either that of an explicit rating or perhaps a read article. From there the system will look at all the other articles and suggest those that are similar to the user’s first preference [5]. The key here is to be able to compare items such as articles or metadata from the video in a meaningful way. The content-based filtering can use different types of models to find this similarity. One of them is to use the vector space model called Term Frequency-Inverse Document Fre-quency (TF-IDF) or probabilistic methods such as Naive Bayes Classi-fier [18], Decision Trees [15] or Neural Networks [4] to model the re-lationships between different items in a larger collection. These tech-niques provide recommendations by learning the underlying model

(15)

Figure 2.1: An overview of different recommender systems.

through either statistical analysis or machine learning.

2.2.1 TF-IDF and similarity measurement

The term frequency-inverse document frequency (TF-IDF), is a statis-tical technique to allocate different weights of significance on words in a document that belong to a larger collection of items. The objec-tive is to find the words that best represent each of the documents or news items. First, by looking at the frequency of each term and then the inverse document frequency to make sure that words that occur in several documents across the collection should be considered less

(16)

Figure 2.2: Illustration of how the content-based filtering works.

Based on a previously seen item, a similar news is recommended. .

important[45]. The formula follows:

wi,j =tfi,j⇥ log(

N

dfi

) (2.1)

where wi,jis the weight for the word i in document j, N is the total

number of documents in the collection, tfi,j is the term frequency of

word i in document j and dfi is the frequency per document of the

word i in the collection.

Once the most significant words for each document or item is found, it is possible to create a distinct representation of it in a vector space. This enables the possibility to find similar items through a similarity measurement, because vectors that are close to each other should be equal in nature. Users can also be represented as vectors in the same vector space, which is done by looking at past interactions of items from which the user receive a combined feature vector. This vector space model allows queries for items that are similar to users, see 2.3. Which in essence is the nature of content-based filtering.

A common similarity measurement to determine the distance be-tween the TF-IDF vectors is the cosine similarity, which can be seen in equation 2.2

(17)

Figure 2.3: This figure illustrates how different users and items are

represented as vectors in a vector space model, where the items could be news articles and the different axes are news categories.

simCOS(~x~y) =

~x_{· ~y} |~x||~y| = Pn i=1xiyi pPn i=1x2i pPn i=1yi2 (2.2) where the distance between vector ~x and vector ~y is calculated[35].

2.2.2 Pros and cons

The content-based filtering technique is especially good for recom-mending text-based items such as publications, web pages and news. The CB recommendations for a specific user is not affected by other users, which reduces the need of a large user base and interactions.

(18)

Therefore, it overcomes the challenge of recommending new items that the collaborative filter has. Also, if the profile preferences of a user changes, the CB is capable of adjusting in a short amount of time. This also means that the users can receive recommendations without hav-ing to worry about givhav-ing away personal information, thus complies well with the European law of GDPR.

There a are some drawbacks however, [1]. The CB system is depen-dent on rich metadata of the items to be able to create good document representations in the vector space. Also, another aspect is that it will give suggestions of items that are similar and limited to the user pro-file, which can lead to over-specialization (see 2.5.8) and lack of explo-ration [44].

2.3 Collaborative filtering

Collaborative filtering is a domain-independent collection of recom-mender systems, that leverages the interests of several users in a "col-laborative way". It is based on the assumption that if a user A has the same opinion as person B on a specific matter, then A is likely to have B’s opinion as well on another matter compared to a random person. The way this technique works is by collecting preferences from users about existing items in order to create a rating matrix, where the rows represent users and columns represent the items. In this thesis those items will consist of news videos. Once this matrix is constructed the algorithm then groups people with similar ratings on similar items to locate so called neighborhoods of users that share similar preference profiles, see figure 2.4. Unseen items are then recommended to users within their respective group [19]. The collaborative filtering method is usually divided into two different categories: memory-based and model-based [5].

2.3.1 Memory-based

The memory-based collaborative filter is defined by the importance of historical ratings from users to define similar user groups. This can in turn be done by either an item-based or user-based approach. The similarity measures are usually calculated with the Pearson correlation coefficient or the cosine similarity [23].

(19)

Figure 2.4: Illustration of how the collaborative-filter works. Two

sim-ilar users will be recommended items that either one have not yet seen .

2.3.2 Model-based

This technique tries to learn a model based on users’ previous ratings through machine learning. Since they use pre-computed models it is possible to quickly recommend a set of items. Another advantage of this kind of collaborative filter is that it can predict user ratings for certain items which facilitates the recommendation of new items that have not received any ratings yet [3]. Examples of these model-based algorithms are Bayesian networks, clustering models, latent seman-tic models such as singular-value decomposition, probabilisseman-tic latent semantic analysis, multiple multiplicative factor, latent Dirichlet allo-cation and Markov decision process based models [40]. The matrix factorization technique, which belongs to the family of latent semantic models was introduced during the Netflix recommendation competi-tion in 2009 [25]. It is currently one of the most common collaborative-filtering techniques due to superior performance, and it is also

(20)

inter-esting to explore as it can include contextual features [26].

Matrix factorization

The matrix factorization algorithm maps both users and items to a joint latent factor space of dimensionality f, where user-item interactions are modeled as inner products in that space [26]. First of all each user uis associated with a vector p_u _{2 R}f _{and each item i a vector q}

i 2 Rf.

Given an item i, the elements of q_i measure the extent to which the item possesses those factors. For a given user u, the elements of p_u represent the extent of interest the user has in items that are high on the corresponding factors. By taking the dot-product between these two vectors, it is possible to estimate the user u0_s_{interest in the item i,}

i.e. calculate u0_s_{rating of item i, denoted as ˆr}

ui, see formula 2.3.

ˆ

rui=qT_i p_u (2.3)

To see an overview of how the matrix factorization is executed see figure 2.5. In order for the model to learn the feature vector p_u and

q_i different optimization methods can be used to minimize the error between the predicted ratings and the real ratings that is known be-forehand with respect to a loss function. One common loss function is the regularized squared error on the set of known ratings, see equation 2.4. min q⇤,p⇤ X (u,i)2k = (rui qT_i p_u)2+ (||q_i||2+||p_u||2) (2.4)

where k is the set of the (u, i) pairs for which the rating ruiis known

beforehand. The regularization terms are present to ensure that the model is not over-fitting, since it is only optimizing on the observed ratings, i.e. not the whole truth. The lambda constant is a hyper-parameter used to control the amount of regularization that should be used and its optimal value could be determined by cross-validation. There are two ways to approach the minimization of the loss function and it is either by using stochastic gradient descent (SGD) or alternat-ing least squares (ALS) [26].

(21)

Figure 2.5: Visualization of matrix factorization, where R is the rating

matrix(also called interaction matrix), P is the user-feature matrix, Q is the item-feature matrix. The highlighted element ruican be calculated

by p_uqT i

.

2.3.3 Pros and cons

Collaborate filtering is good for recommending items that are hard to represent, such as unstructured data like video or audio. It also has an advantage over CB to recommend serendipitous content due to the fact that users help each other explore new material without having the specific items in their user preference profiles. However, these techniques are prone to the cold-start problem (see 2.5.1) as new items are hard to recommend due to the lack of ratings [39].

2.4 Hybrid filtering

In the hybrid recommendation technique both content-based and col-laborative filtering algorithms are combined in various ways to lever-age their respective strengths and at the same time mitigate some of their weaknesses [2]. There are several ways of how to combine these algorithms: a separate implementation of algorithms where the result is combined at the end, or utilizing a content-based method with a collaborative approach or vice-versa, or creating a unified

(22)

recommen-dation system that brings together both approaches.

2.5 Challenges for news recommendations

The domain of news recommendations is different from other areas due to its nature of non-static content and rapid change of trends. To create relevant suggestions of news, the system needs to have access to information about the user, the item and the context affecting them [33]. When constructing a news recommender system the following challenges are present.

2.5.1 The cold-start problem

This is a general problem in information filtering and essentially de-scribes the situation where a system can’t draw any conclusions about preferences between users and items due to lack of data [38]. This is specifically present in collaborative filtering which is built on ratings given by users to items. In news, this becomes evident due to the large and constantly updated stream of content, where users may only in-teract with a small fraction of the items. This leads to a sparse dataset and can cause a decrease in performance of the system.

2.5.2 Time-sensitive content

News have short lifespans since people want to be informed about the latest events regarding a subject. This means that the system needs to prioritize new content before the old. On the other hand, some users might not be up to date with the story, so older and connected news might still be relevant [29].

2.5.3 Shifting short-term trends

It is a challenge to predict the future preferences of users where short-term trends in the news affect what is interesting to them. The rate of how users change their preference is much slower in other domains such as in movie, music or book recommendations compared to news [30]. Another aspect that can blur the algorithms assumption of user preference is that certain news might be read or viewed only because it is important and not due to a specific interest.

(23)

2.5.4 Global trends

Another trend factor that affects what kind of news to recommend are contextual features that can be considered as global trends for the users. For example a study showed that people prefer certain types of news categories based on the time of day [31]. In the morning, politics and breaking news were consumed to a larger extent whereas in the evening "leisure" related categories were preferred such as en-tertainment and technology. Other patterns could be extracted where for example mobile devices were mostly used during rush hours and desktop devices during work hours and weekends. Another finding was that users were more prone to read articles about the same subject during late evenings compared to the rest of the day, which could indi-cate that it is better to recommend a variation of news during the day and allow for more related content at night. These findings could help improve recommendation systems since a user’s behavior and intent is different depending on the context.

2.5.5 Scalability

In order to serve relevant suggestions, recommender systems usually operate on large data sets and handle many user requests simultane-ously. The system needs to be both robust and effective to provide user recommendations with fast response times. As a guideline an upper limit of 200 ms for the response time was defined by the CLEF news-REEL organization for any recommender system in a live production [7]. In the dynamic environment of news it is important that the algo-rithm has real-time processing capabilities [29].

2.5.6 Modeling user-preference profiles

It is common that few users log in or register when reading or watch-ing news at different online news platforms. This leads to the problem of learning a user’s preferences over time. One way to accumulate information about users is by tracking cookies or IP addresses, how-ever this is not completely reliable since IP addresses change, users may browse anonymously and cookies are different on different de-vices etc. With that said, a recommender system should still model user profiles to the best of their capabilities. There are different ways

(24)

to do this, where one is to log what items the user has interacted with and infer a preference from this [36].

2.5.7 Gray sheep

The gray-sheep problem concerns the users who cannot be recom-mended any useful content due to the fact that they are not similar to any other users in the system. This problem is present in collabo-rative filters where data is sparse, which indeed is a common case in the context of news. This problem usually becomes less problematic as the number of users grows [40].

2.5.8 Serendipity

Over-specialization in news recommendations is generally not good since studies show that users are indeed interested in news categories that they normally do not consider as their preference [28] [21]. The news domain is crowded with different items and content that cover the same topic. When a news story happens, different agencies and news providers describe the same event in slightly different terms which can make the recommender system suggest a collection of con-tent that all cover the same story. This usually leads to a decreased performance of the recommender system and it is a problem for both collaborative filtering [6] and content-based methods [21].

2.6 Data gathering

The data that can be collected and used for the recommender system can be divided into two groups: explicit and implicit feedback. Explicit data consist of feedback that the user has actively provided into the system, examples of this could be "like" buttons or a five-star rating system that has been prompted. This kind of data is however harder to get because it demands efforts from the user. It is more common to have implicit feedback, from which the system can infer ratings, such as number of views of a video or articles read.

(25)

2.7 Evaluation

In order to understand if a recommender system is doing well, certain metrics need to be evaluated. There are generally two ways to ap-proach this, which is either by using online or offline evaluation meth-ods. The former-mentioned method is most common in the industry where recommender systems are live in production and the latter is most common in the academic world where research is conducted on a static dataset and evaluated with statistical methods.

2.7.1 Online evaluation

A/B tests could be done where users get divided into two different groups and exposed to different recommender models. A comparison would be made of different key performance indicators (KPI’s), such as if the session duration increased for either group or if certain rec-ommendations received higher click-through rates.

2.7.2 Offline evaluation

Another alternative is offline evaluation through statistical measures where for example accuracy is calculated. In this instance, accuracy is defined as how well the algorithm manages to recommend videos that the user have previously seen. For example, if a user have watched video x and y, the algorithm would be trained only on video x, and the algorithm would generate good accuracy if it manages to recommend video y. It should be noted that this metric and indeed all offline met-rics, might not be a fair measurement since the dataset is static and the videos that a users have consumed might be a result of a small selec-tion of videos that were present at the time. It could be argued that the user would have been happy with a recommendation of video z, however, since there is only a true label for video y, it is not possible to measure that.

The following evaluation methods are common in practice:

Mean absolute error

The mean absolute error (MAE) metric is commonly used and it mea-sures the deviation of a user’s predicted preference score for an item and the actual score (the ground truth). This ground truth comes from

(26)

explicit feedback where the actual score could be a numerical value be-tween 1-5 if there was a rating system in place that gathered preference ratings. The formula for calculating the MAE is [10]:

M AE = 1

N⌃u,i k pu,i ru,i k (2.5) where pui is the predicted rating for user u on item i, ru,i is the actual

rating and N is the total number of ratings on the item set. The lower the MAE is the better the recommender system is to successfully pre-dict user ratings.

Root mean square error

The root mean square error is another common metric similar to MAE, but focuses on the larger absolute error of the system and the lower this value is the better the accuracy is. RMSE is computed as follows:

RM SE = r

1

N⌃u,i pu,i ru,i

2

(2.6) where the variables have the same meaning as in MAE.

Precision and recall

This thesis does not deal with explicit preference ratings, but instead use implicit features. This calls for the need to be able to evaluate binary results in the form of a list of items where each item is either relevant or not. In the context of recommender systems where a set of items is to be proposed to a user, it is important that the top N items are as relevant as possible and that the models are evaluated with this objective in mind.

Precision, recall and F-measure are more fitting for this purpose. Precision measures the fraction of relevant items that have been rec-ommended to the user out of all recommendations. Recall measures the number of relevant items out of the total number of relevant items. The formulas to calculate precision and recall are like so [14]:

Precision = Correctly recommended items

Total recommended items (2.7)

Recall = Correctly recommended items

(27)

The F-measure is a combination of the Precision (P) and Recall (R) that makes it easier to compare different models in one single metric [37]. The F-measure is calculated like this:

F-measure = 2P R

P + R (2.9) More specifically, this thesis will utilize recall at k, often denoted as recall@k, where k is the defined threshold of interest for how many items that should be recommended.

2.7.3 Mean opinion score

To complement the offline metrics, a qualitative test was performed where the final recommender models was tested by real subjects. The mean opinion score (MOS) is normally used to evaluate the quality of video or sound, and in this instance it is used to compare the recom-mender systems based on the quality of their recommendations.

MOS is performed by having the test subjects rate videos on a scale, normally with the Absolute Category Rating (ACR), which maps rat-ings between 1-5 and Bad-Excellent, see table 2.1. The MOS formula is the following:

M OS = PN

n=1Rn

N (2.10)

where N is the number of test subjects and R is the rating by sub-ject n.

There exist some criticism on how well this metric captures quality due to certain mathematical properties and biases. Some argue that the median value should be used instead of the arithmetic mean to capture the interval of individual ratings[22]. Another aspect is the bias of the different categorical values in the rating scales, as there may be a greater gap between Excellent and Good compared to Good and Fair.

Important factors to consider when setting up this test include: • Number of participants, according to the recommendation from

the International Telecommunication Union (ITU), a minimum of 15 test subjects should be included.

(28)

• Composition of the test group, should be representative of the target audience for the application. Demographics and educa-tion of the subjects are examples that can affect the MOS [24]. • Expert versus non-expert subjects, usually experts are in

agree-ment of the quality of an item, however they are also more criti-cal and could potentially skew the ratings to the lower end of the scale[27].

Table 2.1: Absolute Category Rating

Rating Label 5 Excellent 4 Good 3 Fair 2 Poor 1 Bad

2.8 Related work

A lot of interesting research has been conducted in recent years about different recommender systems and of late specifically regarding context-aware news recommendation. In the following sections a selection of relevant research will be presented, in addition to what has been men-tioned so far. This section will outline their work, success factors and how they relate to this thesis.

Surveys of recommender systems, Yang et al. [43] evaluate different algorithms in the collaborative filtering (CF) domain, in a mobile ap-plication setting. First of all, a framework is proposed on how the main procedure is regarding a typical CF recommender system, i.e. how to collect and process data and the implementation of CF. Their work show how implicit data could be translated to user ratings, where im-plicit features like "download, play count and share" are combined to create a preference for a user regarding an item. Finally in this ar-ticle, two case studies were performed to validate their framework. Five different CF algorithms were used: User-based, based, Item-average, Singular-value decomposition (SVD) and also a random rec-ommendation algorithm as a baseline. By offline evaluation through

(29)

MAE, the result show that the SVD algorithm performs with the best accuracy. The SVD is a matrix factorization (MF) algorithm and will be used in this thesis.

In another article by Koren [25] different matrix factorization tech-niques are presented in regard to recommender systems. Their con-clusion is that MF is superior to traditional nearest-neighbor CF algo-rithms not only due to the superior prediction accuracy, but also due to it being a memory-efficient model and can naturally integrate con-textual features such as temporal data.

In a survey made by Bobadilla et al. [5], an investigation is done not only between different collaborative filtering techniques but also content-based filtering, as well as hybrid recommender systems. The focus lies on the incorporation of social information as features in these algorithms, i.e. trusted or untrusted users, followed and followers, friends lists, posts blogs and tags. This social context aims to help the prediction accuracy of the recommender systems and the use of con-textual features is of interest. Social context is not however, partly due to the fact that these metrics are not available in the dataset at hand, but also since it implies the use of personal data which is something that should be avoided if possible due to user integrity and data pro-tection laws.

News recommender systems, Epure et al. [16] investigate news rec-ommendations in short user sessions, i.e. where users do not authenti-cate themselves and thus cannot be modeled with a user history. Their algorithm was trained on a 17 months worth of data logs from a Ger-man news article publisher. The authors divide the users’ reading in-terests into three levels: short-, medium-, and long-term. Their work show that different combinations of the levels generate different re-sults by the recommender system. Recommendations based on short-term and long-short-term interests generate better prediction accuracy while a combination of short-term and medium-term interests generates a higher news variety.

Another approach to recommending news was taken in a recent study by Wang et al. [42] where the recommender system generates news topics rather than news articles. The recommender system gen-erates its suggestions through keyword extraction and the authors method has proved to be successful in prediction accuracy however their data was limited to political news articles and they still struggle with the removal of repeatedly-appeared words, such as nouns, in the process

(30)

of extracting important keywords.

Maksai, Garcin, and Faltings [32] investigate how metrics that are used in offline evaluation could be used to predict the online perfor-mance of a recommender system, which could circumvent the need for online A/B tests. This is important to know since the online set-ting, where the recommender system is in production, is dynamic and different compared to the setting with a static dataset. The authors used a regression model on the offline metrics and demonstrated that metrics such as coverage and serendipity play an important role when predicting online metrics such as click-through rate.

Interesting findings have been made regarding the impact of con-textual features in recommender systems. In a study by Lommatzsch, Kille, and Albayrak [31], where different recommender systems were evaluated in an online setting, they showed that news are most sirable within four hours of publication and after that the interest de-clines rapidly. The same study came to the conclusion that in order to catch short-term and global trends for recommendations, an ensem-ble method of two different models that specialize in the two different trends is necessary.

Click-position-bias in recommendation models, another important finding made by Craswell et al. [11] that both impacts evaluation and the processing of the training dataset, is the effect of position bias in how news are presented on the web. This study showed that items that were presented in a horizontal feed had almost all engagement isolated to the two leftmost items. This should also be kept in mind when choosing which item to present first to maximize user experi-ence.

Recommender systems based on implicit feedback, Hu, Koren, and Volinsky [20] have done extensive analysis on how to approach a rec-ommender system based solely on implicit feedback. One of their main conclusions was to transform the implicit user data into paired magnitudes of preferences and confidence levels. Since the algorithm only trains on implicit data, ratings can only be inferred and by also having a confidence level paired with a predicted rating, the model can better weight different recommendations. The authors propose a latent-factor algorithm that deals with the preference-confidence paradigm and generates good prediction accuracy on a TV program dataset.

There is a growing attention for the use of deep learning in the context of recommender systems. Deep learning is a type of

(31)

machine-learning that includes large numbers of layers in neural networks. De-vooght and Bersini [12] demonstrated that the use of recurrent neu-ral networks, which is a deep learning technique that is specialized in sequential data, can be competitive in movie recommendations. The deep structure can model more complex correlations that exist in the data and for news recommendations, where temporal data is an im-portant factor, these recurrent neural networks are interesting to con-sider. This method however, requires a large dataset with rich features which is not the case in this master thesis.

2.8.1 Analysis of the related work

In general for research in the recommender system domain, most of the work deals with new algorithms and how to improve their perfor-mances. Less research is done on how the user actually receives and appreciates the recommended items. Another common theme in re-search regarding contextual features is the temporal aspect, however further research needs to be conducted on other contextual features as well.

Another common aspect for previous research is that the algorithms are trained on news articles, which has a different kind of data in com-parison to this thesis were news videos are the items in focus. The main difference is that articles consist of more text which in turn offers a richer description of that item, compared to news videos where only a short description text is present. The item features are also differ-ent as the video will have implicit data such as completion rates and features such as length and resolution.

(32)

Method

This chapter begins by demonstrating how the dataset was acquired, what kind of properties it has and how it was preprocessed. The dif-ferent algorithms are then presented and how they were implemented. Finally, their evaluation methods and experiments are described.

3.1 Data

3.1.1 Data collection

The item metadata, i.e. the information about each news clip, was fetched from Amazon s3 through Sumo Logic, which is a cloud-based log management system. The interaction log files, i.e. video requests and percentage viewed of a video for example, were explored and ob-tained from ElasticSearch through Kibana, which is an open-source data visualization tool.

3.1.2 Data analysis

The users in the data set were spread out from almost all countries around the globe. The majority of the users however, came from the Middle East, North Africa, USA and Sweden. Most of the users were aged between 25-55 and the gender distribution were about 60% male and 40% female. About 30% of the user interactions were created on a mobile device, 10% on tablets and the rest came from desktop. Regard-ing time, interactions are relatively spread out through out the day,

(33)

with a slight increase in the mornings between 07:00-09:00 local time in respective country, and again in the evening between 18:00-21:00.

Regarding the videos in the data set, the available languages of the news clips were both Arabic and English. The average length of the video clips were 72 seconds long and there was a total of 16 different news providers that produced the content.

3.1.3 Data structure

54 days of stored data were collected (between 23 February 2018 - 18 May 2018). Due to the nature of the available features, a decision was made to only include interactions between a user and a news video that had a view completion rate of 30 percent or more. In absence of explicit ratings, this was supposed to represent a positive interaction, i.e. if a user had watched at least 30 percent of a video, the user was assumed to like it. There was a trade-off between the amount of data that could be used and the certainty of users’ positive preferences for the videos.

Since there were few users that signed in into the service, it was impossible to track certain users for longer periods. This resulted in users being represented by their cookie/session ID. This also meant that the data was more sparse than it actually was, since the same user was stored as two different cookies when for example browsing from a mobile and later on a desktop.

Additional data cleaning had to be done, removing items and in-teractions where there were null values present. Any users (sessions) with less than two interactions were removed, because they could oth-erwise not be divided into training and testing data sets. The final data consisted of two files, one for the items and one for user interactions, see table 3.1 and 3.2 for the specific features.

3.1.4 Data sparsity

The sparsity level of a dataset is best understood by picturing the inter-action matrix mentioned in the matrix factorization section 2.3.2. The interaction matrix comprises users as rows and items as columns, the binary values in the cells represents an interaction. The density repre-sent the total number of 1’s that can be found in this interaction matrix. Data sparsity is therefore 1 density, which represent the number of

(34)

Table 3.1: Dataset: Item metadata Total unique videos: 7062

Feature Example

Clip Id 5721993

Headline Windsor visitors snag selfies with Harry and M... Tags "Celebrations","Celebrities","Royalty","Harr... Description Britain’s Prince Harry and Meghan Markle make ...

Language English Duration 47 seconds

Provider Reuters

Timestamp 1524950326021 Location United Kingdom

0’s in the matrix. The data density is calculated by the following for-mula:

density = interactions

users_{⇥ items} (3.1) The density is simply the total amount of unique interactions between users and items divided by the total possible amount of interactions that could exist.

The dataset used in this thesis has a density of 0.086%, which means there are 99.914% zeros in the interaction matrix used in the experi-ments.

3.1.5 Train and test data

The data were split into a train and test set, with a distribution of 80-20%. This was done in order to train the algorithms on as much data as possible, but keep a portion of the data away to be able to test if the recommender system performs well on unseen data. The split of the train and test set was done by hiding interactions from the training set.

(35)

Table 3.2: Dataset: User interactions Total Users: 12853 Total Interactions: 89767 Feature Example Clip Id 5721993 User Id EHGLJJSLK82319SLDKAJ Country Sweden Timestamp 1524960326021

3.1.6 Hyper-parameter optimization

To find the optimal hyper parameters, i.e. the parameters that can be altered in each algorithm, an exhaustive grid search was done. What this does is basically to run different combinations of values on the adjustable parameters that each algorithm allows. Finally the best set-tings for this particular dataset was found.

3.2 Hardware and software

Most of the algorithms and programs were built and implemented in Python, with the help of libraries called Numpy (version 1.13.3), Scipy (version 1.1.0) and Pandas (version 0.20.3) to manipulate the data. The library Scikit-learn (version 0.19.1) was used for machine learning and Matplotlib (version 2.1.0) coupled with Seaborn (version 0.8.0) was used to visualize the results.

The experiments were run on Macbook Pro with the following specs: • Model: Macbook Pro (Retina, 13-inch, Mid 2014)

• Processor: 2,6 GHz Intel Core i5 • Memory: 8GB 1600 MHz DDR3 • Graphics: Intel Iris 1536 MB

(36)

3.3 Algorithms

Four different algorithms, "popularity", "content-based", "collaborative-filter" and "hybrid" were implemented, tested and optimized. They were then evaluated on real test subjects.

3.3.1 Baseline: popularity

The baseline algorithm was implemented to reflect the most popular content. This is a relatively strong baseline algorithm and is more com-petitive than complete random. It simply aggregates all the interac-tions per news clip and recommends the top most viewed clips that the user has not yet seen.

3.3.2 Content-based

To utilize this algorithm to its full potential, specific preprocessing of data was necessary. In order to create vector representations of the items’ metadata TF-IDF was used since it has proved to yield good representations [45]. The data was converted to lowercase and item-specific tags were stemmed to its root, i.e. suffix and prefix were re-moved where possible. Special characters and stopwords, i.e. words that do not bring any semantic value for the item, were removed from description and headline texts. N-grams of size one and two were included, which meant that both single words and two words were to-kenized and included. Different settings of metadata, i.e. the input to the TF-IDF vectorization algorithm, were tested and each component in the input were concatenated into one text and then vextorized. The different settings in table 3.3 were evaluated with the recall@K metric.

Table 3.3: CB - Varied input for the TF-IDF vectorization CB model Input

CB0 Headline

CB1 Headline + Description

CB2 Headline + Description + Tags

CB3 Headline + Description + Tags + Language

(37)

3.3.3 Collaborative filtering

The collaborative filter was implemented with a matrix factorization algorithm called Singular-Value Decomposition (SVD), due to its good performance seen in the related work section 2.8, and the theory be-hind it is explained in section 2.3.2. The interaction matrix that is con-structed of users (sessions) and items (news videos) was created from the "viewed" event. However, only view interactions that had a min-imum of 30 percent video clip completion rate were included. This meant that a user was considered to have a positive preference for a news clip when they viewed at least 30 percent of it.

3.3.4 Hybrid

The hybrid algorithm combines the prediction scores from the content-based and collaborative-filter system. This was done by generating a list of 1000 items from each of the algorithms, then the lists were combined and reordered based on the new item scores.

3.4 Evaluation

3.4.1 Recall at k

The recall at k (recall@k) was chosen as metric for the offline tests since it captures the main purpose of how the recommender system should perform. Especially in this thesis where the algorithms are trained to achieve a good score on the top k items that are recommended, which can be compared with the user study and how well they perform in a live setting.

The implementation of recall@k was constructed so that for each positive item in a user’s test set, 100 negative items were sampled (items that the user did not interact with, i.e. naively assumed to be irrelevant). Then the items in the list of 101 items are ranked based on their recommendation scores in a descending order. The next step is to look at the top k items to see if this positive item (the item known to be relevant from the test set) has actually ended up among the top k recommendations. For every instance that the algorithm successfully does this, it is rewarded 1 point, and 0 otherwise. Once all of the users’ positive items has been evaluated, the final precision@k score is

(38)

calcu-lated as the number of successfully recommended positive test items over total positive test items.

3.4.2 Different levels of density

Each of the algorithms were trained and tested on different levels of density in the data. The objective was to see if there is any change in performance as the data set becomes more dense. All algorithms were first trained on the original density of 0.086% and then compared on incremental instances of density. In order to reduce the sparsity in the data, users were simply removed based on how many total tions they had. The experiment went from a minimum of two interac-tions per user up to 10 minimum interacinterac-tions, see 3.4.

Table 3.4: Different levels of density Min. interactions Density

2 0.086% 3 0.126% 4 0.159% 5 0.189% 6 0.220% 7 0.249% 8 0.279% 9 0.310% 10 0.338%

3.4.3 Mean opinion score

In order to complement the offline metrics, a user test was designed and evaluated with the mean opinion score (MOS) method described in 2.7.3. Fifteen test subjects participated with zero loss, which fulfilled the minimum requirement regarding the number of participants, ac-cording to the recommendation by the International Telecommunica-tion Union (ITU)[8]. The group consisted of a mix of male and female, between the age of 24-51 years old. There was a mix of experts and non-experts in the news field, where a test subject would classify as an expert if they had been working with news media. The reason for

(39)

this was to imitate the target audience, which in turn resembles the original dataset that the algorithms utilize for training.

The objective for the different recommender systems was to pro-vide their top 5 news clips that would be of personal interest to each test subject. To be able to do that, data had to be collected about the preference of news for each test subject. This user test can be divided into two phases, the first one is about the collection of data the second phase is about the MOS application on the resulting recommendations. To resemble how interactions were achieved originally, a list of 100 random sampled clips from the dataset was provided for each test sub-ject. This list contained all the information about a news clip as in Newstag’s platform, which consists of the item features displayed in table 3.1, except for their thumbnails. The test subjects were asked to select a minimum of five news clips that they found interesting, re-gardless if the news event was old and irrelevant in that sense. The list of 100 clips was presented in a random order to counter any bias towards the items that were presented first.

Once these interactions were collected and formatted, they were merged with the original dataset and split into test and train data. Each algorithm was re-trained on the expanded dataset and provided their top 5 personal recommendations for each subject. This was the start of phase two, but before showing these recommendations to the test subjects, five negative news clips were randomly sampled and mixed into the list of recommendations. A clip classified as a nega-tive item if the test subject had not previously interacted with it. The reason for including random negative clips was first of all to be able to compare each algorithm against a random baseline. These negative samples would also take away some bias that could otherwise occur due to the frame of reference in a list comprised of only "good" clips.

The resulting list of recommendations was now 10 items long per algorithm and the test subjects were asked to rate each of them de-pending on how well they thought it was of interest to them. They did not know that half of them were generated at random.

The absolute category rating was used as the scale of grade, span-ning from 1-5, where "1" equals "Bad" and "5" equals "Excellent", see table 2.1 for full description. In total there were forty clips to rate per test subject, i.e. ten per algorithm and to counter any bias towards which algorithms were evaluated first, the order of them was random-ized per subject, as well as the list of items. The visual representations

(40)

of the clips in both phase 1 and phase 2 were created using Google sheets and then exported to csv-files. See figure 3.1 to get a better pic-ture of the process.

Figure 3.1: Illustration of the two phases in the user test.

Once these ratings were collected and formatted, the MOS score was calculated for each of the algorithms. The scores from the recom-mended items and the randomly sampled items were separated and the MOS values were complemented with the median scores. The rec-ommended items and their ratings for each test subject can be found in Appendix A.

(41)

Results

In this section, the different results from each experiment are visual-ized. The offline tests are presented first, followed by the user test and its mean opinion score for each algorithm.

4.1 Offline experiments

In the first graph, 4.1, the result of adding more metadata when con-structing the content-based algorithm is displayed. There is a leap in performance going from model CB0 (only headline) and CB1 (headline + description), see all model descriptions here 3.3. The performance stagnates after CB1 and there is no significant difference in either re-call at 5 or 10 when adding additional information. At least not on this static dataset, however it may make a difference in a live setting. This is due to the fact that the more information that is included in the model, the more personalized can the results become. Although, for this to be effective, the collection of data might need to be larger in order for these correlations to actually be picked up by the algorithm. Another thing to note, is that the user tests were performed with the content-based model where all information were included (CB4), since there wasn’t any difference in performance and the hypothesis was that it might generate better results for the individual test subject. In order to actually know if this was the case however, the models CB0-CB4 should be evaluated on test subjects as well.

(42)

Figure 4.1: Recall@k for different input when constructing the

vector-ization model for the content-based recommender system.

In both figure 4.2 and 4.3 the content-based algorithm show that it performs best to recommend video clips to users based on past inter-actions. The score is overall higher when looking at recall at 10 due to the fact that it allows recommended clips to be in a larger list to count as a successful recommendation. However, it is worth noting that the content-based and the hybrid model maintain a high score in the re-call at 5, whereas the popularity and collaborative-filter performs less well. These results are based on the data-set when it is as sparse as possible, i.e. with a minimum of two interactions per user. The actual score says that the content-based algorithm for example, manages to successfully recommend over 90% of the test items in the top five list result.

(43)

Figure 4.2: Performance of recall@10 for each algorithm.

(44)

The performance of recall at 5 on different sparsity levels are dis-played in 4.4. It is possible to observe that all of the models except from the collaborative-filter decreases in performance as the data be-comes more dense, i.e. data where only users are included with several interactions. By making the dataset more dense, the total number of users and interactions also decrease due to the fact that users with less interactions are removed.

(45)

4.2 User test

The mean opinion score was calculated for each algorithm, based on the quality of the news video clips that they recommended each test subject. The result can be seen in figure 4.5, where the blue bar repre-sent the MOS for recommended items and the green bar reprerepre-sent the randomly sampled clips.

Figure 4.5: Results of the Mean Opinion Score user test for each

algo-rithm.

All of the recommender systems performed relatively similar. They all proved to beat the random baseline, see table 4.1 for exact percent-age. The content-based model got the best score followed by the hy-brid. It is worth noting that the simple popularity model beat the col-laborative filter and was very close to the hybrid model. This indicates that the users could be relatively satisfied by having recommendations based on the most popular content. Which could be a good substitute if there is not enough data to implement any of the other algorithms.

(46)

Table 4.1: Algorithm improvement relative to random, in the MOS user test. Algorithm Improvement Popularity 26% Content-based 27% Collaborative filter 19% Hybrid 27%

(47)

Discussion and conclusion

In this section, the results from the previous chapter are discussed and analyzed. Factors that could have had an impact on them are brought up as well. Further research are proposed and finally a conclusion is drawn.

5.1 Offline experiments

From the section 2.1 it is known that recommender systems should im-prove in performance as more information about a user’s taste-profile is gathered. This does not seem to apply to the content based model by looking at the results in the graph 4.4, which also is reflected in the hybrid model due to its construction. In the perspective of a content-based algorithm, the more interactions known from one user, the more will that persons TF-IDF vector try to reflect this preference. That means that a lot of different content for example could result in rec-ommendations that reflects this middle ground of preference between different items. This phenomena is coupled with the fact that there are 16 different news providers that have created the content in this data set. Which means that there are several videos that cover the same news story, but with slightly different angles and metadata. But since the performance is calculated on the ground truth form the static data set, videos that might be similar and actually of good interest to the user, will still impact the performance in a negative manner. That is one drawback from testing recommender systems offline.

The collaborative-filter on the other hand, seems to improve as the data becomes more dense. This might be explained through the

(48)

rithm being able to better distinguish similar user’s as they have more unique sets of interactions. For example, when there is a minimum of 2 interactions allowed, users are considered to have the same taste-profile by only having one video in common, which might not model the reality that well.

One great factor that impacts the performance of the matrix fac-torization model in this thesis, is the implicit feature used to model the interaction matrix between users and videos. As mentioned and motivated in section 3.1.3, the feature selected to model a positive in-teraction with a video, was a view count of at least 30%. This proved to work fairly well by looking at the results in the offline and in the user test, however it might not be the best representation. For exam-ple, shorter clips will to a larger extend be considered to have positive interactions due to the fact that it takes a shorter amount of time to reach the 30% limit. The average length of all the videos in the data set however, was 72 seconds long. Which means that a user would have to watch about 22 seconds for it to count, which is relatively long to endure if the video is uninteresting.

Another downside with this implicit metric is the problem of know-ing if the user is actually viewknow-ing the video or if it is playknow-ing in another web browser tab for example. This was not possible to measure in this data set, but by tracking mouse movement and similar activities, it is possible to some extent judge if the user is actively watching.

5.2 User test

There were some factors that could have had an impact on the results in this user test. Thumbnails were not included in the user test as the videos in both phase 1 and phase 2 were displayed. This could have impacted the level of interest that a test subject gained from a news video. Another aspect that could affect the level of interest was the possible language barrier, as the provided clips were in English and the test subjects were all Swedish. The participants had good English skills, however, there could have been a chance that some words did not attract interest due to exotic wordings. Also, this did not reflect the cultural diversity that existed in the original dataset. Due to lim-itations in time and resources, it was not possible to include people abroad.

(49)

An interesting finding is that the completely randomly sampled video recommendations performed almost on a fair level. This implies that there are relevant news outside of the modeled taste-profile. This is "good news" since it otherwise would mean that people are happy in their space of preference. Expanding on this notion, the content-based model really encapsulates the individual preference and allowing this algorithm to dominate information filtering could lead to severe echo chambers. The collaborative-filter, expands the boundary of what con-tent could be recommended to the individual, but it is still limited to users that are similar to each other, which in a sense is the definition of a filter bubble. To counter this, perhaps each recommender system should let completely random samples account for a percentage of the list of recommended items.

Another interesting thought to explore would be to use the defined taste-profiles in a different manner. Instead of recommending items within similar groups, completely opposite taste-profiles could be lo-cated and then items would be recommended across these groups. This would give each group insights into what other groups find in-teresting. This could assist the ongoing battle against filter bubbles and echo chambers that became a heated topic after the outcome of the 2016 Brexit referendum in the U.K. and presidential election in the U.S[13].

5.3 News value

This thesis is of interest to those who seek to understand how recom-mender systems work and their implications in a news video domain. It is of interest both to those who deal with video but also pure text articles since the methods apply to both. This thesis assists in un-derstanding what type of recommender system that would work in different settings and gives advice on how to improve them in a live production. The evaluation method, a combination of offline metrics and a mean opinion score, differs from most other research and could also bring new insights.

Looking at the bigger picture, this thesis touches on the implica-tions of recommender systems in society. It is important that stake-holders are aware of the problems with filter bubbles and echo cham-bers and by reading this work, it should become evident what it is in