Hellinger Distance-based Similarity Measures for Recommender Systems
One year master thesis Ume˚a University
Recommender systems are used in online sales and e-commerce for recommend- ing potential items/products for customers to buy based on their previous buy- ing preferences and related behaviours. Collaborative filtering is a popular computational technique that has been used worldwide for such personalized recommendations. Among two forms of collaborative filtering, neighbourhood and model-based, the neighbourhood-based collaborative filtering is more pop- ular yet relatively simple. It relies on the concept that a certain item might be of interest to a given customer (active user ) if, either he appreciated sim- ilar items in the buying space, or if the item is appreciated by similar users (neighbours). To implement this concept different kinds of similarity measures are used. This thesis is set to compare different user-based similarity measures along with defining meaningful measures based on Hellinger distance that is a metric in the space of probability distributions. Data from a popular database MovieLens will be used to show the effectiveness of different Hellinger distance- based measures compared to other popular measures such as Pearson correlation (PC), cosine similarity, constrained PC and JMSD. The performance of differ- ent similarity measures will then be evaluated with the help of mean absolute error, root mean squared error and F-score. From the results, no evidence were found to claim that Hellinger distance-based measures performed better than more popular similarity measures for the given dataset.
Titel: Hellinger distance-baserad similaritetsm˚att f¨or rekomendationsystem Rekomendationsystem ¨ar oftast anv¨anda inom e-handel f¨or rekomenderingar av potentiella varor/produkter som en kund kommer att vara intresserad av att k¨opa baserat p˚a deras tidigare k¨oppreferenser och relaterat beteende. Kollab- orativ filtrering (KF) ¨ar ett popul¨ar ber¨akningsteknik som har anv¨ants ¨over hela v¨arden f¨or dessa personliga rekomendationer. Inom tv˚a typer av KF, n¨armanskaps och model-baserat, ¨ar n¨armanskaps-baserat KF mer popul¨ar, samt relativt enkel. Den f¨orlitar sig p˚a konceptet att en specifik produkt kan vara av intresse f¨or den givna anv¨andare om, antingen uppskattade anv¨andaren lik- nande produkter eller om produkten var uppskattad av liknande anv¨andare (grannar ). F¨or att implimentera detta koncept ¨anv¨ands olika typer av similar- itetsm˚att. Denna uppsats ¨ar avsedd att j¨amf¨ora olika anv¨andar-baserade simi- laritetsm˚att tillsammans med att definera meningsfulla similaritetsm˚att baser- ade p˚a Hellinger distance, vilket ¨ar ett m˚att baserat p˚a sannolikhetsf¨ordelningar.
Data fr˚an en popul¨ar websida MovieLens, kommer att anv¨andas f¨or att visa ef- fektiviteten av de olika Hellinger distance-baserade m˚att j¨amf¨ort med andra popul¨ara m˚att s˚asom Pearson correlation (PC), cosine similarity, constrained PC och JMSD. Prestationen av de olika similaritetsm˚att kommer sedan att studeras med hj¨alp av mean absolute error, root mean squared error och F- score. Fr˚an resultaten hittades inga bevis p˚a att Hellinger distance-baserade m˚att presterade b¨attre ¨an de mer popul¨ara similaritetsm˚att f¨or det givna data.
Popular scientific summary
There are a lot of different product choices when browsing the internet for online purchasing. It can become overwhelming for the consumers to find and choose the most interesting products for them. To make sure that going through the sea of different products is made easier, the companies have started to use so called recommender systems. What they do is provide the consumer with personalized recommended items based on data from, for example, their previous purchases. The use of recommendations has also shown to increase the revenues for companies as customers tend to purchase more items.
Different methods have been created to optimize the recommendation system for different types of data. One of these methods is called collaborative filtering (CF) and is considered to be the most successful approach for personalized product or service recommendations. This method is separated into two main classes, neighborhood- and model-based CF. The model-based CF approach looks at the data from customers and calculates an equation that will predict what kind of items the customer will find worthwhile.
The neighborhood based approach works on the idea that an product might be interesting to a customer if similar customers liked it as well or if the customer liked a similar product before. The neighborhood based CF is the focus of this paper. The main aspect of neighborhood based CF is, with the use of statistical software, calculating a measure that quantifies the similarity between different customers. The value we would then get is known as similarity measure.
The quantification of the similarity is called statistical distances. The smaller the statistical distance between the items/customers the more similar they are.
There are several ways to go about calculating that distance. The purpose of this paper is to see how well popular calculation methods perform against each other as well as introduce a new one that will be compared as well.
The data that was used in this paper came from one of the most popular datasets in research field of CF. It is called MovieLens and is available for public use online. In this dataset, there is information on 600 different users, where each of them has given a rating on the scale from 1 to 5, to a specific movie (item). Some of the users have rated many movies and some just a few. This is one of the problems in using CF; not all of the items are rated by all of the users. From the comparison of the performance between the methods for this data, newly introduced method did not perform better than more popular methods.
1 Introduction 1
1.1 Data . . . 1
2 Theory and methods 3
2.1 Popular CF measures . . . 3 2.2 Bhattacharyya coefficient-based CF measure . . . 5 2.3 Hellinger distance-based CF measure . . . 6
3 Evaluation methods 8
4 Results 10
4.1 MAE . . . 10 4.2 RMSE . . . 10 4.3 F-score . . . 12
5 Discussion 13
Whether it is to buy clothes or watching a new show on Netflix, e-commerce is something that most of the population have some kind of experience with. There are often several options to choose from when looking for a product. Amazon offers their customers over 410 000 different e-books to choose from . In the sea of items, books in this example, it can be hard for the user to find that one item that would be of interest. With the use of recommender system techniques it is possible to provide customers with personalized recommendations that would suit their specific interest. So every time a user get recommended a new item when browsing Amazon’s website, or being shown supplements to a product a customer is about to purchase, chances are, it was all done with the help of recommender systems.
There are two main classes within the recommender systems, content-based filtering and collaborative filtering. Content-based filtering is basically a way to use the information of users profile as well as the purchased items charac- teristics to find and recommend similar items . This paper focuses solely on collaborative filtering (CF), more specifically, the neighborhood based CF.
There is a saying: ”Show me who your friends are, and I will tell you who you are”. The basic concept of neighborhood based CF method is very similar to this saying. What it does is, with the use of data, trying to find the given users
”friends”, other users that have same preferences. K-number of closest users are found with the use of similarity measure. Information from the k-closest users is then used to make an estimation on the rating the user would give to a certain item. The purpose of this paper is to introduce Hellinger distance-based similarity measure  and evaluate it’s the performance by comparing it with others, currently more popular similarity measures, for a given dataset.
This thesis is based on the dataset, referred to as MovieLens, that was collected by GroupLens Research from the MovieLens’s web site. It was first released in 1998 and is one of the most popular datasets within the reccommender systems field . There are several datasets provided by the GroupLens’s web page. The one used in this thesis is a small sample of the data collected over the years.
This sample is being changed over time on the website and thus can make it hard for the recreation of this thesis. Dataset in use here is dated as September 2018. MovieLens is built with four variables: ratings (100 836), users (600), movies (9 724) and timestamps that show when the movie was rated. For the statistical experiment of comparing different methods, and not using the results for commercial use, time of the data collection, timestamps, was considered immaterial and is not taken into account when calculating the results. The rating scale of the dataset differs for the ratings made prior to february 2003 where 1 to 5 one-star scale was used, after february 2003 the 0.5 to 5 half-star scale was implemented instead . For simplicity sake, all of the ratings with half star scale were rounded to one-star scale where 0.5 would mean rating 1,
1.5 rating 2, and so on. The amount of rated movies varies heavily between the users. The lowest threshold for the amount of rated items for a user to be in the data was 20, which was the case for several users. While the user with the highest amount of ratings had 2698 rated items.
2 Theory and methods
This thesis concentrates on finding neighbours, for a given active user, to provide information that will be used for rating predictions, user-based CF. With the use of information from MovieLens, it is possible to compute various statistical measures of similarity between the users. When calculations for the similarity are done for a given active user, say, u (”user u” is shortened as Uu), k number of closest users are then selected from the set of all other users, Up for p = 1, ...M ; p 6= u. Prediction of the rating on an item i, by the user u, is denoted by ˆrui is defined as 
rui= ¯ru+ PK
where ¯ru is the average of the rating made by the user u, s(Uu, Uk) is the similarity measure between the user u and the user k (kth neighbour of u), ¯rku is the average of the ratings made by the kth neighbour of the user u and rki
is the rating made by the kth user on item i. There are two additive parts in this equation. The first part is the information taken from the previous ratings made by the user u. The second part takes into account the ratings on item i that have been done by other K number of closest users of the user u, in the sense of similarity measure s(., .) and rki. Note that this thesis does not tackle so-called cold-start problem, where a new user have not rated anything yet, but will look into different methods of calculation for s(., .).
2.1 Popular CF measures
There are several different ways and measures to evaluate the similarities among users. Some of the measures that are typically used in CF based recommender systems field are discussed in this thesis. Those are: cosine similarity, Pearson correlation (PC), constrained PC (CPC), mean squared difference (MSD), Jac- card (Jacc) and JMSD which is a combination of Jacc and MSD. Formulas for given methods are presented in Table 1 .
The cosine similarity between two users is measured by common items they have rated. It is the cosine of the angle between two rating vectors on the commonly rated items of the two users . When two vectors are close to each other the angle between them is close to zero and therefore cosine of it will be close to 1. If the two vectors differ a lot from each other the cosine similarity will then take value closer to 0. One of the drawbacks of this method is that it does not take into account the rating scale of different users. If, for example, user u have given rating 1 to five different movies while user v rated same movies with 10, the similarity coefficient will still be 1 even though the ratings are complete opposite. This problem is offset in PC and CPC by introducing scale to the formula. Another drawback of cosine similarity is that it is dependent on users to have several co − rated items, items that have been rated by both users in question. By looking at the formula for cosine similarity it is noticeable that if two users only have one co-rated item the value will always be 1. This
Cosine scos(u, v) =
I is the set of items that user u and user v both rated and rui is the rating of user u on item i
Pearson correlation sP C(u, v) =
ru is the avarage of the rating made by user u Constrained Pearson correlation sCP C(u, v) =
rmed is the median value of the rating scale Mean squared difference sM SD(u, v) = 1 −
where |I| denotes the cardinality of the set I Jaccard sJ acc(u, v) = |I|Iu∩Iv|
JMSD sJ M SDs(u, v) = sM SD(u, v) ∗ sJ acc(u, v) Table 1: Frequently used similarity measures
could lead to misleading results as even if the two customers have completely opposite opinions, the coefficient value will still be 1. Even if there are more then one co-rated items there is a risk of misleading results, few co-rated items problem . The problem of one and few co-rated items does reoccur with several other methods as well.
PC is the most commonly used method within user based CF. By calculating the linear relationship between two vectors, PC outputs a value between -1 and 1. Negative value indicates that there is negative correlation between the users, dissimilarity. Value 1 indicates strong similarity between the users . By using users rating mean as a scale, PC attends to the first problem that was mentioned in cosine measure. However, PC also suffers in cases of one or few co-rated items. An example of that would be: let two sets of items for two users be Iu= (9, 8, 9, 10) and Iv= (10, 10, 10, 8), the value of PC coefficient will then be approximately −0.82 even though they both rated the same items highly.
CPC is a variant of PC that uses the median value of the rating scale as the scale instead.
Mean squared difference (MSD) have not been used as much in the field of CF compared to cosine similarity and PC . This is because unlike PC, MSD is dismissive of some of the information from co-rated items and tends to have hard time to pick up similar rating patterns. Jaccard similarity calculates the percentage of co-rated items within the union of items that users rated. This similarity measure is very basic and does not take into account the actual rating
values. JMSD is then a combination of Jaccard and MSD as shown in Table 1.
2.2 Bhattacharyya coefficient-based CF measure
Bhattacharyya coefficient (BC) is a statistical measure to evaluate the similarity between two probability distributions. For discrete probability distributions, say, p1 and p2 on same sample space X , it is defined as
BC(p1, p2) = X
In this thesis, the usage of BC as a similarity measure is for p1that is the rating distribution for one item and p2that is for another. If, for given two items, the two probability distributions are close to each other, then BC will take value closer to 1. Noticeably this coefficient does not require that the two items are co- rated, but it only takes into account how the two item have been rated by users.
Note that, in order to estimate BC in a given context, all that is needed is the empirical rating distributions for the respective items. Usually, these empirical distributions are estimated non-parametrically, that is, without any probability distributional assumptions such as normality. In fact, since ratings are in the scale of, for example, 1 to 5 the non-parametric estimates are simply based on histograms. Therefore, there is some sampling variability of the estimated BC value. Further note that when there are a large number of individual ratings, these empirical estimates are said to be consistent estimates (since they are the maximum likelihood estimates). Furthermore it can also be applied to two users.
The idea behind BC-based similarity measure is to use the Bhattacharyya coefficient and combine it with some local similarity measure between two users loc(., .) . Bhattacharyya coefficient based CF similarity measure (sBCF) be- tween two users u and v is defined as 
sBCF(u, v) = X
BC(pi, pj)loc(rui, rvi) (2)
where Iu is the set of items that user u rated, and pi is the probability distri- bution of the ratings of the item i and loc(rui, rvj) is a local similarity between two ratings rvi and rvj. Items rated by the users, Iu and Iv, do not need to be co-rated. Note that BC(pi, pj) provides the global rating information of the two items i and j .
To give more importance to the sheer amount of co-rated items between the users, Jaccard similarity measure is added to the sBCF. The combined function appears as follows 
sJ accBCF(u, v) = J acc(u, v) +P
j∈IvBC(pi, pj)loc(rui, rv j) Noted that this way of calculating similarity between two users takes into account all of the items rated by the users u and v, so all of the information
provided by the dataset is used, even if there are no co-rated items between the users. In this thesis the correlation based loc(., .) is used, equation 3 .
loccor(rui, rv j) = (rui− ¯ru)(rv i− ¯rv) σuσv
(3) where σuis the standard deviation of ratings by the user u, ¯ruis the average rating of user u and ¯rv is then the average rating of user v.
2.3 Hellinger distance-based CF measure
Hellinger distance between two discrete probability distributions, say, p1 and p2, H(p1, p2) measures how far apart the two distributions lying in the space of probability distributions common to them. The Bhattacharyya coefficient between p1 and p2, BC(p1, p2) is related to the H(p1, p2) as follows
H(p1, p2) =p1 − BC(p1, p2)
Hellinger distance is a metric in the space of probability distributions that takes values in between 0 and 1. However, for a given probability distribution, the probability distribution that is lying furthest to it is not necessarily at a distance of 1. Hellinger distance can be used to measure the degree of sim- ilarity between two probability distributions; when the distance is 0 the two distributions are identical and when it is 1 they are the furthest apart.
Note that there are several criteria that needs to be met for a distance measure d to be considered a metric if , for entities p1, p2and p3,
1. positivity condition: d(p1, p2) ≥ 0, 2. symmetry property: d(p1, p2) = d(p2, p1),
3. identity property: d(p1, p2) = 0 if and only if p1= p2
4. triangular inequality: d(p1, p2) ≤ d(p1, p3) + d(p3, p2)
The reason for the introduction of Hellinger distance is that BC does not neces- sarily obey triangular inequality , further developing of the equation 2 with the Hellinger distance as replacement for BC is then suggested. By definition, Bhattacharyya coefficient is not a metric, but Hellinger distance is. By looking at the Figure 1 it is noticeable that for for each value of Hellinger distance (HC) there are few values of BC, for example, for HC value of 0.7 there are several BC values. Note that one can see it other way round but with the use of the fact that Hellinger distance is a metric intuitively, it is safely to assume that if two pairs of users have the same value of HC then both pairs have the same similarity. It is not necessarily the case for BC since it is not a metric. The use of BC could add more randomness to the results which is why the Hellinger distance was suggested as replacement.
Figure 1: Plotted values of BC and HC from the 100 users that were used in this study.
Based on , that defined a statistical measure of dependence, similarity between two probability distributions p1 and p2 is then defined as follows
HS(p1, p2) = 1 − H(p1, p2)
pH(p1, pm2)H(p2, pm1) (4) where pm1 is the furthest Hellinger-distanced probability distribution to p1 and similarly for pm2. Note that second part of the equation 4 is just normalized Hellinger distance between p1and p2where normalizing constant is the geomet- ric mean of the two distances, one between p1and pm2, and the other between p2
and pm1 that are the largest possible distances for p1 and p2 respectively. This normalization allows any p1 to have it’s furthest distribution at a distance of 1, and so does for p2. The main purpose of this thesis is to see how Hellinger distance performs instead of BC.
3 Evaluation methods
For evaluation of different similarity measures, MovieLens data was used. Be- cause of the time limitations for this thesis it was not plausible to use all of the 600 users as some of the methods used are not computationally efficient.
Therefore a sample of 100 users, with 19 134 ratings on 5586 unique movies, were randomly selected. All of the results are based on this sample. After that, 25 users were randomly selected, for which 20 percent of all the unique movies they rated were set as the test data, without replacement. Constraint of using only 25 users allowed for 4 times faster computations as for each user similarity coefficients for the rest 99 users are calculated before assigning k closest neigh- bours. Rating values not included in the test subset were set as the training data. Train data consisted of 5560 unique movies and 18 532 rating values. Test data consisted of 225 unique movies and 602 ratings. For each of the 25 users in the test data, k closest neighbours were selected out of the remaining 99 using train data.
Mean absolute error (MAE) as well as root mean squared error (RMSE) are the evaluation metrics that were used for calculating the predictive accuracy of given a method, quantitative accuracy. Formulas for M AE and RM SE are shown in equations 5 respectively 6 .
M AE = PN
where N is the number of times the prediction is performed for all of the users in the test data, ri is the actual rating of item i and ˆri is the predicted value.
RM SE = v u u t 1 N
(ri− ˆri)2 (6)
To evaluate the classification accuracy, qualitative performance, of the rec- ommender system, F-score is used. F-score is a combination of two evaluation methods that are regarded as complementary , P recision and Recall. Let items that will be recommended to the user, Lr, be defined as those that are predicted to be rated with rating 4 or higher. While Lrevis the list of items that user have given a rating of 4 or higher. These mentioned thresholds are used in testing. Precision calculates what fraction of the predicted items for given users are relevant, while recall is the portion of relevant items that were recommended to the given user . The formula for precision is shown in equation 7.
P recision =|Lr∩ Lrev|
|Lr| Recall = |Lr∩ Lrev|
To receive as good assessment as possible, F-score is preferable as both precision and recall are needed. It is also very easy to manipulate the recall by increasing the set of items Lr to all items available. It would then lead
to intersect |Lr∩ Lrev| being equal to the length of Lrev and return the value 1. The use of F-score for evaluation of accuracy prevents occurrence of this problem . Formula for F-score is shown in equation 8.
F − score = 2 ∗ P recision ∗ Recall
P recision + Recall (8)
All of the calculations were done with the help of software R without the use of any CF specific packages. Note, an adjustment to the code have been done to calculate the results without relying on having items being rated by several users. If there are no co-rated items between two users, the value of similarity measure is set to 0. In these cases, the prediction of an item would be the rounded mean of ratings for a given user. The calculations were done with k = 5, 10, 15, 20, 25, 50, 75, 95. With this range of k values it is possible to see if users that are ranked lower on how similar they are to a given user, potentially being dissimilar to the user, would still make difference. With the information provided by the k-neighbours, rating predictions were then made. The quali- tative and quantitative accuracy of the predictions are shown in this chapter.
Notation for the figures below:
BC is the sJ accBCF(u, v) with loccor
HC is the sJ accBCF(u, v) combined with H(p1, p2) SHC is the sJ accBCF(u, v) combined with HS(p1, p2).
Figure 2 is displaying the results acquired from calculations of the mean abso- lute error. The results are shown to be a bit lackluster for methods of interest, Hellinger distance. Similarity coefficients BC, HC and SHC has almost identical values of MAE without getting any big improvements with the increase of k- users after 25 mark. The best performing methods, for the given dataset, seems to be PC and cosine similarity. As k increases, both seem to get decreased error rate. CPC has the biggest difference between different k values, as it starts off way behind every other methods but rapidly lowers the MAE.
In Figure 3, there is a similar trend for BC, HC and SHC as the results are al- most identical between the three and are the worst performing methods. Cosine similarity is the one that takes advantage of information acquired from increase of k-closes neighbours the most while JMSD peaks at 75 neighbours and drops off at 95.
Figure 2: Mean absolute error
Figure 3: Root Mean Squared Error
The classification accuracy for different methods is shown in Figure 4. The best performing methods are the BC, HC and SHC where results are, once again, basically identical between them. All three methods peak at 20 users and stagnate afterwards. Cosine similarity isn’t doing to well compared to others in the beginning. However, as the k increases, so does the accuracy for it. The only method that struggles hard is JMSD, as the accuracy drops of heavily as k increases.
Figure 4: F-score
From the results there was no visual difference between the BC and HC based methods. As noted in the results, after k = 25 the curve stagnated. From the Hellinger distance and Bhattacharyya coefficient-based similarity measures, top five users that were found as the most similar to the 25 users in the test data, were the ones with most items rated. There was a bit over 2000 rating values for the user that rated most, while the user with lowest amount of rated items had only 17 (20 from the beginning, but 3 were moved to the test data). It is possible to assume that most of the weight for predictions was based on those five. So, the higher the k value got, the less impactful was the added information, until it eventually did not matter if more users were taken into consideration.
There was no clear trend for the cosine similarity on what users were mea- sured as most similar. Some of the users had many co-rated items with their closest neighbours, others had only one as that would mean similarity measure takes value 1 (one co-rated item problem). The same thing can be said about PC. However this was not necessarily the case for the CPC as the range for this measure can be higher than 1 which would be higher then one co-rated item.
No specific trend was found for neighbours based on JMSD similarity measure.
Overall, the results showed that there was no clear advantage of using the Hellinger distance based method over others. More popular methods like co- sine similarity, had better predictive accuracy then HC while being a lot more computationally efficient. While it took over 48 hours for HC to run its calcu- lations, the co-rated based methods did it in just a couple of minutes. If the data is then scaled to the full MovieLens dataset containing 27 000 000 rating, the difference in time it takes to compute will be too big. Even though there were some superiority in classification accuracy for HC, which is the main point of a recommender system, recommend items to users that they will like, it stills doesn’t seem to be worthwhile to replace other methods. While discussing the results it should be noted that some randomness might be present due to the fact that a smaller data set is used (sampling error of probability distribution estimates can be higher).
For future studies it would be interesting to try out the SHC based method on a full dataset, 600 users. It would also be interesting to test HC and SHC on a sparse dataset as the performance with the use Bhattacharyya coefficient-based CF have already shown good results in that scenario before . As noted earlier, HC takes away some randomness that comes with BC not being a metric. If equation 2 is to be used as a method, there is an argument to use Hellinger distance instead.
 M Ekstrand, J Riedl, and J Konstan. Collaborative filtering recommender systems. Foundations and Trends in Human-Computer Interaction, 4(2):81–
 G Greegar and CS Manohar. Global response sensitivity analysis using prob- ability distance measures and generalization of Sobol’s analysis. Probabilistic engineering mechanics, 41:21–33, 2015.
 R Guns, C Lioma, and B Larsen. The tipping point: F-score as a function of the number of retrieved items. Information Processing & Management, 48(6):1171–1180, 2012.
 M Harper and J Konstan. The movielens datasets: History and context.
Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015.
 T Kailath. The divergence and Bhattacharyya distance measures in signal selection. IEEE transactions on communication technology, 15(1):52–60, 1967.
 B Patra, R Launonen, V Ollikainen, and N Sukumar. A new similarity measure using Bhattacharyya coefficient for collaborative filtering in sparse data. Knowledge-Based Systems, 82:163–177, 2015.
 P Wijayatunga. A geometric view on Pearson’s correlation coefficient and a generalization of it to non-linear dependencies. Ratio Mathematica, 30:3–21, 2016.